THE CORRENTROPY MACE FILTER FOR IMAGE RECOGNITION By KYU-HWA JEONG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 1
The Correntropy Mace Filter for Image Recognition - Jeong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
THE CORRENTROPY MACE FILTER FOR IMAGE RECOGNITION
By
KYU-HWA JEONG
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
6-1 Estimated computational complexity for training with N images and testingwith one image. Matrix inversion and multiplication are considered . . . . . . . 59
7-1 Comparison of standard deviations of all the Monte-Carlo simulation outputs . . 63
7-2 Comparison of ROC areas with different kernel sizes . . . . . . . . . . . . . . . . 64
7-3 Case A: Comparison of ROC areas with different kernel sizes . . . . . . . . . . . 72
7-4 Comparison of computation time and error for one test image between the directmethod (CMACE) and the FGT method (Fast CMACE) with p = 4 and kc = 4 75
7-5 Comparison of computation time and error for one test image in the FGT methodwith a different number of orders and clusters . . . . . . . . . . . . . . . . . . . 76
8-1 Comparison of the memory and computation time between the original CMACEand CMACE-RP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
E-1 Comparisons of the beampattern for three beamformers in Gaussian noise with10dB of SNR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
E-2 Comparisons of BER for three beamformers in Gaussian noise with differentSNRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
E-3 Comparisons of BER for three beamformers with different characteristic exponentα levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
E-4 Comparisons of the beampattern for three beamformers in non-Gaussian noise. . 116
10
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
THE CORRENTROPY MACE FILTER FOR IMAGE RECOGNITION
By
Kyu-Hwa Jeong
August 2007
Chair: Jose C. PrincipeMajor: Electrical and Computer Engineering
The major goal of my research was to develop nonlinear methods of the family of
distortion invariant filters, specifically the minimum average correlation energy (MACE)
filter. The minimum average correlation energy (MACE) filter is a well known correlation
filter for pattern recognition. My research investigated a closed form solution of the
nonlinear version of the MACE filter using the recently introduced correntropy function.
Correntropy is a positive definite function that generalizes the concept of correlation
by utilizing higher order moments of the signal statistics. Because of its positive definite
nature, correntropy induces a new reproducing kernel Hilbert space (RKHS). Taking
advantage of the linear structure of the RKHS, it is possible to formulate and solve the
MACE filter equations in the RKHS induced by correntropy. Due to the nonlinear relation
between the feature space and the input space, the correntropy MACE (CMACE) can
potentially improve upon the MACE performance while preserving the shift-invariant
property (additional computation for all shifts will be required in the CMACE).
To alleviate the computation complexity of the solution, my research also presents
the fast CMACE using the Fast Gauss Transform (FGT). Both the MACE and CMACE
are basically memory-based algorithms and due to the high dimensionality of the image
data, the computational cost of the CMACE filter is one of critical issues in practical
applications. Therefore, my research also used a dimensionality reduction method based
11
on random projections (RP), which has emerged as a powerful method for dimensionality
reduction in machine learning.
We applied the CMACE filter to face recognition using facial expression data and the
MSTAR public release Synthetic Aperture Radar (SAR) data set, and experimental results
show that the proposed CMACE filter indeed outperforms the traditional linear MACE
and the kernelized MACE in both generalization and rejection abilities. In addition,
simulation results in face recognition show that the CMACE filter with random projection
(CMACE-RP) also outperforms the traditional linear MACE with small degradation in
performance, but great savings in storage and computational complexity.
12
CHAPTER 1INTRODUCTION
1.1 Background
The goal of pattern recognition is to detect and assign an observation into one of
multiple classes to be recognized. The observation can be a speech signal, an image or a
higher-dimensional object. In general there are two broad class of classification problems:
open and close sets. Most of classification problems deal with closed set, which means
that we have all prior information of the classes and classify those given classes. In open
classification problem, we only have a prior information of one specific class and no prior
information for out of class which can be universe. Object recognition that we present
in this research is an open set problem, therefore the method that we are going to use is
based on one class versus the universe.
There are a lot of applications for object recognitions. In automatic target recognition,
the goal is to quickly and automatically detect and classify objects which may be present
within large amounts of data (typically imagery) with a minimum of human intervention
such as vehicle vs. non-vehicle, tanks vs. trucks, on type of tank vs. another type.
Another pattern recognition applications which is emerging research fields is the
biometrics such as face, iris and fingerprint recognition for human identification and
verification. biometrics technology is rapidly being adopted in a wide variety of security
applications such as computer and physical access control, electronic commerce, homeland
security, and defense.
Figure 1-1 shows the block diagram for the common pattern recognition process. In
the preprocessing block, denosing, normalization, edge detection, pose estimation, etc are
conducted for each applications.
Feature extraction involves simplifying the amount of resources required to describe a
large set of data accurately. When performing analysis of complex data one of the major
problems stems from the number of variables involved. Analysis with a large number
13
of variables generally requires a large amount of memory and computation power or a
classification algorithm which overfits the training sample and generalizes poorly to new
samples. Feature extraction is a general term for methods of constructing combinations of
the variables to get around these problems while still describing the data with sufficient
accuracy.
The goal of classification is to assign the features derived from the input pattern to
one of the classes. There are a variety of classifiers including statistical classifiers, artificial
neural networks, support vector machine (SVM), and so on.
Another important pattern recognition methodology is to use the training data
directly instead of extracting some features and performing classification based on those
features. While feature extraction works well in many pattern recognition applications, it
is not always easy for humans to identify what the good features may be.
Correlation filters have been applied successfully to automatic target detection
and recognition (ATR) [1] for SAR image [2],[3],[4] and biometric identification such as
face, iris and fingerprint recognition [5],[6], by virtue of their shift-invariant property[7],
which means that if the test image contains the reference object at a shifted location, the
correlation output is also shifted by exactly the same amount. Due to this property, there
is no need to conduct additional process of centering the input image prior to recognizing
it. Also, in some ATR applications, it is not only desirable to recognize various targets,
but to locate them with some degree of accuracy and the location can be easily founded
by searching the peak of the correlation output. Another advantage of correlation filters is
that it is linear and therefore the solution can be computed analytically.
Figure 1-1. Block diagram for pattern recognition.
14
Figure 1-2 depicts the simple block diagram for image recognition process using
correlation filters. Object recognition can be performed by cross-correlating an input
image with a synthesized template (filter) and the correlation output is searched for the
peak, which is used to determine whether the object of interest is present or not.
It is well known that matched filters are the optimal linear filters for signal detection
under linear channel and white noise conditions [8][9]. Matched spatial filters (MSF) are
optimal in the sense that they provide the maximum output signal to noise ratio (SNR)
for the detection of a known image in the presence of white noise, under the reasonable
assumption of Gaussian statistics [10]. However, the performance of the MSF is very
sensitive to even small changes in the reference image and the MSF cannot be used for
multiclass pattern recognition since it is only optimum for a single image. Therefore
distortion invariant composite filters have been proposed in various papers [1].
Distortion invariant composite filters are a generalization of matched spatial filtering
for the detection of a single object to the detection of a class of objects, usually in the
image domain. Typically the object class is represented by a set of training exemplars.
The exemplar images represent the image class through the entire range of distortions of
a single object. The goal is to design a single filter which will recognize an object class
through the entire range of distortion. Under the design criterion the filter is equally
Figure 1-2. Block diagram for image recognition process using correlation filter.
15
matched to the entire range of distortion as opposed to a single viewpoint as in a matched
filter.
The most well known of such composite correlation filters belong to the synthetic
discriminant function (SDF) class [11] and its variations. One of the appeals of the SDF
class is that it can be computed analytically and effectively using frequency domain
techniques. In the conventional SDF approach, the filter is matched to a composite
template that is a linear combination of the training image vectors such that the cross
correlation output at the origin has the same value with all training images. The hope
is that this composite template will correlate equally well not only with the training
images but also with distorted versions of the training images, as well as with test images
in the same class. One of the problems with the original SDF filters is that because
only the origin of the correlation plane is constrained, it is quite possible that some
other value in the correlation plane is higher than this value at the origin even when the
input is centered at the origin. Since the processing of resulting correlation outputs is
based on detecting peaks, we can expect a high probability of false peaks and resulting
misclassifications in such situations.
Minimum variance SDF (MVSDF) filter has been proposed in [12] taking into
consideration additive input noise. The MVSDF minimizes the output variance due to
zero-mean input noise while satisfying the same linear constraints of the SDF. One of the
major difficulties in MVSDF is that the noise covariance is not known exactly; even when
known, an inversion is required and it may be computationally demanding.
Another correlation filter that is widely used is the minimum average correlation
energy (MACE) filter [13]. The MACE minimizes the average correlation energy of the
output over the training images to produce a sharp correlation peak subject to the same
linear constraints as the MVSDF and SDF filters. In practice, the MACE filter performs
better than the MVSDF with respect to out-of-class rejection. The MACE filter however,
has been shown to have poor generalization properties, that is, images in the recognition
16
class but not in the training exemplar set are not recognized well. The MACE filter is
generally known to be sensitive to distortions but readily able to suppress clutter. In
general, it was observed that filters that produce broader correlation peaks(such as the
early SDFs) offer better distortion tolerance. However, they may also provide poorer
discrimination between classes since these filters tend to correlate broadly with low
frequency information in which the classes may be difficult to separate.
By minimizing the average energy in the correlation plane, we hope to keep the side
lobes low while maintaining the origin values at prespecified levels. This is indirect method
to reduce the false peak or side lobe problem. However, in their attempt to produce
delta-function type correlation outputs, MACE filters emphasize high frequencies and yield
low correlation outputs with images not in the training set.
Therefore, some advanced MACE approaches such as the Gaussain MACE (G-MACE)
[14], the minimum noise and correlation energy (MINACE) [15] and optimal trade-off
filters [16] have been proposed to combine the properties of various SDF’s. In the
G-MACE, the correlation outputs are made to approximate Gaussian shaped functions.
This represents a direct method to control the shape of the correlation outputs. The
MINACE and G-MACE variations have improved generalization properties with a slight
degradation in the average output plane variance and sharpness of the central peak
respectively.
In the most of the previous research in SDF type filters, linear constraints are
imposed on the training images to yield a known value at specific locations in the
correlation plane. However, placing such constraints satisfies conditions only at isolated
points in the image space but does not explicitly control the filter’s ability to generalize
over the entire domain of the training images.
New correlation filter design based on relaxed constraints, which is called the
maximum average correlation height (MACH) filter has been proposed in [17]. MACH
filter adopt a statistical approach that they do not treat training images as deterministic
17
representations of the object but as samples of a class whose characteristic parameters
should be used in encoding the filter.
The concept of relaxing the correlation constraints and utilizing the entire correlation
output for multi-class recognition was explicitly first addressed by the distance classifier
correlation filter (DCCF)[18].
1.2 Motivation
Most of the members of the distortion invariant filter family are linear filters, which
are optimal only when the underlying statistics are Gaussian. For the non-Gaussian
statistics case, we need to extract information beyond second-order statistics. This is the
fundamental motivation of this research.
A nonlinear version of correlation filters called the polynomial distance classifier
correlation filter (PDCCF) has been proposed in [19]. A nonlinear extension to the
MACE filter using neural network topology has also been proposed in [20]. Since the
MACE filter is equivalent to a cascade of a linear pre-processor followed by a linear
associative memory (LAM) [21], the LAM portion of the MACE can be replaced with a
nonlinear associative memory structure, specifically a feed-forward multi-layer perceptron
(MLP). It is well known that non-linear associative memory structures can outperform
their linear counterparts on the basis of generalization and dynamic range. However,
in general, they are more difficult to design as their parameters cannot be computed in
closed form. Results have also shown that it is not enough to simply train a MLP using
backpropagation. Careful analysis of the final solution is necessary to confirm reasonable
results. Experimental results in [22] showed better generalization and classification
performance than the linear MACE in the MSTAR ATR data set(at 80% probability of
false alarms, the probability of detection dropped from 4.37%(MACE) to 2.45% in the
nonlinear MACE).
Recently, kernel based learning algorithms have been applied to classification and
pattern recognition due to the fact that they easily produce nonlinear extensions to linear
18
classifiers and boost performance [23]. By transforming the data into a high-dimensional
reproducing kernel Hilbert space (RKHS) and constructing optimal linear algorithms in
that space, the kernel-based learning algorithms effectively perform optimal nonlinear
pattern recognitions in input space to achieve better separation, estimation, regression and
etc. The nonlinear versions of a number of signal processing techniques such as principal
component analysis (PCA) [24], Fisher discriminant analysis [25] and linear classifiers [25]
have already been defined in a kernel space. Also the kernel matched spatial filter (KMSF)
has been proposed for hyperspectral target detection in [26] and the kernel SDF has
been proposed for face recognition [27]. The kernel correlation filter (KCF) which is the
kernelized MACE filter after prewhitening has been proposed in [28] for face recognition.
Similar to Fisher’s idea in [20], in the KCF, prewhitening is performed in the input space
with linear methods and it may affect to the whole performance. We will later present the
difference between the KCF and proposed method, where all the computation including
prewhitening are conducted in the feature space.
More recently, a new generalized correlation function, called correntropy has been
introduced by our group [29]. Correntropy is a positive definite function, which measures
a generalized similarity between random variables (or stochastic processes) and it involves
high-order statistics of input signals, therefore it could be a promising candidate for
machine learning and signal processing. Correntropy defines a new reproducing kernel
Hilbert space (RKHS), which has the same dimensionality as the one defined by the
covariance matrix in the input space and it simplifies the formulation of analytic solutions
in this finite dimensional RKHS. Applications to the matched filter [30], chaotic time
series prediction have been presented in the literature.
Based on the promising properties of correntropy and the MACE filter, the main goal
of this research is to exploit the generalized nonlinear MACE filter for image recognition.
As the first step, we applied the kernel method to the SDF and obtained the kernel SDF.
Application of the kernel SDF to face recognition has been presented in this research. As
19
the main part of the research, this dissertation establishes the mathematical foundations of
the correntropy MACE filter (called here the CMACE filter) and evaluates its performance
in face recognition and synthetic aperture radar (SAR) ATR applications.
The formulation exploits the linear structure of the RKHS induced by correntropy
to formulate the correntropy MACE filter in the same manner as the original MACE,
and solves the problem with virtually the same equations (e.g without regularization)
in the RKHS. Due to the nonlinear relation between the input and this feature spaces,
the CMACE corresponds to a nonlinear filter in the input space. In addition, the
CMACE preserves the shift-invariant property of the linear MACE, however it requires an
additional computation for each input image shift. In order to reduce the computational
complexity of the CMACE, the fast CMACE filter based on the Fast Gauss Transform
(FGT) is also proposed.
Also we introduce the dimensionality reduction method based on random projections
(RP) and apply RP to the CMACE in order to decrease the storage and meet more
readily available computational resources and show the RP method works well with the
CMACE filter for image recognition.
We can say the CMACE formulation for image recognition is one application of
the general case of the energy minimization problems. We can also apply the same
methodology which is minimizing correntropy energy of the output for the beamforming
problem, whose conventional linear solution is obtained by minimizing the output power.
Appendix E presents the new application of the correntropy to the beamforming problem
in wireless communications with some preliminary results.
20
CHAPTER 2FUNDAMENTAL DISTORTION INVARIANT LINEAR
CORRELATION FILTERS
2.1 Introduction
Distortion invariant composite filters are a generalization of matched spatial filtering
for the detection of a single object to the detection of a class of objects, and those are
widely used for image recognition. There are a lot of variations of the correlation filters.
The SDF and the MACE filter are fundamental correlation-based distortion invariant
filters for object recognition. Most of the correlation filters are based on them. In this
research, we present the nonlinear extensions to the SDF and MACE. The formulations
of the SDF and MACE filter are briefly introduced in this chapter. We consider a
2-dimensional image as a d × 1 column vector by lexicographically reordering the image,
where d is the number of pixels.
2.2 Synthetic Discriminant Function (SDF)
The SDF filter is matched to a composite image h, where h is a linear combination of
the training image vectors xi
h =N∑
i=1
aixi, (2–1)
where N is the number of training images and the coefficients ai are chosen to satisfy the
following constraints
hTxj = uj, j = 1, 2, · · · , N, (2–2)
where T denotes the transpose and uj is a desired cross correlation output peak value. In
vector form, we define the training image data matrix X as
X = [x1,x2, · · · ,xN ], (2–3)
where the size of matrix X is d × N . Then the SDF is the solution to the following
optimization problem
minhTh, subject to XTh = u. (2–4)
21
Figure 2-1. Example of the correlation output plane of the SDF [31].
It is assumed that N < d and so the problem formulation is a quadratic optimization
subject to an under-determined system of linear constraints. The optimal solution is
h = X(XTX)−1u. (2–5)
Once h is determined, we apply an appropriate threshold to the output of the cross
correlation, which is the inner product of the test input image and the filter h and decide
on the class of the test image.
Figure 2-1 shows the general shape of the correlation output plane of the SDF uisng
inverse synthetic aperture radar (ISAR) imagery [31]. As stated early, the SDF has
a broad output plane response and it means that the SDF has a good generalization
performance with the true class images, but a poorer discrimination between true class
and out of class images.
22
2.3 Minimum Average Correlation Energy (MACE) Filter
Let us denote the ith image vector be xi after reordering. The conventional MACE
filter is better formulated in the frequency domain. The discrete Fourier transform (DFT)
of the column vector xi is denoted by Xi and we define the training image data matrix X
as
X = [X1, X2, · · · ,XN ] , (2–6)
where the size of X is d × N and N is the number of training image. Let the vector h be
the filter in the space domain and represent by H its Fourier transform vector. We are
interested in the correlation between the input image and the filter. The correlation of the
ith image sequence xi(n) with filter sequence h(n) can be written as
gi(n) = xi(n)⊗ h(n). (2–7)
By Parseval’s theorem, the correlation energy of the ith image can be written as a
quadratic form
Ei = HHDiH, (2–8)
where Di is a diagonal matrix of size d × d whose diagonal elements are the magnitude
squared of the associated element of Xi, that is, the power spectrum of xi(n) and the
superscript H denotes the Hermitian transpose. The objective of the MACE filter is
to minimize the average correlation energy over the image class while simultaneously
satisfying an intensity constraint at the origin for each image. The value of the correlation
at the origin can be written as
gi(0) = XHi H = ci, (2–9)
for i = 1, 2, · · · · ·· , N training images, where ci is the user specified output correlation
value at the origin for the ith image. Then the average energy over all training images is
expressed as
Eavg = HHDH, (2–10)
23
Figure 2-2. Example of the correlation output plane of the MACE [31].
where
D = (1/N)N∑
i=1
Di. (2–11)
The MACE design problem is to minimize Eavg while satisfying the constraint,XHH =
c, where c = [c1, c2, · · · , cN ] is an N dimensional vector. This optimization problem can
be solved using Lagrange multipliers, and the solution is
H = D−1X(XHD−1X)−1c. (2–12)
It is clear that the spatial filter h can be obtained from H by an inverse DFT. Once h is
determined, we apply an appropriate threshold to the output correlation plane and decide
whether the test image belongs to the class of the template or not.
Figure 2-2 shows the general shape of the correlation output plane of the MACE [31].
It shows a sharp peak at the origin and as a result the abilities of finding the location
of the target and discrimination between true class and out of class images have been
24
Figure 2-3. Example of the correlation output plane of the OTSDF [31].
improved. However, a sharp output plane causes a worse distortion tolerance and poor
generalization.
2.4 Optimal Trade-off Synthetic Discriminant (OTSDF) Function
The optimal trade-off filter (OTSDF) is a well know correlation filter to overcome the
poor generalization of the MACE when noise input is presented. The OTSDF wishes to
trade-off the MACE filter criterion versus the MVSDF filter criterion.
The OTSDF filter in the frequency domain is given by
H = T−1X(XHT−1X)−1c, (2–13)
where T = αD +√
1− α2C, and 0 ≤ α ≤ 1, and D is the diagonal matrix in the MACE
and C is the diagonal matrix containing the input noise power spectral density as its
diagonal entries.
25
The correlation output response of the OTSDF is shown in Figure 2-3[31]. As
compared to the MACE filter response, the output peak is not nearly as sharp, but still
more localized than the SDF case.
26
CHAPTER 3KERNEL-BASED CORRELATION FILTERS
3.1 Brief review on Kernel Method
3.1.1 Introduction
Kernel-based algorithms have been recently developed in the machine learning
community, where they were first used to solve binary classification problems, the so-called
Support Vector Machine (SVM) [32]. And there is now an extensive literature on SVM
[33],[34] and the family of kernel-based algorithms [23].
A kernel-based algorithm is a nonlinear version of a linear algorithm where the data
has been previously (and most often nonlinearly) transformed to a higher dimensional
space in which we only need to be able to compute inner products (via a kernel function).
It is clear that many problems arising in signal processing are of statistical nature
and require automatic data analysis methods. Furthermore, the algorithms used in
signal processing are usually linear and their transformation for nonlinear processing is
sometimes unclear. Signal processing practitioners can benefit from a deeper understanding
of kernel methods, because they provide a different way of taking into account nonlinearities
without loosing the original properties of the linear method. Another aspect is dealing
with the amount of available data in a space of a given dimensionality, one needs methods
that can use little data and avoid the curse of dimensionality.
Aronszajn [35] and Parzen [36] were some of the first to employ positive definite
kernel in statistics. Later, based on statistical learning theory, support vector machine
[70] and other kernel-based learning algorithms [63] such as kernel principal component
Many algorithms for data analysis are based on the assumption that the data can
be represented as vectors in a finite dimensional vector space. These algorithms, such
27
Figure 3-1. The example of kernel method (left: Input space, right: feature space).
as linear discrimination, principal component analysis, or least squares regression, make
extensive use of the linear sstructure. Roughly speaking, kernels allow one to naturally
derive nonlinear versions of linear algorithms through the implicit nonlinear mapping. The
general idea is the following. Given a linear algorithm (i.e. an algorithm which works in
a vector space), one first maps the data living in a space χ (the input space) to a vector
space H (the feature space) via a nonlinear mapping Φ(·) : χ −→ H; and then runs the
algorithm on the vector representation Φ(x) of the data. In other words, one performs
nonlinear analysis of the data using a linear method. The purpose of the map Φ(·) is to
translate nonlinear structures of the data into linear ones in H.
Consider the following discrimination problem (see Figure 3-1) where the goal is
to separate two sets of points. In the input space, the problem is nonlinear, but after
applying the transformation Φ(x1, x2) = (x21,√
2x1x2, x22) which maps each vector to the
three monomials of degree 2 formed by its coordinates, the separation boundary becomes
linear. We have just transformed the data and we hope that in the new representation,
linear structures will emerge.
Working directly with the data in the feature space may be difficult because the space
can be infinite dimensional, or the transformation implicitly defined. The basic idea of
kernel algorithm is to transform the data x from the input space to a high dimensional
feature space of vectors Φ(x), where the inner products can be computed using a positive
28
definite kernel function satisfying Mercer’s condition [23],
κ(x,y) =< Φ(x), Φ(y) > . (3–1)
Mercer’s Theorem: Suppose κ(t, s) is a continuous symmetric non-negative
function on a closed finite interval T × T. Denote by {λk, k = 1, 2, . . . } a sequence
of non-negative eigenvalues of κ(t, s) and by {ϕk(t), k = 1, 2, . . . } the sequence of
corresponding normalized eigenfunctions, in other word, for all integers t and s,
∫
T
κ(t, s)ϕk(t)dt = λkϕk(s), s, t ∈ T (3–2)∫
T
ϕk(t)ϕj(t)dx = δk,j (3–3)
where δk,j is the Kronecker delta function, i.e., equal to 1 or 0 according as k = j or k 6= j.
Then
κ(t, s) =∞∑
k=0
λkϕk(t)ϕk(s) (3–4)
where the series above converges absolutely and uniformly on T × T [37].
This simple and elegant idea allows us to obtain nonlinear versions of any linear
algorithm expressed in terms of inner products, without even knowing the exact mapping
function Φ. A particularly interesting characteristic of the feature space is that it is a
reproducing kernel Hilbert space (RKHS), i.e., the span of functions {κ(·,x) : x ∈ χ}defines a unique functional Hilbert space [35]. The crucial property of these space is the
reproducing property of the kernel
f(x) =< κ(·,x), f >, ∀f ∈ F. (3–5)
In particular, we can define our nonlinear mapping from the input space to RKHS as
Φ(x) = κ(·,x), then we have
< Φ(x), Φ(y) >=< κ(·,x), κ(·,y) >= κ(x,y), (3–6)
29
and thus Φ(x) = κ(·,x) defines the Hilbert space associated with the kernel.
In this research, we use the Gaussian kernel, which is the most widely used Mercer
kernel,
κ(x− y) =1√2πσ
exp− (‖x− y‖2
2σ2). (3–7)
3.2 Kernel Synthetic Discriminant Function
Based on the kernel methodology, the previous optimization problem for the SDF can
be solved in an infinite dimensional kernel feature space by transforming each element of
the matrix of exemplars X to Φ(Xij) and h to Φ(h) with sample by sample mapping, thus
forming a higher dimensional matrix Φ(X) whose (i, j)th feature vector is Φ(Xij). Let the
N training images matrix X be
X = [x1,x2, · · · ,xN ], (3–8)
where xi is ith training image vector given by
xi = [xi(1), · · · , xi(d)], (3–9)
Then we can extend the SDF optimization problem to the nonlinear feature space by
min ΦT (h)Φ(h), subject to ΦT (X)Φ(h) = u. (3–10)
where the dimensions of the transformed Φ(X) and Φ(h) are ∞ × N and ∞ × 1,
respectively for the Gaussian kernel. Then the solution in kernel space becomes
Φ(h) = Φ(X)(ΦT (X)Φ(X))−1u. (3–11)
We denote KXX = ΦT (X)Φ(X) ,which is a N ×N full rank matrix whose (i, j)th element
is given by
(KXX)ij =d∑
k=1
k(xi(k), xj(k)), i, j = 1, 2, · · · , N. (3–12)
30
Although Φ(h) is a infinite dimensional vector, the output of this filter is going to be an
N × 1, which can be easily computed using these kernels.
Let Z be the matrix of vector images for testing and its number of testing images are
L. We denote KZX = ΦT (Z)Φ(X), which is L×N matrix whose each element is given by
Then the L× 1 output vector of the kernel SDF is given by
y = ΦT (Z)Φ(h) = KZXK−1XXu. (3–14)
We can compute KXX off-line with given training data. Then KZX and y are can be
computed on-line with a given test image. Given N training images and one test image,
the computational complexity are O(dN2) and O(dN) + O(N3) during off-line and on-line,
respectively. In general, N ¿ d, so the dominant part of the computational complexity for
matrix inversion in (3–14), O(N3), is not a critical computational issue in the KDSF. Also
the required memory for KXX which is O(N2) is much less than the case of depending on
the number of image pixels, d.
By applying an appropriate threshold to the output in (3–14), we can detect and
recognize the testing data without generating the composite filter in a feature space. In
object recognition and classification senses, the proposed kernel SDF is simpler than the
kernel matched filter.
3.3 Application of the Kernel SDF to Face Recognition
3.3.1 Problem Description
In this section, we show the performance of the proposed kernel based SDF filter for
face image recognition. In the simulations, we used the facial expression database collected
at the Advanced Multimedia Processing Lab at the Electrical and Computer Engineering
Department of Carnegie Mellon university [38]. The database consists of 13 subjects,
whose facial images were captured with 75 varying expressions. The size of each image is
31
64×64. Sample images are depicted in Figure 3-2. In this research, we tested the proposed
kernel SDF method with the original database images as well as with noisy images.
Sample images with additive Gaussian noise with a 10 dB SNR are shown in Figure 3-2(c).
In order to evaluate the performance of the SDF and kernel SDF filter in this data set,
we examined 975(13×75) correlation outputs. From these results and the ones reported
in [5] we picked and report the results of the two most difficult cases who produced the
worst performance with the conventional SDF method. We test with all the images of
each person’s data set resulting in 75 outputs for each class. The simulation results have
been obtained by averaging (Monte-Carlo approach) over 100 different training sets
(each training set has been chosen randomly ) to minimize the problem of performance
differences due to splitting the relatively small database in training and testing sets. In
this data set, it has been observed that the kernel size around 30%-50% of the standard
deviation of the input data would be appropriate.
3.3.2 Simulation Results
(a)
(b)
(c)
Figure 3-2. Sample images: (a) Person A (b) Person B (c) Person A with additiveGaussian noise (SNR=10dB).
32
10 20 30 40 50 60 700.6
0.7
0.8
0.9
1
outp
ut
number of testing images
10 20 30 40 50 60 700.6
0.7
0.8
0.9
1
outp
ut
number of testing images
True classFalse class
True classFalse class
Figure 3-3. The output peak values when only 3 images are used for training (N=3),(Top): SDF, (Bottom): Kernel SDF.
Figure 3-3 shows the average output peak values for image recognition when we use
only N = 3 images as training. The desired output peak value should be close to one
when the test image belongs to the training image class. Figure 3-3 (Top) shows that the
correlation output peak values of the conventional SDF in both true and false classes not
only overlap but are also close to one. As a result the system will have great difficulty to
differentiate these two individuals because they can be interpreted as belonging to the
same class. Figure 3-3 (Bottom) shows the output values of kernel SDF and we can see
that the two images can be recognized well even with a small number of training images.
Figure 3-4 shows the ROC curves with different number of training images (N). In the
kernel SDF with N = 3, the probability of detection with zero false alarm rate is 1.
However, the conventional SDF needs at least 25 images for training in order to have the
same detection performance as the kernel SDF.
33
Figure 3-4. The comparison of ROC curves with different number of training images.
One of the major problems of the conventional SDF is that the performance can
be easily degraded by additive noise in the test image since SDF does not have any
special mechanism to consider input noise. Therefore, it has a poor rejecting ability for
a false class image. Figure 3-5 (Top) shows the noise effect on the conventional SDF.
When the class images are seriously distorted by additive Gaussian noise with a very
low SNR (-2dB), the correlation output peaks of some test images become great than
1, hence wrong recognition happens. The results in Figure 3-5 (Bottom) are obtained
by the kernel SDF. The kernel SDF shows a much better performance even in a very
low SNR environment. The comparison of ROC curves between the kernel SDF and the
conventional SDF in the case of noisy test input with different SNRs is shown in Figure
3-6. We can see that the kernel SDF outperforms the SDF and achieves a robust pattern
recognition performance in a very high noisy environment.
34
10 20 30 40 50 60 700.8
0.85
0.9
0.95
1
1.05
number of testing images
outp
ut
10 20 30 40 50 60 700
0.2
0.4
0.6
0.8
1
number of testing images
outp
ut
Figure 3-5. The output values of noisy test input images with additive Gaussian noisewhen 25 images are used for training (N=25),(Top): SDF, circle-true classwith SNR=10dB, cross-false class with SNR=-2dB, diamond-false class withno noise, (Bottom): Kernel SDF, circle-true class with SNR=10dB, cross-falseclass with SNR=-2dB.
Figure 3-6. The ROC curves of noisy test input images with different SNRs when 10images are used for training (N=10).
35
CHAPTER 4A RKHS PERSPECTIVE OF THE MACE FILTER
4.1 Introduction
This section presents the interpretation of the MACE filter in the RKHS. The original
linear MACE filter was formulated in the frequency domain, however, the MACE filter
can also be understood by the theory of Hilbert space representations of random functions
proposed by Parzen [39]. Parzen analyzed the connection between RKHS and second-order
random (or stochastic) processes by using the isometric isomorphism 1 that exists
between the Hilbert space spanned by the random variables of a stochastic process and the
RKHS determined by its covariance function. Here, we present first the basic theory of the
RKHS, then show the interpretation the MACE filter formulation in the RKHS.
4.2 Reproducing Kernel Hilbert Space (RKHS)
A reproducing kernel Hilbert space (RKHS) is a special Hilbert space associated
with a kernel such that reproduces (via an inner product) each function in the space,
or, equivalently, every point evaluation functional is bounded. Let H be a Hilbert space
of functions on some set E, define an inner product 〈·, ·〉H in H and a complex-valued
1
Consider two Hilbert space H1 and H2 with inner products denoted as 〈f1, f2〉1 and〈g1, g2〉2 respectively, H1 and H2 are said to be isomorphic if there exists a one-to-one andsurjective mapping ψ from H1 to H2 satisfying the following properties
for all functionals in H1 and any real number α. The mapping ψ is called an isomorphismbetween H1 and H2. The Hilbert spaces H1 and H2 are said to be isometric if there exista mapping ψ that preserves inner products,
〈f1, f2〉1 = 〈ψ(f1), ψ(f2)〉2, (4–2)
for all functions in H1. A mapping ψ satisfying both properties (4–1) and (4–2) issaid to be an isometric isomorphism or congruence. The congruence maps bothlinear combinations of functionals and limit points from H1 into corresponding linearcombinations of functionals and limit points in H2.
36
bivariate function κ(x, y) on E×E. Then the function κ(x, y) is said to be positive definite
if for any finite point set {x1, x2, . . . , xn} ∈ E and for any not all zero corresponding
complex number {α1, α2, . . . , αn} ∈ C,
n∑i=1
n∑j=1
αiαjκ(xi, xj) > 0. (4–3)
Any positive definite bivariate function κ(x, y) is a reproducing kernel because of the
following fundamental theorem.
Moore-Aronszajn Theorem: Given any positive definite function κ(x, y), there
exists a uniquely determined (possibly finite dimensional) Hilbert space H consisting of
functions on E such that
(i) for every x ∈ E, κ(x, ·) ∈ H and (4–4)
(ii) for every x ∈ E and f ∈ H, f(x) = 〈f, κ(x, ·)〉H. (4–5)
Then H := H(κ) is said to be a reproducing kernel Hilbert space with reproducing kernel
κ. The properties (i) and (ii) are called the reproducing property of κ(x, y) in H(κ).
Parzen [39] analyzed the connection between RKHSs and orthonormal expansions for
second-order stochastic processes obtaining a general expression for the reproducing kernel
inner product in terms of the eigenvalues and eigenfunctions of a certain operator defined
on an appropriate Hilbert space. In addition, Parzen showed that there exists an isometric
isomorphism between the Hilbert space spanned by the random variables of a stochastic
process and the RKHS determined by its covariance function.
Given a zero mean second-order random vector {xi : i ∈ I} with I being an index set,
the covariance function is defined as
R(i, j) = E [xixj] . (4–6)
It is well known that the covariance function R is non-negative definite, therefore it
determines a unique RKHS, H(R), according to the Moore-Aronszajn Theorem. By the
37
Mercer’s theorem [35],
R(i, j) =∞∑
k=0
λkϕk(i)ϕk(j), (4–7)
where {λk, k = 1, 2, · · · } and {ϕk(i), k = 1, 2, · · · } are a sequence of non-negative
eigenvalues and corresponding normalized eigenfunctions of R(i, j), respectively.
H(R) has two important properties which make it a reproducing kernel Hilbert space.
First, let R(i, ·) be the function on I with value at j in I equal to R(i, j), then by the
Mercer’s theorem, eigen-expansion for the covariance function (4–7), we have
R(i, j) =∞∑
k=0
λkakϕk(j), ak = ϕk(i). (4–8)
Therefore, R(i, ·) ∈ H(R) for each i in I. Second, for every function f(·) ∈ H(R) of form
given by f(i) =∑∞
k=0 λkakϕk(i) and every i in I,
〈f,R(i, ·)〉 =∞∑
k=0
λkakϕk(i) = f(i). (4–9)
By the Moore-Aronszajn Theorem, H(R) is a reproducing kernel Hilbert space with R(i, j)
as the reproducing kernel. It follows that
〈R(i, ·), R(j, ·)〉 =∞∑
k=0
λkϕk(i)ϕk(j) = R(i, i). (4–10)
Thus H(R) is a representation of the random vector{xi : i ∈ I} with covariance function
R(i, j).
One may define a congruence G form H(R) onto linear space L2(xi, i ∈ I) such that
G(R(i, ·)) = xi. (4–11)
The congruence G can be explicitly represented as
G(f) =∞∑
k=0
akξk, (4–12)
38
where the set of ξk is an orthogonal random variables belong to L2(ϕ(i), i ∈ I) and f is
any element in H(R) in the form of f(i) =∑∞
k=0 λkakϕk(i) and every i in I.
Summary Let {xi : i ∈ I} be a continuous random function defined on a closed finite
interval I. Then the following conclusions hold:
• The covariance kernel R possesses the expansion (4–7).
• There exists a Hilbert space L2(ϕ(i), i ∈ I) of sequences which is a representation of
the random function.
• There exists a reproducing kernel Hilbert space H(R) of functions on I, which is a
representation of the random function.
4.3 Interpretation of the MACE filter in the RKHS
The original MACE filter was derived in the frequency domain for simplicity, however,
it can also be derived in the space domain [40] and this helps us understand the RKHS
perspective of the MACE. Let us consider the case of one training image and construct the
following matrix
U =
x(d) 0 · · · 0 0
x(d− 1) x(d) 0 · · · 0
......
......
...
x(1) x(2) · · · · · · x(d)
0 x(1) x(2) · · · x(d− 1)
0 0...
......
0 · · · · · · · · · x(1)
, (4–13)
where the dimension of matrix U is (2d − 1) × d. Here we denote the ith column of the
matrix U as Ui. Then the column space of U is
L2(U) = {d∑
i=1
αiUi | αi ∈ R, i = 1, · · · , d}, (4–14)
39
which is congruent to the RKHS induced by the correlation kernel
R(i, j) =< Ui, Uj >= UTi Uj, i, j = 1, · · · , d, (4–15)
where < ·, · > represents the inner product operation. If all the columns in U are
linearly independent, R(i, j) is positive definite and the dimensionality of L2(U) is d.
If U is singular, the dimensionality is smaller than d. However, in either case, all the
vectors in this space can be expressed as a linear combination of the column vectors. The
optimization problem of the MACE is to find a vector go =d∑
i=1
hiUi in L2(U) space with
coordinates h = [h1h2 · · ·hd]T such that gT
o go is minimized subject to the constraint
that the dth component of go (which is the correlation at zero lag) is some constant.
Formulating the MACE filter from this RKHS viewpoint only provides a new perspective
but no additional advantage. However, as explained next, it will help us derive a nonlinear
extension to the MACE with a new similarity measure.
40
CHAPTER 5NONLINEAR VERSION OF THE MACE IN A NEW RKHS :
THE CORRENTROPY MACE (CMACE) FILTER
5.1 Correntropy Function
5.1.1 Definition
Correlation is one of the fundamental operations of statistics, machine learning and
signal processing because it quantifies similarity. However, correlation only exploits second
order statistics of the random variables or random processes, which limits its optimality to
Gaussian distributed data. Correntropy was introduced in [29] as a generalized measure of
similarity. Its name stresses the connection to correlation, but also indicates the fact that
its mean value across time or dimensions is associated with entropy, more precisely to the
argument of the log in Renyi’s quadratic entropy estimated with Parzen windows, which
is called the information potential. Information potential (IP) is the argument of Renyi’s
quadratic entropy of a random variable X with PDF fX(x) as,
H2(x) = − log
∫f 2
X(x)dx, (5–1)
where , IP (x) =∫
f 2X(x)dx.
A nonparametric estimator of the information potential using Parzen window from N
samples data is
IP (x) =1
N2
N∑i=1
N∑j=1
κσ(xi − xj), (5–2)
where κσ is the Gaussian kernel in (5–4) [41].
This relation to entropy shows that the correntropy contains information beyond
second order moments, and can therefore generalize correlation without requiring moment
expansions.
Definition: Cross correntropy or simply correntropy is a generalized similarity
measure between two arbitrary vector random variables X and Y defined as
V (X, Y ) = E[κσ(X − Y )], (5–3)
41
where E is the mathematical expectation and κσ is the Gaussian kernel given by
κσ(X − Y ) =1√2πσ
exp
{−‖X − Y ‖2
2σ2
}, (5–4)
where σ is the kernel size of bandwidth.
In practice, given finite number of data samples {(xi, yi)}di=1, the cross correntropy is
estimated by
V (X,Y ) =1
d
d∑i=1
κσ(xi − yi). (5–5)
5.1.2 Some Properties
Correntropy has very nice properties that make it useful for machine learning and
nonlinear signal processing. First and foremost, it is a positive function also defining a
RKHS, but unlike the RKHS defined by the covariance function of the random variable
(process) it contains higher order statistical information. This new function quantifies
the average angular separation in the kernel feature space between the dimensions of the
random variable (or between temporal lags of the random process). Therefore, correntropy
can be the metric for similarity measurements in feature space. Several properties of
correntropy and their proofs are presented in [29][42][43]. Here we present, without proofs,
only the properties that are relevant to this dissertation.
Property 1:Correntropy is a similarity measure between X and Y incorporating
higher order moments of the random variable X − Y [29].
Applying the Taylor series expansion for the Gaussian kernel, we can rewrite the
correntropy function in (5–3) as
V (X, Y ) =1√2πσ
∞∑
k=0
(−1)k
(2σ2)kk!E[(X − Y )2k], (5–6)
which contains all the even-order moments of the random variable X − Y . The kernel
size controls the emphasis of the higher order moments with respect to the second, since
the higher order terms of the expansion decay faster for larger σ. As σ increases, the
42
high-order moments decay and the second order moment tends to dominate. In fact, for
kernel size larger than 10 times the one chosen from density estimation considerations (e.g.
Silverman’s rule [44]), correntropy starts to approach correlation. The kernel size has to be
chosen according to the application, but here this issue will not be further addressed and
Silverman’s rule will be used by default.
Property 2: Let {xi, i ∈ T} be a random vector(process) with T being an index
set, the auto-correntropy function of random vector(process) V (i, j) = E[κσ(xi − xj)] is a
symmetric and positive definite function; therefore it defines a new RKHS, called VRKHS
[29].
Since κσ(xi − xj) is symmetrical, it is obvious that V (i, j) is also symmetrical. Also
since κσ(xi − xj) is positive definite, for any set of n point {x1 , · · · , xn} and none zero
real numbers {α1 , · · · , αn}, we have
n∑i=1
n∑j=1
αiαjκσ(xi − xj) > 0. (5–7)
It is true that for any strictly positive function g(·, ·) of two random variables x and y,
E[g(x, y)] > 0. Thus we have E
[n∑
i=1
n∑j=1
αiαjκσ(xi − xj)
]> 0. This equals to
n∑i=1
n∑j=1
αiαjE[κσ(xi − xj)] =n∑
i=1
n∑j=1
αiαjV (i, j) > 0. (5–8)
Thus V (i, j) is both symmetric and positive definite. Now, the Moore-Aronszajn theorem
[35] proves that for every real symmetric positive definite function k, there exists a unique
RKHS with k as its reproducing kernel. Hence V (i, j) is a reproducing kernel.
As shown in property 1, VRKHS contains higher order statistical information, unlike
the RKHS defined by the covariance function of random processes.
Property 3:Assume the samples {(xi, yi)}di=1 are drawn from the joint PDF
fX,Y (x, y) and fX,Y,σ(x, y) is the Parzen estimator with kernel size σ. The correntropy
43
0.05
0.10.15
0.2
0.25
0.25
0.3
0.3
0.350.
35
0.35
0.4
0.4
0.4
0.45
0.45
0.45
0.45
0.5
0.5
0.5
0.5
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Figure 5-1. Contours of CIM(X,0) in 2D sample space (kernel size is set to 1)
estimator with kernel size σ′ =√
2σ is the integral of fX,Y,σ(x, y) along the line x = y [42],
V√2σ(X,Y ) =
∫ +∞
−∞fX,Y,σ(x, y)|x=y=udu. (5–9)
Property 4: Correntropy, as a sample estimator, induces a metric in the sample
space. Given two vectors X = [x1, x2, · · · , xN ]T and Y = [y1, y2, · · · , yN ]T in the sample
space, the function CIM(X, Y ) = (κσ(0) − V (X,Y ))1/2 ,where κσ is Gaussian kernel in
(5–4) with κσ(0) = 1/√
2πσ, defines a metric in the sample space and is named as the
Correntropy Induced Metric (CIM) [42]. Therefore, correntropy can be the metric for
similarity measurement in feature space.
Figure 5-1 shows the contours of distance from X to the origin in a two dimensional
space. The interesting observation from the figure is as follows: when X is close to zero,
44
CIM behaves like an L2 norm 1 , which is clear from the Taylor expansion in (5–6); further
out CIM behaves like an L1 norm; eventually as X departs from the origin, the metric
saturates and becomes insensitive to distance (approaching a L0 norm2 ). larger differences
saturate so the metric is less sensitive to large deviations what makes it more robust.
This property inspired us to investigate the inherent robustness of CIM to outliers. The
kernel size controls this very interesting behavior of the metric across neighborhoods. A
small kernel size leads to a tight linear (Euclidean) region and to a large L0 region, while
a larger kernel size will enlarge the linear region. In this dissertation, we mathematically
prove that 1) when the kernel size goes to infinity, the CIM norm is equivalent to the
L2 norm and 2) when the kernel size goes to zero (from the positive side), the CIM is
equivalent to the L0 norm.
Let us define E = X − Y = [e1, e2, ..., eN ]T , then
CIM(E) = [κσ(0)− 1N
N∑i=1
κσ(ei)]1/2
= { 12√
2πNσ3 [2σ2
N∑i=1
(1− exp(−e2i /2σ
2))]}1/2.
(5–10)
First, let us take a look at the following limit
limσ→∞
2σ2(1− exp(−e2i /2σ
2))
= limt→0
1−exp(−te2i )
t(t =: 1/2σ2)
= limt→0
exp(−te2i )e2
i
1(LHospital)
= e2i .
(5–11)
1 Given vector X, Lp norm of X is defined by ‖ X ‖p= (N∑
i=1
|xi|p)1/p, where p is a real
number with p >= 1.
2 L0 norm of X is defined as limp→0 ‖ X ‖pp, that is the zero norm of X is simply the
number of non-zero elements of X. Despite its name, the zero norm is not a true norm; inparticular, it is not positive homogeneous.
45
Therefore,
limσ→∞
(2√
2πNσ3)1/2CIM(E) = ||E||2. (5–12)
Second, look at the following limit
limσ→0+
(1− exp(−e2i /2σ
2)) =
0 if ei = 0
1 if ei 6= 0(5–13)
Therefore,
limσ→0
√2πNσ[CIM(E)]2 = ||E||0. (5–14)
Property 5: Given data samples {xi}di=1, the correntropy kernel creates another data
set {f(xi)}di=1 preserving the similarity measure as
V (i, j) = E[κσ(xi − xj)] = E[f(xi)f(xj)]. (5–15)
The proof of property 5 is in Appendix B.
According to property 5, there exists a scalar nonlinear mapping f which makes the
correntropy of xi the correlation of f(xi). (5–15) allows the computation of the correlation
in feature space by the correntropy function in the input space [45][46].
5.2 The Correntropy MACE Filter
According to the RKHS perspective of the MACE filter in capter 4, we can extend it
immediately to VRKHS 3 . Applying the correntropy concept to the MACE formulation of
chapter 4, the definition of the correlation in (4–15) shall be substituted by
V (i, j) =1
2d− 1
2d−1∑n=1
κσ(Uin −Ujn) i, j = 1, · · · , d, (5–16)
where, Uin is (i, n)th elements in (4–13). This function is positive definite and thus
induces the VRKHS. According to the Mercer’s theorem [35], there is a basis {ηi, i =
3 In this dissertation, we call the RKHS induced by correntropy VRKHS
46
1, · · · , d} in this VRKHS such that
< ηi, ηj >= V (i, j), i, j = 1, · · · , d. (5–17)
Since it is a d dimensional Hilbert space, it is isomorphic to any d dimensional real vector
space equipped with the standard inner product structure. After an appropriate choice
of this isomorphism {ηi, i = 1, · · · , d}, which is nonlinearly related to the input space,
a nonlinear extension of the MACE filter can be readily constructed on this VRKHS,
namely, finding a vector v0 =d∑
i=1
fhiηi with fh = [fh1 · · · fhd]T as coordinates such
that vT0 v0 is minimized subject to the constraint that the dth component of v0 is some
pre-specified constant.
Let the ith image vector be xi = [xi(1) xi(2) · · · xi(d) ]T and the filter be h =
[h(1) h(2) · · · h(d)]T , where T denotes transpose. From property 5, the CMACE filter can
be formulated in feature space by applying a nonlinear mapping function f onto the data
as well as the filter. We denote the transformed training image matrix and filter vector
whose size are d×N and d× 1, respectively, by
FX = [fx1 , fx2 , · · · , fxN], (5–18)
fh = [f(h(1)) f(h(2)) · · · f(h(d))]T , (5–19)
where fxi= [f(xi(1)) f(xi(2)) · · · f(xi(d))]T for i = 1, · · · , N . Given data samples, the
cross correntropy between the ith training image vector and the filter can be estimated as
voi[m] =1
d
d∑n=1
f(h(n))f(xi(n−m)), (5–20)
for all the lags m = −d+1, · · · , d− 1. Then the cross correntropy vector voi can be formed
including all the lags of voi[m] denoted by
voi = Sifh, (5–21)
where Si is the matrix of size (2d− 1)× 1 as
47
Si =
f(xi(d)) 0 · · · 0 0
f(xi(d− 1)) f(xi(d)) 0 · · · 0
......
......
...
f(xi(1)) f(xi(2)) · · · · · · f(xi(d))
0 f(xi(1)) f(xi(2)) · · · f(xi(d− 1))
0 0...
......
0 0 0 f(xi(1)) f(xi(2))
0 0 · · · 0 f(xi(1))
. (5–22)
Since the scale factor 1/d has no influence on the solution, it will be ignored throughout
the dissertation. The correntropy energy of the ith image is given by
Ei = vToivoi = fT
h STi Sifh, (5–23)
Denoting Vi = STi Si and using the definition of correntropy in (5–15), the d × d
correntropy matrix Vi is
Vi =
vi(0) vi(1) · · · vi(d− 1)
vi(1) vi(0) · · · vi(d− 2)
......
. . ....
vi(d− 1) · · · vi(1) vi(0)
, (5–24)
where, each element of the matrix is computed without explicitly knowledge of the
mapping function f by
vi(l) =d∑
n=1
κσ(xi(n)− xi(n + l)), (5–25)
for l = 0, · · · , d − 1. The average correntropy energy over all the training data can be
written as
Eav =1
N
N∑i=1
Ei = fTh VXfh, (5–26)
48
where,
VX =1
N
N∑i=1
Vi. (5–27)
Since our objective is to minimize the average correntropy energy in feature space, the
optimization problem is formulated as
min fTh VXfh subject to FT
Xfh = c, (5–28)
where, c is the desired vector for all the training images. The constraint in (5–28) means
that we specify the correntropy values between the training input and the filter as the
desired constant. Since the correntropy matrix VX is positive definite, there exists an
analytic solution to the optimization problem using the method of Lagrange multipliers in
the new finite dimensional VRKHS. Then the CMACE filter in feature space becomes
fh = V−1X FX(FT
XV−1X FX)−1c. (5–29)
Unlike the KSDF in (3–11) which has a ∞× 1 dimensionality in the RKHS by the
conventional kernel method, the CMACE filter is defined in the finite dimensional VRKHS
which has the same dimensionality as the input space with the size of d × 1. In general,
the kernel method creates a infinite dimensional feature space, so the solution often needs
the regularization to limit the bound of the solution. Therefore, the KSDF may be needed
additional regularization terms for better performance. In the computational complexity of
the CMACE compared to the KSDF, the additional O(d3) operation and O(d2) storage for
V−1X is needed. It makes us to need fast version of the CMACE as well as a dimensionality
reduction method for practical applications.
49
5.3 Implications of the CMACE Filter in the VRKHS
5.3.1 Implication of Nonlinearity
From (5–17) and (5–22), we can say that the RKHS induced by correntropy (VRKHS)
is a Hilbert space spanned by the basis {ηi}di=1 of size (2d− 1)× 1 as
where N is the number of training images and L is the the number of test input images.
In the CMACE, we denote FX = V−1/2X FX , and we can decompose fh as
fh = V−1/2X V
−1/2X FX(FT
XV−1/2X V
−1/2X FX)−1c
= V−1/2X FX(FT
XFX)−1c (5–35)
The main difference between the CMACE and KCFA is the prewhitening process. In the
KCFA, prewhitening is conducted in the input space using D, on the other hand, in the
CMACE, (5–35) implies that the image is implicitly whitened in the feature space by the
correntropy matrix VX . In the space domain MACE filter, the autocorrelation matrix
can be used as a preprocessor for prewhitening. Since the CMACE filter uses the same
formulation in the feature space, we can also expect that the correntropy matrix can be
used for prewhitening. However, in practice, we cannot obtain whitened data explicitly
since the mapping function is not explicitly known. In addition, the solution in the KCFA
is defined in a infinite feature space like the KSDF therefore additional regularization term
may be neede for a better performance.
52
CHAPTER 6THE CORRENTROPY MACE IMPLEMENTATION
6.1 The Output of the CMACE Filter
Since the nonlinear mapping function f is not explicitly known, it is impossible to
directly use the CMACE filter fh in the feature space. However, the correntropy output
can be obtained by the inner product between the transformed input image and the
CMACE filter in the VRKHS. In order to test this filter, let Z be the matrix of L vector
testing images and FZ be the transformed matrix of Z, then the L × 1 output vector is
given by
y = FTZV−1
X FX(FTXV−1
X FX)−1c. (6–1)
Here, we denote TZX = FTZV−1
X FX and TXX = (FTXV−1
X FX)−1. Then the output becomes
y = TZX(TXX)−1c, (6–2)
where TXX is N × N symmetric matrix and TZX is L × N matrix whose (i, j)th element
is expressed by
(TXX)ij =d∑
l=1
d∑
k=1
wlkf(xi(k))f(xj(l))
∼=d∑
l=1
d∑
k=1
wlkκσ(xi(k)− xj(l)), i, j = 1, · · · , N, (6–3)
(TZX)ij =d∑
l=1
d∑
k=1
wlkf(zi(k))f(xj(l))
∼=d∑
l=1
d∑
k=1
wlkκσ(zi(k)− xj(l)), i = 1, · · · , L, j = 1, · · · , N, (6–4)
where wlk is the (l, k)th element of V−1X .
The final output expressions in (6–3) and (6–4) are obtained by approximating
f(xi(k))f(xj(l)) and f(zi(k))f(xj(l)) by κσ(xi(k) − xj(l)) and κσ(zi(k) − xj(l)),
53
respectively, which is similar to the kernel trick and holds on average because of property
5. Unfortunately (6–3) and (6–4) involve weighted versions of the functionals therefore the
error in the approximation requires further theoretical investigation.
The CMACE is formulated in the linear VRKHS but has a nonlinear behavior since
the VRKHS is nonlinearly related to the input space. However, the CMACE preserves
the shift-invariant property of the linear MACE. The proof of the shift-invariant property
is given in Appendix C. Although the output of the CMACE gives us only one value, it
is possible to construct the whole output plane by shifting the test input image and as a
result, the shift invariance property of the correlation filters can be utilized at the expense
of more computation. Applying an appropriate threshold to the output of (6–1), one can
detect and recognize the testing data without generating the composite filter in feature
space. As will be shown in the simulation results section, even with this approximation,
the CMACE outperforms the conventional MACE.
6.2 Centering of the CMACE in Feature Space
With the Gaussian kernel, the correntropy value is always positive, which brings the
need to subtract the mean of the transformed data in feature space in order to suppress
the effect of the output DC bias. This centering of the correntropy should not be confused
with the spatial centering of the input images.
Given d data samples {x(i)}di=1, let us denote the mean of the transformed data in
feature space as E[f(x(i))] = mf , then the centered correntropy, that can be properly
called the generalized covariance function, is give by
Vc(i, j) = E[{f(x(i))−mf}{f(x(j))−mf}]
= E[f(x(i))f(x(j))]−m2f
= V (i, j)−m2f . (6–5)
54
The square of the mean of the transformed data f(·) coincides with the estimate of the
information potential of the original data, that is,
m2f =
1
d2
d∑i=1
d∑j=1
κσ(x(i)− x(j)). (6–6)
In order to show the validity of (6–6), let us consider the sample estimation of correntropy
(and ignoring the scalar factor 1/d) then we have
d∑i=1
f((x(i))f(x(i + t)) =d∑
i=1
κσ(x(i)− x(i + t)). (6–7)
We arrange the double summation (6–6) as an array and sum along the diagonal direction
which yields exactly the autocorrelation function of the transformed data at different lags,
thus the correntropy function of the input data at different lags can be written
1
d2
d∑i=1
d∑j=1
f(x(i))f(x(j)) =1
d2{
d−1∑t=0
d−t∑i=1
f(x(i))f(x(i + t)) +d−1∑t=1
d∑i=1+t
f(x(i))f(x(i− t))}
≈ 1
d2(
d−1∑t=0
d−t∑i=1
κσ(x(i)− x(i + t)) +d−1∑t=1
d∑i=1+t
κσ(x(i)− x(i− t)))
=1
d2
d∑i=1
d∑j=1
κσ(x(i)− x(j)). (6–8)
As we see in (6–8), when the summation is far from the main diagonal, smaller and
smaller data sizes are involved which leads to poor approximation. Notice that this is
exactly the same problem when the auto correlation function is estimated from windowed
data. However, when d is large, the approximation improves. Therefore, in the CMACE
output equation (6–1), we can use the centered correntropy matrix VXC by subtracting
the information potential from the correntropy matrix VX as
VXC = VX −m2favg · 1d×d, (6–9)
where, m2favg is the average estimated information potential over N training images and
and 1d×d is a d × d matrix with all the entries equal to 1. Using the centered correntropy
55
matrix VXC , a better rejection ability for out of class images is achieved since the offset of
the output can be removed except for the center value in case of training images.
6.3 The Fast CMACE Filter
In practice, the drawback of the proposed CMACE filter is its computation
complexity. In the MACE, the correlation output can be obtained by multiplication in
the frequency domain and the computation time can be drastically reduced by the FFT.
However in the CMACE, the output of the CMACE filter is obtained by computing
the product of two matrices in (6–3) and (6–4), which depends on the image size and
the number of training images. Each element involves a double summation of weighted
kernel functions. Therefore, each elements of the matrix requires O(d2) computations,
where d is the number of image pixels. When the number of training images is N , the
total computation complexity for one test output is O(Nd2 + N2). A similar argument
shows that the computation needed for training is O(d2(N2 + 1) + N2). On the other
hand, the MACE only requires O(4(d(2N2 + N + 2) + N2) + Nd log2(d)) for training
and O(4d + d log2(d)) for testing one input image. Table 6-1 shows the computational
complexity of the MACE and CMACE. More details about the required computation costs
are given in Appendix D. Constructing the whole output plane would significantly increase
the computational complexity of the CMACE. This quickly becomes too demanding
in practical settings. Therefore a method to simplify the computation is necessary for
practical implementations.
Here the Fast Gauss Transform (FGT) [47] is proposed to reduce the computation
time with a very small approximation error. The FGT is one of a class of very interesting
and important families of fast evaluation algorithms that have been developed over the
past decades to enable rapid calculation of weighted sums of Gaussian functions with
arbitrary accuracy. In nonparametric probability density estimation with Gaussian kernel,
the FGT can reduce the complexity of O(dM) to O(d + M) for M evaluations with d
sources.
56
6.3.1 The Fast Gauss Transform
In many problems in mathematics and engineering, the function of interest can be
decomposed into sums of pairwise interactions among a set of sources. In particular, this
type of problem is found in nonparametric probability density estimation as
G(z) =d∑
j=1
qjκσ(z − x(j)), (6–10)
where k is a kernel function centered at the source points x(j) and qj are scalar weighting
coefficients. With the Gaussian kernel, (6–10) can be interpreted as a ”Gaussian” potential
filed due to sources of strengths qj at the points x(j), evaluated at the target point z.
Suppose that we have M evaluation target points, then the computation of (6–10) requires
O(dM) calculations, which constrains the computation bandwidth for large data sets
d and M in real world applications. The Fast Gauss Transform (FGT) can reduce the
complexity to O(d + M) for (6–10). The FGT is one of a class of very interesting and
important families of fast evaluation algorithms that have been developed over the past
decades to enable rapid calculation of approximations at arbitrary accuracy. The basic
idea is to cluster the sources and target points using appropriate data structures and
the Hermite expansion, and then reduce the number of summations with a given level of
precision.
6.3.2 The Fast Correntropy MACE Filter
The major part of the computation burden in the correntropy MACE filter is given by
T =d∑
i=1
d∑j=1
wije−(z(i)−x(j))2/2σ2
. (6–11)
This is very similar to the density estimation problem that evaluates at d targets z(i) with
given d source samples x(j). However, the weighting factor wij in (6–11) are dependent
on both target and source, which is different from the original FGT applications, where
the weight vector is always the same at every evaluation target points. In our case, the
weight vector wi = [wi1, · · · , wid]T is varying on every evaluation point z(i). We can say
57
that (6–11) is a more general expression than the original FGT formulation and it can be
written as
T =d∑
i=1
Gi(z), (6–12)
where
Gi(z) =d∑
j=1
wijκσ(z(i)− x(j)). (6–13)
This means that clustering and the Hermite expansion should be performed at every
target z(i) with a different weight vector wi, which causes an extra computation for
clustering. However, since the sources are clustered in the FGT, if one expresses the
clustered sources about its center into the Hermite expansion, then there is no need to do
clustering and the Hermite expansion at every evaluation. The only thing that is necessary
is to use different weight vectors at every evaluation point. This process does not require
additional complexity compared to the original FGT formulation except that more storage
is required to keep the weight vectors. By using the Hermite expansion around the target
s, the Gaussian centered at x(j) evaluated at z(i) can be obtained by
exp
{−(z(i)− x(j))2
2σ2
}=
p−1∑n=0
1
n!
(x(j)− s√
2σ
)n
hn
(z(i)− s√
2σ
)+ ε(p), (6–14)
where the Hermite function hn(x) is defined by
hn(x) = (−1)n dn
dxn(exp(−x2)). (6–15)
Also, in this research, we use a simple greedy algorithm for clustering [48], which computes
a data partition with a maximum radius at most twice the optimum. This clustering
method and the Hermite expansion with order p requires O(pd). In the case of (6–3) and
(6–4), since the number of sources and targets are the same, they can be interchanged,
that is, the test image can be the source so that the clustering and Hermite expansion can
58
Table 6-1. Estimated computational complexity for training with N images and testingwith one image. Matrix inversion and multiplication are considered(In this simulation, d=4096, N=60, p=4,kc=4)
be done only one time per test. Thus T in (6–11) can be approximated by
T ≈d∑
i=1
∑B
p−1∑n=0
1
n!hn
(x(i)− sB√
2σ
)Cn(B), (6–16)
where B represents a cluster with a center sB and Cn(B) is given by
Cn(B) =∑
z(j),wij∈B
wTij
(z(j)− sB√
2σ
)n
. (6–17)
From (6–16), we can see that evaluation at kc expansions at all the evaluation points
costs O(pkcd), so the total number of operations is O(pd(kc + 1)) per computation of
each element in (6–3) and (6–4). The final aim is to obtain the output of the CMACE
filter with N training images and L test images. In order to compute the output of one
test image, the original direct method requires O(d2N(N + 1)) operations to obtain TXX
and TZX , and we can reduce the operation count reduces to O(pd(kc + 1)N(N + 1)) by
applying this enhanced FGT. Typically p and kc are around 4 while d and N are 4,096
and around 100 respectively in our application, which results in a computational savings
of roughly 100 times. Additionally, clustering with the test image is performed only once
per test which reduces the computation time even more. However, from the Table 6-1,
we see that the computational complexity of the CMACE for the testing still depends on
the number of training images, resulting in more computations than the MACE. More
work is necessary to reduce even further the computation time of the CMACE and its
memory storage requirements, but the proposed approach enables practical applications
with present day computers.
59
CHAPTER 7APPLICATIONS OF THE CMACE TO
IMAGE RECOGNITION
7.1 Face Recognition
7.1.1 Problem Description
In this section, we show the performance of the proposed correntropy MACE filter for
face image recognition. In the simulations, we used the same facial expression database
used in the chapter 3. We used only 5 images to composite template (filter) per person
( the MACE filter shows a reasonable recognition result with a small number of training
image in this database [5]). we picked and report the results of the two most difficult
cases who produced the worst performance with the conventional MACE method. We
test with all the images of each person’s data set resulting in 75 outputs for each class.
The simulation results have been obtained by averaging (Monte-Carlo approach) over 100
different training sets (each training set consists of randomly chosen 5 images) to minimize
the problem of performance differences due to splitting the relatively small database
in training and testing sets. The kernel size, σ, is chosen to be 10 for the correntropy
matrix during training and 30 for test output. In this data set, it has been observed that
the kernel size around 30%-50% of the standard deviation of the input data would be
appropriate. Moreover, we can control the performance by choosing a different kernel size
during training for prewhitening.
7.1.2 Simulation Results
Figure 7-1 shows the average test output peak values for image recognition. The
desired output peak value should be close to one when the test image belongs to the
training image class (true class) and otherwise it should be close to zero. Figure 7-1 (Top)
shows that the correlation output peak values of the conventional MACE in false classes
is close to zero and it means that the MACE has a good rejecting ability of false class.
However, some outputs in the test image set, even in the true class, are not recognized as
the true class. Figure 7-1 (Bottom) shows the output values of the proposed correntropy
60
10 20 30 40 50 60 700
0.2
0.4
0.6
0.8
1
Index of test image
outp
ut
10 20 30 40 50 60 700
0.2
0.4
0.6
0.8
1
Index of test image
outp
ut
True classFalse class
True classFalse class
Figure 7-1. The averaged test output peak values (100 Monte-Carlo simulations withN=5), (Top): MACE, (Bottom): CMACE.
10 20 30 40 50 60 70
0
0.5
1
index of test image
outp
ut
10 20 30 40 50 60 70
0
0.5
1
index of test image
outp
ut
Figure 7-2. The test output peak values with additive Gaussian noise (N=5),(Top):MACE, circle-true class with SNR=10dB, cross-false class with SNR=2dB,(Bottom): CMACE, circle-true class with SNR=10dB, cross-false class withSNR=2dB.
61
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion
MACE (No noise)MACE (SNR:2dB)MACE (SNR:0dB)Proposed method(No noise, SNR:2dB, 0dB)
Figure 7-3. The comparison of ROC curves with different SNRs.
MACE and we can see that the generalization and rejecting performance are improved.
As a result, the two images can be recognized well even with a small number of training
images. One of problems of the conventional MACE is that the performance can be easily
degraded by additive noise in the test image since the MACE does not have any special
mechanism to consider input noise. Therefore, it has a poor rejecting ability for a false
class image when noise is added into a false class. Figure 7-2 (Top) shows the noise effect
on the conventional MACE. When the class images are seriously distorted by additive
Gaussian noise ( SNR =2dB), the correlation output peaks of some test images from false
class become great than that of the true class, hence wrong recognition happens. The
results in Figure 7-2 (Bottom) are obtained by the proposed method. The correntropy
MACE shows a much better performance especially for rejecting even in a very low SNR
environment. Figure 7-3 shows the comparison of ROC curves with different SNRs. In
the conventional MACE, we can see that the false alarm rate is increased as additive
noise power is increased. However, in the proposed method, the probability of detection
62
Table 7-1. Comparison of standard deviations of all the Monte-Carlo simulation outputs(100×75 outputs)
True False True False(No noise) (No noise) (SNR:0dB) (SNR:0dB)
Figure 7-7. Case A: ROC curves with different numbers of training images.
67
for the images of the same class not used in training, which is one of known drawbacks of
the conventional MACE. For the confuser test images, most of the output values are near
zero but some are higher than those of target images, creating false alarms. On the other
hand for the CMACE, most of the peak output values of test images are above 0.5, which
means that CMACE generalizes better than the MACE. Also, the rejecting performance
for a confuser is better than the MACE. As a result, recognition performance between
two vehicles is improved by the CMACE, as best quantified in the ROC curves of Figure
7-7. From the ROC curves we can see that the detecting ability of the proposed method
is much better than both the MACE and the KCFA. For the KCFA, prewhitened images
are obtained by multiplying D−0.5 in the frequency domain and applied the kernel trick
to the prewhitened images to compute the output in (5–31). Gaussian kernel with kernel
size of 5 is used for the KCF. From the ROC curves in Figure 7-7, we can also see that the
CMACE outperforms the nonlinear kernel correlation filter in particular for high detection
probability.
Figure 7-8 (a) shows the MACE filter output plane and (b) shows the CMACE filter
output plane, for a test image in the target class not present in the training set. Figure
7-8 (c) and (d) show the case of a confuser (false class) test input. In Figure 7-8 (a)
and (b), we can see that both the MACE and the CMACE produce a sharp peak in the
output plane. However, the peak value at the origin of the CMACE is higher (closer to
the desired value) than that of the MACE. Moreover, the CMACE has less sidelobes and
the values of sidelobes around the origin are lower than those of the MACE. These points
tell us that the detection ability of the propose method is better that the MACE. On
the other hand, for the confuser test input in Figure 7-8. (c) and (d), the output values
around the origin of the CMACE have lower values than the MACE, which means that
the CMACE has better rejection ability than the MACE.
In order to demonstrate the shift-invariant property of the CMACE, we apply the
images of Figure 7-9. The test image was cropped for the object to be shifted 13 pixels
68
020
4060
80
0
20
40
60
80−0.5
0
0.5
1
corr
elat
ion
outp
ut
0.74
0.48
(a) True class in the MACE
020
4060
80
0
20
40
60
80−0.5
0
0.5
1
corr
entr
opy
outp
ut
0.98
0.44
(b) True class in the CMACE
020
4060
80
0
20
40
60
80−0.5
0
0.5
1
corr
elat
ion
outp
ut
peak :0.87
center:0.22
(c) False class in the MACE
020
4060
80
0
20
40
60
80−0.5
0
0.5
1
corr
entr
opy
outp
ut
peak : 0.62
center : 0.006
(d) False class in the CMACE
Figure 7-8. Case A: The MACE output plane vs. the CMACE output plane
in both x and y pixel positions. Figure 7-10 shows the output planes of the MACE and
CMACE when the shifted image is used as the test input while all the training images
are centered. In Figure 7-10, the maximum peak value should happen at the position
of (77,77) in the output plane since the object is shifted by 13 pixels in both x and y
directions. In the CMACE output plane, the maximum peak happens at (77,77) and the
value is 0.9585. However, in the MACE, the maximum peak happens at (74,93) with
0.9442 and the value at the position of (77,77) is 0.93. In this test, the CMACE shows
better shift invariance property than the MACE.
69
(a) original (x,y) (b) shifted to (x−13,y−13)
Figure 7-9. Sample images of BTR60 of size (64× 64) pixels (a) The cropped image of size(64× 64) pixels at the center of the original of size (128x128) pixels. (b) Thecropped image of size (64× 64) pixels with (x− 13, y − 13) of the original ofsize (128× 128) pixels.
0
50
100
0
50
100
−0.5
0
0.5
1
X: 74Y: 93Z: 0.9442
(a) The MACE output plane
0
50
100
0
50
100
−0.5
0
0.5
1
X: 77Y: 77Z: 0.9585
(b) The CMACE output plane
Figure 7-10. Case A: Output planes with shifted true class input image
70
The CMACE performance sensitivity to the kernel size is studied next. In order to
find the appropriate kernel size for the CMACE, the easiest step is to apply Silverman’s
rule of thumb developed for the kernel density estimation problem, which is given by
σi = 1.06σid−1/5, where σi is the standard deviation of the ith training data and d is the
number of samples [44]. A more principled alternative is to apply cross validation to find
the best kernel size. For cross validation, we use one image of training set which is not
included in filter design. Since we are considering images as 1-dimensional vectors, we have
N different training data set. Therefore, we obtain one proper kernel size, σ, by averaging
N different kernel sizes with σ = 1N
∑Ni=1 σi. In this simulation, when N = 60, the value
of the kernel size given by Silverman’s rule is 0.0185 and the best one from cross validation
is 0.1. Figure 7-11 shows the ROC curves for the kernel size obtained by Silverman’s rule
and the one obtained by cross validation. We see that the ROC performance from the
Silverman’ rule is very close to that of the optimal kernel size by cross validation. Also
when we increase the kernel size to be 10, its performance is similar to that of the MACE.
As expected from the property of correntropy, it is noticed that correntropy approaches
correlation with large kernel size.
Table 7-5 shows the area under the ROC for different kernel sizes, and we conclude
that kernel sizes between 0.01 to 1 provide little change in detectability. This may be
surprising when contrasted with the problem of finding the optimal kernel size in density
estimation, but in correntropy the kernel size enters in the argument of an expected value
and plays a different role in the final solution, namely it controls the balance between the
effect of second order moments versus the higher order moments (see property 1).
7.2.3 Depression Angle Distortion Case
In the second simulation, we selected the vehicle 2s1 (Rocket Launcher) as a
target and the T62 as a confuser. These two kinds of images look very similar in shape
therefore they represent a difficult object recognition case, useful to test the performance
improvement of the proposed method. In order to show the effect of the depression angle
71
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion
CMACE(σ=0.0185, Silverman’s rule)
CMACE (σ=0.1, the best from cross validation)MACE CMACE(σ=10)
Figure 7-11. The ROC comparison with different kernel sizes
Table 7-3. Case A: Comparison of ROC areas with different kernel sizes
distortion, training data are selected from target images which were collected at 30 degree
depression angle and the MACE and CMACE are tested with data taken at 17 degree
depression angle.
Figure 7-12 depicts some sample images. As we can see in Figure 7-12 (a) and (b),
due to the big change in depression angle (13 degree of depression is considered a huge
distortion), test images have more shadows and the image size of the vehicles also change,
making detection more difficult. In this simulation, we use all the images (120 images
72
(a) Training images (2S1) of aspect 0,35,124,159 degrees
(b) Test images (2S1) of aspect 3,53,104,137 degrees
(c) Test images from confuser (T62) of aspect 2,41,103,137degrees
Figure 7-12. Case B: Sample SAR images (64x64 pixels) of two vehicle types for a targetchip (2S1) and a confuser (T62).
covering 180 degrees of pose) at 30 degree depression angle for training and also test with
all of 120 exemplar images at 17 degree depression angle.
Figure 7-13 (Top) shows the correlation output peak value of the MACE and
(Bottom) shows the output peak values of the CMACE filter with a target and a confuser
test data. We see that the conventional MACE is very poor in this case, either under or
overshooting the peak value of 1 for the target class, but the CMACE can improve the
recognition performance because of its better generalization. Figure 7-14 depicts the ROC
curve and summarizes the CMACE advantage over the MACE in this large depression
angle distortion case. More interestingly, the KCFA performance is closer to the linear
MACE, due to the same input space whitening which is unable to cope with the large
distortion.
73
20 40 60 80 100 120
0
0.5
1
Index of Test Images (aspect angle 0~179 degree)
Pea
k ou
tput
20 40 60 80 100 120
0
0.5
1
Index of Test Images (aspect angle 0~179 degree)
Pea
k ou
tput
Figure 7-13. Case B: Peak output responses of testing images for a target chip (circle) anda confuser (cross) : (Top) MACE, (Bottom) CMACE.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion MACE
Correntropy MACEKCF
Figure 7-14. Case B: ROC curves.
74
Table 7-4. Comparison of computation time and error for one test image between thedirect method (CMACE) and the FGT method (Fast CMACE) with p = 4 andkc = 4
Figure 8-3. ROC comparison with different dimensionality reduction methods for MACEand CMACE (reduced image size is 16× 16).
The comparison among the four dimensionality reduction methods (subsampling,
pixel-averaging, bilinear interpolation and Gaussian (RP)) for images of size 16 × 16 (from
64× 64) is shown in Figure 8-3.
For the CMACE, the Gaussian method (RP) and pixel-averaging methods work very
well, with the subsampling the worst, but still with robust performance. Subsampling is
the simplest technique but it can also loose important detail information. In the MACE
case, the Gaussian method is the worst, with pixel-averaging method still performing some
discrimination, but at a much reduced rate (compare with Figure 7-3). It is surprising
that local pixel averaging, the simplest method of dimensionality reduction provides such
a robust performance in this application for both the MACE and CMACE. It indicates
that coarse features are sufficient for discrimination up to a certain level of performance.
However, notice that the pixel averaging looses with respect to CMACE-RP when the
operating point in the ROC is close to 100%, as can be expected (finer detail is needed to
discriminate between classes).
87
We have also applied PCA to the MACE and CMACE. There are different ways to
apply PCA to this task. One method for dimensionality reduction with PCA, uses training
images from all the data set, and then projects all the images to the subspace spanned by
principal components. With this method, when we choose 10 images (5 from true class
and 5 from false class) and project all true and false class images onto this subspace, the
performance of both the MACE and CMACE are perfect. However, the training data
must be sufficient to find the principal directions to cover the whole test data and large
computation is required. However, in practice, it is impossible to use out of class images
as a training set for a MACE filter which is designed only with data from one class. In
this more realistic case, the test image class does not belong to the training set for PCA
and the performance of the discrimination will be very poor. Figure 8-4 shows the ROC
curves when only true class images are used for PCA. Even for this case, we had to use all
the true class images (75) to find 75 principal directions, project all the images, and then
choose the 5 projected images to composite the MACE and CMACE filter. For testing,
we also project the test image onto the subspace obtained from the training set. Since
the false class test images are not used to determine the PCA subspace, the projected
data of the false class is not guarantee to preserve the information of the original images,
therefore, the rejecting performance becomes very poor. The ROC area value for the
MACE and CMACE are 0.4015 and 0.7283, respectively.
We could not obtain reasonable results in the MACE with the RP method as shown
in Figure 8-3. We will explain this MACE behavior due to the Gaussian dimensionality
reduction procedure, but it partially also applies to the other methods. Although RP
preserves similarity in the reduced projections, it changed the statistics of the original data
classes. After random projection with the Gaussian ensemble method, all the projected
images display statistics very close to white noise with similar variance. This result is
shown in Figure 8-5, where two classes sample images of size 16 × 16 after applying RP
to all images are depicted. The first row shows the training image set, while the second
88
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Probability of False Alarm
Pro
babi
lity
of D
etec
tion
CMACEMACE
Figure 8-4. ROC comparison with PCA for MACE and CMACE (reduced dimensionalityis k = 75).
row displays the in class test set and the third row the out of class test images. We see
that the projected images in the true class and the false class, although slightly different in
detail, seem to have very similar statistics.
The MACE, which extracts only second order information is unable to distinguish
between the projected image set, however the CMACE succeds in this task. In order
to explain the effectiveness of correntropy function, we compare the correlation and
correntropy in the projected space. This result is shown in Figure 8-6. We consider the 2D
images as long 1D vectors. In Figure 8-6 (a) we show the autocorrelation of one original
image vector in the true class, (b) depicts the autocorrelation of one of the training images
after RP, which leads us to conclude that the projected image has been whitened (the only
peak occurs at zero lag), and (c) shows that the cross correlation between the reduced
training image vector and test image vector in the false class after RP is practically the
same as the auto correlation of the reduced training image vector after RP. Therefore the
covariance information of the images after RP is totally destroyed. Since the conventional
89
(a)
(b)
(c)
Figure 8-5. Sample images of size 16× 16 after RP. (a) Training images. (b) True classimages. (c) False class images.
MACE filter utilizes only the second order information, it is unable to discriminate
between in class and out of class images. However, in (e) and (f), we can see that the cross
correntropy between in class and out of class images is still preserved after RP, due to
the fact that correntropy has ability to extract higher order information of the reduced
dimensional data. Therefore, the CMACE filter seems very well posed to work with the
reduced dimensional images by random projection for this and other applications.
We can also see the overall detection and recognition performance of the CMACE-RP
through a further analysis of the output plane. Figure 8-7 shows correlation output planes
for the MACE and correntropy output planes (CMACE) after dimensionality reduction
with k = 64 random projections. Figure 8-7 (a) shows the desirable correlation output
plane of the MACE filter given the true class test image, however (b) shows the poor
rejecting ability for the false class test image. On the other hand, in the CMACE filter,
the true and false class image output plane in Figure 8-7 (c) and (d) show the expected
responses even with such a small dimensional images.
90
−5000 0 50000
5000
10000
15000
(a)−200 0 200
−0.5
0
0.5
1
1.5
2
2.5x 10
5
(b)−200 0 200
−0.5
0
0.5
1
1.5
2
2.5x 10
5
(c)
−5000 0 50000
0.005
0.01
0.015
(d)−200 0 200
0
0.005
0.01
0.015
(e)−200 0 200
0
1
2
3x 10
−3
(f)
Figure 8-6. The cross correlation vs. cross correntropy. (a) Autocorrelation of one of theoriginal training image vector. (b) Autocorrelation of one of the reducedtraining image vector after RP. (c) Cross correlation between one of thereduced training image vector and test image vector in the false class after RP.(d) Autocorrentropy of one of the original training image vector. (e)Autocorrentropy of one of the reduced training image vector after RP. (f)Cross correntropy between one of the reduced training image vector and testimage vector in false class after RP.
The initial idea to use a preprocessor based on random projections was to alleviate
the storage and computation complexity of the CMACE. Table 8-1 presents comparisons
between the original CMACE and the CMACE with RP. The dominant component for
storage is the correntropy matrix (VX). In single precision (32bit), 64Mbytes are needed
to store VX of 64 × 64 pixel images, but only 256Kbytes with 16 × 16 pixel images after
RP. We need an additional 4Mbytes to perform random projection with the Gaussian
ensemble method. In the binary ensemble case, no additional storage for RP is needed.
The table also presents the computational complexity of (6–1) with one test image, given
N = 5 training images and clocked with MATLAB version 7.0 on a 2.8GHz Pentium 4
processor with 2Gbytes of RAM.
91
0
5
10
15
0
5
10
150
0.2
0.4
0.6
0.8
1
(a) With a true class test image in the MACE
0
5
10
15
0
5
10
150
0.2
0.4
0.6
0.8
1
(b) With a false class test image in the MACE
0
5
10
15
0
5
10
150
0.2
0.4
0.6
0.8
1
(c) With a true class test image in the CMACE
0
5
10
15
0
5
10
150
0.2
0.4
0.6
0.8
1
(d) With a false class test image in the CMACE
Figure 8-7. Correlation output planes vs. correntropy output planes after dimensionreduction with random projection (reduced image size is 8× 8).
Table 8-1. Comparison of the memory and computation time between the originalCMACE (image size of 64× 64) and CMACE-RP (16× 16,with Gaussianensemble method) for one test image with N = 5
and K = XHΦ XΦ is an N ×N Gram matrix whose entries are the dot products κ (xi,xj) =
〈Φ(xi), Φ(xj)〉.E.3 Nonlinear Beamformer using Correntropy
The correntropy beamformer is formulated in the RKHS induce by correntropy and
the solution is obtained by solving the constrainted optimization problem which is to
minimize average correntropy output energy. We denote the transformed received data
matrix and filter vector whose size are M ×N and M × 1, respectively, be
FX = [fx1 , fx2 , · · · , fxN], (E–15)
fw = [f(w(1))f(w(2)) · · · f(w(M))]H . (E–16)
107
where,
fxk= [f(xk(1))f(xk(2)) · · · f(xk(M))]H (E–17)
for k = 1, 2, · · · , N . Given data samples, the cross correntropy between the received signal
at kth snapshot and the filter can be estimated as
voi[m] =1
d
d∑n=1
f(w(n))f(xk(n−m)), (E–18)
for all the lags m = −M + 1, · · · ,M − 1.
The correntropy energy of the kth received signal output is given by
Ek = vTokvok = fH
w Vxkfw, (E–19)
and the M ×M correntropy matrix Vxkis
Vxk=
vk(0) vk(1) · · · vk(d− 1)
vk(1) vk(0) · · · vk(d− 2)
......
. . ....
vk(d− 1) · · · vk(1) vk(0)
, (E–20)
where, each element of the matrix is computed without explicitly knowledge of the
mapping function f by
vk(l) =M∑
n=1
κσ(xk(n)− xk(n + l)), (E–21)
for l = 0, · · · ,M − 1.
The average correntropy energy over all the received data can be written as
Eav =1
N
N∑
k=1
Ek = fTwVXfw, (E–22)
where VX = 1N
∑Nk=1 Vxk
. Since our objective is to minimize the average correntropy
energy in the linear feature space, we can formulate the optimization problem by
min fHwVXfw, subject to fHwfa(θ) = 1. (E–23)
108
where, fa(θ) of size M × 1 is the transformed vector of the steering vector . Then the
solution in feature space becomes
fw =V−1
X fa(θ)
fHa(θ)V−1X fa(θ)
. (E–24)
Then the output is given by
ycorrentropy,k =fHa(θ)V
−1X fxk
fHa(θ)V−1X fa(θ)
=Tax
Ta
, (E–25)
where
Ta =M∑i=1
M∑j=1
wijf(a(j))f(a(i)) ∼=M∑i=1
M∑j=1
wijκσ(a(j)− a(i)), (E–26)
Taz =M∑i=1
M∑j=1
wijf(xk(j))f(a(i)) ∼=M∑i=1
M∑j=1
wijκσ(xk(j)− a(i)), (E–27)
where wij is the (i, j)th element of V−1X , xk(i) is the ith element of the received signal at
kth snapshot and a(i) is ith element of the steering vector.
The final output expressions in (E–26) and (E–27) are obtained by approximating
f(a(j))f(a(i)) and f(xk(j))f(a(i)) by κσ(a(j) − a(i)) and κσ(xk(j) − a(i)), respectively,
which is similar to the kernel trick and holds on average because of property 5.
E.4 Simulations
In this simulation, we present comparison results among the Capon, kernel and
correntropy beamformer in the wireless communications with multiple receiving antennas.
In all of the experiments, we assume a uniform linear array with M = 25 sensor elements
and half-wavelength array spacing. Note that as the number of elements increases, the side
lobes become smaller. Also, as the total width of the array increases, the central beam
becomes narrower. For the source scenario, we assume that narrow band signals arrives
from far field and the target of interest is located at angle θ = 45◦. We use BPSK (Binary
Phase Shift Keying) signalling which has unity power and is uncorrelated. In order to
make the result independent of the input and noise, we perform Monte-Carlo simulations
with 100 different inputs and noises.
109
In the first experiment, we investigate the effect of the number of snapshots
(N) in a spatially white Gaussian noise case. Figure E-1 shows the beampatterns of
Capon, kernel and correntropy beamformers with N = 100 and 1000 for the case that
signal-to-noise-ratios (SNR) is 10dB. The Capon beamformer has the poor performance,
i.e., higher side-lobes for the small number of N , while the kernel method and correntropy
beamformer show a good beampattern even a small number of N . It is well known that
one of the problem of the standard Capon has a poor performance with a small number
of training data. In Figure E-2, we show the performance of BER with N = 100 and
1000 for the range of SNR between 5 and 15dB. It has been shown from Figure E-2(a)
that for N = 100 the Capon beamformer exhibits a high BER floor, but the proposed
beamformer has a much better BER performance than the Capon and kernel beamformer.
For N = 1000 in Figure E-2(b), compared with the Capon when SNR is under of 9dB,
the Capon beamformer shows better BER performance than other two methods, but when
SNR increases, BER of the correntropy beamformer becomes the best.
Next, we test the robustness of the Capon, kernel and correntropy beamformers to
the impulsive noise with N = 1000. We select γ such that SNR is 10dB when an α-stable
noise with α = 2 and δ = 0 (Gaussian noise). Figure E-3 shows BER performance of
three beamformers at different α levels. The correntropy beamformer displays superior
performance for decreasing α, that is, increasing the strength of impulsiveness. From this
result, we can say that the proposed method is robust in terms of BER to the impulsive
noise environment for wireless communications.
Figure E-4(a) and (b) show the beampattern of three beamformer at α = 1.5 and
α = 1.0, respectively. When α = 1.5 in Figure E-4(a), the beampattern of Capon is similar
to that of kernel, and the gain of its side lobe is higher than that of correntropy by 2dB.
As decreasing α, the gap of the side lobe gain between Capon and correntropy is increased
as shown Figure E-4(b).
110
One interesting result of the kernel method is that its BER performance is poor
both in Gaussian and impulsive noise cases even though it shows a nice beampattern.
The output values of the kernel method are far from the original transmitted signals,
±1 therefore, it results in the poor BER performance. The kernel method shown in this
dissertation use a constraint, however, the solution of the optimization problem exists on
the infinite dimensional feature space, therefore, additional regularization for the output
to be close to the original signal may be needed. One important difference compared
with existing conventional kernel method, which normally yields an infinite dimensional
feature space, is that RKHS induced by correntropy (we call it VRKHS) has the same
dimension as the input space. In the beamforming problem, the weight vector w has M
degree of freedom and all the received data are in the M dimensional Euclidean space.
As derived above, all the transformed data belong to a different M dimensional vector
space equipped with the inner product structure defined with the correntropy. The goal
of the proposed beamformer is to find a template fw in this VRKHS such that the cost
function is minimized subject to the constraint. Therefore, the degrees of freedom of this
optimization problem is still M , so regularization, which will be needed in traditional
kernel methods, is not necessary here. Further work needs to be done regarding this point.
E.5 Conclusions
In this research, we have presented a correntropy-based nonlinear beamformer and
compared it with the Capon beamformer, which is a widely used linear technique, and the
kernel-based beamformer, which is one of the nonlinear beamformer. From simulation
results in BPSK wireless communications, it has been shown that the correntropy
beamformer outperforms the Capon and kernel beamfomers significantly in term of
sidelobe suppression of beam shaping and a reduced bit-error-rate. Also, the correntropy
beamformer has a clear advantage over the Capon beamformer in those case where small
data sets are available for training and where non-Gaussian noise is present. Compared to
the kernel beamformer, the correntropy beamformer is computationally much simpler. In
111
the computation complexity, the kernel method needs to compute the inverse of N × N
Gram matrix. On the other hand, the correntropy beamformer needs the inverse of M×M
correntropy matrix, where M ¿ N (in this simulation, M = 25 and N = 1000). In
addition, we hypothesize that in our methodology, regularization is automatically achieved
by the kernel through the expected value operator (which corresponds to a density
matching step utilized to evaluate correntropy).
112
0 10 20 30 40 50 60 70 80 90−10
−9
−8
−7
−6
−5
−4
−3
−2
−1
0
1
Degree (θ)
Arr
ay B
eam
pat
tern
(d
B)
CaponKernelCorrentropy
(a) N = 100
0 10 20 30 40 50 60 70 80 90−18
−16
−14
−12
−10
−8
−6
−4
−2
0
Degree (θ)
Arr
ay B
eam
pat
tern
(d
B)
CaponKernelCorrentropy
(b) N = 1000
Figure E-1. Comparisons of the beampattern for three beamformers in Gaussian noisewith 10dB of SNR.
113
5 10 1510
−4
10−3
10−2
10−1
100
SNR (dB)
BE
R
CaponKernelCorrentropy
(a) N = 100
5 10 1510
−5
10−4
10−3
10−2
10−1
100
SNR (dB)
BE
R
CaponKernelCorrentropy
(b) N = 1000
Figure E-2. Comparisons of BER for three beamformers in Gaussian noise with differentSNRs.
114
0.511.5210
−5
10−4
10−3
10−2
10−1
100
ALPHA (α)
BE
R
CaponKernelCorrentropy
Figure E-3. Comparisons of BER for three beamformers with different characteristicexponent α levels.
115
0 10 20 30 40 50 60 70 80 90−10
−9
−8
−7
−6
−5
−4
−3
−2
−1
0
1
Degree (θ)
Arr
ay B
eam
pat
tern
(d
B)
CaponKernelCorrentropy
(a) α = 1.5
0 10 20 30 40 50 60 70 80 90−10
−9
−8
−7
−6
−5
−4
−3
−2
−1
0
1
Degree (θ)
Arr
ay B
eam
pat
tern
(d
B)
CaponKernelCorrentropy
(b) α = 1.0
Figure E-4. Comparisons of the beampattern for three beamformers in non-Gaussiannoise.
116
LIST OF REFERENCES
[1] B. V. Kumar, “Tutorial survey of composite filter designs for optical correlators,”Appl.Opt, vol. 31, pp. 4773–4801, 1992.
[2] A. Mahalanobis, A. Forman, M. Bower, R. Cherry, and N. Day, “Multi-class SARATR using shift invariant correlation filters,” special issue of Pattern Recognition oncorrelation filters and neural networks, vol. 27, pp. 619–626, 1994.
[3] A. Mahalanobis, B. Vijaya Kumar, D. W. Carlson, and S. Sims, “Performanceevaluation of distance classifier correlation filters,” in Proc. SPIE, 1994, vol. 2238, pp.2–13.
[4] R. Shenoy and D. Casasent, “Correlation filters that generalize well,” in Proc. SPIE,March 1998, vol. 3386, pp. 100–110.
[5] M. Savvides, B. V. Kumar, and P. Khosla, “Face verification using correlation filters,”in Proc. Third IEEE Automatic Identification Advanced Technologies, Tarrytown, NY,2002, pp. 56–61.
[6] B. V. Kumar, M. Savvides, C. Xie, and K. Venkataramani, “Biometric verificationwith correlation filters,” Applied Optics, vol. 43, no. 2, pp. 391–402, Jan 2004.
[7] B. V. K. V. Kumar, A. Mahalanobis, and R. Juday, Correlation Pattern Recognition,Cambridge University Press, 2005.
[8] G. Turin, “An introduction to matched filters,” IEEE Trans. Information Theory,vol. 6, pp. 311–329, 1960.
[9] S. M. Kay, Fundamentals of Statistical signal processing,Volume II Detection Theory,Prentice-Hall, 1998.
[10] A. VanderLugt, “Signal detection by complex spatial filtering,” IEEE Trans.Information Theory, , no. 10, pp. 139–145, 1964.
[11] C. Hester and D. Casasent, “Multivariant technique for multiclass patternrcognition,” Appl.Opt, vol. 19, pp. 1758–1761, 1980.
[12] B. V. Kumar, “Minimum variance synthetic discriminant functions,” J.Opt.Soc.Am.A,vol. 3, no. 10, pp. 1579–1584, 1986.
[13] A. Mahalanobis, B. V. Kumar, and D. Casasent, “Minimum average correlationenergy filters,” Appl.Opt, vol. 26, no. 17, pp. 3633–3640, 1987.
[14] D. Casasent and G. Ravichandran, “Advanced distortion-invariant minimum averagecorrelation energy MACE filters,” Appl.Opt, vol. 31, no. 8, pp. 1109–1116, 1992.
[15] G. Ravichandran and D. a. Casasent, “Minimum noise and correlation energy filters,”Appl.Opt, vol. 31, no. 11, pp. 1823–1833, 1992.
117
[16] P. Refregier and J. Figue, “Optimal trade-off filter for pattern recognition and theircomparison with weiner approach,” Opt. Computer Process., vol. 1, pp. 3–10, 1991.
[17] A. Mahalanobis, B. V. Kumar, S. Song, S. Sims, and J. Epperson, “Unconstrainedcorrelation filters,” Appl.Opt, vol. 33, pp. 3751–3759, 1994.
[18] A. Mahalanobis, B. V. Kumar, and S. Sims, “Distance-classifier correlation filters formulticlass target recognition,” Appl.Opt, vol. 35, no. 17, pp. 3127–3133, June 1996.
[19] M. Alkanha and B. V. Kumar, “Polynomial distance classifier correlation filter forpattern recognition,” Appl.Opt, vol. 42, no. 23, pp. 4688–4708, Aug. 2003.
[20] J. Fisher and J. Principe, “A nonlinear extension of the MACE filter,” NeuralNetworks, vol. 8, pp. 1131–1141, 1995.
[21] J. Fisher and J. Principe, “Formulation of the mace filter as a linear associativememory,” in Proc. Int. Conf. on Neural Networks, 1994, vol. 5.
[22] J. Fisher and J. Principe, “Recent advances to nonlinear MACE filters,” OpticalEngineering, vol. 36, no. 10, pp. 2697–2709, Oct. 1998.
[23] B. Scholkopf and A. J. Smola, Learning with Kernels, The MIT Press, 2002.
[24] B. Scholkopf, A. J. Smola, and K. Muller, “Kernel principal component analysis,”Neural Computation, vol. 10, pp. 1299–1319, 1998.
[25] A. Ruiz and E. Lopez-de Teruel, “Nonlinear kernel-based statistical pattern analysis,”IEEE Trans. on Neural Networks, vol. 12, pp. 16–32, 2001.
[26] H. Kwon and N. M. Nasrabadi, “Kernel matched signal detectors for hyperspectraltareget detection,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing (ICASSP),2005, vol. 4, pp. 665–668.
[27] K. Jeong, P. Pokharel, J. Xu, S. Han, and J. Principe, “Kernel synthetic distriminantfunction for object recognition,” in Proc. Int. Conf. Acoustics, Speech, SignalProcessing (ICASSP), France, May 2006, vol. 5, pp. 765–768.
[28] C. Xie, M. Savvides, and B. V. Kumar, “Kernel correlation filter based redundantclass-dependence feature analysis KCFA on FRGC2.0 data,” in Proc. 2nd Int.Workshop Analysis Modeling of Faces Gesture (AMFG), Beijing, 2005.
[29] I. Santamarıa, P. Pokharel, and J. Principe, “Generalized correlation function:Definition,properties and application to blind equalization,” IEEE Trans. SignalProcessing, vol. 54, no. 6, pp. 2187–2197, June 2006.
[30] P. Pokharel, R. Agrawal, and J. Principe, “Correntropy based matched filtering,” inProc. IEEE Int. Workshop on Machine Learning for signal Processing (MLSP), Sept.2005, pp. 148–155.
118
[31] J. W. Fisher, Nonlinear Extensions to the Miminum Average Correlation EnergyFilter, Ph.D. dissertation, University of Florida, Gainesville, FL, 1997.
[32] B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal marginclassifiers,” in Proc. 5th COLT, 1992, pp. 144–152.
[33] V.Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 1995.
[34] N. Cristianini and J. S. Taylor, An Introduction to Support Vector Machines,Cambridge University Press, 2000.
[35] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, pp.337–404, 1950.
[36] E. Parzen, “On the estimation of probability density function and mode,” The Annalsof Mathematical Statistics, 1962.
[37] J. Mercer, “Functions of positive and negative type, and their connection with thetheory of integral equations,,” Philosophical Trans. of the Royal Society of London,vol. 209, pp. 415–446, 1909.
[39] E. Parzen, “Statistical methods on time series by hilbert space methods,” Tech. Rep.Technical Report No 23, Applied Mathematics and Statistics Laboratory, StanfordUniversity, 1959.
[40] S. Sudharsanan, A. Mahalanobis, and M. Sundareshan, “Unified framework for thesynthesis of synthetic discriminant functions with reduced noise varianve and sharpcorrelation struture,” Optical Engineering, 1990.
[41] J. C. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” in Unsuper-vised Adaptive Filtering, S. Haykin, Ed., pp. 265–319. JOHN WILEY, 2000.
[42] W. Liu, P. P. Pokharel, and J. C. Principe, “Correntropy: properties and applicationsin non-gaussian signal processing,” in press, IEEE Trans. on Signal Processing.
[43] W. Liu, P. Pokharel, , and J. Principe, “Correntropy: a localized similarity measure,”in Proc. 2006 IEEE World Congress on Computational Intelligence (WCCI), Canada,July 2006, pp. 10018–10023.
[44] B. W. Silverman, Density Estimation for Statistics and Data Analysis, CRC Press,1986.
[45] P. Pokharel, J. Xu, D. Erdogmus, and J. Principe, “A closed form solution for anonlinear wiener filter,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing(ICASSP), France, May 2006, vol. 3, pp. 720–723.
119
[46] K. Jeong and J. Principe, “The correntropy MACE filter for image recognition,”in Proc. IEEE Int. Workshop on Machine Learning for signal Processing (MLSP),Ireland, July 2006, pp. 9–14.
[47] L. Greengard and J. Strain, “The fast gauss transform,” SIAM J. Sci. Statist.Comput., vol. 12, no. 1, pp. 79–94, Jan. 1991.
[48] T. Gonzalez, “Clustering to minimize the maximum intercluster distance,” TheoreticalComputer Science, vol. 38, pp. 293–306, 1985.
[49] T. Ross, S. Worrell, V. Velten, J. Mossing, and M. Bryant, “Standard SAR ATRevaluation experiments using the MSTAR public release data set,” in Proc. SPIE,April 1998, vol. 3370, pp. 566–573.
[50] R. Gonzalez and R. Woods, Digital Image Processing, Second Edition, Prentice Hall,2002.
[51] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of CognitiveNeuroscience, vol. 3, no. 1, pp. 71–86, 1991.
[52] R. O. Duda, P. E. Hart, and S. D. G, Pattern Classification, Second Edition, JohnWilly and sons, 2001.
[53] D. Erdogmus, Y. Rao, H. Peddaneni, A. Hegde, and J. Principe, “Recursive principalcomponents analysis usingeigenvector matrix perterbation,” EURASIP Journal onApplied Signal Processing, , no. 13, pp. 2034–2041, Mar. 2004.
[54] K. Fukunaga, Introduction to statistical pattern recognition, Second Edition, AcademicPress Professional, 1990.
[55] S. Kaski, “Dimensionality reduction by random mapping: Fast silimarity computationfor clustering,” in Proc. Int. Joint Conf. on Neural Networks (IJCNN), 1998, pp.413–418.
[56] D. Fradkin and D. Madigan, “Experiments with random projection for machinelearning,” in Proc. Conference on Knowledge Discovery and Data Mining, 2003, pp.517–522.
[57] E. Brigham and H. Maninila, “Random projection in dimensionality reduction:applications to image and text data,” in Proc. Conference on Knowledge Discoveryand Data Mining, 2001, pp. 245–250.
[58] D. Achlioptas, “Database-friendly random projections,” in Symposium on Principlesof Database Systems(PODS), 2001, pp. 274–281.
[59] S. Dasgupta, “Experiments with random projection,” in Proc. Conference onUncertainty in Artificial Intelligence, 2000.
120
[60] E. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signalreconstruction from highly incomplete frequency information,” IEEE Trans. onInformation Theory, vol. 52, no. 2, pp. 489–509, 2006.
[61] S. Dasgupta and A. Gupta, “An elementary proof of the johnson-lindenstrausslemma,” Tech. Rep. Technical Report TR-99-006, International Computer ScienceInstitute, Berkeley, CA, 1999.
[62] G. H. Golub and C. F. v. Loan, Matrix Computations, North Oxford Academic,Oxford, UK, 1983.
[63] E. J. Candes and T. Tao, “Near-optimal signal recovery from random projections:Universal encoding strategies?,” IEEE Trans. of Information Theory, vol. 52, no. 12,pp. 5406–5425, Dec. 2006.
[64] D. Achlioptas, “Database-friendly random projections: Johnson-lindenstrauss withbinary coins,” Journal of Computer and System Sciences, vol. 66, pp. 671–687, 2003.
[65] R. Hecht-Nielsen, Context vectors: general purpose approximate meaning representa-tions self-organized from raw data, IEEE Press, 1994.
[66] D. Donoho, “Compressed sensing,” IEEE Trans. of Information Theory, vol. 52, no.4, pp. 1289–1306, 2006.
[67] D. Casasent and N. A., “Confuser rejection performance of EMACH filters forMSTAR ART,” in Proc. SPIE, April 2006, vol. 6245, pp. 62450D1–12.
[68] B. D. Van Veen and K. M. Buckley, “Beamforming: a versatile approach to spatialfiltering,” IEEE ASSP Magazine, vol. 5, no. 2, pp. 4–22, April 1988.
[69] H. Krim and M. Viberg, “Two decades of array signal processing research: theparametric approach,” IEEE Signal Processing Magazine, vol. 13, no. 4, pp. 67–94,July 1996.
[70] M. S. Bartlett, “Smoothing periodograms from time series with continuous spectra,”Nature, vol. 161, no. 4096, pp. 686–687, May 1948.
[71] J. Capon, “High-resolution frequency-wavenumber spectrum analysis,” Proceedings ofthe IEEE, vol. 57, no. 8, pp. 1408–1418, August 1965.
[72] R. T. Lacoss, “Data adaptive spectral analysis methods,” Geophysics, vol. 36, no. 4,pp. 134–148, August 1971.
[73] P. Stoica, Z. Wang, and J. Li, “Robust capon beamforming,” IEEE Signal ProcessingLetters, vol. 10, no. 6, pp. 172–175, June 2003.
[74] R. G. Lorenz and S. P. Boyd, “Robust minimum variance beamforming,” IEEETrans. Signal Processing, vol. 53, no. 5, pp. 1684–1696, May 2005.
121
[75] M. Shao and C. L. Nikias, “Signal processing with fractional lower order moments:Stable processes and their applications,” Proceedings of the IEEE, vol. 81, no. 7, pp.986–1010, July 1993.
[76] R. Adler, R. E. Feldman, and T. M. S, A Practical Guide to Heavy Tails: StatisticalTechniques and Applications, Boston, MA: Birkhauser, 1998.
[77] P. Tsakalides, R. Raspanti, and C. L. Nikias, “Angle/doppler estimation inheavy-tailed clutter backgrounds,” IEEE Trans. Aerospace and Electronic Systems,vol. 35, no. 2, pp. 419–436, April 1999.
[78] T. Lo, H. Leung, and J. Litva, “Nonlinear beamforming,” Electronics Letters, vol. 27,no. 4, pp. 350–352, February 1991.
[79] S. Chen, L. Hanzo, and A. Wolfgang, “Kernel-based nonlinear beamformingconstruction using orthogonal forward selection with the fisher ratio class separabilitymeasure,” IEEE Signal Processing Letters, vol. 11, no. 6, pp. 478–481, May 2004.
[80] M. Martinez-Ramon, J. L. Rojo-Alvarez, and G. Camps-Valls, “Kernel antenna arrayprocessing,” IEEE Trans. Antennas and Propagation, vol. 55, no. 3, pp. 642–650,March 2007.
[81] K. H. Jeong, W. Liu, S. Han, and J. C. Principe, “The correntropy MACE filter,”submitted to IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI).
122
BIOGRAPHICAL SKETCH
Kyu-Hwa Jeong was born June, 1972 in Korea and received the M.S degree
in electronics engineering from Yonsei University, Seoul, Korea, in 1997, where he
focused on adaptive filter theory and its applications to acoustic echo cancelation. In
1997-2003, he was a senior research engineer with Digital Media Research Lab. in LG
Electronics, Seoul, Korea, and belonged to optical storage group. He mainly participated
in CD/DVD recorder projects. Since 2003, he has been pursuing the Ph.D. degree
with the Computational NeuroEngineering Lab in electrical and computer engineering,
University of Florida, Gainesville, FL. His research interests are in the field of signal
processing, machine learning and its applications to image pattern recognition.