DuSK: A Dual Structure-preserving Kernel for Supervised ...web.cs.wpi.edu/~xkong/publications/papers/sdm14_he.pdfDuSK: A Dual Structure-preserving Kernel for Supervised Tensor Learning

DuSK: A Dual Structure-preserving Kernel for Supervised Tensor

Learning with Applications to Neuroimages

Lifang He∗ Xiangnan Kong† Philip S. Yu ‡ Ann B. Ragin § Zhifeng Hao ¶

Xiaowei Yang ‖

Abstract

With advances in data collection technologies, tensordata is assuming increasing prominence in many appli-cations and the problem of supervised tensor learninghas emerged as a topic of critical significance in the datamining and machine learning community. Conventionalmethods for supervised tensor learning mainly focus onlearning kernels by flattening the tensor into vectors ormatrices, however structural information within the ten-sors will be lost. In this paper, we introduce a newscheme to design structure-preserving kernels for super-vised tensor learning. Specifically, we demonstrate howto leverage the naturally available structure within thetensorial representation to encode prior knowledge inthe kernel. We proposed a tensor kernel that can pre-serve tensor structures based upon dual-tensorial map-ping. The dual-tensorial mapping function can mapeach tensor instance in the input space to another ten-sor in the feature space while preserving the tensorialstructure. Theoretically, our approach is an extensionof the conventional kernels in the vector space to tensorspace. We applied our novel kernel in conjunction withSVM to real-world tensor classification problems includ-ing brain fMRI classification for three different diseases(i.e., Alzheimer’s disease, ADHD and brain damage byHIV). Extensive empirical studies demonstrate that ourproposed approach can effectively boost tensor classi-fication performances, particularly with small samplesizes.

∗Computer Science and Engineering, South China Universityof Technology, China. [email protected]†Computer Science Department, University of Illinois at

Chicago, USA. [email protected]‡Computer Science Department, University of Illinois at

Chicago, USA. [email protected]§Department of Radiology, Northwestern University, USA.

[email protected]¶Faculty of Computer, Guangdong University of Technology,

China. [email protected]‖School of Sciences, South China University of Technology,

China. [email protected]

1 Introduction

Supervised learning is one of the most fundamental datamining tasks. Conventional approaches on supervisedlearning usually assume, explicitly or implicitly, thatdata instances are represented as feature vectors. How-ever, in many real-world applications, data instances aremore naturally represented as second-order (matrices)or higher-order tensors, where the order of a tensor cor-responds to the number of modes or ways. For example,in computer vision, a grey-level image is inherently a 2-D object, which can be represented as a second-ordertensor with the column and row modes [21]; in medicalneuroimaging, an MRI (Magnetic Resonance Imaging)image is naturally a third-order tensor consisting of 3-Dvoxels [3]. Supervised learning on this type of data iscalled supervised tensor learning, where each instancein the input space is represented as a tensor. Withthe rapid proliferation of tensor data, supervised tensorlearning has drawn significant attention in recent yearsin the machine learning and data mining communities.

A straightforward solution to supervised tensorlearning is to convert the input tensors into feature vec-tors, and feed the feature vectors to a conventional su-pervised learning algorithm. However, tensor objectsare commonly specified in high-dimensional space. Forexample, a typical MRI image of size 256 × 256 × 256voxels contains 16, 777, 216 features [23]. This makestraditional methods prone to overfitting, especially forsmall sample size problems [4]. On the other hand, ten-sorial representations retain the information about thestructure of the high-dimensional space the data lie in,such as about the spatial arrangement of the voxel-basedfeatures in a 3-D image. When converting tensors intovectors, such important structural information will belost. In particular, the entries of a tensor object areoften highly correlated with surrounding entries. Forexample, in MRI image data, adjacent voxels usuallyexhibit similar patterns, which means that the sourceimages contain redundant information at this voxel. Itis believed by many researchers that potentially morecompact and useful representations can be extracted

arX

iv:1

407.

8289

v2 [

cs.L

G]

5 A

ug 2

014

from the original tensor data and thus result in moreaccurate and interpretable models. Therefore, super-vised learning algorithms operating directly on tensorsrather than their vectorized versions are much desired.

Formally, a major difficulty in supervised tensorlearning is how to build predictive models that can lever-age the naturally available structure of tensor data tofacilitate the learning process. In the literature, severalsolutions have been proposed. Previous work on super-vised tensor learning mainly focuses on linear models[1, 5, 6, 19, 23], which assume, explicitly or implicitly,that data are linearly separable in the input space. How-ever, in practice this assumption is often violated andthe linear decision boundaries do not adequately sepa-rate the classes. Recently, several approaches try to ex-ploit the tensor structure with nonlinear kernel models[16, 17, 22], which first unfold the tensor along each ofits modes, and then use these unfolded matrices to con-struct nonlinear kernels for supervised tensor learningas shown in Figure 1(b). However, these methods canonly capture the relationships within each single modeof the tensor data, because the structural informationabout inter-mode relationships of tensor data is lost inthe unfolding procedures.

In this paper, we study the problem of supervisedtensor learning with nonlinear kernels which can ade-quately preserve and utilize the structure of the ten-sor data. The major research challenges of supervisedtensor learning with structure-preserving kernels can besummarized as follows:•High-dimensional tensors: One fundamental prob-lem in supervised tensor learning lies in the intrinsichigh dimensionality of tensor objects. Traditional su-pervised learning algorithms assume that the instancesare represented as vectors. However, in the context oftensors, each data object is usually not represented as avector but a high-dimensional multi-mode (also knownas multi-way) array. If we reshape the tensor into avector, the number of features is extremely high. Bothcomputability and theoretical guarantee of the tradi-tional models are compromised by this ultra-high di-mensionality.• Complex tensor structure: Another fundamentalproblem in supervised tensor learning lies in complexstructure of tensors. Conventional tensor-based kernelapproaches focus on unfolding tensor data into matri-ces [16, 17, 22] which can only preserve the one-wayrelationships within the tensor data. However, in manyreal-world applications, the tensor data have multi-waystructures. Such prior knowledge about multi-way re-lationships among features should be incorporated tobuild more accurate and interpretable models, espe-cially in the case of high dimensional tensor data with

Feature space vectorization

Input space

X

Y

vectorization

x

y( )x ,y

( )x

( )y

��

�� kernel

function

(a) Vector-based kernels

Feature space

X(3)

X(1)X(2)

matrix unfolding

X

Y

Y(1)Y(2)Y(3)

(Y(1), Y(2), Y(3))

(X(1), X(2), X(3))

(X , Y)

inner product matrix

unfolding

�

�

�

�kernel

function

Input space

(b) Conventional tensor kernels

Input space Feature space

X

(X , Y)

inner product

factorization

X 0

Y 0

factorization

Y (Y 0)

(X 0)

�

�

�

�

kernel function

(c) Our DuSK

Figure 1: Schematic view of the key difference amongthree kernel learning schemes. Standard kernel (a)works on the vectorized representation and conventionaltensor-based kernel (b) applies tensor-to-matrix align-ment first, which may lead to loss of structural informa-tion. Our method (c) works on the tensor representationdirectly.

small sample size.• Nonlinear separability: In real-world applications,the data is usually not linearly separable in the inputspace. Conventional supervised tensor learning meth-ods which can preserve tensor structures are often basedupon linear models. Thus these methods cannot effi-ciently solve nonlinear learning problems on tensor data.

In this paper, we propose a novel approach to su-pervised tensor learning, called DuSK (Dual Structure-preserving Kernels). Our framework is illustrated inFigure 1(c). Different from conventional methods, ourapproach is based upon kernel methods and tensor fac-torization techniques that can fully capture the multi-way structures of tensor data. We first extract a morecompact and informative representation from the orig-inal data using a tensor factorization method, i.e.,CANDECOMP/PARAFAC (CP) [10]. Then we de-fine a structure-preserving feature mapping to derivethe DuSK kernels in the tensor product feature space,

used in conjunction with kernel machines to solve thesupervised tensor learning problems. Empirical studieson real-world tasks (classifying fMRI images of differ-ent brain diseases, i.e., Alzheimer’s disease, ADHD andHIV) demonstrate that the proposed approach can sig-nificantly boost the classification performances on ten-sor datasets.

2 PRELIMINARIES

Before presenting our approach, we introduce somerelated concepts and notation of tensors. Table 1 listssome basic symbols defined in this study. We first givea formal mathematical definition of the tensor, whichprovides an intuitive understanding of the algebraicstructure of the tensor that tensor object has the tensorproduct structure.

Definition 1. (Tensor) An N th-order tensor is anelement of the tensor product of N vector spaces, eachof which has its own coordinate system.

We use A = (ai1,i2,...,iN ) ∈ RI1×I2×···×IN to denotea tensor A of N order. For n = 1, 2, · · · , N , In is thedimension of A along the n-th mode. Based on theabove definition, we define inner product, tensor norm,tensor product, and rank of a tensor and give CP modelas follows:

Definition 2. (Inner product) The inner productof two same-sized tensors A,B ∈ RI1×I2×···×IN isdefined as the sum of the products of their entries:

(2.1) 〈A,B〉 =

I1∑

i1=1

I2∑

i2=1

· · ·IN∑

iN=1

ai1,i2,...,iN bi1,i2,...,iN .

Definition 3. (Tensor norm) The norm of a tensorA is defined to be the square root of the sum of all entriesof the tensor squared, i.e.,(2.2)

‖A‖F =√〈A,A〉 =

√√√√I1∑

i1=1

I2∑

i2=1

· · ·IN∑

iN=1

a2i1,i2,...,iN

.

As we see the norm of a tensor is a straightforwardgeneralization of the usual Frobenius norm for matricesand of the l2 norm for vectors.

Definition 4. (Tensor product) The tensor prod-uct A ⊗ B of tensors A ∈ RI1×I2×···×IN and B ∈RI′1×I′2×···×I′M is defined by(2.3)

(A⊗ B)i1,i2,...,iN ,i′1,i′2,...,i′M= ai1,i2,··· ,iN bi′1,i′2,··· ,i′M

for all values of the indices.

Table 1: List of symbolsSymbol Definition and Description

s each lower-case represents a scalev each boldface lowercase letter represents a vectorM each boldface capital letter represents a matrixT each calligraphic letter represents a tensorG each gothic letter represent a general set or space⊗ denotes tensor product〈., .〉 denotes the inner product in some feature spaceR =Rank(A) is the rank of tensor Aφ(.) denotes the feature mappingκ(., .) represents a kernel function

It is worth mentioning that a rank-one tensor, is stillanalogously to the matrix case, a tensor that is atensor product of vectors (Nth-order tensor requires Nvectors). Additionally, notice that for rank-one tensorsA = a(1) ⊗ a(2) ⊗ · · · ⊗ a(N) and B = b(1) ⊗b(2) ⊗ · · · ⊗b(N), it holds that(2.4)

〈A,B〉 =⟨a(1),b(1)

⟩⟨a(2),b(2)

⟩· · ·⟨a(N),b(N)

⟩.

Definition 5. (Tensor rank) The rank of a tensorA is the minimum number of rank-one tensor to fit Aexactly.

Definition 6. (CP factorization) Given a tensorA ∈ RI1×I2×···×IN and an integer R, if it can beexpressed as

(2.5) A =

R∑

r=1

a(1)r ⊗ a(2)

r ⊗ · · · ⊗ a(N)r ,

we call it CP factorization (see Figure 2 for graphicalrepresentations). For convenience, in the following we

write∏Nn=1⊗a(n) for a(1) ⊗ a(2) ⊗ · · · ⊗ a(N).

3 APPROACH

In this section, we first formulate the problem of tensor-based kernel learning and then elaborate on our DuSK.For the sake of brevity, hereafter we restrict our discus-sion to classification problems.

3.1 Problem statement Considering a training setof M pairs of samples {Xi, yi}Mi=1 for binary tensorclassification problem, where Xi ∈ RI1×I2×···×IN arethe input of the sample and yi ∈ {−1,+1} are thecorresponding class labels of Xi. In [6], it was notedthat the problem of tensor classification can be statedas a convex quadratic optimization problem in theframework of the standard linear SVM. Based on thisresult, we show how it can be modeled as a kernellearning problem.

Suppose we are given the optimization problem of

X ⇡ + +a1

b1

c1

b2

c2 cR

bRaRa2

· · ·+

Figure 2: CP factorization of a third-order tensor

linear tensor classification as

minW,b,ξ

1

2‖W‖2F + C

M∑

i=1

ξi,(3.6)

s.t. yi (〈W,Xi〉+ b) ≥ 1− ξi,(3.7)

ξi ≥ 0,∀i = 1, · · · ,M.(3.8)

Where W is the weight tensor of the separating hyper-plane, b is the bias, ξi is the error of the ith trainingsample, and C is the trade-off between the classificationmargin and misclassification error.

Obviously, the optimization problem in (3.6)-(3.8)is the generalization of the problem of the standardlinear SVM to tensor patterns in tensor space. Whenthe input samples Xi are vectors, it degenerates intothe standard linear SVM. As such, based on the kernelmethod for the extension of linear SVM to the nonlinearcase−by introducing a nonlinear feature mapping φ :x→ φ (x) ∈ H ⊂ RH , we develop a nonlinear extensionof (3.6)-(3.8) in the following, which is critical for thederivation of the model for tensor-based kernel learning.

Given a tensor X ∈ RI1×I2×···×IN , we assume it ismapped into the Hilbert space H by

(3.9) φ : X → φ (X ) ∈ RH1×H2×···×HP .

Note that the project tensor φ (X ) in space H mayhave different order with X , and each mode dimensionis higher even an infinite dimension depending on thefeature mapping function φ(.). Such a Hilbert spaceis called the high-dimensional tensor feature space orsimply a tensor feature space. According to the sameprinciple as the construction of linear classificationmodel in the original tensor space, we construct thefollowing model in this space:

minW,b,ξ

1

2‖W‖2F + C

M∑

i=1

ξi,(3.10)

s.t. yi (〈W, φ (Xi)〉+ b) ≥ 1− ξi,(3.11)

ξi ≥ 0,∀i = 1, · · · ,M.(3.12)

From the viewpoint of high-dimensional tensor featurespace, this model is a linear model. However, from theviewpoint of the original tensor space, it is a nonlinearmodel. When the input samples Xi are vectors, itdegenerates into the standard nonlinear SVM. When thefeature mapping function φ(.) is an identical function,

i.e., φ(X ) = X , it is the same as that in (3.6)-(3.8).Thus, we say that the optimization model (3.10)-(3.12)is the nonlinear counterpart of (3.6)-(3.8).

Let us now show how this model can be exploited toobtain tensor-based kernel optimization model. UsingLagrangian relaxation method [2], it is easy to checkthat the dual problem of (3.10)-(3.12) is

maxα1,α2,··· ,αM

M∑

i=1

αi −1

2

M∑

i,j=1

αiαjyiyj〈φ (Xi) , φ (Xj)〉

(3.13)

s.t.

M∑

i=1

αiyi = 0,(3.14)

0 ≤ αi ≤ C, ∀i = 1, · · · ,M.(3.15)

Where αi are the Lagrangian multipliers and〈φ (Xi) , φ (Xj)〉 are the inner product between themapped tensors of Xi and Xj in the tensor featurespace.

The advantage of formulation (3.13)-(3.15) over(3.10)-(3.12) is that the training data only appear inthe form of inner products. Based on the fundamentalprinciple of kernel method, by substituting the innerproduct 〈φ (Xi) , φ (Xj)〉 with a suitable tensor kernelfunction κ (Xi,Xj), we thus get the tensor-based kernelmodel. The resulting decision function is

(3.16) f (X ) = sign

(M∑

i=1

αiyiκ (Xi,X ) + b

).

3.2 DuSK From the above statement, we can seethat tensor-based kernel learning degenerates into thestudy of kernel function, and the success of kernelmethods depends strongly on the data representationencoded into the kernel function. Now we propose theDuSK. Our target is to leverage the naturally availablestructure of the tensor to facilitate kernel learning.

Tensors provide a natural and efficient representa-tion for multi-way data, but there is no guarantee thatsuch representation will be good for kernel learning.Since learning will only be successful if the regularitiesthat underlie the data can be discerned by the kernel.As with the previous analysis for the characteristics oftensor object, we know that the essential information inthe tensor is embedded in its multi-way structure. Thus,one important aspect of kernel learning for such complexobjects is to represent them by sets of key structural fea-tures easier to manipulate, and design kernels on suchsets.

According to the mathematical definition of tensor,we can gain a further understanding of the structure

CP factorization

XX�

�

�(X )

�

...

�1(aR)

�2(bR)

�3(cR)

�3(c1)

�3(c2)

�2(b2)

�2(b1)

�1(a1)

�1(a2)

a1

b1

c1

b2

c2

cR

bR

aR

a2

...

Figure 3: Dual-tensorial mapping

of the tensor that tensor object has the tensor prod-uct structure. In previous work, it was found that CPfactorization is particularly effective for extracting thisstructure. Motivated by these observations, we investi-gate how to exploit the benefits of CP factorization tolearn a structure-preserving kernel in the tensor prod-uct feature space. More specifically, we will representeach tensor object as a sum of rank-one tensors in theoriginal space and map them into the tensor productfeature space for our kernel learning. In the following,we illustrate how to design the feature mapping.

We start by defining the following mapping on arank-one tensor

∏Nn=1⊗x(n) ∈ RI1×I2×···×IN .

(3.17)

φ :

N∏

n=1

⊗x(n) →N∏

n=1

⊗φ(x(n)

)∈ RH1×H2×···×HN .

Let the CP factorization of X ,Y ∈ RI1×I2×···×IN be

X =∑Rr=1

∏Nn=1⊗x

(n)r and Y =

∑Rr=1

∏Nn=1⊗y

(n)r re-

spectively. By using the concept of the kernel function,we see that the kernel can be defined directly with innerproduct in the feature space. Thus, when R = 1, basedon the above mapping and Eq. 2.4, we can directly de-rive the naive tensor product kernels, i.e.,

κ (X ,Y) =

N∏

n=1

κ(x(n),y(n)

).(3.18)

Despite this, many authors has demonstrated that asimple rank-one tensor cannot provide compact andinformative presentation for original data [24]. The keypoint is how to design feature mapping when the valueof R is more than one.

Based on the definition of the kernel function,it is easy to find that the feature space is a high-dimensional space of the original space, equipped withthe same operations. Thus, we can factorize tensor datadirectly in the feature space the same as original space.This is formally equivalent to performing the following

mapping:

(3.19) φ :

R∑

r=1

N∏

n=1

⊗x(n)r →

R∑

r=1

N∏

n=1

⊗φ(x(n)r

).

In this sense, it corresponds to mapping tensors intohigh-dimensional tensors that retain the original struc-ture. More precisely, it can be regarded as mapping theoriginal data into tensor feature space and then conduct-ing the CP factorization in the feature space. We call itthe dual-tensorial mapping function (see Figure 3).

After mapping the CP factorization of the data intothe tensor product feature space, the kernel itself is justthe standard inner product of tensors in that featurespace. Thus, we derive our DuSK:

κ

(R∑

r=1

N∏

n=1

⊗x(n)r ,

R∑

r=1

N∏

n=1

⊗y(n)r

)

=

R∑

i=1

R∑

j=1

N∏

n=1

κ(x

(n)i ,y

(n)j

)(3.20)

From its derivation, we know such a kernel can take themulti-way structure flexibility into account. In general,the DuSK is an extension of the conventional kernels inthe vector space to tensor space, and each vector kernelcan be used in this framework for supervised tensorlearning in conjunction with kernel machines.

3.3 Efficiency We consider the case of GaussianRBF kernel in our framework, which is one of the mostpopular kernels that have been proven successful inmany different contexts. Assume that a set of tensordata {(Xi, yi)}Mi=1 is given, where Xi ∈ RI1×I2×···×IN .The time complexity of computing a Gaussian RBF ker-

nel matrix is O(M2

∏Nn=1 In

)and our method DuSK

is thus O(M2R2

∑Nn=1 In

). A typical characteristic

associated with tensor data is very high dimensionalwhile R is often very small, which indicates our pro-posed method is significantly more efficient than itsvector counterpart. It is also worth mentioning thatour method depends on CP factorization technique,but it is backed with rapid implementation [13]. The

storage complexity is reduced from O(M∏Nn=1 In

)to

O(M∑Nn=1 In

), where the data is compressed without

quality loss and can be recovered quickly. Furthermore,since the constituent kernels are Gaussian RBF kernels,

13 10 7 5 4 2 1 13 10 7 5

4 2 13 10 7 5 4 2 1

13 10 7 5 4 2 1 13 10 7 5

4 2 1

1-mode �

2-mode �

3-mode �

(a)(b)

Figure 4: (a) An illustration of a three-order tensor(fMRI image), (b) An visualization of fMRI image.

we can thus reformulate Eq. 3.20 to

κ (X ,Y) =

R∑

i=1

R∑

j=1

N∏

n=1

κ(x

(n)i ,y

(n)j

)

=

R∑

i=1

R∑

j=1

exp

(−σ

N∑

n=1

‖x(n)i − y

(n)j ‖2

)(3.21)

where σ is used to set an appropriate bandwidth. Wedenote this kernel as DuSKRBF.

4 Experiment Evaluation

In this study, we validate the effectiveness of theDuSKRBF kernel within standard SVM framework fortensor classification, which we refer to as DuSKRBF. Asan application we consider an example of neuroimagingmining.

4.1 Data collection We use three real-world fMRIdatasets in our experimental evaluation as follows.• Alzheimer’s Disease (ADNI): The first dataset is col-lected from the Alzheimer’s Disease Neuroimaging Ini-tiative1. The dataset consists of records of patients withAlzheimer’s Disease (AD) and Mild Cognitive Impair-ment (MCI). We downloaded all records of resting-statefMRI images and apply SPM8 toolbox2 to preprocessthe data. We deleted the first ten volumes for each in-dividual, and functional images were realigned to thefirst volume, slice timing corrected, and normalized tothe MNI template and spatially smoothed with an 8-mm FWHM Gaussian kernel. Resting-State fMRI DataAnalysis Toolkit (REST3) was then used to remove thelinear trend of time series and temporally band-pass fil-tering (0.01− 0.08 Hz). The average value of each sub-ject over time series was calculated within each of thoseboxes, thereby resulting in 33 samples and a sum totalof 61× 73× 61 = 271633 voxels (or features). We treatthe normal brains as negative class, and AD+MCI as

1http://adni.loni.usc.edu/2http://www.l.ion.ucl.ac.uk/spm/software/spm8/3http://resting-fmri.sourceforge.net

the positive class. Each individual is linearly rescaledto [0, 1]. Feature normalization is an important proce-dure, since the brain of every individual is different.• Attention Deficit Hyperactivity Disorder (ADHD):The second dataset is collected from ADHD-200 globalcompetition dataset4. The dataset contains recordsof resting-state fMRI images for 776 subjects with58× 49× 47 = 133574 voxels, which are labeled as realpatients (positive) and normal controls (negative). Theoriginal dataset is unbalanced, we randomly sampled100 ADHD patients and 100 normal controls from thedataset for performance evaluation and the average overtime series is conducted. Such dataset are quite special,all algorithms perform bad with normalization, we usenon-normalized dataset.• Human Immunodeficiency Virus Infection (HIV): Thethird dataset is collected from the Department of Ra-diology in Northwestern University [20]. The datasetcontains fMRI brain images of patients with early HIVinfection (positive) as well as normal controls (negative).The same preprocessing steps as in ADNI dataset weregiven. This contains 83 samples with 61 × 73 × 61 =271633 voxels.

4.2 Baselines and Metrics In order to establish acomparative study, we use seven state-of-the-art meth-ods as baselines, each representing a different strat-egy. We here focus on SVM classifier, since it has beenproven successful in many applications.• Gaussian-RBF: a Gaussian-RBF kernel-based SVM,which is now the most widely used vector-based methodfor classification. In the following methods, if not statedexplicitly, we use SVM with Gaussian RBF kernel as theclassifier.• Factor kernel: a matrix unfolding based tensor kernel,which is recently proposed in [16] and the constituentkernels belong to a class of Gaussian RBF kernels.• K3rd kernel: a class of vector-based tensor kernels,aiming at representing the tensor in each vector spaceto capture structural information and have been appliedto analyze fMRI images in conjunction with GaussianRBF kernel [14].• Linear SHTM: a linear support higher-order tensormachine [6], which is one of the most effective meth-ods for tensor classification that generalizes linear SVMto tensor pattern using CP factorization and can beregarded as a special case of DuSK, namely the con-stituent kernels are linear kernels. This baseline is usedto test the ability of our proposed method to cope withcomplex (possibly nonlinear) structured data.• Linear kernel: linear SVM has also been increasingly

4http://neurobureau.projects.nitrc.org/ADHD200/

Table 2: Average classification accuracy comparison: mean (standard deviation).Dataset DuSKRBF Gaussian RBF Factor kernel K3rd kernel linear SHTM linear SVM PCA+SVM MPCA+SVM

ADNI 0.75 (0.18) 0.49 (0.23) 0.51 (0.21) 0.55 (0.14) 0.52 (0.31) 0.42 (0.27) 0.50 (0.02) 0.51 (0.02)ADHD 0.65(0.01) 0.58 (0.00) 0.50 (0.00) 0.55 (0.00) 0.51 (0.03) 0.51 (0.01) 0.63 (0.01) 0.64 (0.01)HIV 0.74 (0.00) 0.70 (0.00) 0.70 (0.01) 0.75 (0.02) 0.70 (0.01) 0.74 (0.01) 0.73 (0.25) 0.72 (0.02)

0 1 2 3 4 5 6 7 8 9 10 11 120.5

0.55

0.6

0.65

0.7

0.75

Rank R

Test

Acc

urac

y(%

)

(a) ADNI

0 1 2 3 4 5 6 7 8 9 10 11 120.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

Rank R

Test

Acc

urac

y(%

)

(b) ADHD

0 1 2 3 4 5 6 7 8 9 10 11 120.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

Rank R

Test

Acc

urac

y(%

)

(c) HIV

Figure 5: Test accuracy vs. R on (a) ADNI, (b) ADHD, and (c) HIV, where the red triangles indicate the peakpositions.

used to handle fMRI data. In some cases, it outper-forms SVM using nonlinear kernels.• PCA+SVM: Principal component analysis (PCA) isa vector-based subspace learning algorithms, which arecommonly used for dealing with high-dimensional data,in particular fMRI data.• MPCA+SVM: Multilinear principal component anal-ysis (MPCA) [12] is a natural extension of PCA to ten-sors, which are used to handle high-dimensional tensordata.

The first three baselines are used to show theimprovement of our proposed method over currentkernel approaches to tensor classification. The lasttwo baselines are used to test the effectiveness of ourproposed method compared to unsupervised methodsfor tensor classification.

The effectiveness of an algorithm is always evalu-ated by test accuracy, we utilize it as metrics in the ex-periments. For our proposed method and linear SHTM,we choose the most popular and widely used enhancedlinear search method [13] as its CP factorization strat-egy. All of the related methods select the optimal trade-off parameter from C ∈ {2−5, 2−4, · · · , 29} and kernelwidth parameter from σ ∈ {2−4, 2−3, · · · , 29}. Consid-ering the fact that there is no known closed-form solu-tion to determine the rank R of a tensor a priori [9], andrank determination of a tensor is still an open problem[18], in our method and linear SHTM, we use grid searchto determine the optimal rank and the optimal trade-offparameter together, where the rank R ∈ {1, 2, · · · , 12}.The influence of different rank parameters on the clas-sification performance of our method is also given.

All the experiments are conducted on a computerwith Intel Core2TM1.8GHz processor and 3.5GB RAMmemory running Microsoft Windows XP.

4.3 Classification Performance In our experi-ments, we first randomly sample 80% of the whole data

as the training set, and the remaining samples as thetest set. This random sampling experiment was re-peated 50 times for all methods. The average perfor-mances of each method are reported. Table 2 shows theaverage classification accuracy and standard deviationof seven algorithms on three datasets, where the bestresult is highlighted in bold type.

From the experimental results in Table 2, we canobserve that the classification accuracy of each methodon different dataset can be quite different. However,the best method that outperforms other methods in alldatasets is DuSKRBF, especially for ADNI dataset. It isworth noting that in neuroimaging task it is very hardfor classification algorithms to achieve even moderateclassification accuracy on ADNI dataset since this datais extremely high dimensional but with small samplesize. While we can observe an 20% gain over comparisonmethods. Based on this result, we can conclude thatoperation on tensors is much more effective than onmatrices and vectors for high-dimensional tensor dataanalysis.

So far we have demonstrated that our proposedmethod is effective for tensor classification. However,it is still interesting to show how the data structurefor tensor is actually used in our method. We focuson ADNI dataset to conduct an analysis. Figure 6shows the visualization of original ADNI object andreconstruction result from our chosen CP factorization.As illustrated, CP factorization can fully capture themulti-way structure of the data, thus our method takeit into account in the learning process.

4.4 Parameter Sensitivity Although the optimalrank parameter R , the optimal trade-off parameterC and kernel width parameter σ are found by a gridsearch in DuSKRBF, it is still important to see thesensitivity of DuSKRBF to the rank parameter R. Forthis purpose, we demonstrate a sensitivity study over

(a) original data (b) reconstruction

Figure 6: (a) is visualization of original ADNI object (across section is shown on the left and a 3D plot on theright) and (b) is reconstruction result from our chosenCP factorization.

different R ∈ {1, 2, · · · , 12} in this section, where theoptimal trade-off parameter and kernel width parameterare still selected from C ∈ {2−5, 2−4, · · · , 29} andσ ∈ {2−4, 2−3, · · · , 29} respectively. According to theaforementioned analysis, we know that the efficiencyof DuSKRBF is reduced when R is increased because ahigher value of R implies that more items are includedinto kernel computations. Thus, we only demonstratethe variation in test accuracy over different R on threedatasets. As shown in Figure 5, we can observe thatthe rank parameter R has a significant effect on thetest accuracy and the optimal value of R depends onthe data, while the optimal value of R lies in the range2 ≤ R ≤ 5, which may provide a good guidance forselection of the R in advance.

In summary, the parameter sensitivity studyindicates that the classification performance ofDuSKRBF+SVM relies on parameter R and it isdifficult to specify an optimal value for R in advance.However, in most cases the optimal value of R lies ina small range of values as demonstrated in [6] and itis not time-consuming to find it using the grid searchstrategy in practical applications.

5 Related Work

From the conceptual perspective, two topics can be seenas closely related to our DuSK approach: supervisedtensor learning and tensor factorization. This sectiongives a short overview of these areas and distinguishesDuSK from other existing solutions.

Tensor factorizations: Tensor factorizations arehigher-order extensions of matrix factorization thatelicit intrinsic multi-way structures and capture the un-derlying patterns in tensor data. These techniques havebeen widely used in diverse disciplines to analyze andprocess tensor data. A thorough survey of these tech-niques and applications can be found in [10]. The twomost commonly factorizations are CP and Tucker. CPis a special case of Tucker decomposition which forcesthe core array to a (super)diagonal form. It is thus morecondensed than that of Tucker. In the supervised ten-

sor learning setting, CP is more frequently applied to ex-plore tensor data because of its properties of uniquenessand simplicity [6, 8, 19, 23]. However, in these applica-tions, CP factorization is used either for exploratoryanalysis or to deal with linear tensor-based models. Inthis study, we employ the CP factorization to foster theuse of kernel methods for supervised tensor learning.

Supervised tensor learning: Supervised tensorlearning has been extensively studied in recent years[1, 5, 11, 19, 23]. Most of previous work has concen-trated on learning linear tensor-based models, whereasthe problem of how to build nonlinear models directlyon tensor data has not been well studied. A first at-tempt in this direction focused on second-order tensorsand led to a non-convex optimization problem [15]. Sub-sequently, the authors claimed that it can be extendedto deal with higher-order tensors at the cost of a highercomputational complexity, and proposed a factor kernelfor tensors of arbitrary order except for square matricesbased upon matrix unfoldings [16]. In the context of thisproposal, Signorette et al. [17] introduced a cumulant-based kernel approach for classification of multichannelsignals. Zhao et al. [22] presented a kernel tensor par-tial least squares for regression of lamb movements. Adrawback of the approaches in [16, 17, 22] is that theycan only capture the one-way relationships within thetensor data, because the tensors are unfolded into ma-trices. The multi-way structures within tensor data arealready lost before the kernel construction process. Dif-ferent from these methods, we aim to directly exploitthe algebraic structure of the tensor to study structure-preserving kernels.

Another recent work by Hardoon et al. [7], althoughnot directly performs supervised tensor learning, isworth mentioning in this context. They introducedthe so-called tensor kernels to analyze neuroimagingdata from multiple sources, which demonstrated thatthe tensor product feature space is useful for modelinginteractions between feature sets in different domains.In this study, we make use of the tensor product featurespace to derive our kernels in vivo the incorporation ofCP model. The tensor kernels can be cast as a specialcase of our framework.

6 Conclusion and Future work

In this paper we have introduced a new tensor-basedkernel methodology and first operate directly on ten-sors. We have applied our method on the problem offMRI classification. The results indicate that the priorstructural information can indeed improve the classifi-cation performance, particularly with small-sample size.As previous work limited on learning with matrices andvectors, this paper provides a new insight into the un-

derstanding of the principles and ideas underlying theconcept of tensor.

In the future, we will investigate the reconstructiontechniques of tensor data, so that our method can han-dle high-dimensional vector data more effectively. An-other interesting topic would be to design some spe-cial method to address the parameter problem. Furtherstudy on this topic will also include many applications ofDuSK kernels in real-world unsupervised learning withtensor representations.

AcknowledgementsThis work is supported in part by NSF through grants CNS-1115234,DBI-0960443, and OISE-1129076, NIH through grant MH080636,US Department of Army through grant W911NF-12-1-0066, HuaweiGrant, National Science Foundation of China (61273295, 61070033),National Social Science Foundation of China (11&ZD156), Sci-ence and Technology Plan Project of Guangzhou City(12C42111607,201200000031), Science and Technology Plan Project of Panyu Dis-trict Guangzhou (2012-Z-03-67), Specialized Research Fund for theDoctoral Program of Higher Education (20134420110010). DisciplineConstruction and Quality Engineering of Higher Education in Guang-dong Province(PT2011JSJ) and China Scholarship Council.

References

[1] D. Cai, X. He, and J. Han. Learning with ten-sor representation. Computer Science Technical Re-port UIUCDCS-R-2006-2716, University of Illinois atUrbana-Champaign, 2006.

[2] E. Chong and S. Zak. An introduction to optimization.Wiley interscience, 2001.

[3] A. Cichocki. Tensors decompositions: New concepts forbrain data analysis? Journal of Control Measurement,and System Integration, 7:507–517, 2013.

[4] J. Davis and I. Dhillon. Structured metric learning forhigh dimensional problems. In Proceeding of the 14thACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 195–203, Las Vegas,Nevada, USA, 2008.

[5] W. Guo, I. Kotsia, and I. Patras. Tensor learning forregression. IEEE Transactions on Image Processing,21(2):816–827, 2012.

[6] Z. Hao, L. He, B. Chen, and X. Yang. A linearsupport higher-order tensor machine for classification.IEEE Transactions on Image Processing, 22(7):2911–2920, 2013.

[7] D. Hardoon and J. Shawe-Taylor. Decomposing thetensor kernel support vector machine for neurosciencedata with structured labels. Machine Learning, 79(1):1–18, 2010.

[8] A. Jukic, I. Kopriva, and A. Cichocki. Canonicalpolyadic decomposition for unsupervised linear featureextraction from protein profiles. In Proceeding of theEuropean Signal Processing Conference, Marrakech, Mo-rocco, 2013.

[9] M. Kilmer and C. Martin. Factorization strategies forthird-order tensors. Linear Algebra and its Applications,435(3):641–658, 2011.

[10] T. Kolda and B. Bader. Tensor decompositions andapplications. SIAM Review, 51(3):455–500, 2009.

[11] I. Kotsia and I. Patras. Support tucker machines. InProceedings of Computer Vision and Pattern Recogni-tion, pages 633–640, Providence, RI, 2011.

[12] H. Lu, K. Plataniotis, and A. Venetsanopoulos. Mpca:multilinear principal component analysis of tensor ob-jects. IEEE Transactions on Neural Networks andLearning Systems, 19(1):18–39, 2008.

[13] D. Nion and L. D. Lathauwer. An enhanced line searchscheme for complex-valued tensor decompositions. appli-cation in ds-cdma. Signal Processing, 21:749–755, 2008.

[14] S. Park. Multifactor analysis for fmri brain imageclassification by subject and motor task. Electrical andcomputer engineering technical report, Carnegie MellonUniversity, 2011.

[15] M. Signoretto, L. D. Lathauwer, and J. Suykens.Kernel-based learning from infinite dimensional 2-waytensors. In Proceedings of the 20th International Confer-ence on Artificial Neural Networks, pages 59–69, Thes-saloniki, Greece, 2010.

[16] M. Signoretto, L. D. Lathauwer, and J. Suykens. a ker-nel based framework to tensorial data analysis. NeuralNetworks, 24(8):861–874, 2011.

[17] M. Signoretto, E. Olivetti, L. D. Lathauwer, andJ. Suykens. Classification of multichannel signals withcumulant-based kernels. IEEE Transactions on SignalProcessing, 45(12):2304–2314, 2012.

[18] V. Silva and L.-H. Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem.SIAM Journal on Matrix Analysis and Applications,30(3):1084–1127, 2008.

[19] D. Tao, X. Li, X. Wu, W. Hu, and S. Maybank.Supervised tensor learning. Knowledge and InformationSystems, 13(1):1–42, 2007.

[20] X. Wang, P. Foryt, R. Ochs, J. Chung, Y. Wu, T. Par-rish, and A. Ragin. Abnormalities in resting-state func-tional connectivity in early human immunodeficiencyvirus infection. Brain Connectivity, 1(3):207, 2011.

[21] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, andH. Zhang. Multilinear discriminant analysis for facerecognition. IEEE Transactions on Image Processing,16(1):212–220, 2007.

[22] Q. Zhao, G. Zhou, T. Adali, L. Zhang, and A. Cichocki.kernel-based tensor partial least squares for reconstruc-tion of limb movements. In Proceedings of Acoustics,Speech, and Signal Processing, pages 633–640, Provi-dence, RI, 2013.

[23] H. Zhou, L. . Li, and H. Zhu. Tensor regression withapplications in neuroimaging data analysis. AmericanStatistical Association, 2012.

[24] Y. Zhu, J. He, and R. Lawrence. Hierarchical modelingwith tensor inputs. In Proceedings of the 26th AAAIConference on Artificial Intelligence, pages 1233–1239,Toronto, Canada, 2012.

DuSK: A Dual Structure-preserving Kernel for Supervised ...web.cs.wpi.edu/~xkong/publications/papers/sdm14_he.pdfDuSK: A Dual Structure-preserving Kernel for Supervised Tensor Learning

Documents