Identifying Informative Imaging Biomarkers via Tree StructuredSparse Learning for AD Diagnosis
Manhua Liu,Department of Instrument Science and Engineering, SEIEE, Shanghai Jiao Tong University,Dong Chuan Rd #800, Shanghai, China. Department of Radiology and BRIC, University of NorthCarolina at Chapel Hill, Chapel Hill, NC 27599, USA
Daoqiang Zhang,Department of Computer Science and Engineering, Nanjing University of Aeronautics &Astronautics, Nanjing, China. Department of Radiology and BRIC, University of North Carolina atChapel Hill, Chapel Hill, NC 27599, USA
Dinggang Shen, andDepartment of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, NC27599, USA. Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea
the Alzheimer’s Disease Neuroimaging InitiativeManhua Liu: [email protected]; Dinggang Shen: [email protected]
Abstract
Neuroimaging provides a powerful tool to characterize neurodegenerative progression and
therapeutic efficacy in Alzheimer’s disease (AD) and its prodromal stage—mild cognitive
impairment (MCI). However, since the disease pathology might cause different patterns of
structural degeneration, which is not pre-known, it is still a challenging problem to identify the
relevant imaging markers for facilitating disease interpretation and classification. Recently, sparse
learning methods have been investigated in neuroimaging studies for selecting the relevant
imaging biomarkers and have achieved very promising results on disease classification. However,
in the standard sparse learning method, the spatial structure is often ignored, although it is
important for identifying the informative biomarkers. In this paper, a sparse learning method with
tree-structured regularization is proposed to capture patterns of pathological degeneration from
fine to coarse scale, for helping identify the informative imaging biomarkers to guide the disease
classification and interpretation. Specifically, we first develop a new tree construction method
based on the hierarchical agglomerative clustering of voxel-wise imaging features in the whole
brain, by taking into account their spatial adjacency, feature similarity and discriminability. In this
way, the complexity of all possible multi-scale spatial configurations of imaging features can be
reduced to a single tree of nested regions. Second, we impose the tree-structured regularization on
the sparse learning to capture the imaging structures, and then use them for selecting the most
relevant biomarkers. Finally, we train a support vector machine (SVM) classifier with the selected
features to make the classification. We have evaluated our proposed method by using the baseline
© Springer Science+Business Media New York 2013
Correspondence to: Dinggang Shen, [email protected].
NIH Public AccessAuthor ManuscriptNeuroinformatics. Author manuscript; available in PMC 2014 July 20.
Published in final edited form as:Neuroinformatics. 2014 July ; 12(3): 381–394. doi:10.1007/s12021-013-9218-x.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
MR images of 830 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI)
database, which includes 198 AD patients, 167 progressive MCI (pMCI), 236 stable MCI (sMCI),
and 229 normal controls (NC). Our experimental results show that our method can achieve
accuracies of 90.2 %, 87.2 %, and 70.7 % for classifications of AD vs. NC, pMCI vs. NC, and
pMCI vs. sMCI, respectively, demonstrating promising performance compared with other state-of-
the-art methods.
Keywords
Alzheimer’s disease diagnosis; Tree-structured sparse learning; Biomarker identification; Mildcognitive impairment; Group sparse learning
Introduction
Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by progressive
impairment of memory and other cognitive functions. Neuroimaging measurements,
including magnetic resonance image (MRI) and fluorodeoxyglucose positron emission
tomography (FDG-PET), provide a powerful in vivo tool for helping diagnosis and
longitudinal follow-up study of AD and MCI (Desikan et al. 2009; Klöppel et al. 2008;
Stonnington et al. 2010; Zhou et al. 2011; Oliveira et al. 2010; Leung et al. 2010;
Davatzikos et al. 2010; Querbes et al. 2009; Filipovych and Davatzikos 2011; Duchesne et
al. 2009; Fan et al. 2007a; Wee et al. 2011, 2012; Zhang and Shen et al. 2012a, b; Li et al.
2012). In recent years, a substantial research effort has been made to investigate many
machine learning and pattern recognition technologies, such as sparse learning and support
vector machines (SVM), in neuroimaging analysis to assist AD diagnosis (Davatzikos et al.
2008a; Magnin et al. 2009; Hinrichs et al. 2009; Cuingnet et al. 2011; Wolz et al. 2011; Liu
et al. 2012a, b). Various methods have been proposed for processing and analysis of
neuroimages to investigate the pathological changes related to brain diseases (Xue et al.
2006; Wu et al. 2006; Yang et al. 2008; Magnin et al. 2009; Shen et al. 1999; Jia et al. 2010,
Tang et al. 2009). One major challenge in neuroimaging analysis is its huge dimensionality
of the original imaging data. For example, a typical MRI scan of brain includes several
millions of measurements on the respective image voxels. Direct classification with original
data is not only computationally expensive, but also inaccurate, since not all imaging
information in the brain is related with the disease. Thus, extraction and selection of
discriminative imaging biomarkers are necessary and important for brain disease diagnosis
and interpretation (Zhu et al. 2013; Chen et al. 2009; Fan et al. 2007b; Chu et al. 2012).
To achieve dimensionality reduction and also take into account the spatial structure of the
neuroimaging data, we can resort to feature agglomeration. One popular method is to group
voxels into multiple anatomical regions, i.e., regions of interest (ROIs), through the warping
of a pre-labeled atlas, and then extract regional features such as anatomical volumes for
classification (Lao et al. 2004; Magnin et al. 2009). However, this type of anatomical
parcellation may not adapt well to the disease-related pathology since the abnormal region
may be part of ROI or span over multiple ROIs, which could eventually affect the
classification result. To address this problem, Fan et al. (2007b) proposed to adaptively
Liu et al. Page 2
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
partition the brain image into a number of the most discriminative brain regions according to
the similarity of local imaging features. Then, regional features were extracted for disease
classification. The advantage of all ROI-based feature extraction methods is that it can
significantly reduce the feature dimensionality and make the extracted features robust to
noise, inter-subject variability, and registration errors. However, the extracted regional
features are generally very coarse and not sensitive to small changes in the local brain
region, which will eventually degrade the classification performance.
The above-mentioned limitations with the ROI-based features could be potentially addressed
by the voxel-wise analysis method (Ishii et al. 2005), i.e., extracting voxel-wise features for
image classification. However, voxel-wise image analysis is sensitive to noise, registration
error, and inter-subject variability. Although these limitations could be alleviated by
smoothing the imaging features via a Gaussian filter, the smoothing is often performed on
the whole brain for all subjects uniformly and thus not adaptive to anatomical structures,
shapes and abnormal regions. More importantly, the number of voxel-wise imaging features
from the whole brain is always very large (i.e., in millions), while the number of training
samples is very small (i.e., in hundreds) in the neuroimaging study. This could easily make
the classification model over-fitted to the training set, thus not generalized well to the test
set. Since only a few number of brain regions are relevant to the disease, it is important to
identify the predictive information for facilitating disease classification and interpretation.
Recently, the effectiveness of four feature selection methods has been investigated and
compared for AD diagnosis in (Chu et al. 2012). Their results show that the feature selection
does improve the classification accuracy, but depends on the method used. The t-test
filtering is one of the commonly used method to select the most discriminative biomarkers
according to their correlations to the class labels (Davatzikos et al. 2008a; Oliveira et al.
2010). However, this method evaluates each brain voxel independently and can handle
neither the spatial structure nor the multivariate nature of neuroimaging data, thus limiting
its ability to detect the complex population difference. Since the informative imaging
features may be distributed over the distant brain regions, the combinations of features over
these regions should also be taken into account for feature selection. Recently, L1-norm
sparse learning method, e.g., Lasso, have been proposed to identify a subset of features for
best representing the outputs (Ghosh and Chinnaiyan 2005; Tibshirani 1996). This method
enforces sparsity on the individual features while ignoring their spatial structures in feature
selection. Since the disease-induced abnormal changes often happen in a few correlated
regions instead of isolated voxels, the structural relationship of imaging features is an
important information source for identification of informative regions.
The existence of structural relationship in spatial features can be used to build better feature
selection method (Hinrichs et al. 2009). Recently, group Lasso was extended from the L1-
norm Lasso to find solutions that are sparse on the group level of features (Yuan and Lin
2006). It assumes that some feature groups are inherently “good” and thus sparsity should be
enforced at the group level by a L1/L2 regularization, in which the L1-norm penalty is
applied over the groups of input features, while the L2 norm is applied for the input features
within the same group. The basic intuition is that some feature groups will have a higher
fraction of useful features than other groups, and thus features should be selected from those
Liu et al. Page 3
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
“good” groups. It is worth noting that the group Lasso assumes the availability of prior
knowledge on how to group the features. However, as mentioned above, the morphological
changes of brain structures resulting from pathological processes usually do not occur in the
pre-identified regions with certain shapes, which makes it difficult to define the meaningful
feature groups for sparse learning. Recently, a sparse learning method with a tree-structured
regularization was proposed as an extension of group Lasso to consider the underlying
multi-level grouping structure among the inputs or also the outputs for feature selection
(Kim and Xing 2009; Liu and Ye 2010). The structured regularization with a predefined
hierarchical tree is defined based on the group penalty, where each tree node is for one
feature group and different tree levels represent different levels of groups.
On the other hand, the pathology of AD and MCI might cause subtle changes in specific
brain ROIs (e.g., atrophy of hippocampus) and/or the whole brain (e.g., atrophy of gray
matter). To obtain high classification performance, the selected informative features should
capture different patterns of structural degeneration, from local to global fashion. In this
paper, we propose to investigate the tree structured sparse learning method to identify the
informative imaging features from multiple levels for guiding brain disease classification
and interpretation. Specifically, the spatial relationships of the imaging features are encoded
into the tree regularization of group sparse learning to guide the selection of relevant
features. Figure 1 shows the flowchart of the proposed method. First, the voxel-wise gray
matter (GM) density map is extracted from each MR brain image as the features to decode
the brain disease. Second, to capture the hierarchical spatial relationship of the imaging
features in the whole brain, a tree structure is constructed by gradually agglomerating the
adjacent and coherent voxels into a hierarchy of groups with a hierarchical agglomerative
clustering technique. Then, the constructed tree structure is imposed on the regularization of
group sparse learning to select the relevant features. Finally, SVM classifier is trained to
perform classification using the selected features. We evaluate our proposed classification
method with the baseline MR brain images of ADNI database. The results demonstrate that,
in addition to better classify the neuroimaging data with AD and MCI, our proposed method
can also identify the relevant biomarkers to facilitate the interpretation of classification
results.
The rest of this paper is organized as follows. The proposed method is presented in the
“Materials and Method” section. In “Experimental Results” section, extensive experiments
and comparisons with other methods on ADNI dataset are presented to demonstrate the
advantage of the proposed method. Finally, we conclude this paper and discuss the possible
future directions in the “Conclusion” section.
Materials and Method
Imaging Data and Preprocessing
The imaging data used for preparation of this paper were obtained from the Alzheimer’s
Disease Neuroimaging Initiative (ADNI) database (www.loni.ucla.edu/ADNI). The ADNI
was launched in 2003 by the National Institute on Aging (NIA) (http://www.nia.nih.gov/
Alzheimers/ResearchInformation/ClinicalTrials/ADNI.htm.), the National Institute of
Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration
Liu et al. Page 4
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
(FDA), private pharmaceutical companies and non-profit organizations, as a $60 million, 5-
year public–private partnership. The primary goal of ADNI has been to test whether serial
magnetic resonance imaging (MRI), Positron Emission Tomography (PET), other biological
markers, and clinical and neuropsychological assessment can be combined to measure the
progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD).
Determination of sensitive and specific markers of very early AD progression is intended to
aid researchers and clinicians to develop new treatments and monitor their effectiveness, as
well as lessen the time and cost of clinical trials. The Principal Investigator of this initiative
is Michael W. Weiner, M.D., VA Medical Center and University of California, San
Francisco. ADNI was the result of efforts of many co-investigators from a broad range of
academic institutions and private corporations. The study subjects was recruited from over
50 sites across the U.S. and Canada and gave written informed consent at the time of
enrollment for imaging and genetic sample collection and completed questionnaires
approved by each participating sites Institutional Review Board (IRB). The initial goal of
ADNI was to recruit 800 adults, aged from 55 to 90, to participate in the research,
approximately 200 cognitively normal older individuals to be followed for 3 years, 400
people with MCI to be followed for 3 years, and 200 people with early AD to be followed
for 2 years. For the up-to-date information, please see www.adni-info.org.
Although the proposed method makes no assumption on a specific neuroimaging modality,
MRI is widely available, non-invasive and often used as the first biomarker in the
differential diagnostics of brain diseases caused by memory problems. In this paper, the T1-
weighted MRI data is tested for demonstrating the performance of our proposed method. In
ADNI, the MRI datasets include the standard T1-weighted MR images acquired sagittally
using volumetric 3D MPRAGE with 1.25×1.25 mm in-plane spatial resolution and 1.2 mm
thick sagittal slices. Most of these images were obtained with 1.5 T scanners, while a few
were acquired using 3 T scanners. Detailed information about MR acquisition procedures is
available at the ADNI website. The current study in this paper involves the baseline MR
brain images from 229 normal control subjects, 403 MCI subjects (including 236 stable MCI
and 167 progressive MCI subjects), and 198 AD patients.
The MR brain images are first preprocessed according to the previously validated and
published techniques (Liu et al. 2012a). Specifically, an intensity inhomogeneity on the T1-
weighted MR images was corrected using nonparametric nonuniform intensity
normalization (N3) algorithm (Sled et al. 1998). Then, a robust and automated skull
stripping method was applied for brain extraction and cerebellum removal (Wang et al.
2011). Each brain image is further segmented into three types of tissue volumes, e.g., gray
matter (GM), white matter (WM), and cerebrospinal fluid (CSF) volumes. All tissue
volumes will be spatially normalized together onto a standard space (also called the
stereotaxic space) by a mass-preserving deformable warping algorithm proposed in (Shen
and Davatzikos 2002, 2003). The warped mass-preserving tissue volumes reflect the spatial
distribution of tissues in the original brain. We call these warped mass-preserving tissue
volumes as the tissue density maps in this paper.
Since GM is more related to AD and MCI than WM and CSF, the voxel-wise GM densities
are used as the imaging features to demonstrate the classification performance of our
Liu et al. Page 5
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
proposed method. Given M training images, with each represented by a feature vector and a
respective class label, the brain classification involves the step of selecting the most relevant
features and also the step of decoding the disease states such as the class labels, as detailed
below.
Feature Selection
The voxel-wise GM densities of the whole brain are of huge dimensionality, in comparison
with the small number of subjects, which makes the disease classification and interpretation
difficult. In addition, the subtle changes caused by AD or MCI might reside in specific brain
regions with little prior knowledge. To identify the informative imaging biomarkers, the
feature learning model should capture different patterns of the brain structural degeneration,
from local to global fashion. Thus, we will consider three special aspects to develop the
feature learning model in this paper. First, since the informative imaging biomarkers may be
distributed over the distant brain regions, a multivariate model is learned to consider the
combinations of features over the distant brain regions for handling the long-range
interactions in feature selection. Second, the spatial structure of imaging features need to be
specially considered for selecting more meaningful and informative biomarkers. Third, a
multi-scale approach is needed for identification of the predictive regions by respecting the
underlying multi-level structures of the imaging features. Accordingly, we propose a tree-
structured sparse learning method to identify informative imaging biomarkers. In the
following subsections, we will first introduce the L1-regularized coding and the group
sparse learning briefly, and then describe our proposed feature selection method.
L1-Regularized Sparse Learning and Group Sparse Learning—Let X denote a M
× Q feature matrix with the m-th row corresponding to the m-th image’s feature vector
, and y be a class label (column) vector of M images with
ym denoting the class label of the m-th image. A linear regression model can be used to
decode the class outputs from a set of features as follows:
(1)
where α = (α1, …, αq, …, αQ)T is a vector of coefficients assigned to the respective features,
and ε is an independent error term. The least square optimization is one of the popular
methods to solve the above problem. When Q is large while the number of features relevant
to the class labels is small, sparsity can be imposed on the coefficients of the least square
optimization via L1-norm regularization for feature selection (Tibshirani 1996; Ghosh and
Chinnaiyan 2005). The L1-norm least square problem, i.e., Lasso, can be formulated as:
(2)
where λ is a regularization parameter that controls the amount of sparsity on the coefficients.
The non-zero elements in α indicate that the corresponding features are relevant to the class
labels. The L1-norm sparse learning provides an effective multivariate regression model to
select a subset of relevant features by taking into account both the correlations of features to
the class labels and the combinations of individual features. However, this method imposes
Liu et al. Page 6
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
the L1-norm sparsity on the individual features for feature selection, which completely
ignores the spatial structure of imaging features. In this situation, the associated features
should be jointly selected to identify the complex population difference, since the disease-
induced abnormal changes often happen in the contiguous brain regions, instead of isolated
voxels.
To alleviate the above problem, the group Lasso has been proposed as an extension of L1-
norm sparse learning to use the groups of features, instead of individual features, as the units
of feature selection (Yuan and Lin 2006). In the regularization, group Lasso applies the L1-
norm penalty over the feature groups and the L2-norm penalty for the individual features
within the same group. It can be formulated as below:
(3)
where Gj(j=1,…,N) is a set of predefined non-overlapping feature groups and wj is a weight
assigned to the corresponding group. Specifically, the penalty acts as the L1-norm on the
vector of ||αGj||2, to make some ||αGj||2 zero. Thus, the sparsity is imposed at the level of
feature groups by the L1/L2-norm penalty to jointly select the features, so that the group
structures among the features can be considered. Obviously, group Lasso assumes that the
feature groups are available with certain prior knowledge. However, in practice, the
structural relationships of the imaging features are not always pre-known in the
neuroimaging study.
Tree-Structured Sparse Learning—Recently, in the literature, the L1/L2-norm penalty
in group sparse learning has been extended to a more general setting with various types of
complex structures on the sparsity patterns by allowing the overlapping of groups. In many
applications, the rich spatial relationships of features can be naturally represented using a
hierarchical tree structure, with the leaf nodes as individual features and the internal nodes
as groups of features. Specifically, a hierarchical feature selection has been proposed to
impose the tree structural relationships on the features, by defining groups with the
particular overlapping patterns (Zhao et al. 2009). Brain image shows the spatial and
structural correlations between the neighboring voxels, thus naturally forming a number of
brain regions with different sizes and shapes. Also, since generally only a few brain regions
over the whole brain are affected by the disease, the structured sparsity can be incorporated
into sparse learning for selection of informative biomarkers. Moreover, different
pathological processes might affect brain regions in different ways, and thus the disease-
affected brain regions might have irregular shapes and different sizes and are not known in
advance. Feature selection should capture different patterns of pathological degeneration
from voxel level to group level. The rich spatial relationships of the brain volume can be
naturally identified as a hierarchical tree structure where the sub-trees represent local brain
regions while different levels of the tree indicate multiple scales of the parcellated brain
regions. Then, L1/L2-norm regularization predefined by the tree structure can be imposed
on the sparse learning to encourage a joint selection of structured relevant features.
Liu et al. Page 7
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
The most important problem for the tree structured sparse learning is how to construct a
meaningful hierarchical tree for grouping the imaging features. A simple method for tree
construction is to hierarchically partition the whole brain into 3D patches (cubic) of multiple
scales, with each level of tree consisting of the non-overlapping patches of one scale (Liu et
al. 2012a, b). Although this method considers the spatial adjacency of the voxel-wise
features, the morphological changes of brain structures resulting from pathological processes
usually do not occur in the regions necessarily with regular shapes. In fact, the disease-
affected brain regions might have irregular shapes and different sizes. To capture more
natural feature groups, a hierarchical agglomerative clustering was used to adaptively
produce a hierarchical representation of the features by feature agglomeration, with each
feature group exhibiting similar characteristics under the spatial constraint (Jenatton et al.
2011). The advantage of agglomeration is that the selected brain regions may be spatially
coherent. However, this method did not take into account the target information in forming
the feature groups and thus the grouped brain regions may not display similar characteristics
with respect to the classification label. In the following subsection, we will introduce a new
tree construction method based on the hierarchical agglomerative clustering of the voxel-
wise imaging features, by taking into account the spatial adjacency, feature similarity and
discriminability.
Tree Construction by Hierarchical Agglomerative Clustering—Assuming that the
M × Q feature matrix X consists of M training subjects represented by Q voxel-wise imaging
features, as mentioned above. We seek to group the neighboring voxels into a tree structure
for representation of voxel-wise imaging features in a bottom-up way. The hierarchical
agglomerative clustering is applied at this stage to encode the rich structure of voxel-wise
imaging features based on a criterion defined below. It first treats each voxel as a singleton
cluster and then iteratively agglomerates a pair of neighboring clusters until all clusters have
been merged into a single cluster. Thus a binary tree is produced to represent a hierarchy of
clusters with each node associated with the cluster obtained by merging its children clusters.
The root of tree gathers all the voxels, while the leaves are the clusters, each consisting of a
single voxel. To recover the true spatial support of a discriminative pattern embedded in an
image, three items are taken into consideration for definition of clustering criterion.
The first item is the spatial adjacency which indicates that only the neighboring clusters can
be merged together. Since the spatially adjacent voxels of brain images are usually
correlated, it is reasonable to require that the two merged clusters should have neighboring
voxels. In addition, it is well-known that the grouped voxels should have similar
characteristics. Thus, the second item is the consistency of the local imaging features which
are directly related to the feature similarity or the uniformity of the grouped voxels. Two
groups of features are spatially consistent if these features are similar to each other. This
characteristic can help form boundaries of clusters with similar properties and also identify
the brain regions with different behaviors. In this work, the similarity of two features is
simply estimated by the Euclidean distance of feature vectors.
The third item is the discrimination of the imaging features with respect to the classification
task. The discriminative power of a feature can be quantitatively measured by its relevance
to the classification task, which is usually computed as the correlation between this feature
Liu et al. Page 8
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
and the class labels in a training dataset (i.e., with the normal samples labeled as −1 and the
disease samples labeled as +1). For our continuous imaging features, linear correlation
measures are easier to compute and are robust to over-fitting, and thus are widely used for
feature selection in machine learning. We apply t-test to measure the relevance of each
feature to the classification task using the p value which is in the range of [0 1]. The smaller
p-value indicates larger relevance of the feature to the classification task. Thus, the
discriminative power of the i-th feature is measured by:
(4)
Where p(i) is the p value of the t-test for the i-th feature. Usually, to enhance the feature
discrimination and also the robustness to noise, a parcellation (division) of the brain region
should focus on partition of the strongly discriminative regions into the spatially tiny
regions, while leaving the uninformative regions un-partitioned. Otherwise, grouping the
informative features with others will decrease the discriminative power of the combined
features.
By taking the above three items into account, two adjacent and spatially-consistent clusters
with less informative features would be first merged together to increase their discriminative
power and then compete with the informative parts in feature selection. Thus, at each step of
the hierarchical agglomerative clustering, we group two adjacent clusters Ck and Cl that will
minimize the measure defined below:
(5)
where |Ck| denotes the number of features in the cluster Ck. Different from the traditional
hierarchical agglomerative clustering method that is designed only based on the spatial
consistency, the above measure can incorporate the information of both feature consistency
and discrimination for forming groups. Thus, the less informative imaging features can be
adaptively merged into larger groups to enhance their robustness to noise and then compete
with other groups. Note the informative features are less grouped in order to maintain their
discriminability. Figure 2 lists the detailed steps of our proposed tree construction method
by the hierarchical agglomerative clustering. Finally, a tree-structured hierarchy of the
imaging features can be constructed on top of all brain voxels. Figure 3 further shows a
sample hierarchical index tree consisting of 6 leaves and 5 nodes, which is constructed with
6 adjacent image voxels.
Feature Selection by Tree-Structured Sparse Learning—Based on the outputs of
the tree construction presented above, we can perform feature selection by combining the
hierarchical tree structure of imaging features with the sparsity-inducing penalty in the
group sparse learning. The hierarchical tree constructed by the above agglomerative
clustering can generate totally Q−1 overlapping feature groups, where Q is the number of
available voxels. The largest feature group lies in the highest level of the tree, i.e., the root
of tree that consists of all imaging features. However, the informative regions often occur in
Liu et al. Page 9
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
the sub-regions of the whole brain. Thus, it may not be optimal to include all the nodes of
the constructed tree in the sparse learning. Accordingly, we can cut the tree at a given level
of hierarchy to produce a subset of tree nodes as a hierarchical parcellation of the brain
volume. The tree cut will be optimized by computing a cross-validated classification
performance within the training data set. Specifically, we select the tree cut that yields the
highest classification performance. Then, all the descendant tree nodes are used as the
overlapping groups for sparse learning.
Assume that the tree nodes produced by the tree cut are denoted as T = {G1, …, Gj, …,
GNc}, where Gj consists of all descendant leaves included in the j-th tree node. The tree
leaves are the voxel-wise imaging features which is the finest level of group. Different levels
of tree nodes indicate different scales of feature groups. The higher level of tree nodes
means the coarser scale of feature groups. The index set of a child node is a subset of its
parent node, i.e., parent node overlaps with its child nodes. Traditional L1-norm sparsity-
inducing penalty, i.e., ||α||1 in Eq. (2), yields sparsity at the level of individual imaging
features, ignoring the potential structures existing between larger groups of features. The
group sparse learning in Eq. (3) can make use of the group structures of feature set, which
should be known in advance. To combine the above two methods, the tree-guided group
sparse learning method with the hierarchical structured penalty can be formulated as:
(6)
where αGj is the set of coefficients assigned to all features within the tree node Gj, and wj is
a predefined positive weight for node Gj and is usually set to be proportional to the square
root of the group size. The first penalty promotes sparsity at the level of individual features
while the second penalty promotes sparsity at the level of overlapping groups. Since each
node represents a subtree of T, if one node is selected, all its descendant child nodes in tree T
will also be selected. In this way, we can make use of the hierarchical structure of feature
groups to guide the selection of informative features from voxels to multi-scale groups.
The tree penalized convex minimization problem in Eq. (6) is more challenging to solve
than both Lasso and group Lasso, since the tree-structured regularization is much more
complex. An efficient optimization has been proposed for tree-structured group Lasso,
where the structure over the features is represented as a hierarchical tree that is composed of
multi-scale image patches (Liu and Ye 2010). In light of the results from (Liu and Ye 2010),
a large class of the convex minimization problem penalized by the grouped tree structure as
defined in Eq. (6) can be solved efficiently. This work showed that the associated Moreau-
Yosida regularization admits an analytical solution and an efficient algorithm was proposed
for determining the effective interval for the regularization parameter λ. This efficient
algorithm has a time complexity comparable to both Lasso and group Lasso and is employed
in this paper to solve the above convex minimization problem with a tree-guided sparse
regularization. Finally, the features with non-zero coefficients are selected for subsequent
classification.
Liu et al. Page 10
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Classification
Based on the selected imaging features by the tree-guided sparse learning method, a
classifier model will be trained to make the final classification. There are various classifier
models investigated for classification of brain images. Among them, SVM is one of the
widely used classifiers because of its high classification performance (Fan et al. 2007b;
Davatzikos et al. 2008b; Zhang et al. 2011; Magnin et al. 2009). SVM constructs a maximal
margin classifier in a high-dimensional feature space by mapping the original features using
a kernel-induced mapping function. For simplicity, we choose the SVM model with a linear
kernel as the classifier, which is implemented by the SVM functions in MATLAB software
(Kecman 2001). The value of the box constraint C for the soft margin is optimized by the
training data set in the cross validation. (Note that this C is different from the parameter C
used in Fig. 2.)
Experimental Results
We evaluate the proposed method with the T1-weighted baseline MR brain images of 830
subjects from ADNI database, which include 198 AD patients, 236 stable MCI (sMCI), 167
progressive MCI (pMCI), and 229 normal controls (NC). Table 1 provides a summary of the
demographic characteristics of the studied subjects (denoted as mean ± standard deviation).
Experimental Design
Before performing disease classification, the image preprocessing was performed as
described in “Imaging Data and Preprocessing” section. The spatially normalized tissue
volumes called as tissue densities are used for classification in this paper. To reduce the
effects of noise, registration inaccuracy, and inter-individual anatomical variations, tissue
density maps were further smoothed using a Gaussian filter and then down-sampled by a
factor of 4 for the purpose of saving the computational time and memory cost. In this
experiment, we use only the GM density map as the imaging features because of its more
relevancy to AD and MCI.
The tree-structured sparse learning method is implemented using the SLEP package
downloaded at http://www.public.asu.edu/~jye02/Software/SLEP. Since some recent
publications (Wolz et al. 2011; Cuingnet et al. 2011) were based on three pairs of
classifications, our proposed method is performed to test three classification problems
related to AD and MCI, which are, respectively, AD vs. NC, pMCI vs. NC, and sMCI vs.
pMCI classifications. To statistically evaluate the classification performance, we conduct
standard 10-folds cross-validation to compute the classification accuracy which evaluates
the proportion of correctly classified subjects among the test dataset. In addition, we also
compute the sensitivity (SEN), i.e., the proportion of AD (or MCI) patients correctly
classified, and the specificity (SPE), i.e., the proportion of correctly classified normal
controls for further evaluation. In each time, one fold of the data set was used for testing,
while the other remaining 9 folds were used as training data. The construction of the
hierarchical tree is performed on the training data set, and then both the feature selection by
the tree-structured sparse learning and the SVM classification model are conducted with the
training data. The training set can be further divided into 10 folds to fine-tune the parameters
Liu et al. Page 11
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
in our method when needed. In the tree-structured sparse learning method, the parameters λ1
and λ2 in Eq. (6) are proportional to each other and thus can be controlled by one parameter
such as λ2 which is optimized in the range of [0 1] with 10 folds of the training data set.
Results on Disease Classification
The first experiment is to test the effect of structural constraints on feature selection and
classification. We compare our proposed method against other two feature selection
methods without using the feature structure information. The first feature selection method
is the t-test, which is one of the commonly-used approaches in the literature. The t-test is
performed at each voxel of the training data, and then we set different thresholds for the
absolute p-value based on the number of voxels (that we want to select) at each level. The
second feature selection method is based on the L1-norm sparse learning (Lasso) which
applies the L1-norm on the imaging features, i.e., the first regularization of Eq. (6). In this
method, we can adjust the regularization parameter λ1 to control the sparsity and select
various numbers of features.
For fair comparison, we test the classification performance with respect to different level of
selected features in all three methods. In the proposed tree-structured sparse learning
method, we change the value of regularization parameter λ2 in the range of [0 1] to adjust
the sparsity of sparse learning and thus obtain different numbers of selected features. All
these methods are tested on the same 10-fold partition of the data set. Given a regularization
parameter, we can obtain a classification accuracy and the corresponding averaging number
of selected features over the 10 folds. For better showing the results, we use polynomial
models to fit all the data obtained by different regularization parameters. Figure 4a, b and c
show the classification results (the fitted data plots of classification accuracies vs. the
number of selected features) for AD vs. NC, pMCI vs. NC, and sMCI vs. pMCI,
respectively. From these results, we can see that the proposed tree-guided method can
achieve better classification accuracy than both t-test and Lasso based methods, especially
when the number of selected features is small (i.e., <1×104, less than half of available
features). This shows that, imposing the structural relationships of features on the
regularization of sparse learning can help select the discriminative features for better
classification. When further increasing the number of selected features, the improvement of
classification accuracy by the proposed method will decrease, compared with other two
methods. These results further demonstrate the importance of the tree-structured
regularization because further increasing the number of selected features will reduce the
effect of structure constraint on selection of discriminative features. All these results show
that the proposed tree-structured sparse learning method can make full use of the
hierarchical structure of imaging features to select the informative biomarkers and then
improve the disease classification.
The second experiment is to test the effectiveness of the proposed tree construction method
in the tree-structured sparse learning on the disease classification performance. In this
experiment, we compare the proposed method with other four methods. The first method is a
baseline method that uses all the imaging features of the whole brain, without feature
selection, for disease classification. (The other three methods are corresponding to different
Liu et al. Page 12
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
strategies used to define the tree structure for sparse learning, named as the 2nd, 3rd, and 4th
methods next.) The 2nd method (namely ‘Patch-based’) is obtained from our previous work
(Liu et al. 2012b), which defined the hierarchical tree by dividing brain image into multi-
scale 3D patches and using the multi-scale patches as the multi-level tree nodes for guiding
feature selection. The 3rd method (namely ‘Cluster-based’) defines the tree nodes by using
the agglomerative clustering and also taking into account their spatial adjacency and feature
similarity. On the other hand, the specific regions of interest (ROIs) can also be generated by
grouping the voxels of brain images into anatomical regions through the registration of a
labeled atlas (Kabani et al. 1998). This anatomical parcellation can take the local features
and their similarity into account for feature grouping. Thus, in the 4th method (namely
‘ROI-based’), we define the feature groups, i.e., the tree nodes, with the 93 manual ROIs as
in (Kabani et al. 1998), instead of adaptively defining the feature groups using the
hierarchical clustering as described above. The optimization of these methods is same as our
proposed method. All the methods in this experiment are tested on the same 10 runs of 10-
fold cross-validation, and the averaged results are reported.
The classification results by the above 4 methods are listed together with our proposed
method in Table 2. From these results, we can see that the feature selection by the sparse
learning with the patch-based tree structure obtains slight improvement (1 %) for
classification of AD vs. NC when compared with the baseline method that uses the whole-
brain features. Although the patch-based tree structure takes the local adjacency of imaging
features into consideration by multi-scale patches, it cannot make full use of the information
in imaging features and class labels for describing both local feature similarity and
discrimination during the tree construction. This limits its ability to use the effective spatial
structures for identifying the informative features. On the other hand, the Cluster-based
method defines multi-level feature groups by taking into account the local similarities, but
uses no discrimination information for grouping features, thus leading to possibly the non-
informative features. Similarly, the feature selection by the ROI-based tree structure cannot
obtain obvious improvement for classifications when compared with the baseline method
using the whole-brain features, except some slight improvement for classification of AD vs.
NC. Although the feature groups defined by ROIs take the local similarities into
consideration, the respective tree structure consists of just one-level non-overlaping groups
and, more importantly, no discrimination information with respect to the classification task
is used to define the feature groups. In fact, the informative region of abnormality might be
part of ROI or span over multiple ROIs, thus potentially reducing the statistical power of
ROI-based feature selection. On the other hand, our proposed tree construction by
hierarchical agglomerative clustering can adaptively generate the feature groups, and take
into account both the feature similarity and the discrimination of adjacent voxels. Thus our
proposed method achieves better classification performance than other methods.
Results on Biomarker Identification
The proposed method also aims to identify the informative biomarkers that are associated
with the AD status to improve the disease interpretation. We examine the selected imaging
features by the proposed tree-structured sparse learning method with the best regularization
parameters. It is worth noting that the feature selection is performed on the training data
Liu et al. Page 13
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
only. Thus, the selected imaging biomarkers at each cross-validation fold may be different.
For example, we checked the selected features from all cross-validation folds, and found that
some selected features do vary across different folds of different partitions. Thus, we
compute the frequencies of the voxels included in the selected features over the cross-
validation folds of all partitions for disease classification. Then we further identify the
imaging features with the frequency values (normalized in the range of [0 1]) larger than a
threshold (set to 0.4 in our experiments) as the selected imaging biomarkers. Similarly, we
also provide the selected features by Lasso for comparison. Figures 5, 6, and 7 show the
identified biomarkers from the GM density map by both Lasso and tree-structured sparse
learning methods for AD vs. NC, pMCI vs. NC, and pMCI vs. sMCI classifications,
respectively. It can be observed that the features selected by the proposed method are
usually grouped at the relevant regions which helps interpretation of the obtained results.
The regions identified by the proposed method include hippocampus, parahippocampal
gyrus, entorhinal cortex, and amygdala, which are consistent with those reported in the
literature for AD and MCI studies (Cuingnet et al. 2011; Zhang et al. 2011; Hinrichs et al.
2009). These results verify the effectiveness of the proposed method for guiding the
identification of relevant biomarkers.
Discussion
In this study, we have evaluated the classification performance of the proposed method with
830 baseline MR brain images, acquired in the ADNI study. Our results demonstrate that the
feature selection method by the tree-structured sparse learning can improve the classification
performance when compared to both the t-test and the L1-norm Lasso methods, which do
not consider the data structure during the feature selection. In addition, the proposed tree
construction method can result in a more powerful classifier than other tree construction
methods, including the Patch-based, Cluster-based and ROIs-based methods.
Furthermore, in Table 3, the results of the proposed classification method are compared with
five recent classification methods, also using the baseline T1-weighted MRI data of ADNI
dataset, as briefly described:
• In (Hinrichs et al. 2009), the linear program (LP) boosting method with a novel
additional regularization was proposed to incorporate the spatial smoothness of MR
imaging space into the learning process and improve the classification accuracy.
Only classification results for AD vs. NC were provided in that paper.
• In (Zhang et al. 2011), 93 volumetric features were extracted from the 93 regions of
interest (ROI) in GM densities for both classifications of AD vs. NC and MCI vs.
NC. A single SVM classifier was constructed to make the final classification.
• In (Cuingnet et al. 2011), ten methods on different types of structural MRI-based
features, which included five voxel-wise imaging features based methods, three
cortical thickness based methods, and two hippocampus based methods, were
tested and compared with a linear SVM classifier. For classifications of AD vs. NC
and MCI vs. NC, the best classification results, which were obtained using voxel-
wise GM densities, are provided for comparison in our paper. For prediction of
Liu et al. Page 14
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
MCI conversion, i.e., classification of pMCI vs. sMCI, the best results that used
hippocampal volume are provided for comparison in our paper.
• In (Wolz et al. 2011), instead of using single MRI feature, four types of MRI-based
features, i.e., hippocampal volume, tensor-based morphometry, cortical thickness,
and manifold-learning based features, were combined to achieve improved
classification accuracies. Both linear discriminant analysis (LDA) and SVM
classification approaches are tested for classifications of AD vs. NC, pMCI vs. NC,
and pMCI vs. sMCI. For comparison, we present their best results that were
obtained with the LDA classification approach.
• More recently, in (Chu et al. 2012), the impact of feature selection and sample size
on brain disease classification was extensively studied by using the GM features
and SVM classifier. In particular, they compared four different feature selection
methods, which are the pre-selected ROIs based on prior knowledge, univariate t-
test filtering, recursive feature elimination, and t-test filtering constrained by ROIs.
Their experimental results showed that the most accurate classification was
achieved by the feature selection using prior knowledge about the regions of brain
atrophy found in previous studies, i.e., using all GM voxels in the hippocampal and
parahippocampal masks. Therefore, their best results reported for classifications of
AD vs. NC, MCI vs. NC, and pMCI vs. sMCI are used here for comparison.
Table 3 summarizes the classification results of the above five methods, along with our
proposed method. It can be observed that our results compare favorably to all other existing
methods for brain disease classifications. It is worth noting that the variations of the reported
results may be due to the use of different MRI feature extraction and classification methods,
and also the use of different ADNI subjects. All these make the comparison of the results
complicated, since it is difficult to implement all other methods on the same conditions for
fair comparison. In addition, the variations in the size of test samples, the use of cross-
validation, and separating the training and testing sets can also make the fair comparison
difficult to achieve. Nevertheless, our results were obtained using the largest data set,
consisting of almost all subjects in the ADNI database.
Conclusion
The pathology of AD and MCI might cause the changes of brain regions in different ways,
and thus the disease-affected regions might have various sizes and irregular shapes with
little prior knowledge. To identify the informative biomarkers, feature selection should
capture different patterns of pathological degeneration from fine to coarse scales. In this
paper, a tree structured sparse learning method is proposed to identify the informative
biomarkers for classifications of AD and MCI. Specifically, a hierarchical tree is constructed
to capture the rich structural relationships among the imaging features by using the
agglomerative clustering and taking into account their spatial adjacency, feature similarity
and discriminability, and then a tree structured regularization is imposed on sparse learning
for feature selection. The tree structured sparse learning can provide an effective way to
identify more meaningful biomarkers to facilitate brain disease classification and
interpretation. Experimental results on the ADNI dataset show that the proposed method can
Liu et al. Page 15
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
not only identify the grouped relevant biomarkers, but also improve the performance of
brain disease classification.
In the current paper, we validated our method using MRI data from ADNI database.
However, our method can also be extended to use other modality of data for AD or other
brain disease classification. In the future work, we will evaluate our method on other
imaging data, e.g., PET. Moreover, since recent studies have shown that different modalities
of neuroimaging data can be combined to provide complementary information and achieve
better classification performance, we will extend our method into the use of multi-modality
biomarkers for further improving the accuracy of brain disease classification.
Information Sharing Statement
The MRI brain image dataset used in this paper was obtained from the Alzheimer’s Disease
Neuroimaging Initiative (ADNI) which is available at http://www.adni-info.org. In this
paper, the proposed method was implemented based on the SLEP package, which is also
publicly available at http://www.public.asu.edu/~jye02/Software/SLEP. Some other source
codes and binary programs used and developed in this paper are available in our website
(http://bric.unc.edu/ideagroup/).
Acknowledgments
This work was supported in part by NIH grants EB006733, EB008374, EB009634 and AG041721, MH100217, andAG042599, and by National Natural Science Foundation of China (NSFC) grants (No. 61375112, No. 61005024)and Medical and Engineering Foundation of Shanghai Jiao Tong University (No. YG2012MS12). This work wasalso partially supported by the National Research Foundation grant (No. 2012-005741) funded by the Koreangovernment, and supported by the Open Project Program of the National Laboratory of Pattern Recognition(NLPR), and by Jiangsu Natural Science Foundation for Distinguished Young Scholar (No. BK20130034), andNUAA Fundamental Research Funds under grant (No. NE2013105). Data collection and sharing for this projectwas funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging andBioengineering, and through generous contributions from the following: Abbott, AstraZeneca AB, Bayer ScheringPharma AG, Bristol-Myers Squibb, Eisai Global Clinical Development, Elan Corporation, Genentech, GEHealthcare, GlaxoSmithKline, Innogenetics, Johnson and Johnson, Eli Lilly and Co., Medpace, Inc., Merck andCo., Inc., Novartis AG, Pfizer Inc., F. Hoffman-La Roche, Schering-Plough, Synarc, Inc., as well as non-profitpartners the Alzheimer’s Association and Alzheimer’s Drug Discovery Foundation, with participation from the U.S.Food and Drug Administration. Private sector contributions to ADNI are facilitated by the Foundation for theNational Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute forResearch and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at theUniversity of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at theUniversity of California, Los Angeles.
References
Chen Y, An H, Zhu H, Stone T, Smith JK, Hall C, et al. White matter abnormalities revealed bydiffusion tensor imaging in non-demented and demented HIV+ patients. Neuro Image. 2009; 47(4):1154–1162. [PubMed: 19376246]
Chu C, Hsu AL, Chou KH, Bandettini P, Lin C. for the Alzheimer’s Disease Neuroimaging Initiative.Does feature selection improve classification accuracy? Impact of sample size and feature selectionon classification using anatomical magnetic resonance images. Neuro Image. 2012; 60(1):59–70.[PubMed: 22166797]
Cuingnet R, Gerardin E, Tessieras J, Auzias G, Lehericy S, Habert MO, et al. Automatic classificationof patients with Alzheimer’s disease from structural MRI: a comparison of ten methods using theADNI database. Neuro Image. 2011; 56(2):766–781.10.1016/j.neuroimage.2010.06.013 [PubMed:20542124]
Liu et al. Page 16
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Davatzikos C, Fan Y, Wu X, Shen D, Resnick SM. Detection of prodromal Alzheimer’s disease viapattern classification of magnetic resonance imaging. Neurobiology of Aging. 2008a; 29(4):514–523.10.1016/j.neurobiolaging.2006.11.010 [PubMed: 17174012]
Davatzikos C, Resnick SM, Wu X, Parmpi P, Clark CM. Individual patient diagnosis of AD and FTDvia high-dimensional pattern classification of MRI. Neuro Image. 2008b; 41(4):1220–1227.10.1016/j.neuroimage.2008.03.050 [PubMed: 18474436]
Davatzikos C, Bhatt P, Shaw LM, Batmanghelich KN, Trojanowski JQ. Prediction of MCI to ADconversion, via MRI, CSF biomarkers, and pattern classification. Neurobiology of Aging. 2010;32(12):2322.e2319–2322.e2327. [PubMed: 20594615]
Desikan RS, Cabral HJ, Hess CP, Dillon WP, Glastonbury CM, Weiner MW, et al. Automated MRImeasures identify individuals with mild cognitive impairment and Alzheimer’s disease. Brain.2009; 132(Pt 8):2048–2057. [PubMed: 19460794]
Duchesne S, Caroli A, Geroldi C, Collins DL, Frisoni GB. Relating one-year cognitive change in mildcognitive impairment to baseline MRI features. Neuro Image. 2009; 47(4):1363–1370. [PubMed:19371783]
Fan Y, Rao H, Hurt H, Giannetta J, Korczykowski M, Shera D, et al. Multivariate examination of brainabnormality using both structural and functional MRI. Neuro Image. 2007a; 36(4):1189–1199.[PubMed: 17512218]
Fan Y, Shen D, Gur RC, Gur RE, Davatzikos C. COMPARE: Classification Of MorphologicalPatterns using Adaptive Regional Elements. IEEE Transactions on Medical Imaging. 2007b;26(1):93–105. [PubMed: 17243588]
Filipovych R, Davatzikos C. Semi-supervised pattern classification of medical images: application tomild cognitive impairment (MCI). Neuro Image. 2011; 55(3):1109–1119.10.1016/j.neuroimage.2010.12.066 [PubMed: 21195776]
Ghosh D, Chinnaiyan AM. Classification and selection of biomarkers in genomic data using LASSO.Journal of Biomedicine and Biotechnology. 2005; 2005(2):147–154. [PubMed: 16046820]
Hinrichs C, Singh V, Mukherjee L, Xu G, Chung MK, Johnson SC. Spatially augmented LPboostingfor AD classification with evaluations on the ADNI dataset. Neuro Image. 2009; 48(1):138–149.[PubMed: 19481161]
Ishii K, Kawachi T, Sasaki H, Kono AK, Fukuda T, Kojima Y, et al. Voxel-based morphometriccomparison between early-and late-onset mild Alzheimer’s disease and assessment of diagnosticperformance of z score images. American Journal of Neuroradiology. 2005; 26(2):333–340.[PubMed: 15709131]
Jenatton, R.; Gramfort, A.; Michel, V.; Obozinski, G.; Bach, F.; Thirion, B. Multi-scale mining offMRI data with hierarchical structured sparsity. IEEE International Workshop on PatternRecognition in Neuro Imaging; Seoul, Korea. May 16–May 18 2011; p. 69-72.
Jia H, Wu G, Wang Q, Shen D. ABSORB: Atlas building by self-organized registration and bundling.Neuro Image. 2010; 51(3):1057–1070. [PubMed: 20226255]
Kabani N, MacDonald D, Holmes CJ, Evans A. A 3D atlas of the human brain. Neuro Image. 1998;7(4):S717.
Kecman, V. Learning and soft computing-support vector machines, neural networks, fuzzy logicsystems. Cambridge: The MIT Press; 2001.
Kim, S.; Xing, EP. Tree-guided group lasso for multitask regression with structured sparsity. 2009.ArxivpreprintarXiv:0909.1373
Klöppel S, Stonnington CM, Chu C, Draganski B, Scahill RI, Rohrer JD, et al. Automaticclassification of MR scans in Alzheimer’s disease. Brain. 2008; 131(3):681–689. [PubMed:18202106]
Lao Z, Shen D, Xue Z, Karacali B, Resnick SM, Davatzikos C. Morphological classification of brainsvia high-dimensional shape transformations and machine learning methods. Neuro Image. 2004;21(1):46–57. [PubMed: 14741641]
Leung K, Shen KK, Barnes J, Ridgway G, Clarkson M, Fripp J, et al. Increasing power to predict mildcognitive impairment conversion to Alzheimer’s disease using hippocampal atrophy rate andstatistical shape models. Medical Image Computing and Computer-Assisted Intervention –MICCAI 2010. 2010; 13:125–132.
Liu et al. Page 17
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Li Y, Wang Y, Wu G, Shi F, Zhou L, Lin W, Shen D. Discriminant analysis of longitudinal corticalthickness changes in Alzheimer’s disease using dynamic and network features. Neurobiology ofaging. 2012; 33(2):427.e15–427. e30. [PubMed: 21272960]
Liu J, Ye J. Moreau-Yosida regularization for grouped tree structure learning. Advances in NeuralInformation Processing Systems. 2010; 23:1459–1467.
Liu M, Zhang D, Shen D. Ensemble sparse classification of Alzheimer’s disease. Neuro Image. 2012a;60(2):1106–1116.10.1016/j.neuroimage.2012.01.055 [PubMed: 22270352]
Liu, M.; Zhang, D.; Yap, P-T.; Shen, D. Medical Image Computing and Computer-AssistedIntervention – MICCAI 2012. Vol. 7512. Berlin Heidelberg: Springer; 2012b. Tree-Guided SparseCoding for Brain Disease Classification; p. 239-247.Lecture Notes in Computer Science
Magnin B, Mesrob L, Kinkingnehun S, Pelegrini-Issac M, Colliot O, Sarazin M, et al. Support vectormachine-based classification of Alzheimer’s disease from whole-brain anatomical MRI.Neuroradiology. 2009; 51(2):73–83. [PubMed: 18846369]
Oliveira PJ, Nitrini R, Busatto G, Buchpiguel C, Sato J, Amaro EJ. Use of SVM methods with surface-based cortical and volumetric subcortical measurements to detect Alzheimer’s disease. Journal ofAlzheimer’s Disease. 2010; 19(4):1263–1272.
Querbes O, Aubry F, Pariente J, Lotterie JA, Demonet JF, Duret V, et al. Early diagnosis ofAlzheimer’s disease using cortical thickness: impact of cognitive reserve. Brain. 2009; 132(Pt 8):2036–2047. [PubMed: 19439419]
Shen D, Davatzikos C. HAMMER: hierarchical attribute matching mechanism for elastic registration.Medical Imaging, IEEE Transactions on. 2002; 21(11):1421–1439.
Shen D, Davatzikos C. Very high resolution morphometry using mass-preserving deformations andHAMMER elastic registration. Neuro Image. 2003; 18(1):28–41. [PubMed: 12507441]
Shen D, Wong W, Ip HHS. Affine-invariant image retrieval by correspondence matching of shapes.Image and Vision Computing. 1999; 17(7):489–499.
Sled JG, Zijdenbos AP, Evans AC. A nonparametric method for automatic correction of intensitynonuniformity in MRI data. Medical Imaging, IEEE Transactions on. 1998; 17(1):87–97.10.1109/42.668698
Stonnington CM, Chu C, Kloppel S, Jack CR Jr, Ashburner J, Frackowiak RS. Predicting clinicalscores from magnetic resonance scans in Alzheimer’s disease. Neuro Image. 2010; 51(4):1405–1413. [PubMed: 20347044]
Tang S, Fan Y, Wu G, Kim M, Shen D. RABBIT: rapid alignment of brains by building intermediatetemplates. Neuro Image. 2009; 47(4):1277–1287. [PubMed: 19285145]
Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical SocietySeries B: Methodological. 1996; 58(1):267–288.
Wang, Y.; Nie, J.; Yap, P-T.; Shi, F.; Guo, L.; Shen, D. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2011. Springer; 2011. Robust deformable-surface-based skull-stripping for large-scale studies; p. 635-642.
Wee CY, Yap PT, Li W, Denny K, Browndyke JN, Potter GG, et al. Enriched white matterconnectivity networks for accurate identification of MCI patients. Neuro Image. 2011; 54(3):1812–1822. [PubMed: 20970508]
Wee CY, Yap PT, Zhang D, Denny K, Browndyke JN, Potter GG, Welsh-Bohmer KA. Identificationof MCI individuals using structural and functional connectivity networks. Neuroimage. 2012;59(3):2045–2056. [PubMed: 22019883]
Wolz R, Julkunen V, Koikkalainen J, Niskanen E, Zhang DP, Rueckert D, et al. Multi-method analysisof MRI images in early diagnostics of Alzheimer’s disease. PLoS ONE. 2011; 6(10):e25446.[PubMed: 22022397]
Wu G, Qi F, Shen D. Learning-based deformable registration of MR brain images. Medical Imaging,IEEE Transactions on. 2006; 25(9):1145–1157.
Xue Z, Shen D, Karacali B, Stern J, Rottenberg D, Davatzikos C. Simulating deformations of MRbrain images for validation of atlas-based segmentation and registration algorithms. Neuro Image.2006; 33(3):855–866. [PubMed: 16997578]
Liu et al. Page 18
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Yang, J.; Shen, D.; Davatzikos, C.; Verma, R. Medical Image Computing and Computer-AssistedIntervention–MICCAI 2008. Springer; 2008. Diffusion tensor image registration using tensorgeometry and orientation features; p. 905-913.
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of theRoyal Statistical Society, Series B: Statistical Methodology. 2006; 68(1):49–67.10.1111/j.1467-9868.2005.00532.x
Zhang D, Shen D. Multi-modal multi-task learning for joint prediction of multiple regression andclassification variables in Alzheimer’s disease. Neuroimage. 2012a; 59(2):895–907. [PubMed:21992749]
Zhang D, Shen D. Predicting future clinical changes of mci patients using longitudinal and multimodalbiomarkers. PloS one. 2012b; 7(3):e33182. 2012. [PubMed: 22457741]
Zhang D, Wang Y, Zhou L, Yuan H, Shen D. Multimodal classification of Alzheimer’s disease andmild cognitive impairment. Neuro Image. 2011; 55(3):856–867. [PubMed: 21236349]
Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variableselection. The Annals of Statistics. 2009; 37(6A):3468–3497.
Zhou L, Wang Y, Li Y, Yap PT, Shen D. Hierarchical anatomical brain networks for MCI prediction:revisiting volumetric measures. PLoS ONE. 2011; 6(7):e21935. [PubMed: 21818280]
Zhu D, Li K, Guo L, Jiang X, Zhang T, Zhang D, et al. DICCCOL: dense individualized and commonconnectivity-based cortical landmarks. Cerebral Cortex. 2013; 23(4):786–800. [PubMed:22490548]
Liu et al. Page 19
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Fig. 1.The flowchart of the proposed method
Liu et al. Page 20
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Fig. 2.The proposed tree construction method by hierarchical agglomerative clustering
Liu et al. Page 21
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Fig. 3.A sample tree (constructed with 6 adjacent image voxels) for illustration of 6 leaves: {V1,
V2, V3, V4, V5, V6} and 5 nodes: {G1, G2, G3, G4, G5}
Liu et al. Page 22
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Fig. 4.Comparison of classification accuracy with respect to different number of selected features
by three feature-selection methods, t-test, Lasso, and the proposed tree-guided method, in
classification of a AD vs. NC, b pMCI vs. NC, and c sMCI vs. pMCI
Liu et al. Page 23
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Fig. 5.The biomarkers identified from the GM density map by a L1-norm Lasso and b our
proposed tree-structured sparse learning method for AD vs. NC classification
Liu et al. Page 24
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Fig. 6.The biomarkers identified from the GM density map by a L1-norm Lasso and b our
proposed tree-structured sparse learning method for pMCI vs. NC classification
Liu et al. Page 25
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Fig. 7.The biomarkers identified from the GM density map by a L1-norm Lasso and b our
proposed tree-structured sparse learning method for pMCI vs. sMCI classification
Liu et al. Page 26
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Liu et al. Page 27
Table 1
Demographic characteristics of the studied subjects from ADNI database
Diagnosis Number Age Gender (M/F) MMSE (Mini Mental State Examination)
AD 198 75.7±7.7 103/95 23.3±2.0
pMCI 167 74.9±6.8 102/65 26.6±1.7
sMCI 236 74.9±7.7 158/78 27.3±1.8
NC 229 76.0±5.0 119/110 29.1±1.0
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Liu et al. Page 28
Tab
le 2
Com
pari
son
of c
lass
ific
atio
n re
sults
by
diff
eren
t tre
e co
nstr
uctio
n m
etho
ds
Met
hod
Tre
e co
nstr
ucti
onC
lass
esA
CC
(%
)SE
N (
%)
SPE
(%
)
Bas
elin
e (w
hole
bra
in)
With
out f
eatu
re s
elec
tion
AD
/NC
88.5
±0.
584
.6±
0.6
92.0
±0.
6
pMC
I/N
C86
.5±
0.7
80.0
±1.
591
.2±
1.0
pMC
I/sM
CI
69.8
±1.
556
.4±
2.4
79.4
±1.
5
Patc
h-ba
sed
Def
inin
g tr
ee n
odes
by
mul
ti-sc
ale
patc
hes
AD
/NC
89.6
±0.
584
.9±
0.9
93.5
±0.
8
pMC
I/N
C86
.5±
0.7
80.0
±1.
191
.3±
1.0
pMC
I/sM
CI
69.6
±1.
056
.5±
1.5
78.8
±1.
4
Clu
ster
-bas
edH
iera
rchi
cal c
lust
erin
g on
fea
ture
sim
ilari
tyA
D/N
C89
.7±
0.4
85.2
±0.
793
.6±
0.5
pMC
I/N
C86
.3±
0.6
79.4
±1.
291
.3±
0.8
pMC
I/sM
CI
69.1
±1.
355
.3±
2.0
78.9
±1.
5
RO
I-ba
sed
Def
inin
g tr
ee n
odes
by
93 R
OIs
AD
/NC
89.5
±0.
784
.9±
1.1
93.6
±0.
5
pMC
I/N
C86
.1±
0.6
79.5
±1.
190
.9±
0.9
pMC
I/sM
CI
69.5
±1.
456
.4±
2.2
79.5
±1.
8
Our
pro
pose
d m
etho
dD
efin
ing
tree
nod
es b
y hi
erar
chic
al c
lust
erin
gA
D/N
C90
.2±
0.5
85.3
±0.
694
.3±
0.4
pMC
I/N
C87
.2±
0.6
80.1
±1.
192
.2±
0.3
pMC
I/sM
CI
70.7
±0.
556
.2±
1.1
80.9
±0.
8
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.
NIH
-PA
Author M
anuscriptN
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript
Liu et al. Page 29
Tab
le 3
Com
pari
son
of c
lass
ific
atio
n re
sults
on
the
NC
, sM
CI,
pM
CI,
and
AD
sub
ject
s, w
ith th
e ba
selin
e M
RI
data
in A
DN
I da
taba
se
Met
hods
Fea
ture
sC
lass
ifie
rSu
bjec
tsA
CC
(%
)SE
N (
%)
SPE
(%
)
(Hin
rich
s et
al.
2009
)V
oxel
-wis
e G
M f
eatu
res
(LP)
boo
stin
g18
3 (A
D +
NC
)82
.085
.080
.0
(Zha
ng e
t al.
2011
)G
M v
olum
es o
f 93
RO
IsSV
M51
AD
+ 5
2 N
C86
.286
.086
.3
99 M
CI
+ 5
2 N
C72
.078
.559
.6
(Cui
ngne
t et a
l. 20
11)
Vox
el-w
ise
GM
fea
ture
sSV
M13
7 A
D +
162
NC
88.6
81.0
95.0
76 p
MC
I +
162
NC
81.2
73.0
85.0
76 p
MC
I +
134
sM
CI
66.5
62.0
69.0
(Chu
et a
l. 20
12)
Vox
el-w
ise
GM
fea
ture
sSV
M13
1 A
D +
188
NC
85.0
––
261
MC
I +
188
NC
70.0
––
261
(pM
CI
+ s
MC
I)65
.0–
–
(Wol
z et
al.
2011
)4
type
s of
MR
I fe
atur
esa
LD
A19
8 A
D +
231
NC
89.0
85.0
93.0
167
pMC
I +
231
NC
84.0
82.0
86.0
167
pMC
I +
238
sM
CI
68.0
67.0
69.0
Our
pro
pose
d m
etho
dV
oxel
-wis
e G
M f
eatu
res
SVM
198
AD
+ 2
29 N
C90
.285
.394
.3
167
pMC
I +
229
NC
87.2
80.1
92.2
167
pMC
I +
236
sM
CI
70.7
56.2
80.9
a Four
type
s of
MR
I fe
atur
es in
clud
e hi
ppoc
ampa
l vol
ume,
tens
or-b
ased
mor
phom
etry
, cor
tical
thic
knes
s, a
nd m
anif
old-
lear
ning
bas
ed f
eatu
res
Neuroinformatics. Author manuscript; available in PMC 2014 July 20.