A data mining approach to predict conversion from mild cognitive impairment to Alzheimer’s Disease Lu´ ıs Jorge Matias de Lemos Dissertation submitted to obtain the Master Degree in Information Systems and Computer Engineering Jury President: Doctor Jos ´ e Carlos Alves Pereira Monteiro Supervisor: Doctor Sara Alexandra Cordeiro Madeira Co-supervisor: Doctor Pedro Filipe Zeferino Tom´ as Members: Doctor Cl ´ audia Martins Antunes Doctor Alexandre Val´ erio de Mendonc ¸a November 2012
102
Embed
A data mining approach to predict conversion from mild cognitive impairment … · Abstract Alzheimer’s disease (AD) is a well known neurodegenerative disease causing cognitive
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A data mining approach to predict conversion from mildcognitive impairment to Alzheimer’s Disease
Luıs Jorge Matias de Lemos
Dissertation submitted to obtain the Master Degree inInformation Systems and Computer Engineering
JuryPresident: Doctor Jose Carlos Alves Pereira MonteiroSupervisor: Doctor Sara Alexandra Cordeiro MadeiraCo-supervisor: Doctor Pedro Filipe Zeferino TomasMembers: Doctor Claudia Martins Antunes
Doctor Alexandre Valerio de Mendonca
November 2012
Acknowledgments
First, I would like to thank my family for the given support.
Secondly, I would like to thank my advisors, Sara Madeira and Pedro Tomas, for having a
fundamental and difficult role in this work by guiding and motivating me. I would like to say thanks
to all the members of the NEUROCLINOMICS team by the hours that they spent listing to my
presentations and providing valuable feedback.
My thanks to all my colleagues from room 128/425, and to Andre Silva for helping me with the
english.
This work was partially supported by FCT - Fundacao para a Ciencia e a Tecnologia under
without class. This data was initially pre processed to contain only 4 classes: Normal, Pre-MCI,
MCI and Dementia. The new MCI class is now composed by all MCI subtypes. The instances
without classification have been removed. Many instances contain missing values for a set of
neuropsychological tests. The normal and pre-MCI instances were also discarded.
2.2 Classification
To extract useful knowledge from this data, data mining techniques have to be used. Generally
these are divided into 7 steps [23]:
1. Data Cleaning to remove inconsistent data and outliers, in the case of our problem this is
of great importance and has already been addressed by reporting errors to the medical
doctors.
2. Data integration to combine multiple data from different sources. In this work there was no
need to perform it since the data already was delivered as a single table. This may however
be important in the future if data from other database is integrated (e.g ADNI data base) 1.
3. Data selection (sometimes referred as feature selection) to discover relevant features. This
step is of enormous importance to simplify the data, and minimize the confusion presented
to the classifier.
4. Data Transformation, to transform the data in order to be fit to the classification process. For
now this step is performed This step for now is done automatically by the WEKA software
[22]. In the future this will be done if the new algorithms that will be studied have the need
of this.
5. Data Mining, the process where intelligent methods are applied in order to extract data
patterns. For the problem of the diagnosis and prognosis, a comparative study of various
algorithms was performed.
6. Pattern evaluation, with the purpose of recognizing interesting patterns that represented the
knowledge. This step corresponds to the evaluation the different classifiers and the different
parameters used.
1http://adni.loni.ucla.edu/ last accessed in 13 October 2012
7
2. Background
7. Knowledge presentation, where visualization and knowledge representation techniques are
used to show the knowledge acquired to the user. In our case, this will be performed when
we show our results and rules to the medical doctors.
Classification can be described as a two-stage process [23]. In the first stage, a classifier
describing a set of data classes is built. This stage is designated by learning step or training
phase. In this stage the classification algorithm is ”learning from” a training set composed by
instances of data, which are made up of a n-dimensional attribute vector, X =(x1,...,xn), and a
class label. In this case X a set of attributes extracted from the neuropsychological data and the
class label is the patient mental health given by a medical evaluation and categorized as MCI and
AD. The attributes in vector X is can be numerical or categorical. The instances used to train the
classification algorithm compose the training set. This type of process is known as supervised
learning, since the class label attribute is provided to each X in distinction to the unsupervised
learning algorithms that do not known the class label attribute or the number of classes to be
learned in advance. In our case we can use unsupervised methods to obtain subsets of MCI for
example. In the context of classification, the set of n-dimensional attribute vector that represents
an evaluation of the patient, and the respective class label attribute can be named as instances. In
the second step, the model obtained is used to classify the test set. The test set is a subset of data,
independent from the train set, that is used to measure the accuracy of the classification model.
It should be noticed that in this we only use supervised methods. However, as the future work
unsupervised learning could be used. For example, to decrease the complexity in the classifier,
clustering techniques can be applied to divide the MCI group into sub groups.
k -Nearest-Neighbour Classifiers
The k -Nearest-neighbour (kNN) classification consists on learning by comparison. Suppose
we define a metric to evaluate the distance between two instances, for example, or the Euclidean
distance:
dist(X1, X2) =
√√√√ n∑i=1
(x1i − x2i)2 (2.1)
or the the Manhattan distance:
dist(X1, X2) =
n∑i=0
|x1i − x2i| (2.2)
where X1 = (x11, x12, ..., x1n) and X2 = (x21, x22, ..., x2n). The kNN algorithm works as fol-
lows. For each instance Xi in the test set, find the K nearest instances in the training set
(Xi1, Xi2, ..., XiK). Following classify the instance Xi as belonging to the most common class
within the K nearest neighbours. Typically, the values of each attribute are normalized before
using the Euclidean distance. This prevents the under-weighting of attributes with a smaller range
relatively to attributes with a larger range. To deal with missing values, the classifier assumes the
8
2.2 Classification
highest difference between these two attributes. This can cause mistakes in the classifier. How-
ever adequate pre-processing can overcome limitations of the algorithm, by for example replacing
missing values by the mean value of the attribute [12].
Naıve Bayes
Bayesian classifiers are statistical methods that can forecast the class membership probabilities
using the Bayes theorem in (2.3)
P (H|X) =P (X|H)P (H)
P (X)(2.3)
where H represents some hypothesis, such as belonging to a class Ci and X is the instance.
Studies comparing classification algorithms have discovered that a simple Bayesian classifier
such as Naıve Bayes, can in some cases, be analogous in performance with decision trees and
neural network classifiers [23]. Bayesian classifiers have also demonstrated high accuracy and
speed when applied to large amounts of data. Naıve Bayes classifiers assume that attributes are
independent and work as follows [23]:
1. Assume D to be a training set (n-dimensional attribute vector and respective class label).
2. Suppose that there are m classes, C1, C2, ..., Cm. Given a test instance, X, the classifier
will predict to what class X belongs to, by choosing the class having the highest posterior
probability, P (Ci|X):
∀j 6=iP (Ci|X) > P (Cj |X) (2.4)
The P (Ci|X) is called the maximum posteriori hypothesis.
3. Since P (X) is constant for all classes we only need to maximize P (X|H)P (H). Replacing
hypothesis H by the Class Ci we have P (X|Ci)P (Ci). If the class prior probabilities are
unknown, then we assume that all classes are equally likely, and we maximize P (X|Ci).
4. Since Naıve Bayes classifiers assume that attributes are conditionally independent it results
that
P (X|Ci) =
n∏k=1
P (xk|Ci) = P (x1|Ci) ∗ P (x2|Ci) ∗ ... ∗ P (xn|Ci) (2.5)
5. Estimation of P (Xk|Ci) is performed differently when dealing with categorical and continuous-
valued attributes. For categorical attributes, P (xk|Ci) is the number of instances of class Ci
in training set having the value xk for an attribute k, divided by the number of instances of the
class Ci in training set. For continuous-valued attributes, is the probability density function
must be estimated. A simple approach is to assume that P (xk|Ci) is normally distributed, in
which case:
P (xk|Ci) = g(xk, µCi , σCi) (2.6)
9
2. Background
g(x, µ, σ) =1√2πσ
e−(x−µ)2
2σ2 (2.7)
where µ is the expected value of xk and σ is the standard deviation.
6. To predict the class label of the test instance, X, we use (2.5) for each class and then, by
applying (2.4) we obtain the most probable class.
Decision Trees
A decision tree is a model structure where each non-leaf node has a test on an attribute, each
branch represents an outcome of the test and each leaf has a class label (see Figure 2.1). The
top node is the root node. In Figure 2.1 the root node is the node that tests Attribute 1.
Attribute 1
Class 2Class 1 Attribute 2
Low
Normal
High
Class 2Class 1 Attribute 2
Class 2 Class 1
True False
Figure 2.1: A decision tree that uses 2 attributes: Attribute 1, that is ternary (Low, Normal andHigh), and Attribute 2, that is binary (true or false). Each internal or non-lead node (in blue)represents a test on an attribute. Each leaf node (in orange) represents a class (Class 1 or Class2).
This classifier receives an instance with an unknown class label, the attributes of the instance
are then tested against the decision tree. A path is traced from the root to the leaf node, having
the class label for that instance. In the case of Figure 2.1, an instance X = {Attribute 1 = Normal,
Attribute 2 = False} would be tested first against Attribute 1 node (root node). As Attribute1 =
Normal, would be tested against Attribute 2 and then since Attribute2 = False the tree would
predict that the instance belongs to Class 1. If Attribute 1 was Low or High only that test would be
necessary, and thus Attribute 2 would not be used.
A basic algorithm for the construction of a decision tree receives as input: set of instances, an
attribute list and an attribute selection method. The set of instances at the beginning is the training
set, but since this algorithm is recursive, this set will change along the execution. The attribute list
is a list of attributes that describes the instances. Finally, the attribute selection method specifies
a heuristic procedure for selecting the attribute that better discriminates the instances according
10
2.2 Classification
to the class. This procedure uses an attribute selection measure, such as information gain [48],
gini index [48] or gain ratio [3]. The tree begins as a single node, N , representing the training set.
If all the instances have the same class label then the node becomes a leaf, is annotated with
that class, and the algorithm ends. If not, the Attribute selection method is called to determinate
the splitting criterion. This method will determinate the best way to partition the instances into
individual classes. The splitting criterion indicates which branches should be grown from N with
respect to the outcomes of the chosen test. The node N is then labelled with the splitting criterion,
which is the test at the node. A branch is grown from N for each of the results of the splinting
criterion. The instances are divided according to the results. The splitting attribute will have one of
two possible scenarios: discrete-values or continuous-values. If the splitting attribute is discrete-
value, the results of the test at N correspond directly to the known value of the attribute. A branch
is created for all values of the attribute. In this case this attribute is removed from the attribute
list, since this attribute will not be considered in any future instance. If the splitting attribute is
continuous-valued, N has two results Attribute ≤ splittingpoint and Attribute > splittingpoint.
The splitting point is returned by the attribute selection method and two branches are gowned from
N with the results as labels. The algorithm is recursive, repeating the process for each subset of
the training set created. The possible stop conditions of the algorithm are:
• All the instances in the training set belong to the same class.
• There are no more attributes to split. In this case N is converted to a leaf and labelled with
the most common class in the current training set.
• There are no instances for a given branch. In this case, a leaf is created with the majority
class in the current training set.
Finally, the decision tree is returned.
If the training set has noise or outliers, this will generate branches that reflect this problem.
The tree pruning tries to identify and remove such branches. The most commonly used attribute
selection methods are the following
• Information Gain [23]
The information gain uses the value of information content of messages. The attribute cho-
sen to split is that with the higher information gain. This attribute minimizes the information
needed to classify the instances in the resulting partitions and maximizes the homogene-
ity of the label class in resulting partitions. This approach produces simple trees and re-
duces the number of tests. Let D, be the instance set. The information gain of D, is called
Gain(D), is defined as follows:
Gain(D) = −m∑i=1
pi ∗ log2(pi) (2.8)
11
2. Background
where pi is the probability that a random instance in D, belongs to a classi, and is estimated
by |Ci,D||D| . A base 2 logarithm is used since the information is encoded in bits. Gain(D) is
also known as the entropy of D.
If the attribute is discrete-value v branches will be grown. In the ideal case we want the
partition to have only instances from the same class, but that is rarely achieved. Thus, we
need to know how much more information is needed in order to obtain an exact classification.
We use the next expression for this purpose. Let, Dj be the set of instances in partition j.
GainA(D) =
v∑j=1
|Dj ||D|∗ info(Dj) (2.9)
The term |Dj |D represents the weight of the partition j. GainA(D) is the expected information
required to classify an instance from D based on this partition. Smaller values mean more
homogeneous partitions in relation to the class label.
The information gain is given by the difference between the original information based only
on the proportion of classes and information obtained after the partition.
Gain(A) = Gain(D)−GainA(D) (2.10)
In case of attributes that are continuous-valued, we have to determine the best splinting
point, where the split-point is a threshold in the attribute, after sorting the values of the
attribute in increasing order. In general, we use the midpoint between each pair of values.
The informative gain attribute selection method is used in the ID3 algorithm [47].
• Gain ratio [23]
The C4.5 uses an extension of information gain, called Gain Ratio which is computed as
follow:
GainRatio(A) =Gain(A)
SplitInfo(A)(2.11)
where SplitInfoA represents the potential information generated by splitting the train set.
SplitInfoA(D) = −v∑
j=1
|Dj ||D|∗ log2
(|Dj ||D|
)(2.12)
In this settings, the attribute with the highest value is chosen.
The C4.5 algorithm applies a kind of normalization to information gain using the split infor-
mation value.
• Gini index [23]
The Gini index measures the mistakes in a set of instances by using the following expres-
sion:
GiniD = 1−#classes∑class=i
p2class (2.13)
12
2.2 Classification
where pi represents the probability that an instance of D belongs to a Class Ci and is
estimated as CI,DD . C4.5 uses the Gini index and considers a binary split for each attribute.
In order to compute the attribute with the best binary split, in case of a discrete-value we
have to analyse all of the possible subsets of instances that can be formed using the know
values of the attribute. Each subset can be considered as a binary test for the attribute.
In case of two splits D1 and D2, GiniA is given by:
GiniA(D) =|D1||D|
Gini(|D1|) +|D2||D|
Gini(|D2|) (2.14)
In case of continuous-valued attributes, each possible binary split is analysed. This is a
similar to that of information gain, where each sorted pair of values is taken as possible
split. The point given in the minimum Gini index value is the chosen split point.
The reduction error in the classification occurring by performing a binary split on an attribute
is given by:
4Gini(A) = Gini(D)−GiniA(D) (2.15)
CART uses the Gini index.
Neural Networks
As in [23] a Neural Network is a set of connected input/output units or neurods where each
connection has an associated weight, as shown in Figure 2.2. The neural networks are computa-
tional analogues of neurons. Given a neurode j in a hidden or output layer, the net input, Ij to the
neurode is:
Ij =∑i
wijOi + θj (2.16)
where wij is the weight of the connection from neurode i to neurode j; Oi is the output of the
neurode i from the previous layer; and θj is the bias of the neurode, which allow to vary the
neurode activity. Each neurode in the hidden layer or output layers uses an activation function.
This function is a non-linear and differentiable logical function that allows for the classification of
problems that are linearly inseparables [23].
A neural network is simply a set of neurodes organized in layers. Each layer is composed by
neurodes in the case of the hidden and output layer. The neurodes in the input layer are called
input neurodes. The inputs to the neural network match up to the attributes measured in each
training instance. The input instance is fed simultaneously into the neurodes of the input layer.
These inputs pass the input layer and are weighted and used as input to the second layer, or
hidden layer. The outputs of one hidden layer can be the input of the next hidden layer, and so on.
The weighted outputs of the last hidden layer are the inputs of the output layer neurodes. These
neurodes emit the network’s prediction for a given instance.
A network is called feed-forward if none of the weights cycles back to a hidden neurode or to
an output neurode of a previous layer. The network is fully connected if each neurode provides
13
2. Background
y1
fy2 ∑
Weights
w1j
w2j
Bias
Ɵj
fy2
yn
∑ output
Inputs (outputs from
previous layers)
Wnj
Weighted Sum Activation function
Figure 2.2: The figure shows a neurode of a hidden or output layer. The inputs to the neurode areoutput of the previous layer. These are multiplied by their respective weight to form a weightedsum, which is finally added to the neurode bias. A non-linear activation function is then appliedto the next input. If the hidden layer is the first one then her inputs will correspond to the inputinstance.
an input to all neurodes in the forward layer. It can model the class prediction by a non-linear
combination of inputs, that is, by a non-linear regression. If given enough hidden neurodes and
enough training instances a neural network can closely approximate any function [23].
The design of the network topology is defined by the user, that is, the number of hidden layers,
the number of neurodes in each hidden layer, the number of input and output neurodes. The
choice of these values is usually a trial and error process and may affect the accuracy of the final
model. Neural networks can be used for classification (predict the instance label) or prediction of
a continuous-valued output. For classification, one output neurode may be used to represent two
classes (where 1 represent one and 0 another class). If the problem has more than two classes
then one output neurode is used per class [23].
Backpropagation is the most common neural network learning algorithm. Backpropagation
learns by iteratively comparing the prediction output of the training set with the real value. This
value may be a class label in case of classification or a continuous value in case of prediction. Fol-
lowing the weights (wij , θj)of the network are adjusted to minimize the mean square error (MSE)
between the network’s prediction and the actual target value of the instance2. These adjustments
are made by computing the derivative of the error regarding each weight. The learning process
stops when the weights converge. The Backpropagation algorithm can, in a very simple way be
divided in two phases: the propagation and the weight update. After receiving the input param-
eters, the response of an unit is propagated as input to the neurodes in the next layer until the
2The MSE is the most common metric for the error but there are other metrics.
14
2.2 Classification
Input layer Hidden layer Output layer
Figure 2.3: Multilayer feed-forward neural network. A multilayer feed-forward neural network is aset of layers: one input layer, one or more hidden layers and an output layer.
output layer, where the response is obtained from the network and the error is computed as:
Errj = (Oi − Tj)2 (2.17)
where Oj is the observed output of the neurode j and Tj is the known target value for the given
training instance. Now this is done backwards, from the output layer until the first hidden layer the
synaptic weights are adjusted.
A disadvantage of neural networks, besides generally the long training time, is their poor
interpretability. It is difficult Humans, to interpret the symbolic meaning behind the learned weights
and the hidden units of the network. Advantages of neural networks include high tolerance to
noise in data, their ability to find patterns on which they have not been trained for, their ease
of use when little knowledge on the relation between attributes and classes is known, and their
suitability for continuous-value input/outputs as a contrast to decision trees [23].
Support vector machines
Support vector machines (SVMs) is a method for linear classification of data as showed in
Figure 2.4. Non-linear classification can however be achieved by applying a non-linear kernel to
the data; this transforms the data into a higher dimensional space where linear classification can
be applied. In the original space, this results in non-linear classification. Shortly SVM works as
follows [23]: it uses a non-linear mapping φ() to transform the original training data into a higher
dimension Y = φ(X). In the new dimension it searches for the optimal linear hyperplane that
makes the separation. In SVM sense, the optimal hyperplane (W ) is the one that maximizes the
margin (distance) between two classes (in the transformed space), as shown in figure 2.4.
Once the optimal hyperspace is found the classification between two classes can be achieved
15
2. Background
Figure 2.4: Linearly separable data, can be divided with strait line (between the dashes lines).The showed strait line is the Maximum-margin hyperplane.
by computing:
d(Y ) = sign(W · Y + b) (2.18)
To find the optimal hyperplane, a linear combination of training points can be used:
W =∑
αiCiYi (2.19)
where Ci = {1;−1} indicates the true class of instance Yi = φ(Xi); and αi is a coefficient indi-
cating how difficult is to classify instance Xi. Using (2.19) one can re-write the decision function
as:
d(Y ) = sign(∑
αiCiYi · Y + b)
(2.20)
In the non-linear classification case one can define a non-linear transformation kernelK(X1, X2) =
Naıve Bayes Gaussian or SupervisedDiscrimination or KernelRBF SVM Complexity ∈ [1, 10] and γ ∈ [10−5, 102]Poly SVM Complexity ∈ [1, 10] and Degree ∈ [0.5, 5.0]C4.5 DT Confidence ∈ [0.05, 0.5]
ANN Time ∈ [1000, 2000], LearningRate ∈ [0.1, 0.4] andMomentum ∈ [0.1, 0.3]
kNN k ∈ [1, 10]
Data FeatureSelection
Cross-validation
Classifier(model)
Testingset
Training set
Results
Generation of 10 non-overlapping folds
(model training)
(model evaluation)
SMOTESynthetic
oversampling
Figure 3.2: Data Flow used in the parameter grid search for finding the classifiers parameters.The SMOTE percentage is tested with 11 different values for each parameter combination.
Testing Model
For testing the obtained classification model a different data set is used, which was obtained
by splitting the original dataset in 75% of patients for training and 25% of patients for testing. For
this we apply stratification based on: (i) number of evaluations; (ii) age; (iii) sex; (iv) schooling
years and (v) class. Spiting of patients was therefore made such as the distribution of the above
variables is kept constant in the training and testing datasets.
This allows the test set to be used in all problems; diagnosis, prognosis and any future problem
tacked in the NEUROCLINOMICS project. It should be noticed that the test set will not be used
33
3. Differentiating MCI from AD (Diagnosis)
Parameters
Classifier (model)Test set
Training set
Results
SMOTESynthetic
oversampling
Figure 3.3: Data Flow used simulate the real world results.
to find the best parameter set. It is only used to evaluate the final models created using the
train set. These models only use the train set to find the best features and the best parameters
for that specific data set. This will allow us to analyse the behaviour of trained models in a ”real
world” simulation, since the model has never been in contact with any instance of the test patients.
Note also that the features and the parameters are only selected using 75% of the data, avoiding
overfitting to the feature selection and parameters grid search. This overfitting would put in cause
the generalization of the model.
Table 3.4: Details of the train set.Normal Pre-MCI MCI AD
Tables 3.4 and 3.5 detail the obtained train and test set. The stratification was done having
in consideration the number of evaluation of the patient in the raw dataset. As can be concluded
from the tables, the distribution of instances in the two sets is almost similar.
3.2 Missing values
In this section, the aim is to study the missing values in a more systematic way, to find out how
missing values impact the classification results. For this we test a variety of strategies to deal with
missing values, such as: use median/mode imputation, use median/mode imputation only from
34
3.2 Missing values
the previously patients instances, use linear regression with the other patients evolution for the
imputation of missing values, use of a single value to describe a missing value.
Missing Minimization We use two strategies to reduce the number of missing values. In the
first strategy we use the average value in that feature of a patient’s other evaluation, or in the
second strategy the linear regression, to determine a value. These strategies will not remove
every missing value but will reduce their number significantly.
Random Assumption In the majority of work done in missing value analysis [12] the assump-
tion of random occurrence is made. In our data we know that this assumption is probably in part
fallacious, a doctor may not perform some test if the patients have a low score in other test, if the
patient is simply too tired, for time restrictions and so long. However, the main techniques were
tested, to observe and test, if this assumption can improve the overall classification, in a set of
classifiers.
Non-random Assumption Now we use the assumption that the missing values do not appear at
random. In fact the existence of a missing value may have a discriminative power. The techniques
to minimize missing values are not used, since now the assumption is that the missing values
are purposeful. For this study all features are now nominals and in some experiments use a
discriminated dataset, created using a supervised algorithm [12]. To study this assumption we use
the value ”MISSING” and discretize the data and then we analyse if there was some discriminative
power improvement.
3.2.1 Experimental Setup
Using the knowledge acquired previously such as the best feature selection technique, and
the Oversampling percentage of the minority class (SMOTE), this setup uses four classification
techniques: a linear SVM, RBF SVM, a C4.5 Decision tree and Naıve Bayes.
The configuration is:
• C4.5 Decision Tree [46] with 0.25 confidence factor.
• Naıve Bayes classifier [28], assuming that each feature probability density function (pdf)
follows a Gaussian function.
• Support Vector Machine (SVM)[30] using either a linear or a radial basis function (RBF)
kernel (γ = 0.001). The complexity parameter that defines the maximum weight of the
support vectors, is for the linear kernel C = 1 and for RBF kernel C = 2.
35
3. Differentiating MCI from AD (Diagnosis)
As feature selection the subset covariance method [21] is used. In the random assumption
test we combine the dataset resulted by using the average in the patients evaluation’s or the lin-
ear regression in the patients evaluation’s with two techniques to remove the remaining missing
values. These techniques are implemented in WEKA, the replace of missing values with median
or mode, and the Expectation Maximization imputation. In the non random assumption, the miss-
ing values where replace by the string ”MISSING”. The numerical features now are nominals, for
this we used also a supervised discrimination algorithm to find the best discretization.
3.2.2 Results
0.3
0.4
0.5
0.6
0.7
NB SVM Linear SVM RBF C4.5 DT
Random Assumption
Original Median/Mode EM
AVG AVG + Median/Mode AVG + EM
LR LR + Median/Mode LR + EM
|TP
R-F
PR
|
Figure 3.4: The influence of replace missing values with different techniques, using the metric|TPR − FPR|. The algorithms used for the classification are NB (Naıve Bayes), SVM Linear,SVM RBF(Gaussian) and a C4.5 DT (Decision tree). The imputation techniques are Medianand mode, in numerical and categorical respectively and expectation maximization. The missingminimization techniques are AVG (average) and LR (linear regression) both computed using otherevaluation of the patients.
A comparative study was made to access the influence of this assumption on replacing the
missing values. In Figure 3.4 and 3.5, the comparative study was made using |TPR−FPR| met-
ric, since this metric shows a trade-off between sensitivity e specificity. By analysing the results
the best technique to deal with the missing values, as expect, is dependent of the classification
algorithm.
The best result to each method are:
• In Naıve Bayes, with non random assumption, the discretized dataset shows a slightly better
result that the original dataset, note that this dataset have the missing. But all datasets have
a satisfactory result.
• In Linear SVM, with random assumption using the original dataset yields also the best
result. In this case using the linear regression to minimize the missing values has a similar
36
3.2 Missing values
0.3
0.4
0.5
0.6
0.7
NB SVM Linear SVM RBF C4.5 DT
Non Random Assumption
Original Unique value Discretized Discretized + Unique value
|TP
R-F
PR
|
Figure 3.5: The influence of replace missing values with different techniques, using the metric|TPR − FPR|. The algorithms used for the classification are NB (Naıve Bayes), SVM Linear,SVM RBF(Gaussian) and a C4.5 DT (Decision tree). Using an unique value to represent missingvalues. And using also a discretized data set.
result.
• In RBF SVM, with random assumption, the use of linear regression with median/mode or
average with median/mode is marginality better that using the original data set.
• In C4.5 Decision Tree the effects of dealing with the missing values are more evident. The
best technique is, using the random assumption, the use of median/mode to replace all
missing values. Note that in this classifier the datasets without missing minimization are by
far the best.
• To kNN and Neural Networks the results are not showed, but in both cases the random
assumption with the use of median/mode imputation have the best results.
3.2.3 Conclusions
The techniques used that increase the discriminative power, have been obtained using the
randomness assumption. This does not imply that the missing values are random, only that in
the performed experiments the techniques used with the randomness assumption have better
results. Maybe by using other techniques, and using neuropsycholocical domain information in
the classifiers the assumption of non-randomness would have the expected results. Nevertheless
the gain in discriminative power is not significant, this means the results obtained using the default
way of dealing with the missing values in each classifier are almost similar and in some cases even
better.
37
3. Differentiating MCI from AD (Diagnosis)
3.3 Results and Discussion
For the diagnosis problem six triples, for each feature set, have been selected. These triples
are the classifier, parameters set and the SMOTE percentage. The triples are shown in Table 3.6.
These results are obtained using a grid search using only the train set. The box plots with the
classification results in the training set are showed in Figure 3.6.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 3.6: Train results of Diagnosis for 3 features sets.
Additionally, for each of classifiers the following method is used to deal with missing values:
• In Naıve Bayes the internal way of ignoring the missing data is used.
• In SVMs the internal way of median/mode imputation is used.
• In kNN the median/mode imputation is used, and the internal way of using the maximal
possible distance in missing cases is never used.
• In C4.5 Decision Tree the median/mode imputation is used.
• In Neural Networks the median/mode imputation is used.
Table 3.6 presents the best set of parameters when using the grid search. It should be noticed
that, to balance the two classes a synthetic oversampling of 600% would have to be applied.
Naıve Bayes
The Naıve Bayes, by design, is insensitive to class imbalance since the probabilities of belonging
38
3.3 Results and Discussion
Table 3.6: Diagnosis Parameters. Using the training set. The number of features selected are forAll Features 153, Correlation 32 and mRMR 22. In this set the missing values are for All Features45%, Correlation 13% and mRMR 29%
Classifier Feature Selection SMOTE Parameters
Naıve BayesAll Features 0% Supervised DiscritizationCorrelation 0% KernelmRMR 1270% Kernel
SVM RBFAll Features 635% Compl = 2.5 and γ = 0.01Correlation 381% Compl = 4.0 and γ = 0.01mRMR 508% Compl = 2.0 and γ = 0.01
SVM PolyAll Features 1143% Compl = 1.5 and Exp = 1Correlation 1270% Compl = 0.5 and Exp = 4mRMR 1143% Compl = 0.5 and Exp = 3
Neural NetworkAll Features 0% l = 0.3 , m = 0.2 and time = 2000Correlation 0% l = 0.3 , m = 0.1 and time = 2000mRMR 0% l = 0.3 , m = 0.1 and time = 1000
Decision Tree C4.5All Features 508% Conf = 0.05Correlation 1143% Conf = 0.05mRMR 1016% Conf = 0.05
kNNAll Features 508% k = 9Correlation 635% k = 5mRMR 1143% k = 8
to a class are calculated using only that class and then the class that is more probable is chosen.
In this case, oversampling is only used with the mRMR feature set. This allows the classifier to
overcome some data confusion near the decision frontier. The way to deal with the numerical
values for the full set is by using supervised discritization, and for the other feature sets the best
way do deal with the numerical values is by applying kernel density estimation of the probability
density function.
SVM
The Gaussian SVM (SVM RBF) is considerably sensitive to class imbalance. For this reason, grid
search always chooses a synthetic oversampling percentage that nearly balances the classes.
Remarkably, for the polynomial SVM, the grid search leads to an inversion of the class distribution.
However, on average, considering only the training set, the results using a Gaussian kernel are
better than the results using a polynomial kernel.
In the polynomial SVM (SVM Poly) the SMOTE applied inverts the imbalance of the data in
all three cases. The AD class is now overrepresented which shows that this model prefers over-
represented AD class. This will, as side effect, reduce the confusion on the class borders, since
it is now overpopulated with AD instances. Overpopulating the AD instances will increase the
probability to correctly classify the AD instances at the border, consequently the MCI instances in
this border will have higher misclassification. This increase of misclassification is more acceptable
39
3. Differentiating MCI from AD (Diagnosis)
that the AD misclassification, since we have in the dataset more MCI instances than AD instances.
The complexity found is relativity small, between 0.5 and 1.5. The polynomial degree with more
features, in the original case, only needs a degree of 1, but the smaller feature sets uses a higher
degree.
Neural Networks
For the Artificial Neural Networks (Neural Networks) case, it can be observed that SMOTE is
always kept at 0%. This shows that artificial neural networks, in all feature sets tested, are not
sensible to the class imbalance. Nevertheless, the median |TPR − FPR| is generally worse in
all feature sets. For the diagnosis case, and having in account the mean |TPR − FPR|, Neural
Networks are the worst model tested.
Decision Tree
In the Decision tree C4.5, the selected SMOTE percentage using all features almost leads to the
balanced state, but for the correlation and mRMR the need for oversampling inverts the balance
of the classes. This shows that for the reduced feature sets, this classifier prefers to have the
AD class overrepresented. The confidence chosen is 0.05 in all features sets. Lowering the
confidence will lower the pruning of the tree. The confidence selected show us that the model will
have little confidence in the dataset.
kNN
k-Nearest Neighbour (kNN) is also sensitive to class imbalance. Thus, the selected SMOTE tend
to the balanced state, except in mRMR case. In the mRMR the best SMOTE case inverted the
data balance. Again this can be explained by the need to define the classification frontiers by
overpopulating it with instances of the least represented class. The number of chosen neighbours
for the full set and mRMR set is large, with 8 and 9, respectively. Again this shows us confusion
in the classification. For the correlated set, the number of neighbours is 5, this shows us a less
confused dataset.
Statistically we can compare the models, using the training set, that use the same features
set. For this analysis we use paired t-test, that we applied an all vs all. The t-test are only applied
if ANOVA test with a 95% confidence confirms the existence of a significant difference.
Using a paired test all vs all approach, in each feature set:
• Original (All Features)
The SVM RBF in all t-test, with a 95% confidence, show us a significantly difference, being
the best one in all cases. The Decision tree is in all cases the worse model, with a 95%
confidence.
40
3.3 Results and Discussion
• Correlation
For the correlation-based feature set, the Naıve Bayes and SVM RBF are the best mod-
els, having no statistical difference between themselves at a 95% confidence. The Neural
networks is the worst model.
• mRMR
For the mRMR feature set, the SVM RBF have a significant difference to all others models
and in all cases it is the best one (at a 95% confidence interval). Again the Neural networks
are the worst model.
Now using the test set that was never seen before, it is possible to evaluate the models be-
haviour in a ”real world” environment. Using this results we can compare models that use different
features, since the features were found without the test set help. Figure 3.7 shows the test results;
the scale of the plot is |TPR− FPR|.
By looking at the test results, we can see that 3 out of the 6 classification algorithms have, for
one feature set, a result |TPR−FPR| > 0.6. These classifiers are: Naıve Bayes with Correlation,
Neural Networks with correlation and kNN with all features. In the SVMs all results using the
different features sets are almost similar (around |TPR−FPR| ≈ 0.5). For the C4.5 Decision tree
the results in the different features sets vary significantly. Results with the original feature set are
considerably worse than for other models (|TPR − FPR| ≈ 0.2). For the kNN, best results are
achieved using all features.
Analysing the results we can see that for each classifier method, the best features can change,
which means that a single feature set is not always the best feature set for all cases, but varies
from model to model. In Table 3.3 we can see for the best classifiers the most common metrics
and the |TPR− FPR|.
Using the train set, the higher median value is |TPR − FPR| ≈ 0.6 using SVM RBF with
correlated-based feature selection (the results with others feature sets are almost similar). How-
ever when using the test set the model now has a |TPR − FPR| ≈ 0.5. The maximum value,
obtained in all algorithms, is |TPR − FPR| ≈ 0.6. This maximum appears in 2 models that use
correlation feature set, the Naıve Bayes and Neural Network. In this case, we can compare the
result, using the train set and using the test set (that simulate the real world with a fully indepen-
dent sample) to analyse the consequence of only using the train set to pick the best model. The
SVM RBF with correlation-based feature selection appears to have the best results. However,
generalization is not so good since in contact with unknown instances its results drops.
Now, we compare the other metric such as accuracy, sensitivity, specificity and Area under the
ROC curve in the test set. By analysing the accuracy results we can see two values above 90%,
the Naıve Bayes and Neural networks. Note that those models have a high |TPR − FPR| score
of about 0.62. But we can see that kNN that also has 0.62 in |TPR − FPR| has an accuracy
only of 78%. Thus by using only accuracy this model would be considered inferior and dismissed.
41
3. Differentiating MCI from AD (Diagnosis)
The trade-off sensitivity and specificity are used to calculate |TPR − FPR|. We can see that
Neural Network and Naıve Bayes have a higher score in sensitivity. And that kNN has the higher
specificity but one of the lower sensitivity. Using the AUC (ROC area) we can see that the three
top scores are also the three top scores of the |TPR− FPR|. AUC metric is also sensitive to the
imbalanced data. Is not easy to choose the best model, however the model that has the higher
|TPR− FPR| and area under the ROC is the neural network.
In table 3.8 the result confusion matrix of the neural network model is showed, and we can
see that the majority of the classes are correctly classified.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 3.7: Results for the diagnosis using an independent test set, the scale is |TPR − FPR|where higher is better.
Table 3.7: Best Classifiers for the Diagnosis problem, using the test set.Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|
The missing values in the database have been taken in consideration. This problem was
analysed using two assumptions: a random distribution and a non-random distribution. A set of
42
3.4 Summary
Table 3.8: Confusion matrix for the diagnosis problem, using the test set. The algorithm used isArtificial Neural Network with correlation-based feature set.
Predicted MCI Predicted AD
Real MCI 134 9Real AD 4 9
experiences have been performed: using linear regression, expectation maximization, median
mode imputation, and a single value. The conclusion were that the missing are random, and the
results do not significantly improve by using those techniques. Other approaches can be taken
to deal with the missing data like using unsupervised learning to impute missing values on the
clusters. But in this work those approaches were not tested.
To tackle the need of knowing the behaviour of the created models in a real environment, that
is, to the behaviour of the models in contact with new instances, a completely independent test
set was created with the raw dataset. This approach is common in data mining contests were
the train set is given to the competitors and the test set used to compare the submitted models is
provided latter. The metric used to compare the results is the |TPR − FPR| that by taking in the
account the random classification in the ROC space, will give us an unbiased metric. The use of
metrics like precision, f-measure, sensibility, specificity or accuracy would give us biased results
in consequence of the imbalance state of the data.
In this chapter, the diagnosis problem was addressed. For that a methodology was created
to create models using the clinical data, having in account the missing data and the class im-
balance. In this problem we aim at differentiating MCI and AD in an unbalanced data set, with
high dimensionality and with a high percentage of missing data. For this we defined and applied
the methodology described. In the progress of this work we found that, in contact with an inde-
pendent test set, the models that show higher generalization are Naıve Bayes, Artificial Neural
Network and kNN. Furthermore, a single feature subset is not always the best one. This allowed
us to conclude that the best features set depends on the algorithm used, and probably also on
the parameters used. The best models found have a |TPR − FPR| ≈ 0.6. This result shows as
that the diagnosis problem is indeed complex but that we achieve a good discriminative model for
the MCI and AD classes, using state of the art techniques. The best models are Naıve Bayes and
Neural Networks using correlation to select the features, and the kNN using all features. Other
metrics also indicate, in particular the area under the ROC, that the models obtained have a high
discriminative power.
43
3. Differentiating MCI from AD (Diagnosis)
44
4Predicting conversion from MCI to
AD (Prognosis)
The prognosis prediction of a patient is of great importance to the medical doctors. It allows
for adequate medical care to the patient and support for the family. The prognosis prediction of
Alzheimer’s Disease (or other cognitive impairment) has also a role in the decisions of the patient
about their future. For example if the conversion to AD occurs in a year, and this patient has a high
responsibility job, e.g, as a company manager or a pilot, the patient can adjust his life to minimize
the impact of his disease on the society.
4.1 Prognosis prediction approach
For prognosis prediction we use two different approaches. The first one, which is, normally
used in similar problems [8] [19] [27] [25], consists in finding if a patient will ever convert to AD.
This approach will be refereed in this work as First and Last Evaluation, since it looks for the first
and last entry of the patient in the database to determine if a patient will ever evolve from MCI to
AD. In this approach, each patient has only one single entry in the post-processed dataset.
The second approach looks at a given temporal window and tries to predict if a patient converts
from MCI (at the beginning of the temporal window) to AD (at the end). For this, and according
to Figure 4.1, a new set of labels has been created: evolution (Evol) and no evolution (noEvol)
instances. The noEvol class is considered the positive class. Notice that any instance with insuffi-
cient knowledge about the outcome is removed in the process, since the behaviour of the disease
is unknown inside the window.
45
4. Predicting conversion from MCI to AD (Prognosis)
To chose the temporal windows two factors were considered: (i) the instances distribution
between classes (Evol/noEvol) and (ii) the medical relevance which was obtained by consulting
the medical partners of the NEUROCLINOMICS project. For the latest case, a period of around
3 years was recommended. For the first case we extracted the class distribution according to
the temporal window size, see Figure 4.2. By analysing the evolution of the labels, in function of
the temporal window size, it can be observed that using a temporal window of 3 years balances
the classes Evol and noEvol. Thus, three temporal windows have been created with one year
of difference: 2 years, 3 years and 4 years. For the first and last evaluation approach after data
pre-processing, the class distribution is 37% for Evol and 63% for noEvol for both the training and
testing sets, respectively, as presented in Table 4.1.
MCI
EVOL
noEVOL
UNK-MCI-MCI
UNK-MCI-?
Preference
+
DEM
t
t
t
t
Figure 4.1: Graphical representation of the new class labels created for the temporal windowsprognosis problem.
0
50
100
150
200
250
300
350
400
0 500 1000 1500 2000 2500 3000
Inst
ance
s
Temporal Window (Days)
Evol
noEvol
UNK-MCI-MCI
UNK-MCI-?
Figure 4.2: Variation of the class labels with the size of the temporal window. This results aredone using all data. Only the Evolution(Evol) and no Evolution (noEvol) are used.
46
4.2 Classification Model
4.2 Classification Model
The classification model used for the prognosis prediction is simpler than the one used for
diagnosis. An independent test and train set has been created and a grid search is applied to find
the best model parameters for the training set. As in the diagnosis, the |TPR− FPR| metric was
used to determine the best model, which considers a balance between specificity and sensitivity.
Also three feature sets are used: Original (i.e., all features), correlation (that are obtained by
using correlated-based feature selection) and mRMR (that uses the mRMR feature selection) and
Figure 4.3 shows the model used in the grid search to find the best parameters using the train
set. The Figure 4.4 shows the model used for testing after having the found the parameters.
Data FeatureSelection
Cross-validation
Classifier(model)
Testingset
Training set
Results
Generation of 10 non-overlapping folds
(model training)
(model evaluation)
SMOTESynthetic
oversampling
Figure 4.3: Data Flow used in the parameter grid search for finding the classifiers parameters.The SMOTE percentage is tested with 11 different values for each parameter combination.
Parameters
Classifier (model)Test set
Training set
Results
SMOTESynthetic
oversampling
Figure 4.4: Data Flow used in simulate the real world results.
To deal with the missing data problem, a study similar to the one made in diagnosis for diag-
nosis was made. The following method is used to deal with missing:
• In Naıve Bayes, the internal way of ignoring the missing data is used, that simply exclude
them from the calculations.
• In SVMs, the internal way of median/mode imputation is used
• In kNN, the median/mode imputation is used, which gives better results than the WEKA’s
default of considering the maximum distance between instances.
47
4. Predicting conversion from MCI to AD (Prognosis)
• In C4.5 Decision Tree, median/mode imputation is used, since the tests showed us that this
method was better than the statistical by default.
• In Neural Networks, median/mode imputation is used is used, since the tests showed us
that this method was better than turn off the input neurode in case of missing value.
Table 4.2: First and Last Evaluation’s Parameters. Using the train set. The number of featuresselected are for All Features 153, Correlation 32 and mRMR 22. In this set the missing values arefor All Features 43%, Correlation 18% and mRMR 35%.
SVM RBFAll Features 50% Compl = 4.5 and γ = 0.001Correlation 0% Compl = 1.5 and γ = 0.1mRMR 50% Compl = 4.0 and γ = 0.01
SVM PolyAll Features 500% Compl = 2.0 and Exp = 1Correlation 0% Compl = 0.5 and Exp = 1mRMR 250% Compl = 1.0 and Exp = 1
Neural NetworkAll Features 0% l = 0.3 , m = 0.1 and time = 1000Correlation 0% l = 0.3 , m = 0.2 and time = 1000mRMR 0% l = 0.3 , m = 0.1 and time = 1000
Decision Tree C4.5All Features 400% Conf = 0.35Correlation 400% Conf = 0.5mRMR 50% Conf = 0.15
kNNAll Features 350% k = 4Correlation 100% k = 9mRMR 350% k = 10
Table 4.2 presents the parameters found after performing grid search, as described in Section
4.2. The processed dataset has a minor imbalance, 63% of evolution vs 37% of noEvolution, (see
48
4.3 Results and Discussion
Table 4.1). Nevertheless the oversample technique, SMOTE, was applied with the balance state
being obtained by using approximately 70% of minority class oversampling. For the Naıve Bayes
classifier, oversampling has been used only with the mRMR features. In this case for example,
oversampling inverts the class balance. Another more extreme case example of this, is observed
when using the original set of features with the SVM Poly classifier. In this case, an oversampling
of 500% was chosen, which transforms the minority class in a majority class. Using the Decision
trees and kNN we note that in some cases the used oversampling also completely transforms the
class balance.
Statistically we can compare the models using 30 repetitions, which were obtained by running
the models 30 times with 30 different seeds. For this analysis we used paired t-test, where we
applied an all vs all procedure. The t-tests are only applied if the ANOVA test with a confidence
level of 95% confirms the existence of a significant difference. Using a paired t-test, an all vs all,
with 95% confidence, in each feature set, we observed:
• Original (All Features)
The Naıve Bayes and SVM RBF classifiers, do not have a significant difference. But those
two have a significant difference from all others models, having always a greater mean result.
The kNN only has a statistically insignificant difference to the Neural Network classifier,
having in all other cases a worst mean result.
• Correlation
The Naıve Bayes has a significant difference to all other models, having always a greater
mean result. The Neural Network model has a significant difference, but has in all cases a
worst mean result.
• mRMR
The Naıve Bayes and SVM RBF have again the best results, having a significant difference
from all others models. The model that has the worst result is the kNN.
Table 4.3: Best Classifiers for the First and Last problem, using the test set.Classifier FS Accuracy Sensitivity Specificity ROC Area |TPR− FPR|
Table 4.5: Classification model parameters for the prognosis in a temporal window of 2 years. Thenumber of features selected are for All Features 153, Correlation 29 and mRMR 15. In this setthe missing values are for All Features 44%, Correlation 11% and mRMR 43%.
SVM RBFAll Features 54% Compl = 5.0 and γ = 0.01Correlation 54% Compl = 2.0 and γ = 0.1mRMR 162% Compl = 0.5 and γ = 1.0
SVM PolyAll Features 432 % Compl = 1.0 and Exp = 1Correlation 486% Compl = 1.0 and Exp = 1mRMR 162% Compl = 0.5 and Exp = 1
Neural NetworkAll Features 0% l = 0.2 , m = 0.1 and time = 1000Correlation 0% l = 0.2 , m = 0.1 and time = 2000mRMR 0% l = 0.2 , m = 0.2 and time = 2000
Decision Tree C4.5All Features 216% Conf = 0.05Correlation 378% Conf = 0.15mRMR 270% Conf = 0.35
kNNAll Features 54% k = 10Correlation 650% k = 3mRMR 216% k = 8
For the 2-years dataset the oversample needed to balance the data is approximately 100%.
By analysing Table 4.5 we can see that, in most cases, when oversampling is applied, the per-
centage used inverts the balance of classes. In those cases the oversample helps to define the
decision boundaries. The neural network does not use any oversampling; this is consistent with
the majority of the results that show that this algorithm is not sensitive to oversampling effects.
Statistically we can compare the models, using 30 repetitions. For this analysis we use paired
51
4. Predicting conversion from MCI to AD (Prognosis)
t-test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%
confidence level confirms the existence of a significant difference. The following conclusions are
obtained:
• Original (All Features)
The two best models are Naıve Bayes and SVM RBF. Both have a significant difference to
the other models and have a higher mean. The Decision Tree model got the worst results,
having a significant difference with the worse mean except in the case of the kNN where no
statistically significant difference was found.
• Correlation
The Naıve Bayes model shows a significant difference to all other models and in all cases
shows a higher mean result. The decision tree model shows, the worst result having in all
cases a significant difference with a lower mean result.
• mRMR
The SVM RBF have a significant difference in all cases with a higher mean result. The
Decision trees are the worst model and only got a significant difference with a higher mean
result with the Neural Networks.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TPR-FPR|
|TPR-FPR|
|TPR-FPR|
|TPR-FPR|
|TPR-FPR|
|TPR-FPR|
Figure 4.6: Test results of Prognosis using two years temporal window
Analysing the results of the test set presented in Figure 4.6, we can observe that 3 models
have in all datasets good results. Those models are: Naıve Bayes and the SVMs (with a linear and
52
4.3 Results and Discussion
Table 4.6: Best Classifiers for the prognosis problem with a 2 years temporal window, using thetest set.
In the three year’s dataset the data balance is not a problem since the two classes have the
same representativity. As we can see, the oversampling in the majority of the cases is null or very
reduced. As we already saw, the oversampling has the capability of overpopulating the dataset
to better define the decision frontier. This side effect is the main reason why oversampling is
sometimes used. Nevertheless we have cases where the oversampling percentage is higher,
such as the case with the Naıve Bayes. We can see that in this problem the Naıve Bayes got
53
4. Predicting conversion from MCI to AD (Prognosis)
Table 4.8: Classification model parameters for the prognosis in a temporal window of 3 years. Thenumber of features selected are for All Features 153, Correlation 21 and mRMR 17. In this setthe missing values are for All Features 43%, Correlation 12% and mRMR 42%.
SVM RBFAll Features 0% Compl = 1.0 and γ = 0.1Correlation 76% Compl = 1.5 and γ = 0.1mRMR 0% Compl = 0.5 and γ = 1.0
SVM PolyAll Features 38% Compl = 1.5 and Exp = 1Correlation 0% Compl = 4.0 and Exp = 2mRMR 0% Compl = 1.5 and Exp = 2
Neural NetworkAll Features 0% l = 0.3 , m = 0.1 and time = 1000Correlation 0% l = 0.2 , m = 0.1 and time = 2000mRMR 0% l = 0.2 , m = 0.2 and time = 2000
Decision Tree C4.5All Features 38% Conf = 0.45Correlation 152% Conf = 0.45mRMR 228% Conf = 0.5
kNNAll Features 0% k = 5Correlation 0% k = 5mRMR 0% k = 9
some confusion on the decision borders, that is minimized by applying oversampling to the no
Evolution class. The decision trees have also the same problem in all features sets.
Statistically we can compare the models, using 30 repetitions. For this analysis we use paired
t-test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%
confidence level confirms the existence of a significant difference. The following conclusions are
obtained:
• Original (All Features)
The Naıve Bayes and SVM RBF, do not have a statistical significant difference. But those
two have a significant difference from all others models, having always a higher mean result.
The Decision Tree have in all cases a significantly lower mean result, except versus the
Neural network where no significant difference has been found.
• Correlation
The Naıve Bayes have a significant difference from all others models, having a greater mean
result. The Decision Tree have in all cases a significantly worse mean result.
• mRMR
The SVM RBF have a significant difference from all others models, having always a higher
mean result. The Decision Tree have in all cases a significant difference and worse mean
54
4.3 Results and Discussion
result.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
DecisionTree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NaiveBayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 4.7: Test results of Prognosis using three years temporal window
Table 4.9: Best Classifiers for the prognosis problem with a 3 years temporal window, using thetest set.
Table 4.11: Classification model parameters for the prognosis in a temporal window of 4 years.The number of features selected are for All Features 153, Correlation 17 and mRMR 19. In thisset the missing values are for All Features 44%, Correlation 14% and mRMR 43%.
SVM RBFAll Features 203% Compl = 3.0 and γ = 0.01Correlation 58% Compl = 3.5 and γ = 0.1mRMR 87% Compl = 1.5 and γ = 0.1
SVM PolyAll Features 290% Compl = 1.0 and Exp = 1Correlation 87% Compl = 1.5 and Exp = 1mRMR 29% Compl = 1.0 and Exp = 1
Neural NetworkAll Features 0% l = 0.2 , m = 0.2 and time = 1000Correlation 0% l = 0.3 , m = 0.1 and time = 1000mRMR 0% l = 0.3 , m = 0.1 and time = 1000
Decision Tree C4.5All Features 87% Conf = 0.05Correlation 145% Conf = 0.35mRMR 174% Conf = 0.2
kNNAll Features 0% k = 8Correlation 0% k = 5mRMR 29% k = 10
Four Years Temporal Window
In the four years time window the balance of the data is inverted regarding the two years time
window. The class distribution is now 65% of Evolution and 35% of no Evolution. Table 4.11
56
4.3 Results and Discussion
shows the parameters after the grid search. It can be observed that oversampling is used in
almost all models, except in the Neural Networks; in kNN the oversampling was only used in
the mRMR feature set, and the oversampling value in this case is very low. Note that the used
SMOTE level in the Naıve Bayes and SVMs invert the data balance; in this case the SMOTE will
help the definition of the borders, as it happens in the previously studied cases.
Statistically we can compare the models, using 30 repetitions. As before, we use paired t-
test and applied an all vs all strategy. The t-test are only applied if the ANOVA test with a 95%
confidence level confirms the existence of a significant difference. The following conclusions are
obtained:
• Original (All Features)
The Naıve Bayes and SVM RBF, do not have a significant difference, but those two have
a significant difference from all others models, having always a higher mean result. The
Decision Tree have in all cases a significantly worse mean result.
• Correlation
The Naıve Bayes, SVM RBF and kNN do not have a significant difference between them
and have better mean result. The Decision Tree has in all cases a significantly worse mean
result.
• mRMR
The Naıve Bayes and SVM RBF, do not have a significant difference between themselves
but have a significantly difference from all others models, and a higher mean result. The
Decision Tree and Neural Network have no significant difference between them but in all
other cases have a significant difference and a worse mean result.
Table 4.12: Best Classifiers for the prognosis problem with a 4 years temporal window problem,using the test set.
NN mRMR 77% 72% 80% 0.88 0.52DT C4.5 Original 74% 61% 84% 0.71 0.45
kNN Correlation 79% 72% 84% 0.76 0.56
Using the four years temporal window we also got an overall good performance in the test
set. The best model found is the SVM RBF using the mRMR feature set. For this problem the
mRMR is the best in 3 of the 6 used algorithms. The best results are obtained using the SVM
RBF with mRMR feature selection and kNN with the correlated dataset. These results show us
a classifier closer of the perfect classifier than from the random one. Again we found that the
57
4. Predicting conversion from MCI to AD (Prognosis)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure 4.8: Test results of Prognosis using four years temporal window
classification performance depends of the features and the classification algorithm. The worse
result is obtained when we use the decision trees with mRMR feature set. This show us that with
this feature set the Decision tree does not have a good generalization power. The best model in
this approach is the Gaussian SVM (SVM RBF) using the mRMR feature selection. By analysing
the other metrics (see Table 4.12) we conclude that SVM has a high balanced result with 83%
sensitivity, 80% specificity and 0.82 ROC area.
4.4 Summary
In this chapter, we studied two approaches to process the data in order to predict conversion
from MCI to AD (prognosis), using state of the art methods. The standard method, that in this
work is named as First and Last Evaluation, was tested to define a baseline in discriminative
power. The temporal windows, have been presented as an alternative to this method and as a
way to take into account the different patient profiles that a patience can have in their evaluation
history. With the temporal windows approach, we got a higher discriminative power, using the test
set (pre-processed for each problem). The results show that using temporal windows increases
the prediction capability of the models. In this work, we also compare our approach to the First
and Last evaluation with the Maroco et al [36] work, achieving slightly better results. Using the
temporal windows the results that we got have a greater discriminative power.
In the size of the temporal windows, we got a better results in the test set using the three years
temporal window, as we expected from the medical feedback.
58
4.4 Summary
Table 4.13: Best Models to each progression approachApproach Classifier Accuracy Sensitivity Specificity ROC Area |TPR− FPR|
First & Last Naıve Bayes 71% 72% 68% 0.67 0.42 Years SVM Poly 73% 69% 86% 0.78 0.553 Years SVM RBF 82% 79% 87% 0.83 0.664-Years SVM RBF 81% 83% 80% 0.82 0.63
59
4. Predicting conversion from MCI to AD (Prognosis)
60
5Decision Support System
The models created in chapter 3 and 4 showed an overall good performance in the tasks of
discriminating MCI patients (diagnosis) and predicting the progression from MCI to AD (progno-
sis). But these models are unusable by the medical doctors. To bridge this usability impediment,
a solution was designed and implemented in this thesis, to facilitate the use of the system by
third parties. By integrating the models in an information system, the medical doctors can now
evaluate them in a real work situation. Since this work was done in the context of the NEURO-
CLINOMICS project, the integration of the models into an application that can be used by the
healthcare professionals is of huge importance. Having this in mind, a modular Decision Support
System (DDS) webservices architecture was developed, which integrates with other tools devel-
oped in the project, in particular with the AD information system, that is under development. The
use of web services allows model updating without altering any other part of the system.
The DDS was designed in a modular way and is composed of the following components:(i)
a data input system where healthcare professionals introduce data relative to the instances or
update previous ones;(ii) a prediction system that computes the patient diagnosis (2, 3, 4 years)
based on previously trained models;(iii) an automatic training tool that updates model parame-
ters (feature set selection, oversampling percentage and the algorithm parameters) based on the
complete know instances.
61
5. Decision Support System
5.1 System
Using WEKA, six models have been created for each problem: diagnosis and prognosis in
a temporal window of two, three and four years. These models are parametrized using the tool
created and explained previously. The use has the possibility of choosing any of the models
or even all. For example, the user can choose from the 6 models for a specific problem those
that have more confidence in the results, and present to the final user a confidence interval.
The user can also test a patient evaluation and the system returns the diagnosis prediction, and
if this diagnosis is MCI, then the system show the prognosis results. In this way the system
integrates the diagnosis and prognosis in a simple way. The constant update of the database, will
Figure 5.1: DDS web service architecture
decrease the missing data, since in new evaluations the clinicians are now using higher number
of assessment test (features). This change likely lead system to choose new feature sets, for
example, since features with less missing values are likely more relevant to the classification.
Database updates can also change the class balance. So the models must be updated and
re-parametrized having in account this factors. For this, the parametrizable system, described
in chapter 3, is used to acquire the new parameters and then create new models to reflect the
changes.
The implementation of this DSS was performed using web services. This technology allow
us to deploy the services in an application server, in the network that services can be remotely
accessed by using a defined message scheme, e.g, XML. The DDS also allows easy integration
of new services or updating existing ones.
For this work four webservices have been created, one for the diagnosis and one for each
prognosis temporal window. The webservice receives as argument the patient evaluation and
the model to be used, and returns the confidence of the prediction. The possible models to be
62
5.1 System
Figure 5.2: Prototype data input screen.
used are: Naıve Bayes, SVM (Gaussian or polynomial), kNN, C4.5 decision trees and Artificial
neural networks. Figure 5.1 presents the architecture of the proposed system. The client will
send a request with the patient evaluation by the Network to the webservice that will interrogate
the classification models. In the end a response is given to the client.
Figure 5.3: Prototype output screen.
Using these webservices we can construct applications that allows a client to easily access
the information created by the models. Figure 5.2 shows a prototype scheme of data insertion.
In this prototype there are two boxes, one to select the classification model (Neural Networks,
SVMs, Naıve Bayes, kNN and C4.5 Decision Tree) and other, called ”Patient Evaluation”, were
patient evaluation is inserted. After clicking in ”submit”, the classification request is made to
the webservice. The response to the request is given and the application showed in Figure 5.3
appears.
A system already exists that uses the created webservices, which was created by a NEURO-
CLINOMICS team member in the scope of an AD information system. In Figure 5.4 a screen shot
63
5. Decision Support System
of the system is shown. In the table (of the Figure 5.4) all predictions to all models are displayed;
the values correspond, in the Diagnostic model to the probability of the patients be MCI and for
the prognosis, the probability of the patient not to evolve to AD. The graphic shows the maximum,
average and minimum for each problem (Diagnosis and for the three temporal windows). This
graphic shows confidence intervals created by using multiple models with the same instance.
In Figure 5.4, a real instance is used, the user inquiries the system about the an instance
of the patient 9, evaluated at 22/11/2004. The response indicates that the patient is MCI with a
models average probability of 95%. And the probability of not converting to AD is high in 2, 3 or
4 years (a model average of 90%, 92% or 90% respectively), this means that the evolution to AD
probability is very low, in 4 years.
Figure 5.4: DDS system screen. The values on diagnosis represent the probability of the patientbe MCI, in prognosis case the values represent the probability of not evolve to AD.
5.2 Summary
In this chapter a decision support system is briefly described, using a web services archi-
tecture. The DDS was created to facilitate the use of the system by the medical doctors and
64
5.2 Summary
to integrate the work done in this thesis with the NEUROCLINOMICS project. In this chapter a
prototype is described and a functional application is shown. Since the created system is highly
modular the classification models upgrade is easy and can be done automatically.
65
5. Decision Support System
66
6Conclusions and Future Work
In this work we study the diagnosis and prognosis prediction for patients suffering from Alzheimer’s
disease. To perform this work we use a dataset consisting of neuropsicological numeric tests and
the corresponding diagnosis. Because new neuropsicological tests were added to the dataset
along the time, it suffers from missing values. Furthermore, since there are many more MCI in-
stances in the dataset than AD instances, it also suffers from class imbalance. Thus, in this work
there is the influence of the class imbalance, high dimensionality and missing values. To evaluate
the influence of all these factors an unique approach was designed to create and evaluate the
data. This unique model uses a grid search that combines oversampling and multidimensional
parameters search.
To evaluate the results without a bias, in case of an unbalance dataset the |TPR − FPR|
metric is used in all models of this work. This metric is a trade-off between the sensibility and
specificity.
In this work we tackle the diagnosis problem and create models that discriminate MCI and
AD cases. We analysed the behaviour of a set of supervised data mining algorithms and we con-
cluded that the Naıve Bayes and Neural Networks have a better performance when in contact with
an unknown test set. Those results are obtained using the Original set of features (All Features)
and a Correlated set. We can also conclude that the use of 10-fold-cross-validation provides an
estimate of the goodness of the result that is not analogue to the obtained test set. This can be
caused by an overfitting to the training set, that leads to a low generalization of the models. Simi-
lar results are obtained in all problems analysed. One of the best diagnostic model was obtained
using Naıve Bayes algorithm, with an accuracy of 91%, a sensitivity of 93%, a specificity of 69%,
67
6. Conclusions and Future Work
a ROC Area of 0.85 and a |TPR− FPR| of 0.62.
In the prognosis problem we present a new approach to predicting the conversion from MCI
to AD. The standard method is to use the first and last evaluation of the patient. This approach,
in our opinion, will not use important information, such as profiles that a patient can have in their
evaluation history. In that 10 years a patient can pass though profiles that some other patients
can also have. By using a temporal window approach, we obtain better discriminative results. We
concluded that the best models to use are the Naıve Bayes and SVMs algorithms, and that the
mRMR feature set showed us very good results, generally better than those using the original
set or the correlated set. The temporal window with the highest discriminative power is the 3
years window. Using this window the best model was obtained using radial SVM algorithm, which
has an accuracy of 82%, a sensitivity of 79%, a specificity of 86%, a ROC Area of 0.83 and a
|TPR− FPR| of 0.64.
Finally, we created a Decision support system, that uses the diagnosis and prognosis models.
This system can help the medical doctors to evaluate in a short space of time the patients. This
system was implemented using web services to integrate this work in the NEUROCLINOMICS
project. The use of web services allows this work to be integrated in other works in the scope of
the project since it uses a simple communication protocol.
Future work
To improve the quality of the decision support system, including the prediction models new
approaches and techniques should be investigated. This includes the usage of state-of-art super-
vised or unsupervised data discritization and feature selection techniques, or new classification
techniques, or new classification models including those using boosting approaches. Further-
more, in order to identify patient profiles, unsupervised clustering techniques can be applied. This
would allow the development of specialized models for specific groups of patients and therefore
improve the prediction accuracy. It should be noted that, while some of these techniques were
applied in the course of this thesis, no significant results were achieved. Nonetheless, we believe
that case application of this techniques should be able to identify groups of patients.
Finally, in order to improve the decision support system information, techniques should be
studied to tackle the time to conversion problem, where one predicts how much time will pass
until the patient converts from MCI to AD.
68
Bibliography
[1] Abdi, H. and Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary
Reviews: Computational Statistics.
[2] Bekris, L. M., Yu, C.-E., Bird, T. D., and Tsuang, D. W. (2010). Review article: Genetics of
alzheimer disease. Journal of Geriatric Psychiatry and Neurology.
[3] Boonchuay, K., Sinapiromsaran, K., and Lursinsap, C. (2011). Minority split and gain ratio for
a class imbalance.
[4] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
[5] Breiman, L., Friedman J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and
regression trees.
[6] C, G. (1984). Doenca de Alzheimer, problemas do diagnostico clınico. PhD thesis, Faculdade
de Medicina de Lisboa.
[7] Chapman, R., Mapstone, M., McCrary, J., Gardner, M., Porsteinsson, A., Sandoval, T., Guillily,
M., DeGrush, E., and Reilly, L. (2011a). Predicting conversion from mild cognitive impairment
to alzheimer’s disease using neuropsychological tests and multivariate methods. Journal of
Clinical and Experimental Neuropsychology, 33(2):187–199.
[8] Chapman, R. M., Mapstone, M., McCrary, J. W., Gardner, M. N., Porsteinsson, A., Sandoval,
T. C., Guillily, M. D., DeGrush, E., and Reilly, L. A. (2011b). Predicting conversion from mild
cognitive impairment to alzheimer’s disease using neuropsychological tests and multivariate
methods. Journal of Clinical and Experimental Neuropsychology.
[9] Chawla, N. V. (2010). Data Mining for Imbalanced Datasets: An Overview, pages 875–886.
Number 40. Springer US.
[10] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic
Minority Over-sampling TEchnique. Journal of Artificial Intelligence Research 16.
[D’Agostino et al.] D’Agostino, R. B., Lee, M.-L., Belanger, A. J., Cupples, L. A., Anderson, K.,
and Kannel, W. B. Relation of pooled logistic regression to time dependent cox regression
analysis: The framingham heart study.
69
Bibliography
[12] Data, M. (2007). Missing data in clinical. Group, pages 453–460.
[13] de Lemos, L. J. M., Silva, D., Guerreiro, M., Mendonca, A., Tomas, P., and Madeira, S.
(2012a). Discriminating alzheimer?s disease from mild cognitive impairment using neuropsy-
chological data.
[14] de Lemos, L. J. M., Silva, D., Guerreiro, M., Mendonca, A., Tomas, P., and Madeira, S.
(2012b). Predicting conversion from mild cognitive impairment to alzheimer’s disease using
neuropsychological data: Preliminary results.
[15] Dietterich, T. G. and Bakiri, G. (1995). Solving multiclass learning problems via error-
correcting output codes. Journal of Artificial Intelligence Research 2.
[16] Duan, K.-B. and Keerthi, S. S. (2005). Which is the best multiclass svm method? an empirical
study. Springer-Verlag Berlin Heidelberg.
[17] Elkan, C. (2001). The foundations of cost-sensitive learning. International Joint Conference
on Artificial Intelligence, 17(1):973–978.
[18] Ewers, M., Walsh, C., Trojanowski, J., Shaw, L., Petersen, R., Jack Jr, C., Feldman, H.,
Bokde, A., Alexander, G., Scheltens, P., et al. (2010a). Prediction of conversion from mild
cognitive impairment to alzheimer’s disease dementia based upon biomarkers and neuropsy-
chological test performance. Neurobiology of Aging.
[19] Ewers, M., Walsh, C., Trojanowski, J. Q., Shaw, L. M., Petersen, R. C., Jr., C. R. J., Feldman,
H. H., Bokde, A. L., Alexander, G. E., Scheltens, P., Vellas, B., Duboisl, B., Weiner, M., and
Hampel, H. (2010b). Prediction of conversion from mild cognitive impairment to alzheimer’s
disease dementia based upon biomarkers and neuropsychological test performance. Elsevier.
[20] Guerreiro, M., Silva, A. P., B., M. A., L., Castro-Caldas, A., and Garcia, C. (1994). Adaptacao
a populacao portuguesa da traducao do mini mental state examination (mmse). Revista
Portuguesa de Neurologia.
[21] Hall, M. (1999). Correlation-based feature selection for machine learning. PhD thesis, The
University of Waikato.
[22] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The
WEKA data mining software: an update. SIGKDD Explorations, 11(1):10–18.
[23] Han, J. and Kamber, M. (2006). Data Mining:Concepts and Techniques. Diane Cerra, 2
edition.
[24] He, H. and Garcia, E. (2009). Learning from imbalanced data. Knowledge and Data
Engineering, IEEE Transactions on, 21(9):1263 –1284.
70
Bibliography
[25] Hinrichs, C., Singh, V., Xu, G., and Johnson, S. (2011). Predictive markers for ad in a
multi-modality framework: An analysis of mci progression in the adni population. NeuroImage,
55(2):574–589.
[26] Hinrichs, C., Singh, V., Xu, G., and Johnson, S. C. (2010). Predictive markers for ad in a
multi-modality framework: An analysis of mci progression in the adni population. NeuroImage.
[27] Jack Jr, C., Wiste, H., Vemuri, P., Weigand, S., Senjem, M., Zeng, G., Bernstein, M., Gunter,
J., Pankratz, V., Aisen, P., et al. (2010). Brain beta-amyloid measures and magnetic resonance
imaging atrophy both predict time-to-progression from mild cognitive impairment to alzheimer’s
disease. Brain, 133(11):3336–3348.
[28] John, G. and Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers.
In Proceedings of the eleventh conference on uncertainty in artificial intelligence, pages 338–
345. Morgan Kaufmann Publishers Inc.
[29] Kass, G. (1980). An exploratory technique for investigating large quantities of categorical
data. Applied statistics, pages 119–127.
[30] Keerthi, S., Shevade, S., Bhattacharyya, C., and Murthy, K. (2001). Improvements to Platt’s
SMO algorithm for SVM classifier design. Neural Computation, 13(3):637–649.
[31] Kolibas, E., Korinkova, V., Novotny, V., Vajdickova, K., and Hunakova, D. (2000). Adas-cog
(alzheimer’s disease assessment scale-cognitive subscale)–validation of the slovak version.
PubMed.
[32] Liu, H. and Setiono, R. (1996). A probabilistic approach to feature selection-a filter solution.
In Proceedings of the 13th International Conference on Machine Learning, pages 319–327.
Morgan Kaufmann.
[33] Loewenstein, D., Greig, M., Schinka, J., Barker, W., Shen, Q., Potter, E., Raj, A., Brooks, L.,
Varon, D., Schoenberg, M., et al. (2012). An investigation of premci: Subtypes and longitudinal
outcomes. Alzheimer’s and Dementia, 8(3):172–179.
[34] Loh, W. and Shih, Y. (1997). Split selection methods for classification trees. Statistica sinica,
7:815–840.
[35] Maroco, J., Silva, D., Guerreiro, M., de Mendonca, A., and Santana, I. (2011a). Prediction
of dementia patients: A comparative approach using parametric vs. non parametric classifiers.
XIX Congresso Anual da Sociedade Portuguesa de Estatıstica.
[36] Maroco, J., Silva, D., Rodrigues, A., Guerreiro, M., Santana, I., and de Mendonca, A.
(2011b). Data mining methods in the prediction of dementia: A real-data comparison of the
accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural
71
Bibliography
networks, support vector machines, classification trees and random forests. BMC research
notes, 4(1):299.
[37] Maroco, J., Silva, D., Rodrigues, A., Guerreiro, M., Santana, I., and de Mendonca, A.
(2011c). Data mining methods in the prediction of dementia: A real-data comparison of the
accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural
networks, support vector machines, classification trees and random forests. BMC research
notes, 4(1):299.
[38] MF, F., SE, F., and PR., M. (1975). ”mini-mental state”. a practical method for grading the
cognitive state of patients for the clinician. PubMed.
[39] Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K. (1999). Fisher discriminant
analysis with kernels. In Proceedings of the 1999 IEEE Signal Processing Society Workshop
on Neural Networks for Signal Processing, pages 41–48. IEEE.
Ikonomovic, M., Perez, S., and Scheff, S. (2012). Mild cognitive impairment: pathology and
mechanisms. Acta neuropathologica, 123(1):13–30.
[41] Neter, J., Kutner, M. H., Naschsheim, C., and Wasserman, W. (1996). Applied Linear
Regression Models. The McGraw-Hill Companies.
[42] Noorbakhsh, F., Overall, C. M., and Power, C. (2009). Deciphering complex mechanisms
in neurodegenerative diseases: the advent of systems biology. Trends in Neurosciences,
32(2):88–100.
[43] Peng, H. P. H., Long, F. L. F., and Ding, C. (2005). Feature selection based on mutual in-
formation criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 27(8):1226–1238.
[44] Platt, J. C., Cristianin, N., and Shawe-Taylor, J. (2000). Large margin dags for multiclass
classi?cation. Advances in Neural Information Processing Systems 12.
[45] Powers, D. M. W. (2011). Evaluation : From precision , recall and f-measure to roc , informed-
ness , markedness and correlation. Journal of Machine Learning Technologies, 2(1):37–63.
[46] Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan kaufmann.
[47] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1):81–106.
[48] Raileanu, L. E. and Stoffel, K. (2004). Theoretical comparison between the gini index and
information gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1):77–93.
72
Bibliography
[49] Robert, P., Ferris, S., Gauthier, S., Ihl, R., Winblad, B., and Tennigkeit, F. (2010). Review of
alzheimer’s disease scales: is there a need for a new multi-domain scale for therapy evaluation
in medical practice? Alzheimer’s Research & Therapy.
[50] Samtani, M., Farnum, M., Lobanov, V., Yang, E., Raghavan, N., DiBernardo, A., Narayan,
V., et al. (2011). An improved model for disease progression in patients from the alzheimer’s
disease neuroimaging initiative. The Journal of Clinical Pharmacology.
[51] Silva, D., Guerreiro, M., Maroco, J., Santana, I., Rodrigues, A., Bravo Marques, J., and
de Mendonca, A. (2012). Comparison of four verbal memory tests for the diagnosis and pre-
dictive value of mild cognitive impairment. Dementia and Geriatric Cognitive Disorders Extra,
2(1):120–131.
[52] Silva, D., Santana, I., do Couto, F. S., J Maroco, M. G., and de Mendonca, A. (2008). Cogni-
tive deficits in middle-aged and older adults with bipolar disorder and cognitive complaints:
Comparison with mild cognitive impairment. INTERNATIONAL JOURNAL OF GERIATRIC
PSYCHIATRY.
73
Bibliography
74
AAppendix Medical exams (in
Portuguese)
75
A. Appendix Medical exams (in Portuguese)
Table A.1: Feature List part-1Feature Type Descritpion
Case number for this database Numeric Rank of CasesAge Numeric Age at evaluationDiagNPS String Diagnosis from PsychologistDiagnosis code Numeric Neuropsychological and clinical DiagnosisDisease duration Numeric Evolution of cognitive symptoms in yearsDate Date Date of the evaluationSchool Numeric Years of formal educationGroup Numeric Group in BLAD controlsGender NumericBirth Date Date of birthAs cut Numeric Corte de As cuts (Min=0; Max(pontuacao
melhor)=16)As time Numeric Corte de As time (valor em segundos; quanto
menor melhor)As tot Numeric Corte de As total (Corte de As cuts/Corte de
As time*10)DS forw Numeric Digit Span forward (Min=0; Max(pontuacao
melhor)=9)DS back Numeric Digit Span backward (Min=0; Max(pontuacao
melhor)=8)DS tot Numeric Digit Span total (Digit Span forward+Digit
LM a Interf Cued Numeric Logical Memory (with Interference A) Cued(Min=0; Max(pontuacao melhor)=23)
76
Table A.2: Feature List part-2Feature Type Descritpion
LM a Interf Cued Numeric Logical Memory (with Interference A) Cued (Min=0;Max(pontuacao melhor)=23)
LM b Interf Cued Numeric Logical Memory (with Interference B) Cued (Min=0;Max(pontuacao melhor)=22
MVI Free Numeric Word Recall with Interference Free (Min=0;Max(pontuacao melhor)=15)
MVI Cued Numeric Word Recall with Interference Cued (Min=0;Max(pontuacao melhor)=10)
MVI Rec Numeric Word Recall with Interference Recognition (Min=0;Max(pontuacao melhor)=5)
MVI Tot Numeric Word Recall with Interference total (Word Recall with Inter-ference Free+Word Recall with Interference Cued+WordRecall with Interference Recognition)
Infor Numeric Test about General Information (Min=0; Max(pontuacaomelhor)=20)
VisualM A Numeric Visual Memory (image A) (Min=0; Max(pontuacao mel-hor)=3)
VisualM B Numeric Visual Memory (image B) (Min=0; Max(pontuacao mel-hor)=5)
Repetition Numeric Repetition (Min=0; Max(pontuacao melhor)=11)Token Complete Numeric Complete version of Token (Min=0; Max(pontuacao mel-
hor)=22)Snodgrass missing Numeric Snodgrass and Vanderwart - numero de palavras em falta
Snodgrass end String Snodgrass and Vanderwart - numero total de palavras ap-resentadas
Public Faces missing Numeric Public Faces - numero de palavras em faltaPublic Faces end String Public Faces -numero total de palavras apresentadas
Prxs Numeric Motor Coordenation (Min=0; Max(pontuacao melhor)=12)Cube Numeric draw of a cube (Min=0; Max(pontuacao melhor)=3)Clock Numeric draw of a clock (Min=0; Max(pontuacao melhor)=3)
77
A. Appendix Medical exams (in Portuguese)
Table A.3: Feature List part-3Feature Type Description
TMT A temp Numeric Trail Making Test (Part A) - Tempo (valor em segundos,se ultrapassar os 180sec normalmente interrompe-se aprova; quanto menos tempo melhor)
TMT A err Numeric Trail Making Test (Part A) - Erros (sem pontuacao maxima,quanto menos melhor)
TMT B temp Numeric Trail Making Test (Part B) - Tempo (valor em segundos,se ultrapassar os 300sec normalmente interrompe-se aprova; quanto menos tempo melhor)
TMT B err Numeric Trail Making Test (Part B) - Erros (sem pontuacao maxima,quanto menos melhor)
TMT B incomplete Numerica1 Numeric CVLT - Lista A - 1.a evocacao (Min=0; Max(pontuacao mel-
hor)=16)a2 Numeric CVLT - Lista A - 2.a evocacao (Min=0; Max(pontuacao mel-
hor)=16a3 Numeric CVLT - Lista A - 3.a evocacao (Min=0; Max(pontuacao mel-
hor)=16a4 Numeric CVLT - Lista A - 4.a evocacao (Min=0; Max(pontuacao mel-
hor)=16a5 Numeric CVLT - Lista A - 5.a evocacao (Min=0; Max(pontuacao mel-
hor)=16
78
Table A.4: Feature List part-4Feature Type Description
a1a5 Numeric CVLT - Lista A de 1 a 5 Total (somatorio das 5 evocacoes;Min=0; Max(pontuacao melhor)=80)
a pers Numeric CVLT - Lista A Perseveracoes (somatorio da repeticaode palavras nas 5 evocacoes; sem pontuacao maxima,quanto menor melhor)
a intr Numeric CVLT - Lista A Intrusoes (somatorio das palavras novasacrescentadas a lista ao longo das 5 evocacoes; sempontuacao maxima, quanto menor melhor)
b tot Numeric CVLT - Lista B (Min=0; Max(pontuacao melhor)=16)b pers Numeric CVLT - Lista B Perseveracoes
b intr Numeric CVLT - Lista B Intrusoesb cs Numeric CVLT - Lista B Cluster Semantic (CS; numero de agrupa-
mentos de palavras da mesma categoria)a cr int Numeric CVLT - Lista A - Evocacao espontanea apos curto intervalo
a crint ajsem Numeric CVLT - Evocacao apos curto intervalo com ajudasemantica (Min=0; Max(pontuacao melhor)=16)
a crint ajsem pers Numeric CVLT - Ev. curto intervalo com ajuda semantica -perseveracoes
a crint ajsem intr Numeric CVLT - Ev. curto intervalo com ajuda semantica - intrusoesa lg int Numeric CVLT - Lista A - Evocacao apos longo intervalo (Min=0;
QSM Total Numeric Escala de Queixas Subjectivas de Memoria(Min=0(pontuacao melhor); Max=22)
BlessedAVD Numeric Blessed (Total of Part 1 - Daily living activities)(Min=0(pontuacao melhor); Max=8)
BlessedHAB Numeric Blessed (Total of Part 2 - Habits) (Min=0; Max=9)BlessedPERS Numeric Blessed (Total of Part 3 - Personality) (Min=0; Max=11)
79
A. Appendix Medical exams (in Portuguese)
Table A.5: Feature List part-5Feature Type Description
BlessedTOT Numeric Blessed TOTAL (Blessed (Total of Part 1 -Daily living activities)+Blessed (Total of Part2 - Habits)+Blessed (Total of Part 3 - Person-ality))
CancellationTask Z Numeric Nota ZDigitSpan Z Numeric Nota Z
DigitSpan forward Z Numeric Nota ZDigitSpan backward Z Numeric Nota Z
SemanticFluency Z Numeric Nota ZMotorInitiative Z Numeric Nota Z
GraphomotorInitiative Z Numeric Nota ZComprehension Z Numeric Nota Z
Identification Z Numeric Nota ZToken Z Numeric Nota Z
Naming Z Numeric Nota ZRepetition Z Numeric Nota Z
Writing Z Numeric Nota ZOrientation Z Numeric Nota ZWordRecall Z Numeric Nota Z
GeneralInformation Z Numeric Nota ZVerbalPaired AssociateLearning Z Numeric Nota Z
LogicalMemory Z Numeric Nota ZLogicalMemory A Z Numeric Nota Z
LM DR Z Numeric Nota ZVisualMemory Z Numeric Nota Z
Cube Z Numeric Nota ZClock Z Numeric Nota Z
CubesWAIS Z Numeric Nota ZCalculation Z Numeric Nota Z
MPR Z Numeric Nota ZProverbs Z Numeric Nota Z
TP RT Z Numeric Nota ZTP ID Z Numeric Nota Z
TMT A Z Numeric Nota ZTMT B Z Numeric Nota Z
A1 Z Numeric Nota ZA5 Z Numeric Nota Z
Atot Z Numeric Nota ZB Z Numeric Nota Z
SDFR Z Numeric Nota ZSDCR Z Numeric Nota ZLDFR Z Numeric Nota ZLDCR Z Numeric Nota Z
REC Z Numeric Nota ZToken Complete Z Numeric Nota Z
80
BAppendix Diagnosis
81
B. Appendix Diagnosis
Table B.1: Selected Features, for diagnosis, using correlation and mRMR.Correlation mRMR
As cut rec aAs tot ProverbLM a Cued WritingLM a Interf PA Inter DifInfor NamingOr Total a crint ajsem persOrient P a crint persOrient S DS backOrient T BlessedHABFluency Sem MPRWriting CompNaming Orient SCube M CalcClock Orient TMPR a lgint csBlessedAVD Gm InitiativeBlessedTOT BlessedAVDCancellationTask Z MVI FreeDigitSpan Z Writing ZDigitSpan backward Z ClockSemanticFluency Z Naming ZGraphomotorInitiative Z Orient POrientation ZWordRecall ZGeneralInformation ZVerbalPaired AssociateLearning ZLogicalMemory A ZClock ZCalculation ZMPR ZProverbs ZAtot Z
82
CAppendix Prognosis
C.1 First and Last Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
DecisionTree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NaiveBayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure C.1: Train results of Prognosis using First and Last Evaluations
83
C. Appendix Prognosis
Table C.1: Selected Features, for the prognosis using the First and last evaluation. The usedtechniques are correlation[32] and mRMR [43]
Correlation mRMR
Age VisualM BAs time ClockAs tot As cutPA Dif Orient PLM a b csLM b LM b Interf CuedLM tot GenderLM a Interf a lg intMVI Free TMT B incompleteOr Total Gm InitiativeOrient T MVI FreeCube a lgint ajsem persMPR Repetitiona1a5 TMT B Za lg int Orient SLogicalMemory Z NamingLogicalMemory A Z CubeProverbs Z a lgint csTMT B Z Orient TAtot Z Writing
CalcBlessedHABa lgint persPA Inter DifCompToken TPrxsComprehension ZTP RT Z
C.2 Temporal window: Two years
C.3 Temporal window: Three years
C.4 Time window: Four years
84
C.4 Time window: Four years
Table C.2: Selected Features, for the prognosis using the two years temporal window. The usedtechniques are correlation [32] and mRMR [43]
Correlation mRMR
Age WritingPA Easy Orient PPA Tot DS backLM a TMT A errLM a Interf a lgint ajsemMVI Free NamingOr Total Or TotalFluency Sem CompCalc MVI FreeCancellationTask Z CalcOrientation Z a crint persGeneralInformation Z CubeVerbalPaired AssociateLearning Z TMT B incompleteCube Z b csMPR Z Orient SAtot Z PA Inter Dif
BlessedHABM CalcTMT B temp
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure C.2: Train results of Prognosis using two years temporal window
85
C. Appendix Prognosis
Table C.3: Selected Features, for the prognosis using the three years temporal window. The usedtechniques are correlation [32] and mRMR [43].
Correlation mRMR
Age DS backDS back a crint ajsem intrPA Easy CompPA Tot a3LM a Cued NamingLM a Interf b csLM tot Interf Or TotalMVI Free CubeOrient T MVI FreeFluency Sem a crint persCalc M Calca2 BlessedHABCancellationTask Z PA Inter DifDigitSpan Z a crint csOrientation Z TMT B incompleteWordRecall Z WritingGeneralInformation Z TPIDVerbalPaired AssociateLearning ZLogicalMemory A ZCube ZMPR Z
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure C.3: Train results of Prognosis using three years temporal window
86
C.4 Time window: Four years
Table C.4: Selected Features, for the prognosis using the Four years temporal window. The usedtechniques are correlation[32] and mRMR [43].
Correlation mRMR
PA Tot Orient PLM a Fluency PhonLM a Cued LDCR ZLM a Interf Writing ZLM tot Interf CompMVI Free b csInfor LM b Interf CuedOrient T MVI FreeFluency Sem TMT A errCube Orient TBlessedAVD NamingSemanticFluency Z CubeOrientation Z a crint csWordRecall Z M CalcVerbalPaired AssociateLearning Z a lgint persMPR Z PA Inter DifA5 Z TMT B incomplete
BlessedHABTPRT
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NeuralNetwork
Decision Tree C4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Naive Bayes SVM RBF SVM Poly
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
|TP
R-F
PR
|
Figure C.4: Train results of Prognosis using four years temporal window