Biclustering-based imputation in longitudinal data Inês de Almeida Nolasco Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Supervisor(s): Professor Alexandra Sofia Martins Carvalho Professor Sara Alexandra Cordeiro Madeira Examination Committee Chairperson: Prof. João Fernando Cardoso Silva Sequeira Supervisor: Prof. Alexandra Sofia Martins de Carvalho Member of the Committee: Prof. Pedro Filipe Zeferino Tomás May 2015
62
Embed
Biclustering-based imputation in longitudinal data · was tested together with several baseline imputation methods in both synthetic and real-world data ... Amyotrophic lateral sclerosis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Biclustering-based imputation in longitudinal data
Inês de Almeida Nolasco
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor(s): Professor Alexandra Sofia Martins CarvalhoProfessor Sara Alexandra Cordeiro Madeira
Examination Committee
Chairperson: Prof. João Fernando Cardoso Silva SequeiraSupervisor: Prof. Alexandra Sofia Martins de Carvalho
Member of the Committee: Prof. Pedro Filipe Zeferino Tomás
May 2015
ii
Resumo
Esclerose Lateral Amiotrofica (ALS) e uma doenca neurodegenerativa que afecta as capacidades mo-
toras. A degeneracao de pacientes com ALS progride rapidamente e usualmente estes acabam por
falecer num perıodo de poucos anos, principalmente devido ao comprometimento das funcoes respi-
ratorias. A importancia de identificar atempadamente a necessidade de iniciar Ventilacao nao invasiva
e portanto vital, este problema e abordado atraves de uma analise longitudinal de dados clinicos que
seguem os mesmos pacientes ao longo de um perıodo de tempo. Contudo, estes estudos e dados em
que os estudos se baseiam sofrem muito de valores em falta, i.e., , valores que por alguma razao estao
em falta e portanto a obtencao de conhecimentos a partir dos dados torna-se bastante difıcil. Neste
trabalho, abordamos a problematica de valores em falta em dados longitudinais atraves da aplicacao
de tecnicas baseadas em biclustering com o intuito de encontrar grupos de pacientes que partilhem da
mesma tendencia de evolucao das variaveis ao longo do tempo e imputar os valores em falta com base
nessa informacao. Esta abordagem e aplicada, em conjunto com outros metodos basicos de tratamento
de valores em falta, em dados sinteticos e no caso real de predicao de NIV em pacientes de ALS. Os
resultados indicam que o uso de imputacao baseada em biclusters melhora os resultados em dados
longitudinais.
Palavras-chave: Valores em falta, Dados longitudinais, Imputacao baseada em Biclusters,
F-measure incorporates both Precision ans recall, being able to measure the performance from both
aspects. its definition is as follows:
F −measure =Precision×Recall
Precision+Recall(2.7)
11
Chapter 3
Method
3.1 Biclustering-based imputation in longitudinal data
Following the idea of imputation inside classes or subgroups introduced in Section 2.2.1, in this work it
will be further explored the use of biclustering techniques, applied to our longitudinal dataset, in order
to create similarity groups and perform group-dependent imputation. The idea is that if the imputation
is carried out based on groups that share some similarity, it is to be expected that the imputed values
are more accurate than if the imputation was made considering the whole dataset. Considering time-
dependent aspects in the imputation process, the idea of grouping persons according to some similarity
can be described as looking for local trends in the data, i.e., we want to group together patients that
for some feature show the same evolution over time. Then, it is imputed therein any missing value
taking into account the group trend for that feature. The biclustering strategy used in the course of this
work is called CCC-biclustering, which as previously explained (in Section 2.3) only finds biclusters with
contiguous time points. However, a slightly different version of this algorithm has to be used, the e-
CCCbiclustering, which allows the biclustering of objects with approximate similarity instead of an exact
one, i.e., selected patients in the defined time-points do not need to be precisely equal between them,
but only similar in some defined degree that may contain mismatches or missing values. In the context
of this work, allowing missing values inside bicluster is imperative to be able to group together subjects
that present some missing values in some time-point, and generate biclusters as shown in Figure 3.1.
However, we are not interested in allowing other mismatches/errors in the bi-cluster computation, so
the e-CCCbiclustering algorithm was modified to allow only missing values as errors. Before applying
the biclustering algorithm, the data needs to be processed in order to generate one-feature matrices for
each longitudinal feature in the original dataset. Each data matrix consists in the observations for the
considered feature for all patients at different time points and it constitutes the base data matrix where
biclustering is performed. In Figure 3.2 a representation of the strategy used to biclustering this data is
presented step by step.
After the generation of each one-feature matrix, as described in the Section 2.3, the matrix needs
to be discretized since this algorithm applies biclustering over a discretized version of the data. There
12
Figure 3.1: Illustration of an e-CCCbicluster containing samples with missing values.
Figure 3.2: Bicluster computation workflow. First construct the one-feature matrices by separating thedataset in sub-datasets of only one feature (but with all samples and all time points). Then apply themodified e-CCCbiclustering algorithm to find biclusters in samples that have the same discretized valueswith only missing values as the possible differences.
are several options regarding the discretization method, but the one that seems more convenient to our
problem is the discretization in n symbols performed with equal width by subject.
This discretization approach looks into the values of each subject across the time and creates n
bins of equal width that correspond univocously to a symbol of the discretization alphabet. Alphabets
of 3 symbols are usually used, in particular the following sequence of letters ordered in crescent or-
der: (D,N,U ). But, other alphabets are possible, in this work, in order to understand the effect of
discretization on the results of biclustering-based imputation the discretization with 5 symbols is also
used, corresponding to the following alphabet: (A,B,C,D,E).
A point to consider is that not all resulting biclusters are of interest, they may be too small or trivial1,
they may be statistically insignificant, or they may simply not make much sense in the context of the
problem. For instance, we are interested in getting local trends in a six time-point longitudinal data
1Herein trivial means a bicluster with only one time point or relatively to only one sample/person.
13
and therefore a bicluster of only two time points it is not very consistent with that. To understand the
implications of these aspects in the imputations performed, and what kind of metric should be interesting
to apply, it is considered the use of four different sets of biclusters in the imputation process: (1) ALL:
all non trivial biclusters, (2) SIG: only significant biclusters, i.e., with a P-value associated lower than
0.05 (3) TP: all non trivial biclusters with more than 3 time points i.e., biclusters that present at least 3
columns, and (4) SIGeTP only significant biclusters with more than 3 time points.
Once the biclusters are formed, the imputation may be performed inside the biclusters exactly as it
would be done on a whole dataset except that it is performed on a smaller group of data that forms the
bicluster. This process has as input each one-feature matrix representing each longitudinal matrix with
missing values and a description of the biclusters found after running the e-CCC biclustering algorithm.
It starts by corresponding each missing value to one single bicluster from which the imputation is to
be performed. After the univocal relation between missing value and bicluster is computed, the local
imputation process takes place and a single value to impute is found. This value is thus imputed, in
the one-feature matrix, in the place of the missing value. A schematic representation of this process is
presented in Figure 3.3. However, several biclusters may contain the same missing value and a single
bicluster must be selected, to solve this, in any of the four cases enumerated above, the biclusters are
evaluated according to their statistical significance and the most significant one is selected. Furthermore,
some missing values don’t fall inside any bicluster. In these cases, the missing value will remain missing,
i.e., these missing values will not be imputed, or are imputed with an additional method that is applied
to the whole one-feature matrix.
Figure 3.3: Illustration of the biclustering-based imputation process. For each missing value, it finds abicluster that contains it. Next, it takes the sub-matrix of the data contained in that bicluster and performimputation on that missing value.
3.2 Imputation methods applied
In this section a description of the imputation procedures that are applied and compared are presented.
14
Expectation maximization Imputation using EM approach is performed with the Matlab software, specif-
ically the EM imputation implementation described in [16]. This approach receives as input a matrix
with missing values defined as Not a Number (NaN) and generates as output the same matrix with
the missing values imputed.
Median cross subjects This approach was implemented in the Matlab environment. The procedure
computes for each one-feature matrix the median for each time point across all subjects, then it
imputes each missing value with the corresponding median value computed.
Median longitudinal A variation on the previous implementation was also developed where median
values to input are computed separately, not only in each feature, but also in each subject, across
all time points.
Bicluster-based imputation Following the strategy described in Section 3.1 three imputation methods
applied to the biclusters were explored: imputation with Median cross Persons, imputation with EM
and imputation by bicluster pattern. The first two approaches are simply direct applications of the
previous introduced implementations and differ only in the sense that they are applied to a much
more restricted group, the bicluster. The last method uses the information of the bicluster pattern
and the local values from the same person to predict the value to impute. The strategy used is:
first, select the corresponding letter of the bicluster pattern for that missing value, and then apply
reverse discretization to it. Based on the other values available for that person, compute the value
interval that letter represents and make the imputation with the mean value of that interval. An
example of this process is presented in Figure 3.4.
15
Figure 3.4: Illustration of the biclustering-based imputation using ”by pattern“ approach. First determinewhich discretization letter corresponds to the missing value. Second, compute the mean value of theinterval that letter represents for the given subject. The imputation is then performed with this meanvalue.
16
Chapter 4
Results
The goal of this chapter is two-fold: (1) evaluate the effectiveness of the imputation methods in longitu-
dinal data; (2) evaluate how general the conclusions are, by testing the imputation algorithms in several
datasets.
To properly evaluate the imputation methods, synthetic datasets were generated. Synthetic datasets
are essential in this work to properly evaluate the imputation algorithms, since they make it possible to
compare the imputations against ground truth. Indeed synthetic data is used, which allows reasoning
about aspects that are impossible with real data. As an example, for being able to evaluate imputation
methods, one should check if the predicted values are close to the real ones, i.e., , the values that
should be there and went missing. Using synthetic data, missing values are known a priori and thus an
assessment of this kind is possible. To tackle the second evaluation point, since the previous method-
ology is impossible to apply on real-world data, a classification approach is used over the ALS dataset.
It seems adequate to state that if a dataset resulting from an imputation method yields better results in
the classification problem than other imputed datasets, and if the classifications are performed in exactly
the same conditions, then this imputation method must be better than the others tested for this particular
context. Following this idea several imputations were applied to the real-world dataset and the resulting
complete datasets are classified. This chapter provides the detailed information on each one of these
procedures.
4.1 Synthetic data
The main advantage of testing methods in Synthetic data is that this experiences may be performed in a
controlled environment, since this data may be as well defined as needed and cleaned from other factors
that obstruct and occlude conclusions. Also this kind of data should be designed with a clear idea of
what are the questions and hypothesis one may want to test and thus define parameters to construct
a dataset that naturally shows some answers. As previously mentioned , we want to test if specially
designed imputation methods for longitudinal data perform better when applied to this kind of structured
17
data other than baseline ones, in particular we want to test if the developed bicluster-based imputation
method performs better. With this in mind, a couple of parameters should be tuned when designing the
synthetic dataset: the number of biclusters found in the data and the total number of missing values.
These aspects directly influence the performance of imputation methods and thus should be carefully
defined.
In [6] the authors developed a generator of synthetic data for biclustering (BiGen), which is used
in this work. BiGen creates a data matrix where biclusters are planted accordingly to a multitude of
parameters that the user may control, which the ones of interest are size of dataset, distribution and
type of data values, number of biclusters to plant, coherence type and the size of biclusters, and also
the total number of missing values.
Regarding the settings referring to biclusters, the coherence type is the most important, since it
defines the type of similarity that objects will share together and be grouped upon. As mentioned, the
objective of applying biclustering algorithms is to be able to group together objects that show the same
trend in time. For this, the coherence type defined is ”order preserving across rows“, which defines that
objects are grouped together if values in each object follow the same trend across columns. Although
a strictly longitudinal dataset is not designed, by defining the bicluster coherence as explained, we are
able to approximate the biclusters planted to the ones we would find in a truly longitudinal dataset.
An also important aspect to consider is the strategy used to generate missing values. BiGen is able
to implement a defined percentage of missing values at random positions. To understand the influence
of the amount of missing values in the imputations process several percentages were used. A version
of the data without missing values is also available, which will act as the ground truth in the evaluation
stage. A more complete description of the data generated is presented next, in chapter 4.1.1.
After imputation, the resulting datasets are evaluated using two metrics: Number of missing values
imputed and the mean imputation error, which is the mean difference between the ground truth values
Using the BiGen generator, four different sized matrices were generated. The sizes defined were: 1000×
150, 2000 × 200 and 5000 × 200. Each generated matrix consists in integer values ranging from 0 to 20
and planted with biclusters and missing values. The biclusters planted are set to be ”order preserving
across rows“ and their size is defined by an uniform distribution for both rows and columns, for which
the user defines the minimum and maximum values. The defined sizes were not extremely big in order
to simulate what happens with the real-world data and considering the dataset size. After the biclusters
are planted, missing values were also included in each matrix and in different percentages: 10%, 20%,
30% and 50%. This generates the final datasets where imputations are to be performed. In Figure 4.1
a summary and description of each dataset is presented.
For each one of these datasets, 9 imputation strategies are applied, resulting in 9 differently imputed
18
Figure 4.1: Description of datasets generated by BiGen, before imputation is performed. Each datasetis here described by dataset size, percentage of missing values, bicluster parameters and number ofmissing values found in biclusters.
datasets. The applied imputation strategies follow the same procedures mentioned in chapter 3.2 which
are described next.
BICmed Only missing values inside biclusters are imputed with median across rows, the remaining are
not imputed with any method. The portion of missing values imputed with this method corresponds
to the total number of missing values found in biclusters.
BICem Only missing values inside biclusters are imputed with EM. This method is able to impute every
missing value inside biclusters leaving the missing values not grouped in biclusters not imputed.
Refer to Figure 4.1 to see the corresponding number of imputations.
BICmed MED Applies median across rows inside biclusters and the remaining are imputed with median
across rows applied to the whole dataset.
BICem MED Missing values inside biclusters are imputed with EM and the remaining are imputed with
median across rows.
BICem EM Missing values inside biclusters are imputed with EM and remaining are imputed with EM
applied to the whole one-feature matrix.
MED Imputation without biclustering, applies median across rows to the whole matrix.
MEDL Applies longitudinal median to each line of the whole matrix. Note that if a whole line of the
matrix is missing, this method does not impute any value in that line.
MEDL MED First a longitudinal median is applied to each line of the matrix. Then the few lines that
were entirely missing are imputed with median across rows.
EM Imputation with EM applied to the whole matrix. All missing values are imputed.
For a visual aid on the description of each imputation refer to the Figure 4.2, here each imputation
process is described by the imputation method that it applies(in green).
19
Figure 4.2: Imputation strategies description. Each imputation method that is used appears in green,methods not used are shown in grey. As an example, BICem MED imputes missing values insidebiclusters with the EM approach followed by applying median imputation to the remaining missing values.
In short, each one of the 10 original datasets is imputed by 9 different approaches, originating 90
different datasets that are evaluated and compared. The results of such evaluation are presented in the
next section.
4.1.2 Evaluation results
As mentioned above each imputation approach, applied to each dataset, is evaluated by two metrics,
percentage of missing values imputed and mean imputation error. In the present section the results of
such evaluation are presented.
From the several questions one may want to answer using these results, the most interesting is
whether the bicluster-based imputation approach derives better imputed values with respect to alterna-
tive methods. To answer this, it is crucial to compare between imputation approaches that use bicluster-
based imputation in one portion of data and an additional method on the remaining missing values, with
the imputation approaches that use the same additional method to impute the whole dataset. By doing
this it is possible to directly understand if the use of bicluster-based imputation, even on a small por-
tion of data, results in a better imputation than if this approach was not used. The specific approaches
that fall in the category described are MED versus BICmed MED, or versus BICem MED and also EM
versus BICem EM. Such comparison may be observed in Figures 4.3 and 4.3 where these approaches
are compared relatively to their mean imputation error. The conclusion is straightforward, when using
bicluster-based imputation methods, whichever be the imputation method used inside the bicluster, the
mean error is smaller than if no bicluster imputation is applied. Also these results are consistent in all
nine synthetic datasets, independently of size and percentage of missing values, i.e., the relative rela-
tions between mean imputation errors for the methods in analysis are maintained, confirming that these
conclusions are independent of percentage of missing values or dataset size.
These imputation methods are also robust to the amount of missing values in the synthetic datasets.
As can be seen from Figure 4.4 and 4.5, in general, for all data sizes, the mean imputation error almost
does not increase with a dramatic increase of missing values (from 10% to 50%).
Finally, it is also possible to find which one of the tested imputation approach shows the most promis-
20
Figure 4.3: Comparison between datasets imputed with bbimputation and Median or EM in the remainingmissings with dataset imputed only with median or EM. All nine synthetic datasets present the samerelative results: bicluster imputation, enhances the imputation results.
Figure 4.4: Mean imputation error for the smallest dataset (1000× 150) with differing amount of missingvalues. All methods perform almost equally well for different amount of missing values, even when thisamount rises to 50%. This result is consistent across the datasets of different sizes.
ing results in terms of mean imputation error, in Figure 4.6 the mean imputation error is represented for
the smallest dataset (1000× 150) for all the imputation approaches applied. As before, these results are
also consistent for all datasets sizes and amount of missing values, thus only results from the datasets
of size 1000×150 are shown here. Analyzing these results leads to the conclusion that the BICem MED
and BICem EM methods consistently achieve better imputations, i.e., , the predicted values are closer
to the real ones. The complete results from which these analysis are performed may be consulted in
Appendix A.
4.2 Real-World data
For the real-world data, the ALS dataset, the imputation methods were indirectly evaluated through the
classification results. The classification problem, as described in Chapter 1.2, can be described as
predicting the evolution or not of needing assisted ventilation (NIV) by the time of the sixth visit using all
previous observations.
21
Figure 4.5: Here the mean imputation error for all the datasets with size 1000 × 150 and percentage ofmissings 10% and 50% are represented.
Figure 4.6: Mean imputation error in the smallest dataset (1000 × 150) for all methods, and differentamount of missing values. For all datasets (sizes and amount of missing values) the best method wasBICmed MED followed closely by BICmed MED. BICem MED: EM imputation inside each biclusterfollowed by median imputation across rows of the whole dataset for the rest of the missing values.BICem EM: EM imputation inside each bicluster followed by EM imputation in the whole dataset for therest of the missing values. See above in the text for the definition of all methods.
Classification is performed in a supervised way, but since the two classes EVOL and noEVOl are
seriously unbalanced, it was necessary to apply Synthetic Minority Oversampling Technique (SMOTE)
(as described in Chapter 2.4) in order to achieve better balance.
Concerning the classifiers, the ongoing work by Andre Carreiro [2], selects the Naive Bayes since
this is the one that yields better results. However, Naive Bayes(NB) implementation in the WEKA data
mining software is also known to be a classifier that can work particularly well with missing values, so it is
not expected that classification results can be significantly improved by using better imputation methods.
For this reason, and in order to be able to highlight which imputations really improve the classification
process, different classifiers were used, namely Decision Trees (DT), K-Nearest-Neighbor (KNN) and
linear Support vector machine (LinearSVM).
NB was applied with kernel estimator, regarding the default method for dealing with missing values,
this NB implementation in WEKA simply omits the conditional probabilities of the features with missing
values in test instances. KNN was applied with 1 neighbor, as to default missing values treatment, miss-
ing values are assigned the maximum distance when comparing instances with missing values. DT was
22
performed with a confidence factor of 0.25 and without Laplace smoothing, this implementation simply
does not consider the values of the attributes with missings to compute gain and entropy. LinearSVM
was performed with complexity of 1.0 , missings are treated by imputing global means/modes.
The classification process was performed with cross validation setup, where each dataset (SMOTE
300%, SMOTE 500% and Original) was divided in five folds from which 4 were used for training and one
for testing. These experiments, for each classifier and dataset, were repeated 10 times. Classifications
are evaluated with the TP rate, TN rate, Precision, K-statistics and F-measure. Conclusions from these
results should always be taken from a evaluation of the several metrics. However as F-measure balances
the influence of each class and integrates both precision and recall in a final number, it was given chief
importance.
4.2.1 Data description
This work was build upon clinical data containing information regarding ALS patient follow-ups collected
by the Neuromuscular Unit at the Molecular Medicine Institute of Lisbon. As mentioned, this dataset is
constructed in a longitudinal fashion where each patient is observed in several moments through time.
Although observations do not follow a strict plan, they tend to average 3 month between consecutive ob-
servations. The dataset contains demographic information, patient characteristics, neuropsychological
analysis, motor evaluations and also respiratory tests where the NIV requirement is included. In short,
each patient evaluation consists in the observation of 34 different features. A statistical description of the
dataset is presented in the Appendix B. There are static features, which are time invariant and longitu-
dinal ones which are time variant.evaluated may be differentiated accordingly to their evolution through
time, as static or longitudinal, being the static ones the features that stay constant along the time and
the longitudinal ones the features that show some trend. From the 34 features, 22 are longitudinal and
so are the focus of this work. In the context of the presented problem, each patient’s follow-up is labeled
with Evol or noEvol considering if an evolution in the NIV indicator exists or not. The higher the number
of follow-ups the easier it should be to perceive and exploit trends in the data. Therefore, only patients
that presented at least five follow-ups were considered. From these only the patients that didn’t evolve
from not needing NIV to needing NIV before the fifth moment are of interest to the classification problem
at hand. Although other setups could be considered, this was the best option from a balance between
number of resulting patients and number of follow-ups, since more follow-ups result in fewer patients
fulfilling the needed conditions. By filtering the not interesting ones, the resulting dataset consists of 159
patients observed in 34 different features at 5 different moments, which takes the form of a matrix of size
159*170, as depicted in the Figure 3.2.
The resulting dataset is quite unbalanced. It contains 31 EVOL samples and 128 noEVOL samples
that may be observed in Figure 4.7, where the percentage of patients labeled with Evol consists only in
approximately 20% of the cases.
23
Figure 4.7: Number of instances per class, Evol or noEvol, in the ALS dataset.
Missing Values Analysis
Approximately 40% of the values in the present dataset are missing. As illustrated in Figure 4.8 these
missing values occur in approximately 80% of the features and there is no single patient that does not
present at least one missing value.
Figure 4.8: Here it is described, as to the amount of missing values, features, patients and the wholedataset. missing values are represented in green.
These missing values are distributed unevenly throughout the two classes since from the total num-
ber of values belonging to class Evol approximately 80% are missing, against 20% missing values in
class noEvol. This aspect is depict in Figure 4.9 and represents a problem for the classification.
Regarding the mechanisms of missing values, i.e., the reasons why data is missing, there is not a
known justification. In the longitudinal features, missing values occur in what seems to be a random
fashion without any identified pattern and without any previous knowledge that indicates any pattern.
Data is simply missing because either it was not observed, or not annotated, and a consistent mecha-
nism creating these missing values was not found. In the static features, however, some of the missing
24
Figure 4.9: Proportion of missing values in class Evol and noEvol. 80% of the values in the Evol samplesare missing.
values may be considered as ”false“ missings, since the value that is missing in some time-point is
readily filled in with values from other time-points. The missing values that cannot be instantly filled
are the ones for which no value is observed for all time-points. These cases are pre-imputed with the
median across rows. In Appendix B a characterization of the number of missing values per feature and
time-points is presented. For each time-point features 11 to 33 are the longitudinal ones and are also
the ones presenting the greatest amount of missing data.
4.2.2 Biclustering results
As previously mentioned, the modified version of the biclustering algorithm, e-CCC-Biclustering, was
applied to longitudinal features that previously were transformed in one-feature matrices. The results of
such procedure are presented here. The discretization described in Chapter 3.1, needed when applying
this algorithm, was performed with two different number of symbols, 3 and 5, that correspond to the
following alphabets: U, N, D, and A, B, C, D, E. The reason of using these two settings for the discretiza-
tion was to understand the influence of these on the biclustering, imputation and classification results.
it is expected that a better discretization, i.e., , a discretization where the error that one makes when
discretizes values, should lead to better imputed values, however the biclustering results are expected
to be worse i.e., lower number of interesting biclusters found.
To understand the ability of the e-CCC-Biclustering algorithm to find and group together patients that
show the same trends, it is necessary to analyze the amount and importance of the found biclusters.
Although the trivial biclusters have already been filtered out, not all resulting biclusters are of interest for
the present problem, and thus a characterization relating the size and significance of the biclusters is
needed. For this reason, the biclusters to consider are grouped in four different categories (ALL, SIG,
TP, and TPeSIG), as introduced in Chapter 3.1. The representation of the amount of biclusters in each
one of these categories, distributed by feature is presented in Figure 4.10.
This analysis allows for a characterization of the biclusters found and results in the general idea that,
as expected, the number of biclusters that are both significant and have three or more time-points are
scarce. Also, an interesting observation to be made from this analysis and comparing to the number of
missing values in each feature (presented in Appendix B) is that the higher the number of missing values
the lower the number of biclusters found, in both discretizations of 3 or 5 symbols. This was expected
25
Figure 4.10: Distribution of the biclusters through the bicluster categories. ALL: All biclusters after filter-ing out the trivial ones. SIG: Significant biclusters. TP: biclusters with 3 or more time-points. TPeSIG:Significant biclusters with three or more time-points.
since missings leads to lost of information, the higher the number of missing values the scattered the
data becomes, increasing the difficulty of finding interesting biclusters.
An important aspect to be analyzed, since it is highly correlated with the imputation results, is the
amount of missing values that are caught in biclusters. In Figure 4.11 the percentage of missing values
that fall in each bicluster category is presented for both biclusters from 3 and 5 symbols discretization.
Figure 4.11: Representation of the number of missing values belonging to each set of biclusters.
As may be observed the total number of missings that are grouped in biclusters do not ascend to
30%, and that is considering all the non trivial biclusters. If a more restricted but also more interesting
group of biclusters is to be considered then only approximately 5% of the missings are caught. Also,
regarding the effect that the discretization options had on the capability of finding biclusters and on their
quality, one may observe, from all the previous analysis, that using 3 or 5 symbols does not result in a
concrete difference.
.
4.2.3 Datasets imputation results
As presented in Chapter 3.2 several imputation methods and their mixtures were applied to this data
resulting in several imputed datasets. This section describes in detail each dataset created stating the
imputation method used or the strategy implemented as well as the specific settings applied. Here, each
dataset is identified by the imputation method that was used for its creation.
26
ORI The original dataset, here imputation with the proposed methods is not performed, instead the
missings are left to be treated by the default method that deals with missing values implemented
by WEKA for each classifier. This dataset is the baseline for comparison between the proposed
imputation methods and the default one.
MED The missing values were imputed with the value resulted from the median off all values of the
same feature across all patients in the dataset. This method is capable of imputing all missings in
the dataset.
MEDL Missing values were imputed with the median off values from the same patient and feature for
all time points. This may be seen as a median taken horizontally in contrast with the MED method
that may be seen as a median taken vertically. In the specific cases where, for a single person,
observations in all time-points from the same feature are missing, this method is incapable of
predicting any value to imput. The percentage of missings imputed is about 64%.
MEDL MED This dataset is imputed with the previous imputation approach and additionally it is applied
MED. This strategy allows for the remaining 36% of missings to be imputed.
EM The EM imputation method implementation as described in Chapter 3.2 is applied to the entire
dataset. These approach is able to imput every missing in the dataset.
EMfeature The same EM implementation is applied to each one of the one-feature matrices which
structure is introduced in Chapter 3.1. In the few cases where an entire row or an entire column of
values is missing in these matrices, the EM-imputation algorithm can not predict any value for the
corresponding feature. These cases correspond to 4% of the missings which are left missing.
EMfeature MED In the special cases where EMfeature is not capable of predicting the values, the
cross-persons-median imputation is applied, allowing to obtain a complete dataset.
BIC3TPem This dataset is constructed by imputing missings with the bicluster-based imputation strat-
egy using EM. The biclustering procedure is performed on a discretized version of data with alpha-
bet of 3 symbols, and from the found biclusters, only the biclusters with more than 3 time-points are
considered. This approach is able to imput all missing values that belong to this set of biclusters,
the other remain missing. The amount of missing values imputed consists only in approximately
6% of the missing values in the original dataset.
BIC3TPem MED This dataset is the same as the previous one on which the remaining missings are
imputed with MED imputation.
BIC3TPem EMfeature This dataset is the version of BIC3TPem where the remaining missings are
imputed with the EMfeature approach. As before, this approach is not able to take care of every
missing, resulting that in total only 68% of the missings are imputed.
BIC3TPem EM This dataset is also a version of BIC3TPem where the additional method to impute the
remaining missings is EM. This strategy is able to impute every missing in the dataset.
27
BIC3ALLmed Missings are imputed with bicluster-based imputation with median, using all non trivial
bicluster found on the discretized version of the data performed with 3 symbols. The number of
missings belonging to these biclusters are about 30% of the total amount of missing values and
are all imputed.
BIC3ALLmed MED Here the remaining missings from BIC3ALLmed are imputed with MED, which
results in a complete dataset with no missing values.
BIC3SIGmed Missings are imputed with bicluster-based imputation with median and using only the
significant biclusters. In addition, the biclusters are found on a discretized version of data with 3
symbols. The number of missings imputed corresponds to the number of missings existing in the
selected biclusters, which is about 9% of the total.
BIC3SIGmed MED The remaining missings from BIC3SIGmed are imputed with MED, all missings are
imputed.
BIC3SIGTPmed The bicluster-based imputation is here performed using median as the internal impu-
tation method and using biclusters that are both significant and have more than 3 time-points. The
discretization is also performed with 3 symbols. The amount of missings that this strategy is able
to imput corresponds to the total number of missings that may be found on the selected biclusters,
about 5%.
BIC3SIGTPmed MED The remaining missings from BIC3SIGTPmed are imputed with MED, resulting
in a complete dataset.
BIC3TPmed Biclustering-based imputation is applied with median imputation approach, the data is
discretized with an alphabet of 3 symbols and the biclusters selected to perform the imputation
are the ones that have at least 3 time-points. The amount of missings imputed with this strategy is
about 6% of the total amount of missings.
BIC3TPmed MED In order to deal with the remaining missings from BIC3TPmed, MED is applied. This
allows for a complete imputation of every missing in the dataset.
BIC3TPpattern In this dataset the biclustering-based imputation by pattern is applied on the missings
belonging to the biclusters with at least 3 time-points. Here a discretization with an alphabet of 3
symbols is used. This strategy is able to impute every missing belonging to the selected biclusters,
which are about 6% of the total amount of missings in the original dataset.
BIC3TPpattern MED The previous dataset is additionally imputed with the MED approach, in order to
deal with the remaining missings. This results in a complete dataset.
BIC5TPem This dataset was imputed with biclustering-based imputation using EM approach. The
dataset discretization was performed with an alphabet of 5 symbols and the selected biclusters
are the ones that present at least 3 time-points. This strategy is able to imput 6% of the missings.
28
BIC5TPem MED The remaining missings from the BIC5TPem dataset are here imputed with the MED
procedure. The resulting dataset has no missings.
BIC5TPem EMfeature The remaining missings from the BIC5TPem dataset are here imputed with the
EMfeature procedure. The resulting dataset still have about 2% of missing values.
BIC5TPem EM The remaining missings from the BIC5TPem dataset are here imputed with the EM
procedure. This results in a complete dataset.
BIC5ALLmed Missings were imputed using the biclustering-based imputation with median. Biclustering
is here performed on a discretized version of data with 5 symbols and the imputation process
uses all non trivial biclusters found. The amount of missings that this strategy is able to imput
corresponds to the number of missings existing in the selected biclusters, which are about 30%
BIC5ALLmed MED The remaining missings in the previous dataset are here imputed with MED gen-
erating a complete imputed dataset.
BIC5SIGmed Missings are imputed with bicluster-based imputation with median and using only the
significant biclusters. Here the biclusters are found on a discretized version of data with 5 symbols.
The number of missings imputed corresponds to the number of missings existing in the selected
biclusters, which is about 9%.
BIC5SIGmed MED The remaining missings from BIC3SIGmed are imputed with MED, all missings are
imputed.
BIC5SIGTPmed The bicluster-based imputation is here performed using median as the internal impu-
tation method and using biclusters that are both significant and have more than 5 time-points. The
discretization is also performed with 3 symbols. The amount of missings that this strategy is able
to imput corresponds to the total number of missings that may be found on the selected biclusters,
about 5%.
BIC5SIGTPmed MED The remaining missings from BIC3SIGTPmed are imputed with MED, resulting
in a complete dataset.
BIC5TPmed Biclustering-based imputation is applied with median imputation approach, the data is
discretized with an alphabet of 5symbols and the biclusters selected to perform the imputation are
the ones that have at least 3 time-points. The amount of missings imputed with this strategy is
about 6% of the total amount of missings.
BIC5TPmed MED In order to deal with the remaining missings from BIC3TPmed, MED is applied. This
allows for a complete imputation of every missing in the dataset.
BIC5TPpattern In this dataset the biclustering-based imputation by pattern is applied on the missings
belonging to the biclusters with at least 3 time-points. Here a discretization with an alphabet of 5
symbols is used. This strategy is able to impute every missing belonging to the selected biclusters,
which are about 6% of the total amount of missings in the original dataset.
29
BIC5TPpattern MED The previous dataset is additionally imputed with the MED approach, in order to
deal with the remaining missings. This results in a complete dataset.
A organized visual representation of these descriptions is presented in Figure 4.12. Therein, for
each dataset, the methods and settings used appear in green, together with the information of how
many missing each method is able to imput.
Figure 4.12: For each dataset it is shown which are the imputation approaches uses (green), the orderof its application, and the number of missings each method is able to imput.
4.2.4 Classification results
Because of the extreme data unbalance, SMOTE was imperative to be applied and only with the appli-
cation of SMOTE with 300% was it possible to obtain a balanced dataset. As it is the usual procedure,
SMOTE with 500% was also applied in order to obtain the inverted unbalanced dataset, i.e., the class
EVOL with as many more instances than the class noEVOl had in the original dataset. These two
procedures were applied to each dataset described above.
After these stage, all datasets described, together with the ones created by SMOTE, are classified
as explained before. The resulting classifications are evaluated trough the metrics TP rate, TN rate,
Precision, Recall, K-statistics and F-measure. The extensive results of these metrics for each dataset,
and classifier are presented in Appendix C.
Being unable to directly evaluate the imputation methods with these data, the focus here is to un-
derstand which methods work better with which classifiers, in order to help improve the classification
process addressed in other works. In Figures 4.13, 4.14, 4.15 and 4.16, the F-measure metric for each
classifier and balanced datasets (with SMOTE300) are represented. From here it is possible not only
to observe which method is the best for each classifier, but more importantly it is possible to inquire
30
some aspects regarding the relation between imputation methods particularities and the classifiers per-
formance. It is also important to have in mind that the results concerning the original dataset (ORI) serve
to evaluate the default management of missings that each classifier implementation uses, and compare
it with the present imputation methods in test. The default strategies are described in the Chapter 4.2.
Figure 4.13: F-measure for all balanced datasets classified with NaiveBayes.
Figure 4.14: F-measure for all balanced datasets classified with linearSVM.
Using the Naive Bayes classifier, the imputations applying Median to the whole dataset(MED) and
Bbimputation with median (as an example the datasets imputed with Bic3SIG and BIC5SIG) improves
over Weka’s default method for dealing with missing values (ORI). Also when using KNN, the default
method to deal with missing values in Weka performs worst than Bbimputation approaches with median
(BIC3SIG and BIC5SIG), EM applied to the whole dataset (EM) and Median aplied to the whole dataset
(MED). Regading Decision Trees (DT), it is noticeable that the bbimputation procedure with EM improves
the results over the EM imputation applied feature by feature. As to the linear SVM, the biclustering-
31
Figure 4.15: F-measure for all balanced datasets classified with K-nearest-neighbor.
Figure 4.16: F-measure for all balanced datasets classified with Decision Trees.
based imputation methods using the by pattern approach ( as an example the datasets imputed wih
BIc3bypattern) and Median applied to the whole dataset (MED) improves results over the default missing
treatment of WEKA. The Bbimputation with by pattern approache also improves results over EM aplied
feature by feature (EMbyfeature) and over Median applied to the whole dataset(MED).
These conclusions are supported by the results of the Wilcoxon signed-rank tests, that compares
results of both experiments and is able to determine if the values are significantly different or not. The
results of the applied tests in the form of Pvalues are presented, for each classifier in the tables 4.1, 4.2
4.3 and 4.4. A Pvalue lower than 0.05 indicates that the fscore mean values for both experiments are
different and thus conclusions about performance may be drawn.
Regarding the use of 3 or 5 symbols in the discretization process, by comparing the classification re-
32
sults of the several BIC3 and BIC5 methods it is suggested that also here these options do not introduce
much difference. In the Table 4.5 the mentioned comparison is presented.
Table 4.1: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.
’EMbyfeature vs BIC3em EMbyfeature’ 0,8754 0,8701 0,820018494’EMbyfeature vs BIC5em EMbyfeature’ 0,8754 0,8763 0,979542991
Table 4.2: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.
’EMbyfeature vs BIC3em EMbyfeature’ 0,6798 0,6656 0,2679884’EMbyfeature vs BIC5em EMbyfeature’ 0,6798 0,6858 0,8431294
33
Table 4.3: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.
’EMbyfeature vs BIC3em EMbyfeature’ 0,8066 0,7836 0,088’EMbyfeature vs BIC5em EMbyfeature’ 0,8066 0,8321 0,0164
Table 4.4: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.