University of Manchester School of Computer Science Final year project report 2016 BSc Computer Science with Industrial Experience Evaluation of Mutual information versus Gini index for stable feature selection Author: Veneta Haralampieva Supervisor: Dr. Gavin Brown May 02, 2016
66
Embed
Evaluation of Mutual information versus Gini index for ...studentnet.cs.manchester.ac.uk/resources/library/3... · Research and implement the following feature selection criteria
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
University of Manchester
School of Computer Science
Final year project report 2016
BSc Computer Science with Industrial Experience
Evaluation of Mutual information
versus Gini index for stable feature
selection
Author:
Veneta Haralampieva
Supervisor:
Dr. Gavin Brown
May 02, 2016
Abstract
The selection of highly discriminatory features has been crucial in aiding further advancements in
domains such as biomedical sciences, high-energy physics and e-commerce. Therefore evaluation of the
robustness of feature selection methods to small perturbations in the data, known as feature selection
stability, is of great importance to people in these respective fields. However, little research has been
focused on investigating the stability of feature selection algorithms independently from any learning
models. This project address the problem by providing an overview of several established stability
measures and reintroduces Pearson’s correlation coefficient as another. The coefficient has then been
employed in the empirical evaluation of four commonly used feature selection criteria, Mutual
information maximisation, Mutual information feature selection, Gini index and ReliefF. A high overall
stability of Mutual information maximisation and Gini index can be observed for small data samples,
with a slightly lower stability being seen for ReliefF. All criteria exhibit low stability when applied to
high-dimensional datasets, consisting of small number of samples, with Mutual information feature
selection performing poorly across all datasets.
Acknowledgements
Firstly, I would like to thank my supervisor Dr Gavin Brown for the support and guidance
throughout the course of this project.
Secondly, I would like to extend a thank you to my parents and my sister for always believing
in me and helping me through all difficult times.
Lastly, I would like to express my gratitude to all my colleagues and friends, in particular to my
flat-mate Gaby and my boyfriend Dan, for the useful feedback regarding the seminar and later
the presentation of results.
All your help and support has been greatly appreciated!
Appendix A ......................................................................................................................................... 42
Appendix B ......................................................................................................................................... 46
Appendix C ......................................................................................................................................... 50
Appendix D ......................................................................................................................................... 54
Appendix E ......................................................................................................................................... 55
Appendix F ......................................................................................................................................... 57
Appendix G ......................................................................................................................................... 59
LIST OF FIGURES
Figure 1: Programming languages popularity in data science........................................................ 17
Figure 23: Sampling with replacement ............................................................................................. 59
1
Chapter One
1 INTRODUCTION
1.1 CHAPTER OVERVIEW
The focus of this chapter is to provide a short overview of the main aspects related to this
project, such as the problem being investigated, the motivation behind it and its aims and
objectives. The final section will outline the report’s structure.
1.2 MOTIVATION
In recent years the term ‘Big Data’ appears prominently in both research literature and popular
media. Ward and Barker provide a unified definition of it which postulates that ‘Big data is a
term describing the storage and analysis of large and or complex data sets using a series of
techniques including, but not limited to: NoSQL, MapReduce and Machine learning.’ (Ward &
Barker, 2013). It might seem counter-intuitive as to why this could be problematic since one
would assume that the more information there is regarding a particular problem, the easier
finding a solution will be.
The high-energy physics domain can be used to illustrate this phenomenon; the Large Hadron
Collider in CERN, Geneva has been colliding particles travelling close to the speed of light, and
thus driving important scientific discoveries in the field of particle physics over the last few
decades. Interestingly, the research conducted there presents numerous unique challenges not
only to physicists but also to computer scientists. The data anticipated to be generated from all
four experiments is 25 GB/sec, a single collision generating 1 MB of data (CERN, 2016). After a
two-stage filtering process, around 15 petabytes of data are stored annually. From this
astonishing amount scientists have estimated that only one collision in a million is of interest. It
becomes apparent that without proper data filtering and storage, its management and analysis
becomes virtually impossible. For scientists in that domain selecting only the relevant particle
collisions is of vital importance.
However, the Big Data phenomenon is present in other fields such as biomedical sciences. Since
the completion of the Human Genome Project, the price for full genome sequencing1 has
dropped rapidly, thus generating vast amounts of genomic data. Effective analysis of this data
1 The term full genome sequencing is used to describe the identification of ‘the genetic basis of traits and disease
susceptibilities using SNP microarrays that capture most of the common genetic variation in the human population’
(Ng & Kirkness, 2010). Single nucleotide polymorphism (SNP) microarray is the technique used to accomplish this.
2
promises to revolutionise medicine by uncovering causes for genetic diseases, as well as
developing innovative personalised drug treatments. It should be noted that the human
genome consists of around 30,000 protein-coding genes and selecting the ones associated with a
particular genetic disease such as cancer is no trivial task.
Outside the scientific domains commercial companies such as Google, Amazon, Facebook and
Apple are storing large amounts of user data in hopes of developing better user experience by
effectively analysing and using this information. One example area is digital advertising, where
Amazon, for example employs previously stored user activity to generate recommendations for
products and services. One approach the company uses for offering personalised
advertisements is finding similarity between the products purchased and other available
products (Linden, Smith, & York, 2003). One should realise that a product on Amazon, such as a
book, is described by a number of characteristics such as genre, publisher, product dimensions,
author, price, average review score, etc. Evidently, publisher or product dimensions might not
be of vital importance to whether a person will purchase a book; however, genre and author
might be. Thus selecting a good set of characteristics is important.
All the above mentioned examples are characterised with high-dimensionality, and require
extraction of the most important attributes describing the target concept. In the field of Machine
Learning this is typically accomplished by the means of feature selection, which focuses on
identifying a subset of features which best achieve a certain goal, such as discovering the main
genes, in which mutations can lead to cancer.
It should be noted that obtaining a robust subset of features, also known as ‘feature preferences’
(Kalousis, Prados, & Hilario, 2007), is a necessity in the areas mentioned above. Since experts
will invest considerable amount of time, research effort and money into further exploring the
selected set of ‘preferences’ it is paramount that small variations in the data do not drastically
change these. This links closely with the notion of stability of a feature selection algorithm,
defined as the “robustness of the feature preferences it produces to differences in training sets
drawn from the same generating distribution” (Kalousis, Prados, & Hilario, 2007).
Surprisingly, this area of Machine Learning has received little attention from the research
community. This has motivated the investigation of various stability measures and their
applications in the empirical evaluation of feature selection criteria.
1.3 AIMS AND OBJECTIVES
As identified above, the aim of this project is to research and evaluate the stability of different
feature selection criteria. In order to achieve this, a detailed research has been conducted,
followed by experimental work and consolidation of results. The process involved critical
analysis of every criterion’s strengths and limitations, followed by careful examination of their
3
relationship with each other. To facilitate this a custom implementation of the criteria and
stability measures has been developed, followed by extensive testing and validation.
The project work has been supported by aiming to accomplish the following project objectives:
Research and implement the following feature selection criteria
o Mutual information maximisation
o Mutual information feature selection
o Gini index
o ReliefF
Research, implement and perform critical analysis of the following similarity measures
used in stability estimation
o Tanimoto distance
o Kuncheva’s stability index
o Hamming distance
o Pearson’s correlation coefficient
With the help of the above identified measures evaluate the stability of the criteria
1.4 REPORT OUTLINE
The remainder of the report is organised as follows: Chapter Two provides an overview of the
feature selection criteria and stability measures explored in this project. Chapter Three describes
the project structure and technologies used. Chapter Four focuses on the details of the
implementation, followed by a thorough explanation of the testing performed. Chapter Five
outlines the experimental work undertaken, the characteristics of the datasets used, followed by
analysis on the results. Chapter Six serves as a conclusion and highlights achievements and
future work.
1.5 CHAPTER SUMMARY
The notion of feature selection and stability was introduced, and its importance in relation to
Big Data and Machine Learning applications. The chapter then set out the aims and objectives
and presented a brief overview of the report’s structure.
4
Chapter Two
2 BACKGROUND
2.1 CHAPTER OVERVIEW
This chapter will provide a review of the research already conducted with respect to feature
selection and stability, which aims to introduce the reader to the vital concepts explored in
detail in this project. An outline of the strengths and weaknesses of each of the techniques shall
be presented.
2.2 STABILITY MEASURES
In Machine Learning, the task of identifying the category or class of an unseen instance based
on past data, also known as classification plays a prominent role. It has been established that in
high-dimensional problems feature reduction could improve classification accuracy as
discussed by Das (Das, 2001) and Xing, Jordan and Karp (Xing, Jordan, & Karp, 2001). As a
result the notion of feature selection stability has traditionally been associated with it. However,
this provides no means of distinguishing the behaviour of the feature selection algorithm with
respect to changes in the data, and that of the classifier. One can easily infer that this distinction
is quite important; since a low stability score in that case presents no information as to whether
the feature selection algorithm produces a non-optimal feature subset or the classifier’s
sensitivity to the data is too high.
Kalousis was one of the first people to investigate feature selection stability separately
(Kalousis, Prados, & Hilario, 2007). He proposed several similarity measures, which can be
employed independently of any learning model and cater for the different outputs produced by
feature selection algorithms. These algorithms do one of three things; assign a weight or a score
to features, rank them or select a small subset. As a result Kalousis proposed three separate
measures: Pearson’s correlation coefficient for weight-scoring outputs, Spearman’s rank
correlation coefficient for ranking and Tanimoto distance for measuring set similarity. Measures
operating on subsets of features are the focus of this research and will be explored in detail
shortly.
Each of the similarity measures 𝑆𝑥 presented below operates on two sets alone. They can be
used to measure the total similarity 𝑆 of 𝑚 sequences 𝐴 = {𝐴1, 𝐴2, 𝐴3, … 𝐴𝑚} by averaging the
sum of the similarity scores for every two subsets.
5
This leads to a generalised approach for computing the overall stability regardless of the
similarity measure employed and is defined as:
𝑆(𝐴) =∑ ∑ 𝑆𝑥
𝑚𝑗=𝑖+1
𝑚−1𝑖=1 (𝐴𝑖, 𝐴𝑗)
𝑚(𝑚 − 1)2
(1)
2.2.1 Tanimoto distance
An adaptation of Tanimoto distance has been proposed by Kalousis as a possible measure for
set similarity (Kalousis, Prados, & Hilario, 2007). It relies on computing the number of elements
in the intersection of two sets s and s’ divided by the number of elements in the union and can
be expressed as follows:
𝑆𝑡 (𝑠, 𝑠
′) =|𝑠 ∩ 𝑠′|
|𝑠 ∪ 𝑠′| (2)
The values are in the range of [0, 1], where 0 characterises two completely different subsets, and
1, two subsets which are identical. This measure has advantages, such as computational
efficiency and simplicity of implementation.
However, upon further investigation one apparent disadvantage emerges, as discussed by
Kuncheva (Kuncheva, 2007) and Lustgarten (Lustgarten, Gopalakrishnan, & Visweswaran,
2009), which is the inability of this measure to handle ‘chance’. As the number of selected
features approaches the total number of features, the Tanimoto distance produces values close
to 1.
2.2.2 Hamming distance
When evaluating the stability of different feature selection algorithms, Dunne proposed another
similarity measure based on the Hamming distance (Dunne, Cunningham, & Azuaje, 2002). It
employs the masks generated by the feature selection algorithm, 1 denoting a feature has been
selected and 0 that it has not. Later Kuncheva expressed the Hamming distance in terms of two
sets in the following manner (Kuncheva, 2007), where ‘\’ denotes the set difference operation
and 𝑛 is the total number of features in the dataset.
𝑆ℎ (𝑠, 𝑠
′) = 1 −|𝑠\𝑠′| + |𝑠′\𝑠|
𝑛 (3)
Much like Tanimoto distance this similarity measure does not factor for chance.
6
2.2.3 Kuncheva’s consistency index
In order to resolve this problem Kuncheva introduced several desirable properties of a
similarity measure, which are monotonicity, limits and correction of chance (Kuncheva, 2007).
Her research empirically proved that the previously used measures do not meet these
requirements, thus emphasising the need for a new one. She introduced the ‘consistency index’
which can be formulated as follows.
𝑆𝑐 (𝑠, 𝑠
′) =𝑟𝑛 − 𝑘2
𝑘(𝑛 − 𝑘) (4)
𝑟 − | s ⋂ s’| 𝑘 - number of selected features
𝑛 - total number of features
𝑠, 𝑠′ - subsets of features with the same size
The values range between [-1, 1], where 1 can be obtained for two identical sets (𝑟 = 𝑘). The
minimal value is produced when 𝑘 =𝑛
2 and 𝑟 = 0. As noted by Kuncheva the value of the index
is not defined for 𝑘 = 0 and 𝑘 = 𝑛. Nevertheless, for completeness these can be implemented as
0. By taking into account the total number of features the consistency index satisfies all three
requirements introduced above.
2.2.4 Pearson’s correlation coefficient
Pearson’s correlation coefficient has been widely used in statistics and economics as a measure
for correlation between two variables. In the context of feature selection stability Kalousis
proposed its use for measuring similarity of two separate weighting-scoring outputs produced
by feature selection algorithms. It can be formulated as follows, where 𝜇𝑤 and 𝜇𝑤′ are the
sample means of 𝑤 and 𝑤′ respectively.
𝑆𝑝 (𝑤,𝑤
′) =∑ (𝑤𝑖 − 𝜇𝑤)(𝑤𝑖′ − 𝜇𝑤′)𝑖
√∑ (𝑤𝑖 − 𝜇𝑤)2∑ (𝑤𝑖′ − 𝜇𝑤′)
2𝑖𝑖
(5)
Its values are in the range of [-1, 1], where values less than 0 denote low similarity. This project
will reintroduce Pearson’s correlation coefficient and demonstrate its use in stability estimation.
7
2.3 FEATURE SELECTION CRITERIA
Feature selection has been quite a prominent research area in Machine Learning due to the
increasing number of datasets, characterised by high-dimensionality, as highlighted previously.
Such datasets consist of a high number of features and a low number of samples, which greatly
reduces classification accuracy as well as hinders computational performance. To mitigate these
problems feature selection techniques, aimed at reducing the number of dimensions, have been
widely employed. Huang provides a detailed overview of different feature selection methods
and their distinction into filter, wrapper and embedded methods (Huang, 2015), each of which
will be explained in the following paragraphs.
Wrapper methods evaluate the feature quality by exploring the feature subset space and its
effect on a particular learning model’s accuracy. Their main drawbacks are computational
inefficiency and often being specific to the learning algorithm used.
Embedded methods incorporate feature selection whilst constructing the learning model.
Examples include classification and regression trees. A key advantage of these types of methods
is that they are more computationally efficient in comparison to wrapper approaches.
Nevertheless, they have considerable disadvantages as well; Huang discusses their difficulty in
finding the globally optimal solution, due to the attempt to estimate the model parameters
whilst selecting features (Huang, 2015).
Filters, on the other hand rely on statistical measures, independent from a classifier, to estimate
the relevance of a feature to the target concept. As a result, they produce generalised subsets,
which might not in all cases yield the highest classification accuracy. They are characterised by
computational efficiency and less overfitting2. A majority of them are myopic3 measures, such as
Mutual information and Gini index. Additionally, there are also non-myopic approaches such
as Relief, firstly introduced by Kira (Kira & Rendel, 1992). A comparison of the different filter
based feature selection criteria is the focus of this research.
2.3.1 Mutual information maximization
A number of widely used feature selection approaches rely on the notion of mutual
information, first defined in the field of Information Theory. To understand mutual
information, one should be familiar with the term entropy, which attempts to quantify the
amount of uncertainty present in a distribution of events of a random variable 𝑋. (Shannon,
1948). If one event is highly likely to occur then the entropy is low, since there is little
uncertainty regarding the outcome. If on the other hand, all events in the distribution are
2 In Machine learning, overfitting relates to a learning model being too complicated for the underlying data, thus
capturing noise. This leads to a model which is too specific to the data. 3 Myopic approaches consider each feature independently from the others and do not detect feature interactions.
8
equally probable, then the entropy is maximal, since no prevalence of a particular event is
present. Shannon defined entropy as:
𝐻(𝑋) = −∑𝑝(𝑥) log 𝑝(𝑥)
𝑥∈𝑋
(6)
Furthermore, Shannon demonstrated the means of conditioning entropy based on other events.
Suppose we have another distribution of events, 𝑌 then the conditional entropy of 𝑋 given 𝑌 is
regarded as ‘the amount of uncertainty remaining in 𝑋 after we learn the outcome of 𝑌’ (Brown,
Pocock, Zhao, & Luján, 2012). Its mathematical formulation is defined in the following manner:
𝐻(𝑋|𝑌) = −∑𝑝(𝑦)
𝑦∈𝑌
∑𝑝(𝑥|𝑦) log 𝑝(𝑥|𝑦)
𝑥∈𝑋
(7)
Hence the mutual information shared between 𝑋 and 𝑌 can naturally be describes as a
difference of the two entropies, the amount of uncertainty present in 𝑋 before the outcome of
event 𝑌 is known and after. It should be noted that mutual information is symmetric, hence the
name.
𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)
= ∑∑𝑝(𝑥𝑦) log𝑝(𝑥𝑦)
𝑝(𝑥)𝑝(𝑦)𝑦∈𝑌𝑥∈𝑋
(8)
To aid understanding a practical example is presented. Suppose that 𝑋 denotes the question
whether a person has heart failure or not. We can assume that without any additional
information about the individual our uncertainty regarding the outcome is high. However, by
examining his/hers family history, our uncertainty lowers. If the person’s relatives suffer from
similar heart condition, then it is more likely that he/she will experience heart failure as well.
Similarly, if there is no evidence that any of his/hers family members have had this diagnosis,
that person might be less likely to experience heart problems. Therefore, the outcome of 𝑌,
denoting whether cardiovascular diseases are present in the family, lowers the uncertainty. This
leads to a big difference between 𝐻(𝑋) and 𝐻(𝑋|𝑌), or high mutual information.
In the context of Machine Learning, one can infer that the higher the mutual information
between a feature 𝑋 and a class 𝑌 is, the more useful that feature will be in classification.
Therefore, computing the mutual information between all features with respect to the class, and
selecting the top ones, with the highest values, will result in a good feature selection criterion.
Brown refers to this as Mutual information maximisation criterion, which uses mutual
information to define the feature score (Brown, Pocock, Zhao, & Luján, 2012).
9
𝐽𝑚𝑖𝑚(𝑋𝑘) = 𝐼(𝑋𝑘; 𝑌) (9)
It is important to note that this criterion does not consider inter-feature relationships, the fact
that when grouped together certain features may be highly discriminatory even if individually
they are not, making it a myopic measure. Another disadvantage is its inability to detect feature
redundancy, which as discussed by Huang is the view that ‘two perfectly correlated features are
redundant to each other because adding one feature on top of the other will not provide
additional information; and hence, will not improve model accuracy’ (Huang, 2015). A possible
solution to this problem has been proposed in the form of the Mutual information feature
selection criterion (Battiti, 1994) which will be explored in detail in the following section.
2.3.2 Mutual information feature selection
Mutual information feature selection attempts to resolve the problem of redundancy, by
introducing a penalty for highly correlated features. It is based on the assumption that
redundancy and feature correlation are equivalent and do not increase the predictive accuracy.
Using this method the feature score is calculated using both the feature’s mutual information
and the sum of the mutual information between the feature in question and all others selected
previously. A user-configurable parameter, 𝛽 is used to determine the relative importance of the
penalty. A value of 1 for 𝛽 has been found to yield satisfactory results by Battiti (Battiti, 1994).
𝐽𝑚𝑖𝑓𝑠(𝑋𝑘) = 𝐼(𝑋𝑘; 𝑌) − 𝛽∑𝐼(𝑋𝑘; 𝑋𝑗)
𝑋𝑗
(10)
It should be noted that Guyon argued that high correlation does not always equate redundancy
(Guyon & Elisseeff, 2003). He empirically demonstrated that two highly correlated features can
improve classification accuracy, and reached the conclusion that only perfectly correlated
features are redundant to each other. So this implies that this criterion operates under the
assumption that features are class-conditionally independent.
Moreover Estévez argued that the redundancy term will increase in magnitude with respect to
the relevancy term as the cardinality of the subset of selected features grows (Estévez, 2009). He
inferred that this may result in the early selection of irrelevant features.
As it can be observed, this criterion is more computationally expensive than the previous one
due to its need to estimate the mutual information between the feature in question and all
previously selected features.
10
2.3.3 Gini index
Another correlation-based criterion which attempts to estimate a feature’s ability to distinguish
between classes is Gini index. It was first introduced by (Breiman, Friedman, Olshen, & Stone,
1984) as a splitting rule in decision tree generation. This feature selection method examines the
decrease of impurity when using a chosen feature. In this context, impurity relates to the ability
of a feature to distinguish between the possible classes. In a similar fashion to Mutual
information, the Gini index uses the difference of impurities before the value of a feature is
known and afterwards. For a feature 𝑋 the Gini index is defined as:
𝐺𝑖𝑛𝑖(𝑋) = ∑𝑃(𝑋𝑗)∑𝑃(𝑌𝑐|𝑋𝑗)
2 − ∑𝑃(𝑌𝑐)2,
𝑐𝑐𝑗
(11)
where:
𝑃(𝑋𝑗) – prior probability of feature 𝑋 having value 𝑋𝑗
𝑃(𝑌𝑐|𝑋𝑗) - probability that a random sample from the dataset, whose feature 𝑋 has value 𝑋𝑗,
belongs to class 𝑌𝑐
𝑃(𝑌𝑐) – prior probability that a random sample belongs to class 𝑌𝑐
Due to its myopic nature, the Gini index has low computational requirements, thus making it
attractive in high-dimensional data analysis. Nevertheless, as noted by Čehovin when used on
numerical data it results in overestimation of the features’ quality (Čehovin & Bosnić, 2010). An
improvement can be obtained by discretising the data prior to analysis, which is the approach
undertaken in this project. Similarly to Mutual information maximisation, the Gini index is
unable to detect redundant features, as well as inter-feature relationships.
2.3.4 ReliefF
The biggest disadvantage to the above described criteria is their inability to detect feature
interactions. In domains such as protein folding, where amino-acids’4 interactions determine the
folded structure of a protein and its functionality, discovering inter-feature relationships is
paramount.
To cater for this type of problem Kira proposed a different approach to feature selection called
Relief (Kira & Rendel, 1992). What Relief does differently is that it performs both a global and a
local search before updating the feature relevance score. It starts by selecting a random instance
and finding its nearest neighbour from the same class, called the nearest-hit, and the nearest
neighbour from the other class, nearest-miss. It then examines how well each feature
4 An amino-acid can be treated as a feature in this example.
11
distinguishes between the classes. Using this information, it updates the feature weight. The
details of the original algorithms can be found below:
2.3.4.1 Algorithm: Relief
1
2
3
4
5
6
7
8
9
Input: 𝑆 – dataset, described by n features
𝑚 – the number of examples the algorithm should examine
Output: 𝑊 – a set containing the weight score for each feature
𝑊 = [0…0] for 𝑖 = 1 to 𝑚
Pick a random instance 𝐼𝑅
Find the nearest-hit 𝐻𝐼 and nearest-miss 𝑀𝐼 of 𝐼𝑅
for 𝑋 = 1 to 𝑛
𝑊[𝑓] =(𝑊[𝑋]−𝑑𝑖𝑓𝑓(𝐼𝑅[𝑋],𝐻𝐼[𝑋])
2+𝑑𝑖𝑓𝑓(𝐼𝑅[𝑋],𝑀𝐼[𝑋])2)
𝑚
end for
end for
return 𝑊
The function diff was originally defined as:
𝑑𝑖𝑓𝑓(𝐼1[𝑋𝑖], 𝐼2[𝑋𝑗]) =
{
|𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]|
max(𝑋) −min(𝑋) 𝑋 𝑖𝑠 𝑛𝑢𝑚𝑒𝑟𝑖𝑐𝑎𝑙
0 𝐼1[𝑋𝑖] = 𝐼2[𝑋𝑗] ⋀ 𝑋 𝑖𝑠 𝑛𝑜𝑚𝑖𝑛𝑎𝑙
1 𝐼1[𝑋𝑖] ≠ 𝐼2[𝑋𝑗] ⋀ 𝑋 𝑖𝑠 𝑛𝑜𝑚𝑖𝑛𝑎𝑙
(12)
As discussed by Liu this particular function tends to underestimate numerical features (Liu &
Motoda, 2007). The example he used to illustrate this phenomenon is:
Let 𝑋𝑖 ∈ [1,2,3,4,5,6,7,8]
If 𝑋𝑖 were nominal the value obtained by the difference function would be 𝑑𝑖𝑓𝑓(2,5) = 1.
However, if 𝑋𝑖 were numerical then the value would be 𝑑𝑖𝑓𝑓(2,5) =|2−5|
8−1≈ 0.43. Since this is
used to update the feature weights, Relief underestimates numerical features. As a solution Liu
suggests the use of the contextual merit function proposed by Hong. It provides a generalisation
of the 𝑑𝑖𝑓𝑓function for numerical attributes (Hong, 1997).
𝑑𝑖𝑓𝑓(𝐼1[𝑋𝑖], 𝐼2[𝑋𝑗]) =
{
0 |𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]| ≤ 𝑡𝑒𝑞
1 |𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]| > 𝑡𝑑𝑖𝑓𝑓
|𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]| − 𝑡𝑒𝑞𝑡𝑑𝑖𝑓𝑓 − 𝑡𝑒𝑞
𝑡𝑒𝑞 < |𝐼1[𝑋𝑖] − 𝐼2[𝑋𝑗]| ≤ 𝑡𝑑𝑖𝑓𝑓
(13)
12
Here 𝑡𝑒𝑞 and 𝑡𝑑𝑖𝑓𝑓 are user-defined parameters, where 𝑡𝑒𝑞 is the threshold below which two
values are considered equal and 𝑡𝑑𝑖𝑓𝑓 is the threshold above which two values are regarded as
distinct. Hong mentions the following default values for these parameters:
𝑡𝑒𝑞 = 0.05(max(𝑋) −min(𝑋)) (14)
𝑡𝑑𝑖𝑓𝑓 = 0.10(max(𝑋) − min(𝑋)) (15)
Kira used the Euclidean distance for the nearest neighbour search and the squared difference, as
it can be seen above. Interestingly, Kononenko reported that in his experiments using only the
difference seems to yield the same results (Kononenko, Šimec, & Robnik-Šikonja, 1997). He also
used the Manhattan distance in the nearest neighbour search as opposed to the Euclidean one,
though an explicit reason has not been provided. The author of this report has also
experimented with both approaches and found little difference in the results they produce.
An interesting observation about the relationship between the update rule in Relief and Gini
index has been noticed by Kononenko. He demonstrated that the update rule can be expressed
as an approximation of the following difference in probabilities (Kononenko, Šimec, & Robnik-
Another advantage of using R is the RStudio development environment, which provides access
to debugging tools, instant graph visualisation and a simple command line interface. It allows
good integration with external packages and facilitates both version control and code
organisation. Moreover, it has good support from the community and is freely available. All
these advantages have made it the preferred choice over the terminal-based environment.
3.3.3 Packages
One of the main initial concerns has been related to the need for effective means of visualising
the results. An objective of the project is to investigate and compare the stability of different
feature selection criteria. In order to achieve this, the results needed to be presented in a form
that is easily understandable and enhanced the analysis process, as opposed to hindering it.
This led to research in data visualisation with R and then to the ‘plotly’7 package. Plotly is an
open-source interactive graphic library for multiple programming languages, including R. It
provides a fast way for generation of histograms, charts and line plots. This package has a steep
learning curve and requires the comprehension of the documentation and examples to allow a
novice user to utilise its capabilities.
It should be noted that the speed of the experimental work has been quite important in such a
short project. Hence an application has been created to cater for the author’s needs. However,
since the focus of this project is not on application development the need for a package
supporting quick and effortless web development in R has become apparent. After careful
investigation, the ‘shiny’8 and ‘shiny dashboard’9 packages have been found to facilitate the
creation of a dashboard-style web interface with ease. This provided a way for producing a
general-purpose application, where datasets could be manipulated easily, the stability of
different feature selection criteria analysed and the results consolidated. Nevertheless, these two
packages have their limitations. For example, customising colours or buttons is quite difficult if
not impossible in certain cases. However, for the purposes of this project it has served the
purpose of creating a clean interface facilitating the user’s needs.
Another package of vital importance to this project has been ‘RUnit’10, which supports the
creation of test suites and since most computer science projects require extensive testing, this
proved to be one of the most useful libraries. With the help of a test suite produced, the author
could safely introduce modifications and improvements in the code. The topic of testing and
validation will be explored in detail in the next chapter.
7 Plotly is available from https://plot.ly/r/, along with complimentary tutorials and start up guides 8 Shiny is available from http://shiny.rstudio.com/, along with complimentary tutorials and start up guides 9 Shiny dashboard can be obtained from following GitHub repository https://rstudio.github.io/shinydashboard/ 10 RUnit is available from the CRAN repository https://cran.r-project.org/web/packages/RUnit/index.html
A principle employed when designing the structure of the project is ‘Separation of concerns’,
which as defined by Dijkstra ‘even if not perfectly possible, is yet the only available technique
for effective ordering of one's thoughts’ (Dijkstra, 1982). Therefore, as previously established a
web-based tool, which aids the research process, has been created. It consists of three main
components, a thin-client, a server and a main computational core.
3.4.1 The user interface
Schneiderman suggests several guidelines which need to be employed when designing a user
interface amonst which is that one should strive for consistency (Shneiderman & Plaisant, 2010).
To ensure this throughout the application all graphical representations have been visualised
using line plots along with a consistent colour scheme.
Another key principle, which Schneiderman presented, is offering informative feedback
(Shneiderman & Plaisant, 2010). This improves the user experience by providing information
regarding the status of an event, or the result of an action. For example, in this project the user
has the ability to view the results of the data upload or the data discretisation in real time,
which allows him/her to understand the impliction of these actions.
The dashboard-style of all pages has been chosen deliberately since it serves as a metaphor for
physical dashboards which people rely on in their every day lives. This technique has been
employed by many companies, such as Apple, as a tool which allows new users to grasps
concepts quickly (Apple, 2016).
3.4.2 The server
The server’s responsibilities include graph generation, retrieval of results and communication
with the main computational core. The decision to split the server from the core has been
influenced by the need for abstraction to facilitate the independent use of the stability measures
and feature selection criteria. The server simply utilises the necessary components from the
core, passes the relevant information to the client and is responsible for generating the graphs.
3.4.3 The computational core
The computational core is the most fundamental component of the project. It comprises of the
different feature selection criteria, all the similarity measures discussed, probability estimation
functions which support the criteria, data pre-processing capabilities and file manipulation.
Consequently, the core has been divided into five main subcomponents, illustrated in Figure 2,
whose implementation will be explored in detail in the next chapter.
20
The separation of the subcomponents facilitates the need to flexibility and extensibility. This
implies that if a reimplementation of the probability estimation functions is required, it could be
done without affecting other parts of the code. Due to the generalised approach for stability
estimation, the addition of a new criterion is a trivial task and requires minimal effort and
change in the existing codebase.
Figure 2: Computational core structure
3.5 EXPERIMENTAL DESIGN
To achieve the aim of this project several experiments have been conducted. However, since the
output of them is used for formulating the hypothesis about the behavior of different feature
selection criteria, a careful design process has been required. By constructing a design protocol
one ensures the repeatability of the work, along with verifying that the experiment will indeed
answer the question of interest. Therefore, the following protocol has been devised. A
bootstrapping technique, detailed in Appendix G, is used to introduce small variations in the
data, and the number of repetitions, which is thirty, has been chosen to provide a more
representative stability score for each feature selection criterion.
3.5.1.1 Experimental protocol
1
2
3
4
5
6
7
8
Generate a similar dataset to the original one, using bootstrapping
For subset size, 𝑘 = 1 to 𝑛
Perform feature selection using the four different criteria
Store the feature preferences produced
Repeat Steps 1-4 thirty times
Compute the stability scores for every subset size 𝑘, using the preferences from the 30 sets
Compute the confidence intervals
Plot the results
21
3.6 CHAPTER SUMMARY
This chapter presented the main technologies and software engineering practices utilised in the
development of this project. It described the design choices, provided an overview of the main
architecture and discussed the project planning process.
22
Chapter 4
4 IMPLEMENTATION
4.1 CHAPTER OVERVIEW
This chapter will provide a detailed review of the project implementation and the challenges
faced. Due to the main focus of this project being research and not web application
development this chapter will focus largely on the computational core but will feature details
regarding the user interface and the server as well. This chapter will also explore the different
testing techniques employed for validation.
4.2 THE USER INTERFACE
The user interface has been developed using the ‘shiny’ and ‘shiny dashboard’ packages, which
are web application frameworks. They provide basic building blocks such as menu bars,
sidebars, headers, plot outputs, table outputs and many more, which facilitate the creation of a
web application for data analysis in an efficient way. The figures below demonstrate the
interface’s capabilities.
As it can be observed in Figure 3, the user has the ability to upload data, labels, discretise it if
necessary and preview it in a table format. Such a represtantion proves to be quite consistent
with tools such as Microsoft Excel, thus exploiting the user’s pre-existing knowledge to create a
better user experience.
Figure 4 demonstrates the feature selection stability evaluation page where the user can select
the number of times the algorithms should be run and which criteria should be compared. The
graphs are interactive and facilitate the data analysis process. Furthermore, if the user wishes
he/she could exclude certain criteria on the graph directly and compare only two at time.
Figure 5 illustrates the result consolidation capabilties. The user has the ability to examine the
characterists of the datasets, such as number of features, classes or examples, and display
graphical representation of previously run results. This page serves as a storage place for all
experimental work and is accessbile from multiple locations. This has proven to be valuable
during the experimental analysis stage of the project.
Figure 6 presents the example recreation screen, which contains research paper reproductions
illustrating the stability measures’ performance. Lastly, Figure 7 provides an overview of the
References page which caters for the central storage of background reading.
23
Figure 3: User interface: data upload and preprocessing
Figure 4: User interface: stability evaluation
Figure 5: User interface: result consolidation
Figure 6: User interface: example reproductions
Figure 7: User interface: references
24
4.3 THE SERVER
Section 3.4.2 highlighted the server’s main responsibilities whose implementation shall be
discussed in detail. For retrieval and storage of results the server utilizes the file manipulation
subcomponent of the computational core. Similarly, when data discretisation is required the
server utilizes the data pre-processing subcomponent.
Graph generation is required in a couple of instances and has been implemented using the
‘plotly’ package. Firstly, upon conducting experimental work the result is presented in a form of
a line graph. The server obtains the results from the computational core. Afterwards, they are
passed to the client for visualisation. Their update is achieved with the help of ‘reactive
expressions’11. Additionally, if the user wishes to preserve the results for future use, a special R
file is created containing the data structure of the output.
The saved experimental work could then be read from the server’s filesystem upon start-up.
Similar techniques have been employed for the example reproduction page, with all
experimental results being stored as files.
4.4 THE COMPUTATIONAL CORE
As outlined previously the computational core has been divided into five subcomponents,
ensuring modularity and abstraction.
4.4.1 File manipulation
The first subcomponent is responsible for file manipulation, namely opening and saving files.
Those capabilities are utilized when the user uploads the data and if he/she wishes to save the
results obtained from the analysis. It is also used in the example sections for reading previously
stored data analysis results and dataset information.
4.4.2 Data pre-processing
The data pre-processing subcomponent provides data discretization capabilities, which have
become necessary to the performance improvement of several feature selection criteria as
mentioned in section 2.3.3.
Discretisation is the process of converting continuous attributes into discrete ones. As explained
by Clarke it usually involves the process of splitting the continuous space into 𝐾 partitions
corresponding to discrete values (Clarke & Barton, 2000). This process introduces what is
11 Reactive expression facilitate the control of which parts of the application need to be update and when. They aim to
avoid unnecessary work.
25
known as discretisation error due to the inability to accurately represent all the states of a
variable.
Therefore the number of partitions 𝐾 is important to minimize this. In 1926 Sturges had
proposed an approach for estimating the number of partitions given a sample size 𝑛, which
later on became known as ‘Sturges rule’ (Sturges, 1926).
𝐾 = 1 + 𝑙𝑜𝑔2(𝑛) (18)
An R package ‘infotheo’12 has been used to perform the actual discretization and the number of
partitions has been provided using the above formula.
4.4.3 Probability estimation
Since many of the feature selection criteria rely on probability estimations, such as prior and
posterior probabilities, a separate subcomponent has been developed to facilitate reusability
and modularity. All methods in it rely on the same input parameters, namely the dataset and
the column number of the data labels, and returned matrices as an output. The output type was
chosen after investigation of R’s performance using profiling tools which had demonstrated the
superiority of matrices over other R data types such as ‘data frames’13.
An issue often encountered in probability estimation is the ‘zero conditional probability
problem’, which refers to a dataset not containing an example of class 𝑌𝑐 with value 𝑋𝑖 for an
attribute 𝑋. This is explored in detail in Appendix D along with a possible solution.
As per the definition of stability its calculation involves introducing small perturbations in the
data and applying the feature selection algorithm a number of times. In the field of Machine
learning, a technique for achieving this is bootstrapping, detailed in Appendix G. After a
dataset is generated using this approach a feature selection method is used to produce the
feature preferences. A user-defined parameter controls the number of times the process is
performed and afterwards the preferences are passed to the stability estimator, along with the
type of similarity measure to be used. A detailed pseudocode of the stability function is
supplied in Appendix C.
12 The package ‘infotheo’ is available from https://cran.r-project.org/web/packages/infotheo/index.html 13 A detailed information about data frames is available on https://stat.ethz.ch/R-manual/R-
The feature selection subcomponent is responsible for invoking the correct criteria, performing
the necessary computations for subset generation and returning a set of feature scores. The first
and last parts are straightforward and are implemented as a separate interface. Each of the
criteria is separated into an individual function and file, to ensure low coupling. The Mutual
information based approaches along with Gini index utilise the probability estimation
subcomponent, which provides the prior and posterior probabilities necessary.
The following parameters have been used in the implementations:
Mutual information maximisation: a custom implementation has been provided with
no special parameters; the data has been discretised prior analysis; a detailed
pseudocode is available in Appendix C.
Mutual information feature selection: 𝛽 = 1 (Battiti, 1994); the data has been discretised
prior analysis; a detailed pseudocode is available in Appendix C.
Gini index: a custom implementation has been provided with no special parameters; the
data has been discretised prior analysis; a detailed pseudocode is located in Appendix C.
ReliefF: the number of instances considered, 𝑚 is equal to the total number of examples
(Kononenko, Šimec, & Robnik-Šikonja, 1997); the number of nearest neighbours 𝑘 has
been given value 10 (Kononenko I. , 1994) or less if there are less than 10 instances of that
class. The default values for the thresholds 𝑡𝑒𝑞 and 𝑡𝑑𝑖𝑓𝑓 introduced in section 2.3.4 have
been used; the data has been discretised prior analysis to facilitate comparison with
other criteria; a detailed pseudocode is available in section 2.3.4.
4.5 TESTING
In order to verify the correctness of the implementation, along with the author’s understanding
of the subject matter, several testing techniques have been employed. Together they formed the
test suite of the project.
Embury provided several guidelines for ‘a good quality test suite’ (Embury, 2012). Firstly, test
cases need to be independent, in other words always produce the same result regardless of the
execution context. Secondly, they should be repeatable, meaning that the same output should be
produced every time the test is run (assuming no code modifications). Thirdly, there should be
no redundant test cases, i.e. a test case which verifies the same behaviour as another one. The
last guideline is of importance since running test suites is computationally expensive, consumes
resources and introduces waste, since unnecessary test cases need to be maintained. The author
has taken the above considerations into account when designing the test suite.
27
4.5.1 Unit tests
In software development, unit tests are developed in such a manner so that they can verify the
smallest testable piece of the software. They are generally fast and useful in the debugging
process since they can pinpoint a bug’s location very accurately. When designing test cases the
author has decided to employ black box testing techniques such as boundary value analysis14
and equivalence partitioning15. This ensured all necessary scenarios have been covered with an
optimal set of test cases. Overall there are eighteen separate functions all of which have unit
tests.
4.5.2 Integration tests
This type of tests is used to verify the correctness of multiple components working together. For
example, the feature selection component relies on the probability estimation one for obtaining
the necessary prior and posterior probabilities. To test the components interactions a total of six
such tests have been employed. The output of these along with the one from the unit tests can
be found in Appendix F.
4.5.3 External tools validation
The above mentioned approaches test the correctness of the implementation only against the
author’s individual understanding of the problem. As a counter-measure validation of the
results with external tools and libraries has been performed.
To test Mutual information based criteria the FEAST16 toolbox for Matlab has been utilized. The
implementation of the ReliefF algorithm has been verified using the ‘dprep’17 package for R and
certain similarity measures like Pearson’s correlation coefficient have been validated using
online tools such as the ‘Correlation coefficient calculator’18.
4.6 CHAPTER SUMMARY
This chapter presented in detail the architectural components and their use. It explored various
testing techniques and their practical application in the context of this project.
14 As defined in the testing standards glossary, boundary value analysis is: ‘A test case design technique for a
component in which test cases are designed which include representatives of boundary values’ (Standards, 2016). 15 As defined in the testing standards glossary, equivalence partitioning is: ‘A test case design technique for a
component in which test cases are designed to execute representatives from equivalence classes’ (Standards, 2016). 16 FEAST has been created by G. Brown, A. Pocock, M.-J. Zhao, M. Lujan and is available at
http://www.cs.man.ac.uk/~gbrown/software/ 17 ‘dprep’ is an R package available from the CRAN https://cran.r-project.org/web/packages/dprep/index.html 18 ‘Correlation coefficient calculator’ is a tool available at http://www.socscistatistics.com/tests/pearson/Default2.aspx
end for 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 = 𝑜𝑟𝑑𝑒𝑟(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒𝑠) return 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠
54
APPENDIX D
This appendix explores the ‘zero conditional probability problem’, which is often encountered
in probability estimation and refers to a dataset not containing an example of class 𝑌𝑐 with
value 𝑋𝑖 for an attribute 𝑋. This leads to 𝑃(𝑋𝑖|𝑌𝑐) = 0 which yields incorrect results. This is
particularly common for datasets consisting of a small number of samples but characterised by
high number of classes and attribute values. To mitigate this problem an approach discussed by
Chen has been implemented. It relies on virtual examples in the conditional probability
estimation (Chen, 2016).
𝑃(𝑋𝑖|𝑌𝑐) =
𝑛𝑐 +𝑚𝑝
𝑛 + 𝑚 (19)
𝑛𝑐 – number of training examples for which 𝑋 = 𝑋𝑖 and 𝑌 = 𝑌𝑐
𝑚 - number of virtual examples, usually 𝑚 ≥ 1
𝑝 – prior estimate, usually 𝑝 =1
𝑡 where 𝑡 is the number of possible values for 𝑋
𝑛 - number of training examples for which 𝑌 = 𝑌𝑐
55
APPENDIX E
This appendix provides calculations on example one reproduced from Kuncheva’s paper.
Let 𝑠 and 𝑠′ be two hypothetical sequences of features produced by two runs of a feature
selection algorithm on a dataset with 𝑛 = 10 features.
𝑠 = {𝑓9, 𝑓7, 𝑓2, 𝑓1, 𝑓3, 𝑓10, 𝑓8, 𝑓4, 𝑓5, 𝑓6}
𝑠′ = {𝑓3, 𝑓7, 𝑓9, 𝑓10, 𝑓2, 𝑓4, 𝑓8, 𝑓6, 𝑓1, 𝑓5}
Let 𝑘 = 3 be the subset size to be considered.
𝑠(3) = {𝑓9, 𝑓7, 𝑓2}
𝑠′(3) = {𝑓3, 𝑓7, 𝑓9}
Using this information we can compute the similarities according to the different measures.
Tanimoto distance:
The cardinality of intersection of the two subsets is:
|𝑠(3) ∩ 𝑠′(3)| = |{𝑓7, 𝑓9}| = 2
The cardinality of the union is:
|𝑠(3) ∪ 𝑠′(3)| = |{𝑓2, 𝑓3 , 𝑓7, 𝑓9}| = 4
∴ 𝑆𝑡 (𝑠, 𝑠′) =
|𝑠 ∩ 𝑠′|
|𝑠 ∪ 𝑠′|=2
4= 0.5
Hamming distance:
The cardinality of difference between, 𝑠 and 𝑠′ is:
|𝑠(3) − 𝑠′(3)| = |{𝑓2}| = 1
The cardinality of difference between, 𝑠′ and 𝑠 is:
|𝑠′(3) − 𝑠(3)| = |{𝑓2}| = 1
∴ 𝑆ℎ (𝑠, 𝑠′) = 1 −
|𝑠\𝑠′| + |𝑠′\𝑠|
𝑛= 1 −
1 + 1
10= 0.8
56
Consistency index:
The cardinality of intersection, 𝑟 of those two subsets is:
𝑟 = |𝑠(3) ∩ 𝑠′(3)| = |{𝑓7, 𝑓9}| = 2
𝑆𝑐 (𝑠, 𝑠′) =
𝑟𝑛 − 𝑘2
𝑘(𝑛 − 𝑘)=2 ∗ 10 − 32
3(10 − 3)=11
21≈ 0.5238
Pearson’s correlation coefficient:
Considering this example, by assigning weight 1 to all features in the selected subset and 0 to all
features which have not been chosen, we obtain the following weights for subsets 𝑠 and 𝑠′:
𝑤 = {0,1,0,0,0,0,1,0,1,0}
𝑤′ = {0,0,1,0,0,0,1,0,1,0}
Consequently, the mean of 𝑤 and 𝑤′ can be computed.
𝜇𝑤 = 𝜇𝑤′ =3
10
𝑆𝑝 (𝑤, 𝑤′) =
∑ (𝑤𝑖 − 𝜇𝑤)(𝑤𝑖′ − 𝜇𝑤′)𝑖
√∑ (𝑤𝑖 − 𝜇𝑤)2 ∑ (𝑤𝑖′ − 𝜇𝑤′)2𝑖𝑖
≈ 0.5238
57
APPENDIX F
The following appendix contains the output of the test suite for this project.
RUNIT TEST PROTOCOL -- Sun May 01 14:19:12 2016 *********************************************** Number of test functions: 24 Number of errors: 0 Number of failures: 0 1 Test Suite : final-year-project - 24 test functions, 0 errors, 0 failures Details *************************** Test Suite: final-year-project Test function regexp: ^test.+ Test file regexp: ^\d+\.R Involved directory: tests --------------------------- Test file: tests/1.R test.calculatePriorProbability: (2 checks) ... OK (0.03 seconds) --------------------------- Test file: tests/10.R test.estimateHammingDistance: (7 checks) ... OK (0 seconds) --------------------------- Test file: tests/11.R test.estimateNumBinSturgesRule: (5 checks) ... OK (0 seconds) --------------------------- Test file: tests/12.R test.estimateNumBinScottsRule: (6 checks) ... OK (0 seconds) --------------------------- Test file: tests/13.R test.calculateGiniIndexFeature: (4 checks) ... OK (0.01 seconds) --------------------------- Test file: tests/14.R test.differenceFuntion: (3 checks) ... OK (0 seconds) --------------------------- Test file: tests/15.R test.calculateMinMaxDiffValues: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/16.R test.mutualInformationFeatureSelection: (4 checks) ... OK (0.02 seconds) --------------------------- Test file: tests/17.R test.mutualInformationMaximisation: (4 checks) ... OK (0.01 seconds) --------------------------- Test file: tests/18.R test.calculateMutualInformationSingleFeature: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/19.R test.mutualInfoCriterionBetweenSelectedFeatures: (8 checks) ... OK (0 seconds) ---------------------------
58
Test file: tests/2.R test.calculateNumFeatureValues: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/20.R test.calculateGiniIndexSingleFeature: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/21.R test.featureSelectionBasedOnGiniIndex: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/22.R test.calculatePriorProbabilityFeatures: (16 checks) ... OK (0 seconds) --------------------------- Test file: tests/23.R test.calculateCondProbBetweenVariables: (30 checks) ... OK (0 seconds) --------------------------- Test file: tests/24.R test.findNearestImproved: (8 checks) ... OK (0.01 seconds) --------------------------- Test file: tests/3.R test.calculateCondProbWithM: (20 checks) ... OK (0 seconds) --------------------------- Test file: tests/4.R test.calculateMutualInformation: (4 checks) ... OK (0 seconds) --------------------------- Test file: tests/5.R test.selectNumberOfFeatures: (3 checks) ... OK (0 seconds) --------------------------- Test file: tests/6.R test.estimateTanimotoDistance: (7 checks) ... OK (0 seconds) --------------------------- Test file: tests/7.R test.estimateConsistencyIndex: (10 checks) ... OK (0.03 seconds) --------------------------- Test file: tests/8.R test.estimatePearsonsCorrelationCoefficient: (8 checks) ... OK (0 seconds) --------------------------- Test file: tests/9.R test.weightsFromSet: (22 checks) ... OK (0 seconds)
59
APPENDIX G
Suppose there is a dataset containing 𝑁 items. ‘Bootstrapping’ performs sampling of 𝑁 items
with replacement from the original dataset. This method can be used to generate a new dataset
with similar characteristics to the original one. Therefore, it can be successfully used in the
evaluation the stability of different feature selection criteria.
Figure 23 illustrates this approach. As it can be observed the first bootstrap set contains samples
4 and 7 more than once, whilst samples 5 and 8 are missing. The second one contains a