A Review of Feature Selection and Feature Extraction ......“best” features are selected, feature elimination progresses graduallyandincludescross-validationsteps[26,44–46].A

Review ArticleA Review of Feature Selection and Feature Extraction MethodsApplied on Microarray Data

Zena M Hira and Duncan F Gillies

Department of Computing Imperial College London London SW7 2AZ UK

Correspondence should be addressed to Zena M Hira zenahiragmailcom

Received 25 March 2015 Accepted 18 May 2015

Academic Editor Huixiao Hong

Copyright copy 2015 Z M Hira and D F Gillies This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

We summarise various ways of performing dimensionality reduction on high-dimensional microarray data Many different featureselection and feature extraction methods exist and they are being widely used All these methods aim to remove redundant andirrelevant features so that classification of new instances will be more accurate A popular source of data is microarrays a biologicalplatform for gathering gene expressions Analysing microarrays can be difficult due to the size of the data they provide In additionthe complicated relations among the different genes make analysis more difficult and removing excess features can improve thequality of the results We present some of the most popular methods for selecting significant features and provide a comparisonbetween themTheir advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of themfor saving computational time and resources

1 Introduction

In machine learning as the dimensionality of the data risesthe amount of data required to provide a reliable analysisgrows exponentially Bellman referred to this phenomenonas the ldquocurse of dimensionalityrdquo when considering problemsin dynamic optimisation [1] A popular approach to thisproblem of high-dimensional datasets is to search for aprojection of the data onto a smaller number of variables (orfeatures) which preserves the information as much as pos-sible Microarray data is typical of this type of small sampleproblem Each data point (sample) can have up to 450000variables (gene probes) and processing a large number of datapoints involves high computational cost [2] When thedimensionality of a dataset grows significantly there is anincreasing difficulty in proving the result statistically signif-icant due to the sparsity of the meaningful data in the datasetin question Large datasets with the so-called ldquolarge 119901 small119899rdquo problem (where 119901 is the number of features and 119899 isthe number of samples) tend to be prone to overfitting Anoverfittedmodel canmistake small fluctuations for importantvariance in the data which can lead to classification errorsThis difficulty can also increase due to noisy features Noise in

a dataset is defined as ldquothe error in the variance of a measuredvariablerdquo which can result from errors in measurements ornatural variation [3] Machine learning algorithms tend tobe affected by noisy data Noise should be reduced as muchas possible in order to avoid unnecessary complexity in theinferred models and improve the efficiency of the algorithm[4] Common noise can be divided into two types [5]

(1) Attribute noise(2) Class noise

Attribute noise is caused by errors in the attribute values(wrongly measured variables missing values) while classnoise is caused by samples that are labelled to belong in morethan one class andor misclassifications

As the dimensionality increases the computational costalso increases usually exponentially To overcome this prob-lem it is necessary to find a way to reduce the number offeatures in consideration Two techniques are often used

(1) Feature subset selection(2) Feature extractionCancer is among the leading causes of death worldwide

accounting for more than 8 million deaths according to

Hindawi Publishing CorporationAdvances in BioinformaticsVolume 2015 Article ID 198363 13 pageshttpdxdoiorg1011552015198363

2 Advances in Bioinformatics

theWorld Health Organization It is expected that the deathsfrom cancer will rise to 14 million in the next two decadesCancer is not a single diseaseThere aremore than 100 knowndifferent types of cancer and probably many more The termcancer is used to describe the abnormal growth of cells thatcan for example form extra tissue called mass and thenattack other organs [6]

Microarray databases are a large source of genetic datawhich upon proper analysis could enhance our understand-ing of biology and medicine Many microarray experimentshave been designed to investigate the genetic mechanismsof cancer and analytical approaches have been applied inorder to classify different types of cancer or distinguishbetween cancerous and noncancerous tissue In the last tenyears machine learning techniques have been investigated inmicroarray data analysis Several approaches have been triedin order to (i) distinguish between cancerous and noncancer-ous samples (ii) classify different types of cancer and (iii)identify subtypes of cancer that may progress aggressivelyAll these investigations are seeking to generate biologicallymeaningful interpretations of complex datasets that aresufficiently interesting to drive follow-up experimentation

This review paper is structured as follows The next sec-tion is about feature selection methods (filters wrappers andembedded techniques) applied on microarray cancer dataThen we discuss feature extraction methods (linear and non-linear) inmicroarray cancer data and the final section is aboutusing prior knowledge in combination with a feature extrac-tion or feature selection method to improve classificationaccuracy and algorithmic complexity

2 Feature Subset Selection inMicroarray Cancer Data

Feature subset selection works by removing features that arenot relevant or are redundant The subset of features selectedshould follow the Occamrsquos Razor principle and also give thebest performance according to some objective function Inmany cases this is anNP-hard (nondeterministic polynomial-time hard) problem [7 8]The size of the data to be processedhas increased the past 5 years and therefore feature selectionhas become a requirement before any kind of classificationtakes place Unlike feature extraction methods feature selec-tion techniques do not alter the original representation of thedata [9] One objective for both feature subset selection andfeature extraction methods is to avoid overfitting the datain order to make further analysis possible The simplest isfeature selection in which the number of gene probes in anexperiment is reduced by selecting only the most significantaccording to some criterion such as high levels of activityFeature selection algorithms are separated into three cate-gories [10 11]

(i) The filters which extract features from the data with-out any learning involved

(ii) The wrappers that use learning techniques to evaluatewhich features are useful

(iii) The embedded techniques which combine the featureselection step and the classifier construction

21 Filters Filters workwithout taking the classifier into con-sideration This makes them very computationally efficientThey are divided into multivariate and univariate methodsMultivariate methods are able to find relationships amongthe features while univariate methods consider each featureseparately Gene ranking is a popular statistical method Thefollowing methods were proposed in order to rank the genesin a dataset based on their significance [12]

(i) (Univariate) Unconditional Mixture Modelling as-sumes two different states of the gene on and off andchecks whether the underlying binary state of thegene affects the classification using mixture overlapprobability

(ii) (Univariate) Information Gain Ranking approximatesthe conditional distribution 119875(119862 | 119865) where 119862 is theclass label and119865 is the feature vector Information gainis used as a surrogate for the conditional distribution

(iii) (Multivariate)Markov Blanket Filtering finds featuresthat are independent of the class label so that remov-ing them will not affect the accuracy

In multivariate methods pair 119905-scores are used for evaluatinggene pairs depending on how well they can separate twoclasses in an attempt to identify genes that work together toprovide a better classification [13] Their results for the genepair rankings were found to be ldquoat least as interesting as thesingle genes found by an independent evaluationrdquo

Methods based on correlation have also been suggested

(i) (Multivariate) Error-Weighted Uncorrelated ShrunkenCentroid (EWUSC) this method is based on theuncorrelated shrunken centroid (USC) and shrunkencentroid (SC) The shrunken centroid is found bydividing the average gene expression for each genein each class by the standard deviation for that genein the same class This way higher weight is given togenes whose expression is the same among differentsamples in the same class New samples are assignedto the label with the nearest average pattern (usingsquared distance) The uncorrelated shrunken cen-troid approach removes redundant features by findinggenes that are highly correlated in the set of genesalready found by SC The EWUSC uses both of thesesteps and in addition adds error-weights (based onwithin-class variability) so that noisy genes will bedowngraded and redundant genes are removed [14]A comparison is shown in Figure 1 where the threedifferent methods are tested on a relatively small(25000 genes and 78 samples) breast cancer datasetThe algorithms perform well when the number ofrelevant genes is less than 1000

(ii) (Multivariate)Minimum Redundancy Maximum Rel-evance (mRMR) mRMR is a method that maximisesthe relevancy of genes with the class label while itminimises the redundancy in each class To do so ituses several statistical measures Mutual Information(MI)measures the information a randomvariable cangive about another in particular the gene activity and

Advances in Bioinformatics 3

the class label The method can be applied to bothcategorical and continuous variables For categorical(discrete) variables MI is used to find genes that arenot redundant (minimise redundancy) 119882 and aremaximally relevant119881with a target label [15] as shownin (1) and (2) respectively

119882 =1|119878|

2 sum119894119895isin119878

119868 (119894 119895) (1)

119881 =1|119878|sum

119894isin119878

119868 (ℎ 119894) (2)

where 119868 is the MI 119894 and 119895 are genes |119878| is the numberof features in 119878 and ℎ is a class labelFor continuous variables the 119865-statistic (ANOVA testor regression analysis to check whether the means oftwo populations are significantly different) is used tofind the maximum relevance between a gene and aclass label and then the correlation of the gene pair inthat class is measured tominimise redundancy [15] asshown in (3) and (4) respectively

119881 =1|119878|sum

119894isin119878

119865 (119894 ℎ) (3)

119882 =1|119878|

2 sum119894119895isin119878

1003816100381610038161003816119888 (119894 119895)1003816100381610038161003816 (4)

where 119865 is the 119865-statistic 119894 and 119895 are genes ℎ is aclass label |119878| is the number of features in 119878 and 119888 isthe correlation mRMR can be used in combinationwith entropy Normalised mutual information is usedto measure the relevance and redundancy of clustersof genes Then the most relevant genes are com-bined and LOOCV (leave-one-out cross-validation)is performed to find the accuracy [16] For contin-uous variables linear relationships are used insteadof mutual information MRMR methods give lowererror accuracies for both categorical and discrete data

(iii) (Multivariate) Correlation-based feature selection(CFS) as stated by Hall [17] follows the principal thatldquoa good feature subset is one that contains featureshighly correlated with the class yet uncorrelated witheach otherrdquo CFS evaluates a subset by considering thepredictive ability of each one of its features individ-ually and also their degree of redundancy (or correla-tion)The difference between CFS and other methodsis that it provides a ldquoheuristic meritrdquo for a featuresubset instead of each feature independently [18]Thismeans that given a function (heuristic) the algorithmcan decide on its next moves by selecting the optionthat maximises the output of this function Heuristicfunctions can also be designed to minimise the costto the goal

ReliefF [19] is also widely used with cancer microarraydata It is a multivariate method that chooses the features

90

85

80

75

70

65

60

Pred

ictio

n ac

cura

cy (

)

101

102

103

104

Total number of genes (log scale)

EWUSC (1205880= 07)

USC (1205880= 06)

SC

Test data

Figure 1 Comparison between EWUSC USC and SC on breastcancer data [14]

that are the most distinguishable among the different classesIt repeatedly draws an instance (sample) and based on itsneighbours it gives most weight to the features that help dis-criminate it from the neighbours of a different class [20 21] Amethod using independent logistic regression with two stepswas also proposed [22] The first step is a univariate methodin which the genes are ranked according to their Pearsoncorrelation coefficients The top genes are considered in thesecond phase which is stepwise variable selection This is aconditionally univariate method based on the inclusion (orexclusion) of a single gene at a time conditioned on thevariables already included

A comparison of ReliefF Information Gain InformationGain Ratio and 119883

2 is shown in Figure 2 The methodsperform similarly across the number of genes selected Infor-mation Gain Ratio is defined as the information gain over theintrinsic information It performs normalisation to the infor-mation gain using split value information The Pearson 1198832

test evaluates the possibility of a value appearing by chanceStatistical methods often assume a Gaussian distribution

on the dataThe central limit theoremcan guarantee that largedatasets are always normally distributed Even though allthese methods can be highly accurate in classifying informa-tion there is no biological significance proven with the genesthat are identified by them None of the above methods haveindicatedwhether the results are actually biologically relevantor not In addition filter methods are generally faster thanwrappers but do not take into account the classifier which canbe a disadvantage Ignoring the specific heuristics and biasesof the classifier might lower the classification accuracy

22 Wrappers Wrappers tend to perform better in selectingfeatures since they take the model hypothesis into accountby training and testing in the feature space This leads to thebig disadvantage of wrappers the computational inefficiencywhich is more apparent as the feature space grows Unlikefilters they can detect feature dependencies Wrappers are


ALLAML using SVM MLL using SVM

10 20 30 40 50 60 70 80 90 100 110 120 130 140 15080

82

84

86

88

90

92

94

96

98

100

Number of genes

Accu

racy

ReliefFInformation gain

Gain ratioX2-statistic

10 20 30 40 50 60 70 80 90 100 110 120 130 140 15080

82

84

86

88

90

92

94

96

98

100

Number of genes

Accu

racy



Figure 2 Comparison between ReliefF Information Gain Information Gain Ratio and 1198832 test on ALL and MLL Leukaemia datasets [21]

Table 1 Deterministic versus randomised wrappers

Deterministic RandomisedSmall overfitting risk High overfitting riskProne to local optima Less prone to local optimaClassifier dependent Classifier dependentmdash Computationally intensiveComparison between deterministic and randomised wrappers

separated in 2 categories Randomised and Deterministic Acomparison is shown in Table 1

221 Deterministic Wrappers A number of deterministicinvestigations have been used to examine breast cancer suchas a combination of awrapper and sequential forward selection(SFS) SFS is a deterministic feature selection method thatworks by using hill-climbing search to add all possible single-attribute expansions to the current subset and evaluate themIt starts from an empty subset of genes and sequentiallyselects genes one at a time until no further improvement isachieved in the evaluation function The feature that leads tothe best score is added permanently [23] For classificationsupport vector machines (SVMs) 119896-nearest neighbours andprobabilistic neural networks were used in an attempt toclassify between cancerous and noncancerous breast tumours[24] Very accurate results were achieved using SVMs Threemethods based on SVMs are very widely used in microarraycancer datasets

(1) Gradient-based-leave-one-out gene selection (GLGS)[25ndash28] was originally introduced for selectingparameters for the SVMs It starts by applying PCAon the dataset A vectorwith scaling factors of the newlow-dimensional space is calculated and optimisedusing a gradient-based algorithmThe pseudo scaling

factors of the original genes are calculated Genes aresequentially selected based on a correlation factor

(2) Leave-one-out calculation sequential forward selection(LOOCSFS) is a very widely used feature selectionmethod for cancer data based on sequential forwardselection (SFS) It adds features in an initially emptyset and calculates the leave-one-out cross-validationerror [29] It is an almost unbiased estimator ofthe generalisation error using SVMs and C BoundC Bound is the decision boundary and it is usedas a supplementary criterion in the case where dif-ferent features in the subset have the same leave-one-out cross-validation error (LOOCVE) [26 3031] SFS can also add constraints [32] on the sizeof the subset to be selected It can be used incombination with a recursive support vector machine(R-SVM) algorithm that selects important genesor biomarkers [33] The contribution factor basedon minimal error of the support vector machineof each gene is calculated and ranked The topranked genes are chosen for the subset LOOCSFSis expected to be an accurate estimator of the gener-alization error while GLGS scales very well with high-dimensional datasets The number of the genes in thefeature subset for both LOOCSFS andGLGS has to begiven in advance which can be a disadvantage sincethe most important genes are not known in advanceGLGS is said to perform better than LOOCSFS

222 RandomisedWrappers Most randomisedwrappers usegenetic algorithms (GA) (Algorithm 1) and simulated anneal-ing (Algorithm 2) Best Incremental Ranked Subset (BIRS)[35] is an algorithm that scores genes based on their valueand class label and then uses incremental ranked usefulness


Encode DatasetRandomly Initialise PopulationDetermine Fitness Of Population Based On A Predefined Fitness Functionwhile Stop Condition Not Reach (Best individual Is Good Enough) doCreate Offspring by Crossover OR MutationCalculate Fitness

end while

Algorithm 1 Genetic algorithm

Initialise State s = S(0)Initialise Energy e = E(S(0))Set time to zero k = 0while k lt kmax And e lt emax doTemperature = temperature(kkmax)NewState = neighbour(119904)NewEnergy = E(NewState)if P(e NewEnergy Temperature) gt random() thens = NewStatee = NewEnergy

end ifif NewEnergy lt EnergyBest thenBestState = NewStateEnergyBest = NewEnergy

end ifk = k + 1

end while

Algorithm 2 Simulated annealing algorithm

(based on the Markov blanket) to identify redundant genesLinear discriminant analysis was used in combination withgenetic algorithms Subsets of genes are used as chromosomesand the best 10 of each generation is merged with theprevious ones Part of the chromosome is the discriminantcoefficientwhich indicates the importance of a gene for a classlabel [36] Genetic Algorithm-Support Vector Machine (GA-SVM) [37] creates a population of chromosomes as binarystrings that represent the subset of features that are evaluatedusing SVMs Simulated annealing works by assuming thatsome parts of the current solution belong to a better oneand therefore proceeds to explore the neighbours seekingfor solutions that minimise the objective function and there-fore avoid global optima Hybrid methods with simulatedannealing and genetic algorithms have also been used [38] Agenetic algorithm is run as a first step before the simulatedannealing in order to get the fittest individuals as inputs tothe simulated annealing algorithm Each solution is evaluatedusing Fuzzy 119862-Means (a clustering algorithm that usescoefficients to describe how relevant a feature is to a cluster[39 40])Theproblemwith genetic algorithms is that the timecomplexity becomes119874(119899 log(119899)+119899119898119901119892) where 119899 is the num-ber of samples 119898 is the dimension of the data sets 119901 repre-sents the population size and 119892 is the number of generationsIn order for the algorithm to be effective the number of

generations and the population size must be quite large Inaddition like all wrappers randomised algorithms take upmore CPU time and more memory to run

23 Embedded Techniques Embedded techniques tend to dobetter computationally thanwrappers but theymake classifierdependent selections that might not work with any otherclassifierThat is because the optimal set of genes is built whenthe classifier is constructed and the selection is affected bythe hypotheses the classifier makes A well-known embeddedtechnique is random forests A random forest is a collectionof classifiers New random forests are created iteratively bydiscarding a small fraction of genes that have the lowestimportance [41] The forest with the smallest amount of fea-tures and the lowest error is selected to be the feature subsetA method called block diagonal linear discriminant analysis(BDLDA) [42] assumes that only a small number of genes areassociated with a disease and therefore only a small numberare needed in order for the classification to be accurate Tolimit the number of features it imposes a block diagonalstructure on the covariance matrix In addition SVMs canbe used for both feature selection and classification Featuresthat do not contribute to classification are eliminated in eachround until no further improvement in the classification canbe achieved [43] Support vector machines-recursive featureelimination (SVM-RFE) starts with all the features and gradu-ally excludes the ones that do not identify separating samplesin different classes A feature is considered useful based onits weight resulting from training SVMs with the current setof features In order to increase the likelihood that only theldquobestrdquo features are selected feature elimination progressesgradually and includes cross-validation steps [26 44ndash46] Amajor advantage of SVM-RFE is that it can select high-qualityfeature subsets for a particular classifier It is however com-putationally expensive since it goes through all features oneby one and it does not take into account any correlation thefeatures might have [30] SVM-RFE was compared againsttwo wrappers leave-one-out calculation sequential forwardselection and gradient-based-leave-one-out All three ofthese methods have similar computational times whenrun against a Hepatocellular Carcinoma dataset (7129 genesand 60 samples) GLGS outperforms the others withLOOCSFS and SVM-RFE having similar performance errors[27]

The most commonly used methods on microarray dataanalysis are shown in Table 2


Table 2 Feature selection methods applied on microarray data

Method Type Supervised Linear Description

119905-test feature selection [49] Filter mdash Yes It finds features with a maximal difference of mean value betweengroups and a minimal variability within each group

Correlation-based featureselection (CFS) [50] Filter mdash Yes It finds features that are highly correlated with the class but are

uncorrelated with each other

Bayesian networks [51 52] Filter Yes No They determine the causal relationships among features and removethe ones that do not have any causal relationship with the class

Information gain (IG) [53] Filter No Yes It measures how common a feature is in a class compared to all otherclasses

Genetic algorithms (GA)[33 54] Wrapper Yes No They find the smaller set of features for which the optimization

criterion (classification accuracy) does not deteriorate

Sequential search [55] Wrapper mdash mdashHeuristic base search algorithm that finds the features with thehighest criterion value (classification accuracy) by adding one newfeature to the set every time

SVMmethod of recursivefeature elimination (RFE)[30]

Embedded Yes Yes It constructs the SVM classifier and eliminates the features based ontheir ldquoweightrdquo when constructing the classifier

Random forests [41 56] Embedded Yes YesThey create a number of decision trees using different samples of theoriginal data and use different averaging algorithms to improveaccuracy

Least absolute shrinkageand selection operator(LASSO) [57]

Embedded Yes Yes It constructs a linear model that sets many of the feature coefficientsto zero and uses the nonzero ones as the selected features

Different feature selection methods and their characteristics

Figure 3 Linear versus nonlinear classification problems

3 Feature Extraction inMicroarray Cancer Data

Early methods of machine learning applied to microarraydata included simple clustering methods [47] A widely usedmethod was hierarchical clustering Due to the flexibility ofthe clustering methods they became very popular among thebiologists As the technology advanced however the size ofthe data increased and a simple application of hierarchicalclustering became too inefficient The time complexity ofhierarchical clustering is119874(log(1198992)) where 119899 is the number offeatures Biclustering followedhierarchical clustering as away

of simultaneously clustering both samples and features of adataset leading to more meaningful clusters It was shownthat biclustering performs better than hierarchical clusteringwhen it comes to microarray data but it is still a computa-tionally demanding method [48] Many other methods havebeen implemented for extracting only the important infor-mation from themicroarrays thus reducing their size Featureextraction creates new variables as combinations of others toreduce the dimensionality of the selected features There aretwo broad categories for feature extraction algorithms linearand nonlinear The difference between linear and nonlinearproblems is shown is Figure 3


D

D

K

N X Z= N lowastK UT

Figure 4 Dimensionality reduction using linear matrix factoriza-tion projecting the data on a lower-dimensional linear subspace

31 Linear Linear feature extraction assumes that the datalies on a lower-dimensional linear subspace It projects themon this subspace using matrix factorization Given a dataset119883119873times119863 there exists a projectionmatrix119880119863times119870 and a pro-jection119885119873times119870 where119885 = 119883sdot119880 Using119880119880119879 = 119868 (orthogonalproperty of eigenvectors) we get 119883 = 119885 sdot 119880

119879 A graphicalrepresentation is shown in Figure 4

The most well-known dimensionality reduction algo-rithm is principal component analysis (PCA) Using thecovariance matrix and its eigenvalues and eigenvectors PCAfinds the ldquoprincipal componentsrdquo in the datawhich are uncor-related eigenvectors each representing some proportion ofvariance in the data PCA and many variations of it havebeen applied as a way of reducing the dimensionality of thedata in cancer microarray data [58ndash64] It has been argued[65 66] that when computing the principal components(PCs) of a dataset there is no guarantee that the PCs will berelated to the class variable Therefore supervised principalcomponent analysis (SPCA) was proposed which selects thePCs based on the class variables They named this extra stepthe gene screening step Even though the supervised versionof PCA performs better than the unsupervised PCA has animportant limitation it cannot capture nonlinear relation-ships that often exist in data especially in complex biologicalsystems SPCA works as follows

(1) Compute the relation measure between each genewith outcome using linear logistic or proportionalhazards models

(2) Select genes most associated with the outcome usingcross-validation of the models in step (1)

(3) Estimate principal component scores using only theselected genes

(4) Fit regression with outcome using model in step (1)

The method was highly effective in identifying importantgenes and in cross-validation tests was only outperformed bygene shaving a statistical method for clustering similar tohierarchical clustering The main difference is that the genescan be part of more than one cluster The term ldquoshavingrdquocomes from the removal or shaving of a percentage of thegenes (normally 10) that have the smallest absolute innerproduct with the leading principal component [67]

A similar linear approach is classical multidimensionalscaling (classical MDS) or Principal Coordinates Analysis[68] which calculates the matrix of dissimilarities for any

given matrix input It was used for large genomic datasetsbecause it is efficient in combination with Vector Quantiza-tion or 119870-Means [69] which assigns each observation to aclass out of a total of 119870 classes [70]

32 Nonlinear Nonlinear dimensionality reduction works indifferent ways For example a low-dimensional surface canbe mapped on a high-dimensional space so that a nonlinearrelationship among the features can be found In theory alifting function 119891(119909) can be used to map the features onto ahigher-dimensional space On a higher space the relationshipamong the features can be viewed as linear and thereforeis easily detected This is then mapped back on the lower-dimensional space and the relationship can be viewed asnonlinear In practice kernel functions can be designed tocreate the same effect without the need to explicitly computethe lifting function Another approach to nonlinear dimen-sionality reduction is by using manifolds It is based on theassumption that the data (genes of interest) lie on an embed-ded nonlinear manifold which has lower dimension than theraw data space and lies within it Several algorithms existworking in the manifold space and applied to microarrays Acommonly used method of finding an appropriate manifoldIsomap [71] constructs the manifold by joining each pointonly to its nearest neighbours Distances between points arethen taken as geodesic distances on the resulting graphManyvariants of Isomap have been used for example Balasub-ramanian and Schwartz proposed a tree connected versionwhich differs in the way the neighbourhood graph is con-structed [72] The 119896-nearest points are found by constructinga minimum spanning tree using an 120598-radius hypersphereThis method aims to overcome the drawbacks expressed byOrsenigo and Vercellis [73] regarding the robustness of theIsomap algorithm when it comes to noise and outliers Thesecould cause potential problems with the neighbouring graphespecially when the graph is not fully connected Isomap hasbeen applied onmicroarray data with some very good results[73 74] Compared to PCA Isomap was able to extract morestructural information about the data In addition othermanifold algorithms have been used with microarray datasuch as Locally Linear Embedding (LLE) [75] and LaplacianEigenmaps [76 77] PCA and similar manifold methods areused also for data visualisation as shown in Figure 5 Clusterscan often be better separated usingmanifold LLE and Isomapbut PCA is far faster than the other two

Another nonlinear method for classification is KernelPCA It has been widely used [78 79] since dimensionalityreduction helps with the interpretability of the results It doeshave an important limitation in terms of space complexitysince it stores all the dot products of the training set andtherefore the size of the matrix increases quadratically withthe number of data points [80]

Neural methods can also be used for dimensionalityreduction like Self Organizing Maps [81] (SOMs) or Kohonenmaps that create a lower-dimensional mapping of an input bypreserving its topological characteristics They are composedof nodes or neurons and each node is associated withits own weight vector SOMs training is considered to be


PCA

minus150 minus100 minus50 0 50 100 150minus100

minus50

0

50

100

150

AML t(15 17)AML t(8 21)

AML t(15 17)AML t(8 21)

minus3 minus2 minus1 0 1 2minus25

minus2

minus15

minus1

minus05

0

05

1

15

2

25LLE

AML t(15 17)AML t(8 21)

minus400 minus200 0 200 400minus400

minus300

minus200

minus100

0

100

200

300 IM

Figure 5 Visualisation of a Leukaemia dataset with PCA manifold LLE and manifold Isomap [34]

ldquocompetitiverdquo since when a training example is fed to thenetwork its Euclidean distance with all nodes is calculatedand it is assigned to that node with the smallest distance(Best Matching Unit (BMU)) The weight of that nodealong with its neighbouring nodes is adjusted to match theinput Another neural networks method for dimensionalityreduction (and dimensionality expansion) uses autoencodersAutoencoders are feed-forward neural networks which aretrained to approximate a function by which data can beclassified For every training input the difference between theinput and the output ismeasured (using square error) and it is

back-propagated through the neural network to perform theweight updates to the different layers In a paper that com-pares stacked autoencoders with PCAwithGaussian SVMon13 gene expression datasets it was shown that autoencodersperformbetter on themajority of datasets [82] Autoencodersuse fine-tuning a back-propagation method for adjustingtheir parameters Without back-propagation the autoen-coders get very low accuracies A general problem with thestacked autoencoders method is that a large number of inter-nal layers can easily ldquomemoriserdquo the training data and create amodel with zero error which will overfit the data and so be


unable to classify future test data SOMs have been used asa method of dimensionality reduction for gene expressiondata [77 83] but it was never broadly adopted for analysisbecause it needs just the right amount of data to performwell Insufficient or extraneous data can cause randomness tothe clusters Independent component analysis is also widelyused in microarrays [84 85] in combination with a clusteringmethod

Independent Components Analysis (ICA) finds the corre-lation among the data and decorrelates the data by maximiz-ing or minimizing the contrast information This is calledldquowhiteningrdquoThewhitenedmatrix is then rotated tominimisethe Gaussianity of the projection and in effect retrieve sta-tistically independent data It can be applied in combinationwith PCA It is said that ICA works better if the data has beenpreprocessed with PCA [86] This could merely be due to thedecrease in computational load caused by the high dimen-sion

The advantages and disadvantages of feature extractionand feature selection are shown in Table 3 and in (5)

Feature Selection and Feature Extraction Difference betweenFeature Selection (Top) and Feature Extraction (Bottom)Consider

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge

Prior knowledge has previously been used in microarraystudies with the objective of improving the classificationaccuracy One early method for adding prior knowledge ina machine learning algorithm was introduced by Segal et al[87] It first partitions the variables into modules which aregene sets that have the same statistical behaviour (share thesame parents in a probabilistic network) and then uses thisinformation to learn patternsThemodules were constructedusing Bayesian networks and a Bayesian scoring functionto decide how well a variable fits in a module The parentsfor each module were restricted to only some hundreds ofpossible genes since those genesweremost likely to play a reg-ulatory role for the other genes To learn themodule networksRegression Trees were used The gene expression data weretaken from yeast in order to investigate how it responds to

different stress conditions The results were then verifiedusing the Saccharomyces Genome Database Adding priorknowledge reduces the complexity of the model and thenumber of parametersmaking analysis easier A disadvantagehowever of this method is that it relies only on gene expres-sion data which is noisy Many sources of external biologicalinformation are available and can be integrated withmachinelearning andor dimensionality reduction methods This willhelp overcoming one of the limitations of machine learningclassification methods which is that they do not providethe necessary biological connection with the output Addingexternal information in microarray data can give an insighton the functional annotation of the genes and the role theyplay in a disease such as cancer

41 Gene Ontology GeneOntology (GO) terms are a popularsource of prior knowledge since they describe known func-tions of genes Protein information found in the genesrsquo GOindices has been combined with their expressions in order toidentifymoremeaningful relationships among the genes [88]A study infused GO information in a dissimilarity matrix[89] using Linrsquos similarity measure [90] GO terms were alsoused as a way of weighting the longest partial path sharedby two genes [91] This was used with expression data inorder to produce clusters using a pairwise similaritymatrix ofgene expressions and the weight of the GO paths GO termsinformation integrated with gene expression was used byChen and Wang [92] similar genes were clustered togetherand SPCA was used to find the PCs GO terms have beenused to derive information about the biological similarity of apair of genes This similarity was used as a modified distancemetric for clustering [93] Using a similar idea in a laterpublication similarity measures were used to assign priorprobabilities for genes to belong in specific clusters [94] usingan expectationmaximisationmodel Not all of thesemethodshave been compared to other forms of dimensionality reduc-tion such as PCA or manifold which is a serious limitation asto their actual performance It is however the case that in allof those papers an important problem regarding GO terms isdescribed Some genes do not belong in a functional groupand therefore cannot be used Additionally GO terms tendto be very general when it comes to the functional categoriesand this leads to bigger gene clusters that are not necessarilyrelevant in microarray experiments

42 Protein-Protein Interaction Other studies have usedprotein-protein interaction (PPI) networks for the same pur-pose [95] Subnetworks are identified using PPI informationIteratively more interactions are added to each subnetworkand scored usingmutual information between the expressioninformation and the class label in order to find the mostsignificant subnetwork The initial study showed that there ispotential for using PPI networks but there is a lot ofwork to bedone Prior knowledge methods tend to use prior knowledgein order to filter data out or even penalise features Thesefeatures are called outliers and normally are the ones that varyfrom the average The Statistical-Algorithmic Method forBicluster Analysis (SAMBA) algorithm [96] is a biclustering


Table 3 Advantages and disadvantages between feature selection and feature extraction

Method Advantages Disadvantages

Selection Preserving data characteristics for interpretabilityDiscriminative powerLower shorter training timesReducing overfitting

Extraction Higher discriminating power Loss of data interpretabilityControl overfitting when it is unsupervised Transformation maybe expensive

A comparison between feature selection and feature extraction methods

framework that combines PPI and DNA binding informa-tion It identifies subsets that jointly respond in a subset ofconditions It creates a bipartite graph that corresponds togenes and conditions A probabilistic model is created basedon weights assigned on the significant biclusters The resultsfor lymphoma microarray showed that the clusters producedwere highly relevant to the disease A positive feature of theSAMBA algorithms is that it can detect overlapping subsetsbut it has important limitations in the weighting process Allsources are assigned equal weights and they are not penalisedaccording to their importance or reliability of the source

43 Gene Pathways Themost promising results were shownwhen using pathway information as prior knowledge Manydatabases containing information on networks of molecularinteraction in different organisms exist (KEGG PathwayInteraction Database Reactome etc) It is widely believedthat these lower level interactions can be seen as the buildingblocks of genetic systems and can be used to understand high-level functions of the biological systems KEGG pathwayshave been quite popular in network constrained methodswhich use networks to identify gene relations to diseases Notmany methods used pathway knowledge but most of themtreat pathways as networks with directed edges A network-based penalty function for variable selection has beenintroduced [97] The framework used penalised regressionafter imposing a smoothness assumption on the regressioncoefficients based on their location on the gene network Thebiological motivation of this penalty is that the genes that arelinked on the networks are expected to have similar func-tions and therefore bigger coefficients The weights are alsopenalised using the sum of squares of the scaled difference ofthe coefficients between neighbour vertices in the network inorder to smooth the regression coefficients The results werepromising in terms of identifying networks and subnetworksof genes that are responsible for a disease However theauthors only used 33 networks and not the entire set of avail-able networks A similar approach also exists It is theoreticalmodel which according to the authors can be applied tocancermicroarray data but to date has not been explored [98]The proposed method was based on Fourier transformationand spectral graph analysisThe gene expression profiles werereconstructed using prior knowledge to modify the distancefrom gene networks They use the assumption that the infor-mation lies in the low frequency component of the expres-sion while the high frequency component is mostly noiseUsing spectral decomposition the smaller eigenvalues and

corresponding eigenvectors are kept (the smaller the eigen-value the smoother the graph) A linear classifier can beinferred by penalising the regression coefficients based onnetwork information The biological Pathway-Based FeatureSelection (BPFS) algorithm [99] also utilizes pathway infor-mation formicroarray classification It uses SVMs to calculatethe marginal classification power of the genes and puts thosegenes in a separate set Then the influence factor for each ofthe genes in the second set is calculated This is an indicationof the interaction of every gene in the second set with thealready selected genes If the influence factor is low the genesare added to the set of the selected genesThe influence factoris the sum of the shortest pathway distances that connect thegene to be added with each other gene in the set

5 Summary

This paper has presented different ways of reducing thedimensionality of high-dimensional microarray cancer dataThe increase in the amount of data to be analysed hasmade dimensionality reduction methods essential in orderto get meaningful results Different feature selection andfeature extraction methods were described and comparedTheir advantages and disadvantages were also discussed Inaddition we presented several methods that incorporate priorknowledge from various biological sources which is a wayof increasing the accuracy and reducing the computationalcomplexity of existing methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] R E Bellman Dynamic Programming Princeton UniversityPress Princeton NJ USA 1957

[2] S Y Kung andMW MakMachine Learning in BioinformaticsChapter 1 Feature Selection for Genomic and Proteomic DataMining John Wiley amp Sons Hoboken NJ USA 2009

[3] J Han Data Mining Concepts and Techniques Morgan Kauf-mann Publishers San Francisco Calif USA 2005

[4] D M Strong Y W Lee and R Y Wang ldquoData quality in con-textrdquo Communications of the ACM vol 40 no 5 pp 103ndash1101997


[5] X Zhu andXWu ldquoClass noise vs attribute noise a quantitativestudy of their impactsrdquo Artificial Intelligence Review vol 22 no3 pp 177ndash210 2004

[6] C de Martel J Ferlay S Franceschi et al ldquoGlobal burden ofcancers attributable to infections in 2008 a review and syntheticanalysisrdquoThe Lancet Oncology vol 13 no 6 pp 607ndash615 2012

[7] A L Blum and R L Rivest ldquoTraining a 3-node neural networkis NP-completerdquoNeural Networks vol 5 no 1 pp 117ndash127 1992

[8] T R Hancock On the Difficulty of Finding Small ConsistentDecision Trees 1989

[9] Y Saeys I Inza and P Larranaga ldquoA review of feature selectiontechniques in bioinformaticsrdquo Bioinformatics vol 23 no 19 pp2507ndash2517 2007

[10] A L Blum and P Langley ldquoSelection of relevant features andexamples inmachine learningrdquoArtificial Intelligence vol 97 no1-2 pp 245ndash271 1997

[11] S Das ldquoFilters wrappers and a boosting-based hybrid forfeature selectionrdquo in Proceedings of the 18th International Con-ference on Machine Learning (ICML rsquo01) pp 74ndash81 MorganKaufmann Publishers San Francisco Calif USA 2001

[12] E P Xing M I Jordan and R M Karp ldquoFeature selection forhigh-dimensional genomic microarray datardquo in Proceedings ofthe 18th International Conference onMachine Learning pp 601ndash608 Morgan Kaufmann 2001

[13] T Boslash and I Jonassen ldquoNew feature subset selection proceduresfor classification of expression profilesrdquo Genome biology vol 3no 4 2002

[14] K Yeung and R Bumgarner ldquoCorrection multiclass classifi-cation of microarray data with repeated measurements appli-cation to cancerrdquo Genome Biology vol 6 no 13 p 405 2005

[15] C Ding and H Peng ldquoMinimum redundancy feature selectionfrom microarray gene expression datardquo in Proceedings of theIEEE Bioinformatics Conference (CSB rsquo03) pp 523ndash528 IEEEComputer Society Washington DC USA August 2003

[16] X Liu A Krishnan and A Mondry ldquoAn entropy-based geneselection method for cancer classification using microarraydatardquo BMC Bioinformatics vol 6 article 76 2005

[17] M AHall ldquoCorrelation-based feature selection for discrete andnu- meric class machine learningrdquo in Proceedings of the 17thInternational Conference on Machine Learning (ICML rsquo00) pp359ndash366 Morgan Kaufmann San Francisco Calif USA 2000

[18] Y Wang I V Tetko M A Hall et al ldquoGene selection frommicroarray data for cancer classificationmdasha machine learningapproachrdquo Computational Biology and Chemistry vol 29 no 1pp 37ndash46 2005

[19] M A Hall and L A Smith ldquoPractical feature subset selectionfor machine learningrdquo in Proceedings of the 21st AustralasianComputer Science Conference (ACSC rsquo98) February 1998

[20] G Mercier N Berthault J Mary et al ldquoBiological detectionof low radiation doses by combining results of two microarrayanalysis methodsrdquo Nucleic Acids Research vol 32 no 1 articlee12 2004

[21] Y Wang and F Makedon ldquoApplication of relief-F featurefiltering algorithm to selecting informative genes for cancerclassification using microarray datardquo in Proceedings of IEEEComputational Systems Bioinformatics Conference (CSB rsquo04) pp497ndash498 IEEE Computer Society August 2004

[22] G Weber S Vinterbo and L Ohno-Machado ldquoMultivariateselection of genetic markers in diagnostic classificationrdquo Arti-ficial Intelligence in Medicine vol 31 no 2 pp 155ndash167 2004

[23] P Pudil J Novovicova and J Kittler ldquoFloating search methodsin feature selectionrdquo Pattern Recognition Letters vol 15 no 11pp 1119ndash1125 1994

[24] A Osareh and B Shadgar ldquoMachine learning techniques todiagnose breast cancerrdquo in Proceedings of the 5th InternationalSymposium on Health Informatics and Bioinformatics (HIBITrsquo10) pp 114ndash120 April 2010

[25] O Chapelle VVapnikO Bousquet and SMukherjee ldquoChoos-ing multiple parameters for support vector machinesrdquoMachineLearning vol 46 no 1ndash3 pp 131ndash159 2002

[26] Q Liu A H Sung Z Chen J Liu X Huang and Y Deng ldquoFea-ture selection and classification of MAQC-II breast cancer andmultiplemyelomamicroarray gene expression datardquoPLoSONEvol 4 no 12 Article ID e8250 2009

[27] E K Tang P N Suganthan and X Yao ldquoGene selectionalgorithms for microarray data based on least squares supportvector machinerdquo BMC Bioinformatics vol 7 article 95 2006

[28] X-L Xia H Xing and X Liu ldquoAnalyzing kernel matrices forthe identification of differentially expressed genesrdquo PLoS ONEvol 8 no 12 Article ID e81683 2013

[29] C Ambroise and G J McLachlan ldquoSelection bias in geneextraction on the basis of microarray gene-expression datardquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 99 no 10 pp 6562ndash6566 2002

[30] I Guyon J Weston S Barnhill and V Vapnik ldquoGene selec-tion for cancer classification using support vector machinesrdquoMachine Learning vol 46 no 1ndash3 pp 389ndash422 2002

[31] Q Liu A H Sung Z Chen et al ldquoGene selection and classi-fication for cancer microarray data based on machine learningand similarity measuresrdquo BMCGenomics vol 12 supplement 5article S1 2011

[32] M Gutlein E Frank M Hall and A Karwath ldquoLarge-scaleattribute selection using wrappersrdquo in Proceedings of the IEEESymposium on Computational Intelligence and Data Mining(CIDM rsquo09) pp 332ndash339 April 2009

[33] T Jirapech-Umpai and S Aitken ldquoFeature selection and classi-fication for microarray data analysis evolutionary methods foridentifying predictive genesrdquo BMCBioinformatics vol 6 article148 2005

[34] C Bartenhagen H-U Klein C Ruckert X Jiang and MDugas ldquoComparative study of unsupervised dimension reduc-tion techniques for the visualization of microarray gene expres-sion datardquo BMC Bioinformatics vol 11 no 1 article 567 2010

[35] R Ruiz J C Riquelme and J S Aguilar-Ruiz ldquoIncrementalwrapper-based gene selection from microarray data for cancerclassificationrdquo Pattern Recognition vol 39 no 12 pp 2383ndash2392 2006

[36] E B Huerta B Duval and J-K Hao ldquoGene selection formicroarray data by a LDA-based genetic algorithmrdquo in PatternRecognition in Bioinformatics Proceedings of the 3rd IAPR Inter-national Conference PRIB 2008 Melbourne Australia October15ndash17 2008 M Chetty A Ngom and S Ahmad Eds vol 5265of Lecture Notes in Computer Science pp 250ndash261 SpringerBerlin Germany 2008

[37] M Perez and T Marwala ldquoMicroarray data feature selectionusing hybrid genetic algorithm simulated annealingrdquo in Pro-ceedings of the IEEE 27th Convention of Electrical and ElectronicsEngineers in Israel (IEEEI rsquo12) pp 1ndash5 November 2012

[38] N Revathy and R Balasubramanian ldquoGA-SVM wrapperapproach for gene ranking and classification using expressionsof very few genesrdquo Journal of Theoretical and Applied Informa-tion Technology vol 40 no 2 pp 113ndash119 2012


[39] J C Dunn ldquoA fuzzy relative of the ISODATA process and itsuse in detecting compact well-separated clustersrdquo Journal ofCybernetics vol 3 no 3 pp 32ndash57 1973

[40] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Kluwer Academic Publishers Norwell Mass USA1981

[41] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[42] L Sheng R Pique-Regi S Asgharzadeh and A OrtegaldquoMicroarray classification using block diagonal linear discrim-inant analysis with embedded feature selectionrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1757ndash1760 April 2009

[43] S Maldonado R Weber and J Basak ldquoSimultaneous featureselection and classification using kernel-penalized supportvector machinesrdquo Information Sciences vol 181 no 1 pp 115ndash128 2011

[44] E K Tang P N Suganthan and X Yao ldquoFeature selection formicroarray data using least squares SVM and particle swarmoptimizationrdquo in Proceedings of the IEEE Symposium on Com-putational Intelligence in Bioinformatics and ComputationalBiology (CIBCB rsquo05) pp 9ndash16 IEEE November 2005

[45] Y Tang Y-Q Zhang and Z Huang ldquoDevelopment of two-stage SVM-RFE gene selection strategy for microarray expres-sion data analysisrdquo IEEEACM Transactions on ComputationalBiology and Bioinformatics vol 4 no 3 pp 365ndash381 2007

[46] X Zhang X Lu Q Shi et al ldquoRecursive SVM feature selectionand sample classification for mass-spectrometry and microar-ray datardquo BMC Bioinformatics vol 7 article 197 2006

[47] M B Eisen P T Spellman P O Brown and D Botstein ldquoClus-ter analysis and display of genome-wide expression patternsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 25 pp 14863ndash14868 1998

[48] A Prelic S Bleuler P Zimmermann et al ldquoA systematic com-parison and evaluation of biclusteringmethods for gene expres-sion datardquo Bioinformatics vol 22 no 9 pp 1122ndash1129 2006

[49] P Jafari and F Azuaje ldquoAn assessment of recently publishedgene expression data analyses reporting experimental designand statistical factorsrdquo BMC Medical Informatics and DecisionMaking vol 6 no 1 article 27 2006

[50] M A Hall ldquoCorrelation-based feature selection for machinelearningrdquo Tech Rep 1998

[51] J Hruschka R Estevam E R Hruschka and N F F EbeckenldquoFeature selection by bayesian networksrdquo in Advances in Artifi-cial Intelligence A Y Tawfik and S D Goodwin Eds vol 3060of Lecture Notes in Computer Science pp 370ndash379 SpringerBerlin Germany 2004

[52] A Rau F Jaffrezic J-L Foulley and R W Doerge ldquoAn empir-ical bayesian method for estimating biological networks fromtemporal microarray datardquo Statistical Applications in Geneticsand Molecular Biology vol 9 article 9 2010

[53] P Yang B B Zhou Z Zhang and A Y Zomaya ldquoA multi-filterenhanced genetic ensemble system for gene selection and sam-ple classification of microarray datardquo BMC Bioinformatics vol11 supplement 1 article S5 2010

[54] C H Ooi and P Tan ldquoGenetic algorithms applied tomulti-classprediction for the analysis of gene expression datardquo Bioinfor-matics vol 19 no 1 pp 37ndash44 2003

[55] H Glass and L Cooper ldquoSequential search a method for solv-ing constrained optimization problemsrdquo Journal of the ACMvol 12 no 1 pp 71ndash82 1965

[56] H Jiang Y Deng H-S Chen et al ldquoJoint analysis of twomicroarray gene-expression data sets to select lung adenocar-cinoma marker genesrdquo BMC Bioinformatics vol 5 article 812004

[57] S Ma X Song and J Huang ldquoSupervised group Lasso withapplications to microarray data analysisrdquo BMC Bioinformaticsvol 8 article 60 2007

[58] P F Evangelista P Bonissone M J Embrechts and B K Szy-manski ldquoUnsupervised fuzzy ensembles and their use in intru-sion detectionrdquo in Proceedings of the European Symposium onArtificial Neural Networks pp 345ndash350 April 2005

[59] S Jonnalagadda and R Srinivasan ldquoPrincipal componentsanalysis based methodology to identify differentially expressedgenes in time-coursemicroarray datardquoBMCBioinformatics vol9 article 267 2008

[60] J Landgrebe W Wurst and G Welzl ldquoPermutation-validatedprincipal components analysis of microarray datardquo GenomeBiology vol 3 no 4 2002

[61] J MisraW Schmitt D Hwang et al ldquoInteractive exploration ofmicroarray gene expression patterns in a reduced dimensionalspacerdquo Genome Research vol 12 no 7 pp 1112ndash1120 2002

[62] V Nikulin and G J McLachlan ldquoPenalized principal compo-nent analysis of microarray datardquo in Computational IntelligenceMethods for Bioinformatics and Biostatistics F Masulli L EPeterson and R Tagliaferri Eds vol 6160 of Lecture Notes inComputer Science pp 82ndash96 Springer Berlin Germany 2009

[63] S Raychaudhuri J M Stuart R B Altman and R B Alt-man ldquoPrincipal components analysis to summarize microarrayexperiments application to sporulation time seriesrdquo in Proceed-ings of the Pacific Symposium on Biocomputing pp 452ndash4632000

[64] A Wang and E A Gehan ldquoGene selection for microarray dataanalysis using principal component analysisrdquo Statistics in Medi-cine vol 24 no 13 pp 2069ndash2087 2005

[65] E Bair T Hastie D Paul and R Tibshirani ldquoPrediction bysupervised principal componentsrdquo Journal of the AmericanStatistical Association vol 101 no 473 pp 119ndash137 2006

[66] E Bair and R Tibshirani ldquoSemi-supervised methods to predictpatient survival from gene expression datardquo PLoS Biology vol2 pp 511ndash522 2004

[67] T Hastie R Tibshirani M B Eisen et al ldquolsquoGene shavingrsquo asa method for identifying distinct sets of genes with similarexpression patternsrdquo Genome Biology vol 1 no 2 pp 1ndash212000

[68] I Borg and P J Groenen Modern Multidimensional ScalingTheory and Applications Springer Series in Statistics Springer2nd edition 2005

[69] J Tzeng H Lu and W-H Li ldquoMultidimensional scaling forlarge genomic data setsrdquo BMC Bioinformatics vol 9 article 1792008

[70] J A Hartigan and M A Wong ldquoAlgorithm AS 136 a K-meansclustering algorithmrdquo Journal of the Royal Statistical SocietySeries C Applied Statistics vol 28 no 1 pp 100ndash108 1979

[71] J B Tenenbaum V de Silva and J C Langford ldquoA globalgeometric framework for nonlinear dimensionality reductionrdquoScience vol 290 no 5500 pp 2319ndash2323 2000

[72] M Balasubramanian andE L Schwartz ldquoThe isomap algorithmand topological stabilityrdquo Science vol 295 no 5552 p 7 2002

[73] C Orsenigo and C Vercellis ldquoAn effective double-boundedtree-connected Isomap algorithm for microarray data classifi-cationrdquo Pattern Recognition Letters vol 33 no 1 pp 9ndash16 2012


[74] K Dawson R L Rodriguez and W Malyj ldquoSample phenotypeclusters in high-density oligonucleotidemicroarray data sets arerevealed using Isomap a nonlinear algorithmrdquo BMC Bioinfor-matics vol 6 article 195 2005

[75] C Shi and L Chen ldquoFeature dimension reduction for microar-ray data analysis using locally linear embeddingrdquo in Proceedingsof the 3rd Asia-Pacific Bioinformatics Conference (APBC rsquo05) pp211ndash217 January 2005

[76] M Ehler V N Rajapakse B R Zeeberg et al ldquoNonlinear genecluster analysis with labeling for microarray gene expressiondata in organ developmentrdquo BMC Proceedings vol 5 no 2article S3 2011

[77] M Kotani A Sugiyama and S Ozawa ldquoAnalysis of DNAmicroarray data using self-organizing map and kernel basedclusteringrdquo in Proceedings of the 9th International Conference onNeural Information Processing (ICONIP rsquo02) vol 2 pp 755ndash759Singapore November 2002

[78] Z Liu D Chen and H Bensmail ldquoGene expression data classi-fication with kernel principal component analysisrdquo Journal ofBiomedicine and Biotechnology vol 2005 no 2 pp 155ndash1592005

[79] F Reverter E Vegas and J M Oller ldquoKernel-PCA data integra-tion with enhanced interpretabilityrdquo BMC Systems Biology vol8 supplement 2 p S6 2014

[80] X Liu and C Yang ldquoGreedy kernel PCA for training datareduction and nonlinear feature extraction in classificationrdquo inMIPPR 2009 Automatic Target Recognition and Image Analysisvol 7495 of Proceedings of SPIE Yichang China October 2009

[81] T Kohonen ldquoSelf-organized formation of topologically correctfeature mapsrdquo in Neurocomputing Foundations of Research pp509ndash521 MIT Press Cambridge Mass USA 1988

[82] R Fakoor F Ladhak A Nazi andMHuber ldquoUsing deep learn-ing to enhance cancer diagnosis and classificationrdquo in Proceed-ings of the ICML Workshop on the Role of Machine Learning inTransforming Healthcare (WHEALTH rsquo13) ICML 2013

[83] S Kaski J Nikkil P Trnen E Castrn and G Wong ldquoAnalysisand visualization of gene expression data using self-organizingmapsrdquo in Proceedings of the IEEE-EURASIP Workshop onNonlinear Signal and Image Processing (NSIP rsquo01) p 24 2001

[84] J M Engreitz B J Daigle Jr J J Marshall and R B AltmanldquoIndependent component analysis mining microarray datafor fundamental human gene expression modulesrdquo Journal ofBiomedical Informatics vol 43 no 6 pp 932ndash944 2010

[85] S-I Lee and S Batzoglou ldquoApplication of independent com-ponent analysis to microarraysrdquo Genome Biology vol 4 no 11article R76 2003

[86] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[87] E Segal D Koller N Friedman and T Jaakkola ldquoLearningmodule networksrdquo Journal of Machine Learning Research vol27 pp 525ndash534 2005

[88] Y Chen and D Xu ldquoGlobal protein function annotationthrough mining genome-scale data in yeast SaccharomycescerevisiaerdquoNucleic Acids Research vol 32 no 21 pp 6414ndash64242004

[89] R Kustra andA Zagdanski ldquoData-fusion in clusteringmicroar-ray data balancing discovery and interpretabilityrdquo IEEEACMTransactions on Computational Biology and Bioinformatics vol7 no 1 pp 50ndash63 2010

[90] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning (ICML rsquo98) Madison Wis USA 1998

[91] J Cheng M Cline J Martin et al ldquoA knowledge-basedclustering algorithm driven by gene ontologyrdquo Journal of Bio-pharmaceutical Statistics vol 14 no 3 pp 687ndash700 2004

[92] X Chen and L Wang ldquoIntegrating biological knowledge withgene expression profiles for survival prediction of cancerrdquo Jour-nal of Computational Biology vol 16 no 2 pp 265ndash278 2009

[93] D Huang and W Pan ldquoIncorporating biological knowledgeinto distance-based clustering analysis of microarray geneexpression datardquo Bioinformatics vol 22 no 10 pp 1259ndash12682006

[94] W Pan ldquoIncorporating gene functions as priors inmodel-basedclustering of microarray gene expression datardquo Bioinformaticsvol 22 no 7 pp 795ndash801 2006

[95] H-Y Chuang E Lee Y-T Liu D Lee and T Ideker ldquoNetwork-based classification of breast cancer metastasisrdquo MolecularSystems Biology vol 3 no 1 article 140 2007

[96] A Tanay R Sharan and R Shamir ldquoDiscovering statisticallysignificant biclusters in gene expression datardquo in Proceedingsof the 10th International Conference on Intelligent Systems forMolecular Biology (ISMB rsquo02) pp 136ndash144 Edmonton CanadaJuly 2002

[97] C Li and H Li ldquoNetwork-constrained regularization and vari-able selection for analysis of genomic datardquo Bioinformatics vol24 no 9 pp 1175ndash1182 2008

[98] F Rapaport A Zinovyev M Dutreix E Barillot and J-P VertldquoClassification of microarray data using gene networksrdquo BMCBioinformatics vol 8 article 35 2007

[99] N Bandyopadhyay T Kahveci S Goodison Y Sun and SRanka ldquoPathway-basedfeature selection algorithm for cancermicroarray datardquo Advances in Bioinformatics vol 2009 ArticleID 532989 16 pages 2009


theWorld Health Organization It is expected that the deathsfrom cancer will rise to 14 million in the next two decadesCancer is not a single diseaseThere aremore than 100 knowndifferent types of cancer and probably many more The termcancer is used to describe the abnormal growth of cells thatcan for example form extra tissue called mass and thenattack other organs [6]

Microarray databases are a large source of genetic datawhich upon proper analysis could enhance our understand-ing of biology and medicine Many microarray experimentshave been designed to investigate the genetic mechanismsof cancer and analytical approaches have been applied inorder to classify different types of cancer or distinguishbetween cancerous and noncancerous tissue In the last tenyears machine learning techniques have been investigated inmicroarray data analysis Several approaches have been triedin order to (i) distinguish between cancerous and noncancer-ous samples (ii) classify different types of cancer and (iii)identify subtypes of cancer that may progress aggressivelyAll these investigations are seeking to generate biologicallymeaningful interpretations of complex datasets that aresufficiently interesting to drive follow-up experimentation

This review paper is structured as follows The next sec-tion is about feature selection methods (filters wrappers andembedded techniques) applied on microarray cancer dataThen we discuss feature extraction methods (linear and non-linear) inmicroarray cancer data and the final section is aboutusing prior knowledge in combination with a feature extrac-tion or feature selection method to improve classificationaccuracy and algorithmic complexity

2 Feature Subset Selection inMicroarray Cancer Data

Feature subset selection works by removing features that arenot relevant or are redundant The subset of features selectedshould follow the Occamrsquos Razor principle and also give thebest performance according to some objective function Inmany cases this is anNP-hard (nondeterministic polynomial-time hard) problem [7 8]The size of the data to be processedhas increased the past 5 years and therefore feature selectionhas become a requirement before any kind of classificationtakes place Unlike feature extraction methods feature selec-tion techniques do not alter the original representation of thedata [9] One objective for both feature subset selection andfeature extraction methods is to avoid overfitting the datain order to make further analysis possible The simplest isfeature selection in which the number of gene probes in anexperiment is reduced by selecting only the most significantaccording to some criterion such as high levels of activityFeature selection algorithms are separated into three cate-gories [10 11]

(i) The filters which extract features from the data with-out any learning involved

(ii) The wrappers that use learning techniques to evaluatewhich features are useful

(iii) The embedded techniques which combine the featureselection step and the classifier construction

21 Filters Filters workwithout taking the classifier into con-sideration This makes them very computationally efficientThey are divided into multivariate and univariate methodsMultivariate methods are able to find relationships amongthe features while univariate methods consider each featureseparately Gene ranking is a popular statistical method Thefollowing methods were proposed in order to rank the genesin a dataset based on their significance [12]

(i) (Univariate) Unconditional Mixture Modelling as-sumes two different states of the gene on and off andchecks whether the underlying binary state of thegene affects the classification using mixture overlapprobability

(ii) (Univariate) Information Gain Ranking approximatesthe conditional distribution 119875(119862 | 119865) where 119862 is theclass label and119865 is the feature vector Information gainis used as a surrogate for the conditional distribution

(iii) (Multivariate)Markov Blanket Filtering finds featuresthat are independent of the class label so that remov-ing them will not affect the accuracy

In multivariate methods pair 119905-scores are used for evaluatinggene pairs depending on how well they can separate twoclasses in an attempt to identify genes that work together toprovide a better classification [13] Their results for the genepair rankings were found to be ldquoat least as interesting as thesingle genes found by an independent evaluationrdquo

Methods based on correlation have also been suggested

(i) (Multivariate) Error-Weighted Uncorrelated ShrunkenCentroid (EWUSC) this method is based on theuncorrelated shrunken centroid (USC) and shrunkencentroid (SC) The shrunken centroid is found bydividing the average gene expression for each genein each class by the standard deviation for that genein the same class This way higher weight is given togenes whose expression is the same among differentsamples in the same class New samples are assignedto the label with the nearest average pattern (usingsquared distance) The uncorrelated shrunken cen-troid approach removes redundant features by findinggenes that are highly correlated in the set of genesalready found by SC The EWUSC uses both of thesesteps and in addition adds error-weights (based onwithin-class variability) so that noisy genes will bedowngraded and redundant genes are removed [14]A comparison is shown in Figure 1 where the threedifferent methods are tested on a relatively small(25000 genes and 78 samples) breast cancer datasetThe algorithms perform well when the number ofrelevant genes is less than 1000

(ii) (Multivariate)Minimum Redundancy Maximum Rel-evance (mRMR) mRMR is a method that maximisesthe relevancy of genes with the class label while itminimises the redundancy in each class To do so ituses several statistical measures Mutual Information(MI)measures the information a randomvariable cangive about another in particular the gene activity and



119882 =1|119878|

2 sum119894119895isin119878

119868 (119894 119895) (1)

119881 =1|119878|sum

119894isin119878

119868 (ℎ 119894) (2)


119881 =1|119878|sum

119894isin119878

119865 (119894 ℎ) (3)

119882 =1|119878|

2 sum119894119895isin119878

1003816100381610038161003816119888 (119894 119895)1003816100381610038161003816 (4)




90

85

80

75

70

65

60

Pred

ictio

n ac

cura

cy (

)

101

102

103

104


EWUSC (1205880= 07)

USC (1205880= 06)

SC

Test data










10 20 30 40 50 60 70 80 90 100 110 120 130 140 15080

82

84

86

88

90

92

94

96

98

100

Number of genes

Accu

racy



10 20 30 40 50 60 70 80 90 100 110 120 130 140 15080

82

84

86

88

90

92

94

96

98

100

Number of genes

Accu

racy














end while




end ifk = k + 1

end while




























D

D

K

N X Z= N lowastK UT
















PCA


minus50

0

50

100

150

AML t(15 17)AML t(8 21)

AML t(15 17)AML t(8 21)


minus2

minus15

minus1

minus05

0

05

1

15

2

25LLE

AML t(15 17)AML t(8 21)


minus300

minus200

minus100

0

100

200

300 IM









[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge














5 Summary




References









































































































119882 =1|119878|

2 sum119894119895isin119878

119868 (119894 119895) (1)

119881 =1|119878|sum

119894isin119878

119868 (ℎ 119894) (2)


119881 =1|119878|sum

119894isin119878

119865 (119894 ℎ) (3)

119882 =1|119878|

2 sum119894119895isin119878

1003816100381610038161003816119888 (119894 119895)1003816100381610038161003816 (4)




90

85

80

75

70

65

60

Pred

ictio

n ac

cura

cy (

)

101

102

103

104


EWUSC (1205880= 07)

USC (1205880= 06)

SC

Test data










10 20 30 40 50 60 70 80 90 100 110 120 130 140 15080

82

84

86

88

90

92

94

96

98

100

Number of genes

Accu

racy



10 20 30 40 50 60 70 80 90 100 110 120 130 140 15080

82

84

86

88

90

92

94

96

98

100

Number of genes

Accu

racy














end while




end ifk = k + 1

end while




























D

D

K

N X Z= N lowastK UT
















PCA


minus50

0

50

100

150

AML t(15 17)AML t(8 21)

AML t(15 17)AML t(8 21)


minus2

minus15

minus1

minus05

0

05

1

15

2

25LLE

AML t(15 17)AML t(8 21)


minus300

minus200

minus100

0

100

200

300 IM









[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge














5 Summary




References









































































































10 20 30 40 50 60 70 80 90 100 110 120 130 140 15080

82

84

86

88

90

92

94

96

98

100

Number of genes

Accu

racy



10 20 30 40 50 60 70 80 90 100 110 120 130 140 15080

82

84

86

88

90

92

94

96

98

100

Number of genes

Accu

racy














end while




end ifk = k + 1

end while




























D

D

K

N X Z= N lowastK UT
















PCA


minus50

0

50

100

150

AML t(15 17)AML t(8 21)

AML t(15 17)AML t(8 21)


minus2

minus15

minus1

minus05

0

05

1

15

2

25LLE

AML t(15 17)AML t(8 21)


minus300

minus200

minus100

0

100

200

300 IM









[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge














5 Summary




References









































































































end while




end ifk = k + 1

end while




























D

D

K

N X Z= N lowastK UT
















PCA


minus50

0

50

100

150

AML t(15 17)AML t(8 21)

AML t(15 17)AML t(8 21)


minus2

minus15

minus1

minus05

0

05

1

15

2

25LLE

AML t(15 17)AML t(8 21)


minus300

minus200

minus100

0

100

200

300 IM









[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge














5 Summary




References





























































































































D

D

K

N X Z= N lowastK UT
















PCA


minus50

0

50

100

150

AML t(15 17)AML t(8 21)

AML t(15 17)AML t(8 21)


minus2

minus15

minus1

minus05

0

05

1

15

2

25LLE

AML t(15 17)AML t(8 21)


minus300

minus200

minus100

0

100

200

300 IM









[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge














5 Summary




References








































































































D

D

K

N X Z= N lowastK UT
















PCA


minus50

0

50

100

150

AML t(15 17)AML t(8 21)

AML t(15 17)AML t(8 21)


minus2

minus15

minus1

minus05

0

05

1

15

2

25LLE

AML t(15 17)AML t(8 21)


minus300

minus200

minus100

0

100

200

300 IM









[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge














5 Summary




References








































































































PCA


minus50

0

50

100

150

AML t(15 17)AML t(8 21)

AML t(15 17)AML t(8 21)


minus2

minus15

minus1

minus05

0

05

1

15

2

25LLE

AML t(15 17)AML t(8 21)


minus300

minus200

minus100

0

100

200

300 IM









[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge














5 Summary




References












































































































[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[[[

[

119883119894

119883119896

119883119899

]]]]]]

]

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

997888rarr

[[[[

[

1198841

119884119870

]]]]

]

= 119891(((

(

[[[[[[[[[

[

1198831

1198832

119883119873minus1

119883119873

]]]]]]]]]

]

)))

)

(5)

4 Prior Knowledge














5 Summary




References
















































































































5 Summary




References



































































































































































































































































































A Review of Feature Selection and Feature Extraction ......“best” features are selected, feature elimination progresses graduallyandincludescross-validationsteps[26,44–46].A

Documents