Top Banner
Research Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data Thanh-Tung Nguyen, 1,2,3 Joshua Zhexue Huang, 1,4 and Thuy Thi Nguyen 5 1 Shenzhen Key Laboratory of High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 School of Computer Science and Engineering, Water Resources University, Hanoi 10000, Vietnam 4 College of Computer Science and Soſtware Engineering, Shenzhen University, Shenzhen 518060, China 5 Faculty of Information Technology, Vietnam National University of Agriculture, Hanoi 10000, Vietnam Correspondence should be addressed to anh-Tung Nguyen; [email protected] Received 20 June 2014; Accepted 20 August 2014 Academic Editor: Shifei Ding Copyright © 2015 anh-Tung Nguyen et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. is makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using -value assessment, and the subset of unbiased features is then selected based on some statistical measures. is feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. is approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. e experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures. 1. Introduction Random forests (RFs) [1] are a nonparametric method that builds an ensemble model of decision trees from random subsets of features and bagged samples of the training data. RFs have shown excellent performance for both clas- sification and regression problems. RF model works well even when predictive features contain irrelevant features (or noise); it can be used when the number of features is much larger than the number of samples. However, with randomizing mechanism in both bagging samples and feature selection, RFs could give poor accuracy when applied to high dimensional data. e main cause is that, in the process of growing a tree from the bagged sample data, the subspace of features randomly sampled from thousands of features to split a node of the tree is oſten dominated by uninformative features (or noise), and the tree grown from such bagged subspace of features will have a low accuracy in prediction which affects the final prediction of the RFs. Furthermore, Breiman et al. noted that feature selection is biased in the classification and regression tree (CART) model because it is based on an information criteria, called multivalue problem [2]. It tends in favor of features containing more values, even if these features have lower importance than other ones or have no relationship with the response feature (i.e., containing less missing values, many categorical or distinct numerical values) [3, 4]. In this paper, we propose a new random forests algo- rithm using an unbiased feature sampling method to build a good subspace of unbiased features for growing trees. Hindawi Publishing Corporation e Scientific World Journal Volume 2015, Article ID 471371, 18 pages http://dx.doi.org/10.1155/2015/471371
19

Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

May 31, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

Research ArticleUnbiased Feature Selection in Learning Random Forests forHigh-Dimensional Data

Thanh-Tung Nguyen123 Joshua Zhexue Huang14 and Thuy Thi Nguyen5

1 Shenzhen Key Laboratory of High Performance Data Mining Shenzhen Institutes of Advanced TechnologyChinese Academy of Sciences Shenzhen 518055 China

2University of Chinese Academy of Sciences Beijing 100049 China3 School of Computer Science and Engineering Water Resources University Hanoi 10000 Vietnam4College of Computer Science and Software Engineering Shenzhen University Shenzhen 518060 China5 Faculty of Information Technology Vietnam National University of Agriculture Hanoi 10000 Vietnam

Correspondence should be addressed toThanh-Tung Nguyen tungntwruvn

Received 20 June 2014 Accepted 20 August 2014

Academic Editor Shifei Ding

Copyright copy 2015 Thanh-Tung Nguyen et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Random forests (RFs) have been widely used as a powerful classificationmethod However with the randomization in both baggingsamples and feature selection the trees in the forest tend to select uninformative features for node splitting This makes RFshave poor accuracy when working with high-dimensional data Besides that RFs have bias in the feature selection process wheremultivalued features are favored Aiming at debiasing feature selection in RFs we propose a new RF algorithm called xRF to selectgood features in learning RFs for high-dimensional data We first remove the uninformative features using 119901-value assessmentand the subset of unbiased features is then selected based on some statistical measures This feature subset is then partitioned intotwo subsets A feature weighting sampling technique is used to sample features from these two subsets for building trees Thisapproach enables one to generate more accurate trees while allowing one to reduce dimensionality and the amount of data neededfor learning RFs An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including imagedatasets The experimental results have shown that RFs with the proposed approach outperformed the existing random forests inincreasing the accuracy and the AUC measures

1 Introduction

Random forests (RFs) [1] are a nonparametric method thatbuilds an ensemble model of decision trees from randomsubsets of features and bagged samples of the training data

RFs have shown excellent performance for both clas-sification and regression problems RF model works welleven when predictive features contain irrelevant features(or noise) it can be used when the number of features ismuch larger than the number of samples However withrandomizingmechanism in both bagging samples and featureselection RFs could give poor accuracy when applied to highdimensional data The main cause is that in the process ofgrowing a tree from the bagged sample data the subspaceof features randomly sampled from thousands of features to

split a node of the tree is often dominated by uninformativefeatures (or noise) and the tree grown from such baggedsubspace of features will have a low accuracy in predictionwhich affects the final prediction of the RFs FurthermoreBreiman et al noted that feature selection is biased in theclassification and regression tree (CART) model because it isbased on an information criteria called multivalue problem[2] It tends in favor of features containingmore values even ifthese features have lower importance than other ones or haveno relationship with the response feature (ie containingless missing values many categorical or distinct numericalvalues) [3 4]

In this paper we propose a new random forests algo-rithm using an unbiased feature sampling method to builda good subspace of unbiased features for growing trees

Hindawi Publishing Corporatione Scientific World JournalVolume 2015 Article ID 471371 18 pageshttpdxdoiorg1011552015471371

2 The Scientific World Journal

We first use random forests to measure the importance offeatures and produce raw feature importance scores Thenwe apply a statistical Wilcoxon rank-sum test to separateinformative features from the uninformative ones This isdone by neglecting all uninformative features by definingthreshold 120579 for instance 120579 = 005 Second we use the Chi-square statistic test (1205942) to compute the related scores ofeach feature to the response feature We then partition theset of the remaining informative features into two subsetsone containing highly informative features and the otherone containing weak informative features We independentlysample features from the two subsets and merge themtogether to get a new subspace of features which is usedfor splitting the data at nodes Since the subspace alwayscontains highly informative features which can guarantee abetter split at a node this feature sampling method enablesavoiding selecting biased features and generates trees frombagged sample data with higher accuracy This samplingmethod also is used for dimensionality reduction the amountof data needed for training the random forests modelOur experimental results have shown that random forestswith this weighting feature selection technique outperformedrecently the proposed random forests in increasing of theprediction accuracy we also applied the new approachon microarray and image data and achieved outstandingresults

The structure of this paper is organized as followsIn Section 2 we give a brief summary of related worksIn Section 3 we give a brief summary of random forestsand measurement of feature importance score Section 4describes our newproposed algorithmusing unbiased featureselection Section 5 provides the experimental results evalu-ations and comparisons Section 6 gives our conclusions

2 Related Works

Random forests are an ensemble approach to make classifi-cation decisions by voting the results of individual decisiontrees An ensemble learner with excellent generalizationaccuracy has two properties high accuracy of each compo-nent learner and high diversity in component learners [5]Unlike other ensemble methods such as bagging [1] andboosting [6 7] which create basic classifiers from randomsamples of the training data the random forest approachcreates the basic classifiers from randomly selected subspacesof data [8 9] The randomly selected subspaces increasethe diversity of basic classifiers learnt by a decision treealgorithm

Feature importance is the importancemeasure of featuresin the feature selection process [1 10ndash14] In RF frameworksthe most commonly used score of importance of a givenfeature is the mean error of a tree in the forest when theobserved values of this feature are randomly permuted inthe out-of-bag samples Feature selection is an important stepto obtain good performance for an RF model especially indealing with high dimensional data problems

For feature weighting techniques recently Xu et al [13]proposed an improved RF method which uses a novel fea-ture weighting method for subspace selection and therefore

enhances classification performance on high dimensionaldata The weights of feature were calculated by informationgain ratio or 1205942-test Ye et al [14] then used these weightsto propose a stratified sampling method to select featuresubspaces for RF in classification problems Chen et al[15] used a stratified idea to propose a new clusteringmethod However implementation of the random forestmodel suggested by Ye et al is based on a binary classificationsetting and it uses linear discriminant analysis as the splittingcriteria This stratified RF model is not efficient on highdimensional datasets with multiple classes With the sameway for solving two-class problem Amaratunga et al [16]presented a feature weighting method for subspace samplingto deal with microarray data the 119905-test of variance analysisis used to compute weights for the features Genuer et al[12] proposed a strategy involving a ranking of explanatoryfeatures using the RFs score weights of importance and astepwise ascending feature introduction strategy Deng andRunger [17] proposed a guided regularized RF (GRRF)in which weights of importance scores from an ordinaryrandom forest (RF) are used to guide the feature selectionprocess They found that the regularized least subset selectedby their GRRF with minimal regularization ensures betteraccuracy than the complete feature set However regularRF was used as a classifier due to the fact that regularizedRF may have higher variance than RF because the trees arecorrelated

Several methods have been proposed to correct bias ofimportance measures in the feature selection process in RFsto improve the prediction accuracy [18ndash21] These methodsintend to avoid selecting an uninformative feature for nodesplitting in decision trees Although the methods of this kindwere well investigated and can be used to address the highdimensional problem there are still some unsolved problemssuch as the need to specify in advance the probabilitydistributions as well as the fact that they struggle whenapplied to large high dimensional data

In summary in the reviewed approaches the gain athigher levels of the tree is weighted differently than the gainat lower levels of the tree In fact at lower levels of the treethe gain is reduced because of the effect of splits on differentfeatures at higher levels of the tree That affects the finalprediction performance of RFs model To remedy this inthis paper we propose a new method for unbiased featuresubsets selection in high dimensional space to build RFs Ourapproach differs from previous approaches in the techniquesused to partition a subset of features All uninformativefeatures (considered as noise) are removed from the systemand the best feature set which is highly related to the responsefeature is found using a statistical method The proposedsamplingmethod always provides enough highly informativefeatures for the subspace feature at any levels of the decisiontrees For the case of growing an RF model on data withoutnoise we used in-bagmeasuresThis is a different importancescore of features which requires less computational timecompared to the measures used by others Our experimentalresults showed that our approach outperformed recently theproposed RF methods

The Scientific World Journal 3

input L = (119883119894 119884119894)119873

119894=1| 119883 isin R119872 119884 isin 1 2 119888 the training data set

119870 the number of treesmtry the size of the subspaces

output A random forest RF(1) for 119896 larr 1 to 119870 do(2) Draw a bagged subset of samples L

119896from L

(4) while (stopping criteria is not met) do(5) Select randomlymtry features(6) for 119898 larr 1 to 119898119905119903119910 do(7) Compute the decrease in the node impurity(8) Choose the feature which decreases the impurity the most and

the node is divided into two children nodes(9) Combine the 119870 trees to form a random forest

Algorithm 1 Random forest algorithm

3 Background

31 Random Forest Algorithm Given a training dataset L =

(119883119894 119884119894)119873

119894=1| 119883119894isin R119872 119884 isin 1 2 119888 where 119883

119894are

features (also called predictor variables) 119884 is a class responsefeature 119873 is the number of training samples and 119872 is thenumber of features and a random forest model RF describedin Algorithm 1 let 119896 be the prediction of tree 119879

119896given input

119883 The prediction of random forest with119870 trees is

= majority vote 119896119870

1 (1)

Since each tree is grown from a bagged sample set it isgrown with only two-thirds of the samples in L called in-bagsamples About one-third of the samples is left out and thesesamples are called out-of-bag (OOB) samples which are usedto estimate the prediction error

The OOB predicted value is OOB= (1O

1198941015840)sum119896isinO1198941015840119896

whereO1198941015840 = LO

119894 119894 and 1198941015840 are in-bag and out-of-bag sampled

indices O1198941015840 is the size of OOB subdataset and the OOB

prediction error is

ErrOOB

=1

119873OOB

119873OOB

sum

119894=1

E (119884 OOB

) (2)

where E(sdot) is an error function and 119873OOB is OOB samplesrsquosize

32 Measurement of Feature Importance Score from an RFBreiman presented a permutation technique to measure theimportance of features in the prediction [1] called an out-of-bag importance scoreThe basic idea for measuring this kindof importance score of features is to compute the differencebetween the original mean error and the randomly permutedmean error in OOB samplesThemethod rearranges stochas-tically all values of the 119895th feature in OOB for each tree anduses the RF model to predict this permuted feature and getthe mean error The aim of this permutation is to eliminatethe existing association between the 119895th feature and 119884 values

and then to test the effect of this on the RF model A featureis considered to be in a strong association if the mean errordecreases dramatically

The other kind of feature importance measure canbe obtained when the random forest is growing This isdescribed as follows At each node 119905 in a decision tree thesplit is determined by the decrease in node impurity Δ119877(119905)The node impurity 119877(119905) is the gini index If a subdataset innode 119905 contains samples from 119888 classes gini(119905) is defined as

119877 (119905) = 1 minus

119888

sum

119895=1

1199012

119895 (3)

where 1199012119895is the relative frequency of class 119895 in 119905 Gini(119905) is

minimized if the classes in 119905 are skewed After splitting 119905 intotwo child nodes 119905

1and 1199052with sample sizes119873

1(119905) and119873

2(119905)

the gini index of the split data is defined as

Ginisplit (119905) =1198731(119905)

119873 (119905)Gini (119905

1) +

1198732(119905)

119873 (119905)Gini (119905

2) (4)

The feature providing smallest Ginisplit(119905) is chosen to split thenodeThe importance score of feature119883

119895in a single decision

tree 119879119896is

IS119896(119883119895) = sum

119905isin119879119896

Δ119877 (119905) (5)

and it is computed over all119870 trees in a random forest definedas

IS (119883119895) =

1

119870

119870

sum

119896=1

IS119896(119883119895) (6)

It is worth noting that a random forest uses in-bag sam-ples to produce a kind of importance measure called an in-bag importance scoreThis is themain difference between thein-bag importance score and an out-of-bag measure which isproduced with the decrease of the prediction error using RFinOOB samples In other words the in-bag importance scorerequires less computation time than the out-of-bag measure

4 The Scientific World Journal

4 Our Approach

41 Issues in Feature Selection on High Dimensional DataWhen Breiman et al suggested the classification and regres-sion tree (CART) model they noted that feature selection isbiased because it is based on an information gain criteriacalled multivalue problem [2] Random forest methods arebased on CART trees [1] hence this bias is carried to randomforest RF model In particular the importance scores can bebiased when very high dimensional data contains multipledata types Several methods have been proposed to correctbias of feature importance measures [18ndash21] The conditionalinference framework (referred to as cRF [22]) could be suc-cessfully applied for both the null and power cases [19 20 22]The typical characteristic of the power case is that only onepredictor feature is important while the rest of the featuresare redundant with different cardinality In contrast in thenull case all features used for prediction are redundant withdifferent cardinality Although the methods of this kind werewell investigated and can be used to address the multivalueproblem there are still some unsolved problems such asthe need to specify in advance the probability distributionsas well as the fact that they struggle when applied to highdimensional data

Another issue is that in high dimensional data whenthe number of features is large the fraction of importancefeatures remains so small In this case the original RF modelwhich uses simple random sampling is likely to performpoorly with small 119898 and the trees are likely to select anuninformative feature as a split too frequently (119898 denotesa subspace size of features) At each node 119905 of a tree theprobability of uninformative feature selection is too high

To illustrate this issue let 119866 be the number of noisyfeatures denote by119872 the total number of predictor featuresand let the features 119872 minus 119866 be important ones which have ahigh correlationwith119884 valuesThen if we use simple randomsampling when growing trees to select a subset of 119898 features(119898 ≪ 119872) the total number of possible uninformative aC119898119872minus119866

and the total number of all subset features isC119898119872 The

probability distribution of selecting a subset of 119898 (119898 gt 1)important features is given by

C119898119872minus119866

C119898119872

=(119872 minus 119866) (119872 minus 119866 minus 1) sdot sdot sdot (119872 minus 119866 minus 119898 + 1)

119872 (119872 minus 1) sdot sdot sdot (119872 minus 119898 + 1)

=(1 minus 119866119872) sdot sdot sdot (1 minus 119866119872 minus 119898119872 + 1119872)

(1 minus 1119872) sdot sdot sdot (1 minus 119898119872 + 1119872)

≃ (1 minus119866

119872)

119898

(7)

Because the fraction of important features is too small theprobability in (7) tends to 0 which means that the importantfeatures are rarely selected by the simple sampling methodin RF [1] For example with 5 informative and 5000 noise oruninformative features assuming119898 = radic(5 + 5000) ≃ 70 theprobability of an informative feature to be selected at any splitis 0068

42 Bias Correction for Feature Selection and Feature Weight-ing The bias correction in feature selection is intended tomake the RF model to avoid selecting an uninformative fea-ture To correct this kind of bias in the feature selection stagewe generate shadow features to add to the original datasetThe shadow features set contains the same values possiblecut-points and distribution with the original features buthave no association with 119884 values To create each shadowfeature we rearrange the values of the feature in the originaldataset 119877 times to create the corresponding shadowThis dis-turbance of features eliminates the correlations of the featureswith the response value but keeps its attributes The shadowfeature participates only in the competition for the best splitand makes a decrease in the probability of selecting this kindof uninformative feature For the feature weight computationwe first need to distinguish the important features from theless important ones To do so we run a defined numberof random forests to obtain raw importance scores each ofwhich is obtained using (6) Then we use Wilcoxon rank-sum test [23] that compares the importance score of a featurewith the maximum importance scores of generated noisyfeatures called shadowsThe shadow features are added to theoriginal dataset and they do not have prediction power to theresponse feature Therefore any feature whose importancescore is smaller than the maximum importance score ofnoisy features is considered less important Otherwise it isconsidered important Having computed theWilcoxon rank-sum test we can compute the 119901-value for the feature The 119901-value of a feature in Wilcoxon rank-sum test is assigned aweight with a feature 119883

119895 119901-value isin [0 1] and this weight

indicates the importance of the feature in the predictionThe smaller the 119901-value of a feature the more correlated thepredictor feature to the response feature and therefore themore powerful the feature in prediction The feature weightcomputation is described as follows

Let119872 be the number of features in the original datasetand denote the feature set as S

119883= 119883

119895 119895 = 1 2 119872

In each replicate 119903 (119903 = 1 2 119877) shadow features aregenerated from features119883

119895in SX and we randomly permute

all values of119883119895119877 times to get a corresponding shadow feature

119860119895 denote the shadow feature set as S

119860= 119860

119895119872

1 The

extended feature set is denoted by S119883119860

= S119883S119860

Let the importance score of S119883119860

at replicate 119903 be IS119903119883119860

=

IS119903119883 IS119903119860 where IS119903

119883119895

and IS119903119860119895

are the importance scoresof 119883119895and 119860

119895at the 119903th replicate respectively We built a

random forest model RF from the S119883119860

dataset to compute2119872 importance scores for 2119872 featuresWe repeated the sameprocess119877 times to compute119877 replicates getting IS

119883119895

= IS119903119883119895

119877

1

and IS119860119895

= IS119903119860119895

119877

1 From the replicates of shadow features

we extracted the maximum value from 119903th row of IS119860119895

andput it into the comparison sample denoted by ISmax

119860 For each

data feature 119883119895 we computed Wilcoxon test and performed

hypothesis test on IS119883119895

gt ISmax119860

to calculate the 119901-valuefor the feature Given a statistical significance level we canidentify important features from less important ones Thistest confirms that if a feature is important it consistently

The Scientific World Journal 5

scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]

In each node of trees each shadow 119860119895shares approxi-

mately the same properties of the corresponding 119883119895 but it is

independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883

119895according to 119901-value

and can yield correct ranking of features according to theirimportance

43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features

Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical

response feature and each feature 119883119895 Each observation is

allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is

1205942=

119903

sum

119894=1

119888

sum

119895=1

(119874119894119895minus 119864119894119895)2

119864119894119895

(8)

For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature

Let X119904be the best subset of features we collect all feature

119883119895whose 119901-value is smaller or equal to 005 as a result

from the 1205942 statistical test according to (8) The remainingfeatures X X

119904 are added to X

119908 and this approach is

described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX

119904andX

119908 at each nodewe

randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910

119904= lceil119898119905119903119910 times (X

119904X)rceil and 119898119905119903119910

119908= lfloor119898119905119903119910 times

(X119908X)rfloor where X

119904 and X

119908 are the number of features

in the groups of highly informative features X119904and weak

informative features X119908 respectively X is the number of

informative features in the input datasetThese are merged toform the feature subspace for splitting the node

44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits

at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X

119904

and X119908 (iii) building RF using the subspaces containing

features which are taken randomly and separately from X119904

X119908 and (iv) classifying a new data The new algorithm is

summarized as follows

(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features

(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax

119860of 119877 elements

(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature

(4) Given a significance level threshold 120579 neglect alluninformative features

(5) Partition the remaining features into two subsets X119904

and X119908described in Algorithm 2

(6) Sample the training set L with replacement to gener-ate bagged samples L

1L2 L

119870

(7) For each 119871119896 grow a CART tree 119879

119896as follows

(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX

119904and

X119908and use the subspace features as candidates

for splitting the node(b) Each tree is grown nondeterministically with-

out pruning until the minimum node size 119899minis reached

(8) Given a 119883 = 119909new use (1) to predict the responsevalue

5 Experiments

51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]

For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels

6 The Scientific World Journal

input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold

output X119904and X

119908

(1) Let S119883= L 119884119872 = S

119883

(2) for 119903 larr 1 to 119877 do(3) S

119860larr 119901119890119903119898119906119905119890(S

119883)

(4) S119883119860

= S119883cup S119860

(5) Build RF model from S119883119860

to produce IS119903119883119895

(6) IS119903119860119895 and ISmax

119860 (119895 = 1 119872)

(7) Set X = 0(8) for 119895 larr 1 to 119872 do(9) Compute Wilcoxon rank-sum test with IS

119883119895and ISmax

119860

(10) Compute 119901119895values for each feature119883

119895

(11) if 119901119895le 120579 then

(12) X = X cup 119883119895(119883119895isin S119883)

(13) Set X119904= 0 X

119908= 0

(14) Compute 1205942(X 119884) statistic to get 119901119895value

(15) for 119895 larr 1 to X do(16) if (119901

119895lt 005) then

(17) X119904= X119904cup 119883119895(119883119895isin X)

(18) X119908= X X

119904

(19) return X119904X119908

Algorithm 2 Feature subspace selection

and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons

In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification

In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes

For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256

119872120 and119872504 Totally we created 16 subdatasets as

Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly

Dataset No offeatures

No oftraining No of tests No of

classesColon 2000 62 mdash 2Srbct 2308 63 mdash 4Leukemia 3051 38 mdash 2Lymphoma 4026 62 mdash 3breast2class 4869 78 mdash 2breast3class 4869 96 mdash 3nci 5244 61 mdash 8Brain 5597 42 mdash 5Prostate 6033 102 mdash 2adenocarcinoma 9868 76 mdash 2Fbis 2000 1711 752 17La2s 12432 1855 845 6La1s 13195 1963 887 6

YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenface-M120 YaleBEigenfaceM504 YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 YaleBRandom-faceM504 ORLEigenfaceM30 ORLEigenM56 ORLEigen-M120 ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 ORLRandomM504

The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s

The Scientific World Journal 7

datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset

52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model

The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas

Acc = 1

119873

119873

sum

119894=1

119868 (119876 (119889119894 119910119894) minusmax119895 =119910119894

119876 (119889119894 119895) gt 0) (9)

where 119868(sdot) is the indicator function and 119876(119889119894 119895) =

sum119870

119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889

119894isin D119905on class

119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in

test data D119905 and 119910

119894indicates the true class of 119889

119894

53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)

We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For

comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets

In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog

2(119872) + 1rceil The minimal node

size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory

54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional

Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative

8 The Scientific World Journal

825

850

875

900

925

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

MethodsRFGRRF

wsRFxRF

YaleB + eigenface

(a)

MethodsRFGRRF

wsRFxRF

85

90

95

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

YaleB + randomface

(b)

Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + eigenface

MethodsRFGRRF

wsRFxRF

(a)

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + randomface

MethodsRFGRRF

wsRFxRF

(b)

Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504

features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results

Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset

However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests

Table 3 presents the prediction accuracies (mean plusmn

std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

2 The Scientific World Journal

We first use random forests to measure the importance offeatures and produce raw feature importance scores Thenwe apply a statistical Wilcoxon rank-sum test to separateinformative features from the uninformative ones This isdone by neglecting all uninformative features by definingthreshold 120579 for instance 120579 = 005 Second we use the Chi-square statistic test (1205942) to compute the related scores ofeach feature to the response feature We then partition theset of the remaining informative features into two subsetsone containing highly informative features and the otherone containing weak informative features We independentlysample features from the two subsets and merge themtogether to get a new subspace of features which is usedfor splitting the data at nodes Since the subspace alwayscontains highly informative features which can guarantee abetter split at a node this feature sampling method enablesavoiding selecting biased features and generates trees frombagged sample data with higher accuracy This samplingmethod also is used for dimensionality reduction the amountof data needed for training the random forests modelOur experimental results have shown that random forestswith this weighting feature selection technique outperformedrecently the proposed random forests in increasing of theprediction accuracy we also applied the new approachon microarray and image data and achieved outstandingresults

The structure of this paper is organized as followsIn Section 2 we give a brief summary of related worksIn Section 3 we give a brief summary of random forestsand measurement of feature importance score Section 4describes our newproposed algorithmusing unbiased featureselection Section 5 provides the experimental results evalu-ations and comparisons Section 6 gives our conclusions

2 Related Works

Random forests are an ensemble approach to make classifi-cation decisions by voting the results of individual decisiontrees An ensemble learner with excellent generalizationaccuracy has two properties high accuracy of each compo-nent learner and high diversity in component learners [5]Unlike other ensemble methods such as bagging [1] andboosting [6 7] which create basic classifiers from randomsamples of the training data the random forest approachcreates the basic classifiers from randomly selected subspacesof data [8 9] The randomly selected subspaces increasethe diversity of basic classifiers learnt by a decision treealgorithm

Feature importance is the importancemeasure of featuresin the feature selection process [1 10ndash14] In RF frameworksthe most commonly used score of importance of a givenfeature is the mean error of a tree in the forest when theobserved values of this feature are randomly permuted inthe out-of-bag samples Feature selection is an important stepto obtain good performance for an RF model especially indealing with high dimensional data problems

For feature weighting techniques recently Xu et al [13]proposed an improved RF method which uses a novel fea-ture weighting method for subspace selection and therefore

enhances classification performance on high dimensionaldata The weights of feature were calculated by informationgain ratio or 1205942-test Ye et al [14] then used these weightsto propose a stratified sampling method to select featuresubspaces for RF in classification problems Chen et al[15] used a stratified idea to propose a new clusteringmethod However implementation of the random forestmodel suggested by Ye et al is based on a binary classificationsetting and it uses linear discriminant analysis as the splittingcriteria This stratified RF model is not efficient on highdimensional datasets with multiple classes With the sameway for solving two-class problem Amaratunga et al [16]presented a feature weighting method for subspace samplingto deal with microarray data the 119905-test of variance analysisis used to compute weights for the features Genuer et al[12] proposed a strategy involving a ranking of explanatoryfeatures using the RFs score weights of importance and astepwise ascending feature introduction strategy Deng andRunger [17] proposed a guided regularized RF (GRRF)in which weights of importance scores from an ordinaryrandom forest (RF) are used to guide the feature selectionprocess They found that the regularized least subset selectedby their GRRF with minimal regularization ensures betteraccuracy than the complete feature set However regularRF was used as a classifier due to the fact that regularizedRF may have higher variance than RF because the trees arecorrelated

Several methods have been proposed to correct bias ofimportance measures in the feature selection process in RFsto improve the prediction accuracy [18ndash21] These methodsintend to avoid selecting an uninformative feature for nodesplitting in decision trees Although the methods of this kindwere well investigated and can be used to address the highdimensional problem there are still some unsolved problemssuch as the need to specify in advance the probabilitydistributions as well as the fact that they struggle whenapplied to large high dimensional data

In summary in the reviewed approaches the gain athigher levels of the tree is weighted differently than the gainat lower levels of the tree In fact at lower levels of the treethe gain is reduced because of the effect of splits on differentfeatures at higher levels of the tree That affects the finalprediction performance of RFs model To remedy this inthis paper we propose a new method for unbiased featuresubsets selection in high dimensional space to build RFs Ourapproach differs from previous approaches in the techniquesused to partition a subset of features All uninformativefeatures (considered as noise) are removed from the systemand the best feature set which is highly related to the responsefeature is found using a statistical method The proposedsamplingmethod always provides enough highly informativefeatures for the subspace feature at any levels of the decisiontrees For the case of growing an RF model on data withoutnoise we used in-bagmeasuresThis is a different importancescore of features which requires less computational timecompared to the measures used by others Our experimentalresults showed that our approach outperformed recently theproposed RF methods

The Scientific World Journal 3

input L = (119883119894 119884119894)119873

119894=1| 119883 isin R119872 119884 isin 1 2 119888 the training data set

119870 the number of treesmtry the size of the subspaces

output A random forest RF(1) for 119896 larr 1 to 119870 do(2) Draw a bagged subset of samples L

119896from L

(4) while (stopping criteria is not met) do(5) Select randomlymtry features(6) for 119898 larr 1 to 119898119905119903119910 do(7) Compute the decrease in the node impurity(8) Choose the feature which decreases the impurity the most and

the node is divided into two children nodes(9) Combine the 119870 trees to form a random forest

Algorithm 1 Random forest algorithm

3 Background

31 Random Forest Algorithm Given a training dataset L =

(119883119894 119884119894)119873

119894=1| 119883119894isin R119872 119884 isin 1 2 119888 where 119883

119894are

features (also called predictor variables) 119884 is a class responsefeature 119873 is the number of training samples and 119872 is thenumber of features and a random forest model RF describedin Algorithm 1 let 119896 be the prediction of tree 119879

119896given input

119883 The prediction of random forest with119870 trees is

= majority vote 119896119870

1 (1)

Since each tree is grown from a bagged sample set it isgrown with only two-thirds of the samples in L called in-bagsamples About one-third of the samples is left out and thesesamples are called out-of-bag (OOB) samples which are usedto estimate the prediction error

The OOB predicted value is OOB= (1O

1198941015840)sum119896isinO1198941015840119896

whereO1198941015840 = LO

119894 119894 and 1198941015840 are in-bag and out-of-bag sampled

indices O1198941015840 is the size of OOB subdataset and the OOB

prediction error is

ErrOOB

=1

119873OOB

119873OOB

sum

119894=1

E (119884 OOB

) (2)

where E(sdot) is an error function and 119873OOB is OOB samplesrsquosize

32 Measurement of Feature Importance Score from an RFBreiman presented a permutation technique to measure theimportance of features in the prediction [1] called an out-of-bag importance scoreThe basic idea for measuring this kindof importance score of features is to compute the differencebetween the original mean error and the randomly permutedmean error in OOB samplesThemethod rearranges stochas-tically all values of the 119895th feature in OOB for each tree anduses the RF model to predict this permuted feature and getthe mean error The aim of this permutation is to eliminatethe existing association between the 119895th feature and 119884 values

and then to test the effect of this on the RF model A featureis considered to be in a strong association if the mean errordecreases dramatically

The other kind of feature importance measure canbe obtained when the random forest is growing This isdescribed as follows At each node 119905 in a decision tree thesplit is determined by the decrease in node impurity Δ119877(119905)The node impurity 119877(119905) is the gini index If a subdataset innode 119905 contains samples from 119888 classes gini(119905) is defined as

119877 (119905) = 1 minus

119888

sum

119895=1

1199012

119895 (3)

where 1199012119895is the relative frequency of class 119895 in 119905 Gini(119905) is

minimized if the classes in 119905 are skewed After splitting 119905 intotwo child nodes 119905

1and 1199052with sample sizes119873

1(119905) and119873

2(119905)

the gini index of the split data is defined as

Ginisplit (119905) =1198731(119905)

119873 (119905)Gini (119905

1) +

1198732(119905)

119873 (119905)Gini (119905

2) (4)

The feature providing smallest Ginisplit(119905) is chosen to split thenodeThe importance score of feature119883

119895in a single decision

tree 119879119896is

IS119896(119883119895) = sum

119905isin119879119896

Δ119877 (119905) (5)

and it is computed over all119870 trees in a random forest definedas

IS (119883119895) =

1

119870

119870

sum

119896=1

IS119896(119883119895) (6)

It is worth noting that a random forest uses in-bag sam-ples to produce a kind of importance measure called an in-bag importance scoreThis is themain difference between thein-bag importance score and an out-of-bag measure which isproduced with the decrease of the prediction error using RFinOOB samples In other words the in-bag importance scorerequires less computation time than the out-of-bag measure

4 The Scientific World Journal

4 Our Approach

41 Issues in Feature Selection on High Dimensional DataWhen Breiman et al suggested the classification and regres-sion tree (CART) model they noted that feature selection isbiased because it is based on an information gain criteriacalled multivalue problem [2] Random forest methods arebased on CART trees [1] hence this bias is carried to randomforest RF model In particular the importance scores can bebiased when very high dimensional data contains multipledata types Several methods have been proposed to correctbias of feature importance measures [18ndash21] The conditionalinference framework (referred to as cRF [22]) could be suc-cessfully applied for both the null and power cases [19 20 22]The typical characteristic of the power case is that only onepredictor feature is important while the rest of the featuresare redundant with different cardinality In contrast in thenull case all features used for prediction are redundant withdifferent cardinality Although the methods of this kind werewell investigated and can be used to address the multivalueproblem there are still some unsolved problems such asthe need to specify in advance the probability distributionsas well as the fact that they struggle when applied to highdimensional data

Another issue is that in high dimensional data whenthe number of features is large the fraction of importancefeatures remains so small In this case the original RF modelwhich uses simple random sampling is likely to performpoorly with small 119898 and the trees are likely to select anuninformative feature as a split too frequently (119898 denotesa subspace size of features) At each node 119905 of a tree theprobability of uninformative feature selection is too high

To illustrate this issue let 119866 be the number of noisyfeatures denote by119872 the total number of predictor featuresand let the features 119872 minus 119866 be important ones which have ahigh correlationwith119884 valuesThen if we use simple randomsampling when growing trees to select a subset of 119898 features(119898 ≪ 119872) the total number of possible uninformative aC119898119872minus119866

and the total number of all subset features isC119898119872 The

probability distribution of selecting a subset of 119898 (119898 gt 1)important features is given by

C119898119872minus119866

C119898119872

=(119872 minus 119866) (119872 minus 119866 minus 1) sdot sdot sdot (119872 minus 119866 minus 119898 + 1)

119872 (119872 minus 1) sdot sdot sdot (119872 minus 119898 + 1)

=(1 minus 119866119872) sdot sdot sdot (1 minus 119866119872 minus 119898119872 + 1119872)

(1 minus 1119872) sdot sdot sdot (1 minus 119898119872 + 1119872)

≃ (1 minus119866

119872)

119898

(7)

Because the fraction of important features is too small theprobability in (7) tends to 0 which means that the importantfeatures are rarely selected by the simple sampling methodin RF [1] For example with 5 informative and 5000 noise oruninformative features assuming119898 = radic(5 + 5000) ≃ 70 theprobability of an informative feature to be selected at any splitis 0068

42 Bias Correction for Feature Selection and Feature Weight-ing The bias correction in feature selection is intended tomake the RF model to avoid selecting an uninformative fea-ture To correct this kind of bias in the feature selection stagewe generate shadow features to add to the original datasetThe shadow features set contains the same values possiblecut-points and distribution with the original features buthave no association with 119884 values To create each shadowfeature we rearrange the values of the feature in the originaldataset 119877 times to create the corresponding shadowThis dis-turbance of features eliminates the correlations of the featureswith the response value but keeps its attributes The shadowfeature participates only in the competition for the best splitand makes a decrease in the probability of selecting this kindof uninformative feature For the feature weight computationwe first need to distinguish the important features from theless important ones To do so we run a defined numberof random forests to obtain raw importance scores each ofwhich is obtained using (6) Then we use Wilcoxon rank-sum test [23] that compares the importance score of a featurewith the maximum importance scores of generated noisyfeatures called shadowsThe shadow features are added to theoriginal dataset and they do not have prediction power to theresponse feature Therefore any feature whose importancescore is smaller than the maximum importance score ofnoisy features is considered less important Otherwise it isconsidered important Having computed theWilcoxon rank-sum test we can compute the 119901-value for the feature The 119901-value of a feature in Wilcoxon rank-sum test is assigned aweight with a feature 119883

119895 119901-value isin [0 1] and this weight

indicates the importance of the feature in the predictionThe smaller the 119901-value of a feature the more correlated thepredictor feature to the response feature and therefore themore powerful the feature in prediction The feature weightcomputation is described as follows

Let119872 be the number of features in the original datasetand denote the feature set as S

119883= 119883

119895 119895 = 1 2 119872

In each replicate 119903 (119903 = 1 2 119877) shadow features aregenerated from features119883

119895in SX and we randomly permute

all values of119883119895119877 times to get a corresponding shadow feature

119860119895 denote the shadow feature set as S

119860= 119860

119895119872

1 The

extended feature set is denoted by S119883119860

= S119883S119860

Let the importance score of S119883119860

at replicate 119903 be IS119903119883119860

=

IS119903119883 IS119903119860 where IS119903

119883119895

and IS119903119860119895

are the importance scoresof 119883119895and 119860

119895at the 119903th replicate respectively We built a

random forest model RF from the S119883119860

dataset to compute2119872 importance scores for 2119872 featuresWe repeated the sameprocess119877 times to compute119877 replicates getting IS

119883119895

= IS119903119883119895

119877

1

and IS119860119895

= IS119903119860119895

119877

1 From the replicates of shadow features

we extracted the maximum value from 119903th row of IS119860119895

andput it into the comparison sample denoted by ISmax

119860 For each

data feature 119883119895 we computed Wilcoxon test and performed

hypothesis test on IS119883119895

gt ISmax119860

to calculate the 119901-valuefor the feature Given a statistical significance level we canidentify important features from less important ones Thistest confirms that if a feature is important it consistently

The Scientific World Journal 5

scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]

In each node of trees each shadow 119860119895shares approxi-

mately the same properties of the corresponding 119883119895 but it is

independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883

119895according to 119901-value

and can yield correct ranking of features according to theirimportance

43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features

Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical

response feature and each feature 119883119895 Each observation is

allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is

1205942=

119903

sum

119894=1

119888

sum

119895=1

(119874119894119895minus 119864119894119895)2

119864119894119895

(8)

For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature

Let X119904be the best subset of features we collect all feature

119883119895whose 119901-value is smaller or equal to 005 as a result

from the 1205942 statistical test according to (8) The remainingfeatures X X

119904 are added to X

119908 and this approach is

described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX

119904andX

119908 at each nodewe

randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910

119904= lceil119898119905119903119910 times (X

119904X)rceil and 119898119905119903119910

119908= lfloor119898119905119903119910 times

(X119908X)rfloor where X

119904 and X

119908 are the number of features

in the groups of highly informative features X119904and weak

informative features X119908 respectively X is the number of

informative features in the input datasetThese are merged toform the feature subspace for splitting the node

44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits

at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X

119904

and X119908 (iii) building RF using the subspaces containing

features which are taken randomly and separately from X119904

X119908 and (iv) classifying a new data The new algorithm is

summarized as follows

(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features

(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax

119860of 119877 elements

(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature

(4) Given a significance level threshold 120579 neglect alluninformative features

(5) Partition the remaining features into two subsets X119904

and X119908described in Algorithm 2

(6) Sample the training set L with replacement to gener-ate bagged samples L

1L2 L

119870

(7) For each 119871119896 grow a CART tree 119879

119896as follows

(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX

119904and

X119908and use the subspace features as candidates

for splitting the node(b) Each tree is grown nondeterministically with-

out pruning until the minimum node size 119899minis reached

(8) Given a 119883 = 119909new use (1) to predict the responsevalue

5 Experiments

51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]

For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels

6 The Scientific World Journal

input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold

output X119904and X

119908

(1) Let S119883= L 119884119872 = S

119883

(2) for 119903 larr 1 to 119877 do(3) S

119860larr 119901119890119903119898119906119905119890(S

119883)

(4) S119883119860

= S119883cup S119860

(5) Build RF model from S119883119860

to produce IS119903119883119895

(6) IS119903119860119895 and ISmax

119860 (119895 = 1 119872)

(7) Set X = 0(8) for 119895 larr 1 to 119872 do(9) Compute Wilcoxon rank-sum test with IS

119883119895and ISmax

119860

(10) Compute 119901119895values for each feature119883

119895

(11) if 119901119895le 120579 then

(12) X = X cup 119883119895(119883119895isin S119883)

(13) Set X119904= 0 X

119908= 0

(14) Compute 1205942(X 119884) statistic to get 119901119895value

(15) for 119895 larr 1 to X do(16) if (119901

119895lt 005) then

(17) X119904= X119904cup 119883119895(119883119895isin X)

(18) X119908= X X

119904

(19) return X119904X119908

Algorithm 2 Feature subspace selection

and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons

In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification

In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes

For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256

119872120 and119872504 Totally we created 16 subdatasets as

Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly

Dataset No offeatures

No oftraining No of tests No of

classesColon 2000 62 mdash 2Srbct 2308 63 mdash 4Leukemia 3051 38 mdash 2Lymphoma 4026 62 mdash 3breast2class 4869 78 mdash 2breast3class 4869 96 mdash 3nci 5244 61 mdash 8Brain 5597 42 mdash 5Prostate 6033 102 mdash 2adenocarcinoma 9868 76 mdash 2Fbis 2000 1711 752 17La2s 12432 1855 845 6La1s 13195 1963 887 6

YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenface-M120 YaleBEigenfaceM504 YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 YaleBRandom-faceM504 ORLEigenfaceM30 ORLEigenM56 ORLEigen-M120 ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 ORLRandomM504

The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s

The Scientific World Journal 7

datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset

52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model

The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas

Acc = 1

119873

119873

sum

119894=1

119868 (119876 (119889119894 119910119894) minusmax119895 =119910119894

119876 (119889119894 119895) gt 0) (9)

where 119868(sdot) is the indicator function and 119876(119889119894 119895) =

sum119870

119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889

119894isin D119905on class

119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in

test data D119905 and 119910

119894indicates the true class of 119889

119894

53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)

We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For

comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets

In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog

2(119872) + 1rceil The minimal node

size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory

54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional

Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative

8 The Scientific World Journal

825

850

875

900

925

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

MethodsRFGRRF

wsRFxRF

YaleB + eigenface

(a)

MethodsRFGRRF

wsRFxRF

85

90

95

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

YaleB + randomface

(b)

Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + eigenface

MethodsRFGRRF

wsRFxRF

(a)

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + randomface

MethodsRFGRRF

wsRFxRF

(b)

Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504

features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results

Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset

However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests

Table 3 presents the prediction accuracies (mean plusmn

std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

The Scientific World Journal 3

input L = (119883119894 119884119894)119873

119894=1| 119883 isin R119872 119884 isin 1 2 119888 the training data set

119870 the number of treesmtry the size of the subspaces

output A random forest RF(1) for 119896 larr 1 to 119870 do(2) Draw a bagged subset of samples L

119896from L

(4) while (stopping criteria is not met) do(5) Select randomlymtry features(6) for 119898 larr 1 to 119898119905119903119910 do(7) Compute the decrease in the node impurity(8) Choose the feature which decreases the impurity the most and

the node is divided into two children nodes(9) Combine the 119870 trees to form a random forest

Algorithm 1 Random forest algorithm

3 Background

31 Random Forest Algorithm Given a training dataset L =

(119883119894 119884119894)119873

119894=1| 119883119894isin R119872 119884 isin 1 2 119888 where 119883

119894are

features (also called predictor variables) 119884 is a class responsefeature 119873 is the number of training samples and 119872 is thenumber of features and a random forest model RF describedin Algorithm 1 let 119896 be the prediction of tree 119879

119896given input

119883 The prediction of random forest with119870 trees is

= majority vote 119896119870

1 (1)

Since each tree is grown from a bagged sample set it isgrown with only two-thirds of the samples in L called in-bagsamples About one-third of the samples is left out and thesesamples are called out-of-bag (OOB) samples which are usedto estimate the prediction error

The OOB predicted value is OOB= (1O

1198941015840)sum119896isinO1198941015840119896

whereO1198941015840 = LO

119894 119894 and 1198941015840 are in-bag and out-of-bag sampled

indices O1198941015840 is the size of OOB subdataset and the OOB

prediction error is

ErrOOB

=1

119873OOB

119873OOB

sum

119894=1

E (119884 OOB

) (2)

where E(sdot) is an error function and 119873OOB is OOB samplesrsquosize

32 Measurement of Feature Importance Score from an RFBreiman presented a permutation technique to measure theimportance of features in the prediction [1] called an out-of-bag importance scoreThe basic idea for measuring this kindof importance score of features is to compute the differencebetween the original mean error and the randomly permutedmean error in OOB samplesThemethod rearranges stochas-tically all values of the 119895th feature in OOB for each tree anduses the RF model to predict this permuted feature and getthe mean error The aim of this permutation is to eliminatethe existing association between the 119895th feature and 119884 values

and then to test the effect of this on the RF model A featureis considered to be in a strong association if the mean errordecreases dramatically

The other kind of feature importance measure canbe obtained when the random forest is growing This isdescribed as follows At each node 119905 in a decision tree thesplit is determined by the decrease in node impurity Δ119877(119905)The node impurity 119877(119905) is the gini index If a subdataset innode 119905 contains samples from 119888 classes gini(119905) is defined as

119877 (119905) = 1 minus

119888

sum

119895=1

1199012

119895 (3)

where 1199012119895is the relative frequency of class 119895 in 119905 Gini(119905) is

minimized if the classes in 119905 are skewed After splitting 119905 intotwo child nodes 119905

1and 1199052with sample sizes119873

1(119905) and119873

2(119905)

the gini index of the split data is defined as

Ginisplit (119905) =1198731(119905)

119873 (119905)Gini (119905

1) +

1198732(119905)

119873 (119905)Gini (119905

2) (4)

The feature providing smallest Ginisplit(119905) is chosen to split thenodeThe importance score of feature119883

119895in a single decision

tree 119879119896is

IS119896(119883119895) = sum

119905isin119879119896

Δ119877 (119905) (5)

and it is computed over all119870 trees in a random forest definedas

IS (119883119895) =

1

119870

119870

sum

119896=1

IS119896(119883119895) (6)

It is worth noting that a random forest uses in-bag sam-ples to produce a kind of importance measure called an in-bag importance scoreThis is themain difference between thein-bag importance score and an out-of-bag measure which isproduced with the decrease of the prediction error using RFinOOB samples In other words the in-bag importance scorerequires less computation time than the out-of-bag measure

4 The Scientific World Journal

4 Our Approach

41 Issues in Feature Selection on High Dimensional DataWhen Breiman et al suggested the classification and regres-sion tree (CART) model they noted that feature selection isbiased because it is based on an information gain criteriacalled multivalue problem [2] Random forest methods arebased on CART trees [1] hence this bias is carried to randomforest RF model In particular the importance scores can bebiased when very high dimensional data contains multipledata types Several methods have been proposed to correctbias of feature importance measures [18ndash21] The conditionalinference framework (referred to as cRF [22]) could be suc-cessfully applied for both the null and power cases [19 20 22]The typical characteristic of the power case is that only onepredictor feature is important while the rest of the featuresare redundant with different cardinality In contrast in thenull case all features used for prediction are redundant withdifferent cardinality Although the methods of this kind werewell investigated and can be used to address the multivalueproblem there are still some unsolved problems such asthe need to specify in advance the probability distributionsas well as the fact that they struggle when applied to highdimensional data

Another issue is that in high dimensional data whenthe number of features is large the fraction of importancefeatures remains so small In this case the original RF modelwhich uses simple random sampling is likely to performpoorly with small 119898 and the trees are likely to select anuninformative feature as a split too frequently (119898 denotesa subspace size of features) At each node 119905 of a tree theprobability of uninformative feature selection is too high

To illustrate this issue let 119866 be the number of noisyfeatures denote by119872 the total number of predictor featuresand let the features 119872 minus 119866 be important ones which have ahigh correlationwith119884 valuesThen if we use simple randomsampling when growing trees to select a subset of 119898 features(119898 ≪ 119872) the total number of possible uninformative aC119898119872minus119866

and the total number of all subset features isC119898119872 The

probability distribution of selecting a subset of 119898 (119898 gt 1)important features is given by

C119898119872minus119866

C119898119872

=(119872 minus 119866) (119872 minus 119866 minus 1) sdot sdot sdot (119872 minus 119866 minus 119898 + 1)

119872 (119872 minus 1) sdot sdot sdot (119872 minus 119898 + 1)

=(1 minus 119866119872) sdot sdot sdot (1 minus 119866119872 minus 119898119872 + 1119872)

(1 minus 1119872) sdot sdot sdot (1 minus 119898119872 + 1119872)

≃ (1 minus119866

119872)

119898

(7)

Because the fraction of important features is too small theprobability in (7) tends to 0 which means that the importantfeatures are rarely selected by the simple sampling methodin RF [1] For example with 5 informative and 5000 noise oruninformative features assuming119898 = radic(5 + 5000) ≃ 70 theprobability of an informative feature to be selected at any splitis 0068

42 Bias Correction for Feature Selection and Feature Weight-ing The bias correction in feature selection is intended tomake the RF model to avoid selecting an uninformative fea-ture To correct this kind of bias in the feature selection stagewe generate shadow features to add to the original datasetThe shadow features set contains the same values possiblecut-points and distribution with the original features buthave no association with 119884 values To create each shadowfeature we rearrange the values of the feature in the originaldataset 119877 times to create the corresponding shadowThis dis-turbance of features eliminates the correlations of the featureswith the response value but keeps its attributes The shadowfeature participates only in the competition for the best splitand makes a decrease in the probability of selecting this kindof uninformative feature For the feature weight computationwe first need to distinguish the important features from theless important ones To do so we run a defined numberof random forests to obtain raw importance scores each ofwhich is obtained using (6) Then we use Wilcoxon rank-sum test [23] that compares the importance score of a featurewith the maximum importance scores of generated noisyfeatures called shadowsThe shadow features are added to theoriginal dataset and they do not have prediction power to theresponse feature Therefore any feature whose importancescore is smaller than the maximum importance score ofnoisy features is considered less important Otherwise it isconsidered important Having computed theWilcoxon rank-sum test we can compute the 119901-value for the feature The 119901-value of a feature in Wilcoxon rank-sum test is assigned aweight with a feature 119883

119895 119901-value isin [0 1] and this weight

indicates the importance of the feature in the predictionThe smaller the 119901-value of a feature the more correlated thepredictor feature to the response feature and therefore themore powerful the feature in prediction The feature weightcomputation is described as follows

Let119872 be the number of features in the original datasetand denote the feature set as S

119883= 119883

119895 119895 = 1 2 119872

In each replicate 119903 (119903 = 1 2 119877) shadow features aregenerated from features119883

119895in SX and we randomly permute

all values of119883119895119877 times to get a corresponding shadow feature

119860119895 denote the shadow feature set as S

119860= 119860

119895119872

1 The

extended feature set is denoted by S119883119860

= S119883S119860

Let the importance score of S119883119860

at replicate 119903 be IS119903119883119860

=

IS119903119883 IS119903119860 where IS119903

119883119895

and IS119903119860119895

are the importance scoresof 119883119895and 119860

119895at the 119903th replicate respectively We built a

random forest model RF from the S119883119860

dataset to compute2119872 importance scores for 2119872 featuresWe repeated the sameprocess119877 times to compute119877 replicates getting IS

119883119895

= IS119903119883119895

119877

1

and IS119860119895

= IS119903119860119895

119877

1 From the replicates of shadow features

we extracted the maximum value from 119903th row of IS119860119895

andput it into the comparison sample denoted by ISmax

119860 For each

data feature 119883119895 we computed Wilcoxon test and performed

hypothesis test on IS119883119895

gt ISmax119860

to calculate the 119901-valuefor the feature Given a statistical significance level we canidentify important features from less important ones Thistest confirms that if a feature is important it consistently

The Scientific World Journal 5

scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]

In each node of trees each shadow 119860119895shares approxi-

mately the same properties of the corresponding 119883119895 but it is

independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883

119895according to 119901-value

and can yield correct ranking of features according to theirimportance

43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features

Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical

response feature and each feature 119883119895 Each observation is

allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is

1205942=

119903

sum

119894=1

119888

sum

119895=1

(119874119894119895minus 119864119894119895)2

119864119894119895

(8)

For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature

Let X119904be the best subset of features we collect all feature

119883119895whose 119901-value is smaller or equal to 005 as a result

from the 1205942 statistical test according to (8) The remainingfeatures X X

119904 are added to X

119908 and this approach is

described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX

119904andX

119908 at each nodewe

randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910

119904= lceil119898119905119903119910 times (X

119904X)rceil and 119898119905119903119910

119908= lfloor119898119905119903119910 times

(X119908X)rfloor where X

119904 and X

119908 are the number of features

in the groups of highly informative features X119904and weak

informative features X119908 respectively X is the number of

informative features in the input datasetThese are merged toform the feature subspace for splitting the node

44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits

at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X

119904

and X119908 (iii) building RF using the subspaces containing

features which are taken randomly and separately from X119904

X119908 and (iv) classifying a new data The new algorithm is

summarized as follows

(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features

(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax

119860of 119877 elements

(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature

(4) Given a significance level threshold 120579 neglect alluninformative features

(5) Partition the remaining features into two subsets X119904

and X119908described in Algorithm 2

(6) Sample the training set L with replacement to gener-ate bagged samples L

1L2 L

119870

(7) For each 119871119896 grow a CART tree 119879

119896as follows

(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX

119904and

X119908and use the subspace features as candidates

for splitting the node(b) Each tree is grown nondeterministically with-

out pruning until the minimum node size 119899minis reached

(8) Given a 119883 = 119909new use (1) to predict the responsevalue

5 Experiments

51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]

For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels

6 The Scientific World Journal

input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold

output X119904and X

119908

(1) Let S119883= L 119884119872 = S

119883

(2) for 119903 larr 1 to 119877 do(3) S

119860larr 119901119890119903119898119906119905119890(S

119883)

(4) S119883119860

= S119883cup S119860

(5) Build RF model from S119883119860

to produce IS119903119883119895

(6) IS119903119860119895 and ISmax

119860 (119895 = 1 119872)

(7) Set X = 0(8) for 119895 larr 1 to 119872 do(9) Compute Wilcoxon rank-sum test with IS

119883119895and ISmax

119860

(10) Compute 119901119895values for each feature119883

119895

(11) if 119901119895le 120579 then

(12) X = X cup 119883119895(119883119895isin S119883)

(13) Set X119904= 0 X

119908= 0

(14) Compute 1205942(X 119884) statistic to get 119901119895value

(15) for 119895 larr 1 to X do(16) if (119901

119895lt 005) then

(17) X119904= X119904cup 119883119895(119883119895isin X)

(18) X119908= X X

119904

(19) return X119904X119908

Algorithm 2 Feature subspace selection

and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons

In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification

In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes

For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256

119872120 and119872504 Totally we created 16 subdatasets as

Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly

Dataset No offeatures

No oftraining No of tests No of

classesColon 2000 62 mdash 2Srbct 2308 63 mdash 4Leukemia 3051 38 mdash 2Lymphoma 4026 62 mdash 3breast2class 4869 78 mdash 2breast3class 4869 96 mdash 3nci 5244 61 mdash 8Brain 5597 42 mdash 5Prostate 6033 102 mdash 2adenocarcinoma 9868 76 mdash 2Fbis 2000 1711 752 17La2s 12432 1855 845 6La1s 13195 1963 887 6

YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenface-M120 YaleBEigenfaceM504 YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 YaleBRandom-faceM504 ORLEigenfaceM30 ORLEigenM56 ORLEigen-M120 ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 ORLRandomM504

The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s

The Scientific World Journal 7

datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset

52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model

The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas

Acc = 1

119873

119873

sum

119894=1

119868 (119876 (119889119894 119910119894) minusmax119895 =119910119894

119876 (119889119894 119895) gt 0) (9)

where 119868(sdot) is the indicator function and 119876(119889119894 119895) =

sum119870

119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889

119894isin D119905on class

119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in

test data D119905 and 119910

119894indicates the true class of 119889

119894

53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)

We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For

comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets

In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog

2(119872) + 1rceil The minimal node

size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory

54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional

Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative

8 The Scientific World Journal

825

850

875

900

925

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

MethodsRFGRRF

wsRFxRF

YaleB + eigenface

(a)

MethodsRFGRRF

wsRFxRF

85

90

95

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

YaleB + randomface

(b)

Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + eigenface

MethodsRFGRRF

wsRFxRF

(a)

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + randomface

MethodsRFGRRF

wsRFxRF

(b)

Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504

features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results

Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset

However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests

Table 3 presents the prediction accuracies (mean plusmn

std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

4 The Scientific World Journal

4 Our Approach

41 Issues in Feature Selection on High Dimensional DataWhen Breiman et al suggested the classification and regres-sion tree (CART) model they noted that feature selection isbiased because it is based on an information gain criteriacalled multivalue problem [2] Random forest methods arebased on CART trees [1] hence this bias is carried to randomforest RF model In particular the importance scores can bebiased when very high dimensional data contains multipledata types Several methods have been proposed to correctbias of feature importance measures [18ndash21] The conditionalinference framework (referred to as cRF [22]) could be suc-cessfully applied for both the null and power cases [19 20 22]The typical characteristic of the power case is that only onepredictor feature is important while the rest of the featuresare redundant with different cardinality In contrast in thenull case all features used for prediction are redundant withdifferent cardinality Although the methods of this kind werewell investigated and can be used to address the multivalueproblem there are still some unsolved problems such asthe need to specify in advance the probability distributionsas well as the fact that they struggle when applied to highdimensional data

Another issue is that in high dimensional data whenthe number of features is large the fraction of importancefeatures remains so small In this case the original RF modelwhich uses simple random sampling is likely to performpoorly with small 119898 and the trees are likely to select anuninformative feature as a split too frequently (119898 denotesa subspace size of features) At each node 119905 of a tree theprobability of uninformative feature selection is too high

To illustrate this issue let 119866 be the number of noisyfeatures denote by119872 the total number of predictor featuresand let the features 119872 minus 119866 be important ones which have ahigh correlationwith119884 valuesThen if we use simple randomsampling when growing trees to select a subset of 119898 features(119898 ≪ 119872) the total number of possible uninformative aC119898119872minus119866

and the total number of all subset features isC119898119872 The

probability distribution of selecting a subset of 119898 (119898 gt 1)important features is given by

C119898119872minus119866

C119898119872

=(119872 minus 119866) (119872 minus 119866 minus 1) sdot sdot sdot (119872 minus 119866 minus 119898 + 1)

119872 (119872 minus 1) sdot sdot sdot (119872 minus 119898 + 1)

=(1 minus 119866119872) sdot sdot sdot (1 minus 119866119872 minus 119898119872 + 1119872)

(1 minus 1119872) sdot sdot sdot (1 minus 119898119872 + 1119872)

≃ (1 minus119866

119872)

119898

(7)

Because the fraction of important features is too small theprobability in (7) tends to 0 which means that the importantfeatures are rarely selected by the simple sampling methodin RF [1] For example with 5 informative and 5000 noise oruninformative features assuming119898 = radic(5 + 5000) ≃ 70 theprobability of an informative feature to be selected at any splitis 0068

42 Bias Correction for Feature Selection and Feature Weight-ing The bias correction in feature selection is intended tomake the RF model to avoid selecting an uninformative fea-ture To correct this kind of bias in the feature selection stagewe generate shadow features to add to the original datasetThe shadow features set contains the same values possiblecut-points and distribution with the original features buthave no association with 119884 values To create each shadowfeature we rearrange the values of the feature in the originaldataset 119877 times to create the corresponding shadowThis dis-turbance of features eliminates the correlations of the featureswith the response value but keeps its attributes The shadowfeature participates only in the competition for the best splitand makes a decrease in the probability of selecting this kindof uninformative feature For the feature weight computationwe first need to distinguish the important features from theless important ones To do so we run a defined numberof random forests to obtain raw importance scores each ofwhich is obtained using (6) Then we use Wilcoxon rank-sum test [23] that compares the importance score of a featurewith the maximum importance scores of generated noisyfeatures called shadowsThe shadow features are added to theoriginal dataset and they do not have prediction power to theresponse feature Therefore any feature whose importancescore is smaller than the maximum importance score ofnoisy features is considered less important Otherwise it isconsidered important Having computed theWilcoxon rank-sum test we can compute the 119901-value for the feature The 119901-value of a feature in Wilcoxon rank-sum test is assigned aweight with a feature 119883

119895 119901-value isin [0 1] and this weight

indicates the importance of the feature in the predictionThe smaller the 119901-value of a feature the more correlated thepredictor feature to the response feature and therefore themore powerful the feature in prediction The feature weightcomputation is described as follows

Let119872 be the number of features in the original datasetand denote the feature set as S

119883= 119883

119895 119895 = 1 2 119872

In each replicate 119903 (119903 = 1 2 119877) shadow features aregenerated from features119883

119895in SX and we randomly permute

all values of119883119895119877 times to get a corresponding shadow feature

119860119895 denote the shadow feature set as S

119860= 119860

119895119872

1 The

extended feature set is denoted by S119883119860

= S119883S119860

Let the importance score of S119883119860

at replicate 119903 be IS119903119883119860

=

IS119903119883 IS119903119860 where IS119903

119883119895

and IS119903119860119895

are the importance scoresof 119883119895and 119860

119895at the 119903th replicate respectively We built a

random forest model RF from the S119883119860

dataset to compute2119872 importance scores for 2119872 featuresWe repeated the sameprocess119877 times to compute119877 replicates getting IS

119883119895

= IS119903119883119895

119877

1

and IS119860119895

= IS119903119860119895

119877

1 From the replicates of shadow features

we extracted the maximum value from 119903th row of IS119860119895

andput it into the comparison sample denoted by ISmax

119860 For each

data feature 119883119895 we computed Wilcoxon test and performed

hypothesis test on IS119883119895

gt ISmax119860

to calculate the 119901-valuefor the feature Given a statistical significance level we canidentify important features from less important ones Thistest confirms that if a feature is important it consistently

The Scientific World Journal 5

scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]

In each node of trees each shadow 119860119895shares approxi-

mately the same properties of the corresponding 119883119895 but it is

independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883

119895according to 119901-value

and can yield correct ranking of features according to theirimportance

43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features

Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical

response feature and each feature 119883119895 Each observation is

allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is

1205942=

119903

sum

119894=1

119888

sum

119895=1

(119874119894119895minus 119864119894119895)2

119864119894119895

(8)

For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature

Let X119904be the best subset of features we collect all feature

119883119895whose 119901-value is smaller or equal to 005 as a result

from the 1205942 statistical test according to (8) The remainingfeatures X X

119904 are added to X

119908 and this approach is

described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX

119904andX

119908 at each nodewe

randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910

119904= lceil119898119905119903119910 times (X

119904X)rceil and 119898119905119903119910

119908= lfloor119898119905119903119910 times

(X119908X)rfloor where X

119904 and X

119908 are the number of features

in the groups of highly informative features X119904and weak

informative features X119908 respectively X is the number of

informative features in the input datasetThese are merged toform the feature subspace for splitting the node

44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits

at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X

119904

and X119908 (iii) building RF using the subspaces containing

features which are taken randomly and separately from X119904

X119908 and (iv) classifying a new data The new algorithm is

summarized as follows

(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features

(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax

119860of 119877 elements

(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature

(4) Given a significance level threshold 120579 neglect alluninformative features

(5) Partition the remaining features into two subsets X119904

and X119908described in Algorithm 2

(6) Sample the training set L with replacement to gener-ate bagged samples L

1L2 L

119870

(7) For each 119871119896 grow a CART tree 119879

119896as follows

(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX

119904and

X119908and use the subspace features as candidates

for splitting the node(b) Each tree is grown nondeterministically with-

out pruning until the minimum node size 119899minis reached

(8) Given a 119883 = 119909new use (1) to predict the responsevalue

5 Experiments

51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]

For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels

6 The Scientific World Journal

input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold

output X119904and X

119908

(1) Let S119883= L 119884119872 = S

119883

(2) for 119903 larr 1 to 119877 do(3) S

119860larr 119901119890119903119898119906119905119890(S

119883)

(4) S119883119860

= S119883cup S119860

(5) Build RF model from S119883119860

to produce IS119903119883119895

(6) IS119903119860119895 and ISmax

119860 (119895 = 1 119872)

(7) Set X = 0(8) for 119895 larr 1 to 119872 do(9) Compute Wilcoxon rank-sum test with IS

119883119895and ISmax

119860

(10) Compute 119901119895values for each feature119883

119895

(11) if 119901119895le 120579 then

(12) X = X cup 119883119895(119883119895isin S119883)

(13) Set X119904= 0 X

119908= 0

(14) Compute 1205942(X 119884) statistic to get 119901119895value

(15) for 119895 larr 1 to X do(16) if (119901

119895lt 005) then

(17) X119904= X119904cup 119883119895(119883119895isin X)

(18) X119908= X X

119904

(19) return X119904X119908

Algorithm 2 Feature subspace selection

and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons

In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification

In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes

For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256

119872120 and119872504 Totally we created 16 subdatasets as

Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly

Dataset No offeatures

No oftraining No of tests No of

classesColon 2000 62 mdash 2Srbct 2308 63 mdash 4Leukemia 3051 38 mdash 2Lymphoma 4026 62 mdash 3breast2class 4869 78 mdash 2breast3class 4869 96 mdash 3nci 5244 61 mdash 8Brain 5597 42 mdash 5Prostate 6033 102 mdash 2adenocarcinoma 9868 76 mdash 2Fbis 2000 1711 752 17La2s 12432 1855 845 6La1s 13195 1963 887 6

YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenface-M120 YaleBEigenfaceM504 YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 YaleBRandom-faceM504 ORLEigenfaceM30 ORLEigenM56 ORLEigen-M120 ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 ORLRandomM504

The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s

The Scientific World Journal 7

datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset

52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model

The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas

Acc = 1

119873

119873

sum

119894=1

119868 (119876 (119889119894 119910119894) minusmax119895 =119910119894

119876 (119889119894 119895) gt 0) (9)

where 119868(sdot) is the indicator function and 119876(119889119894 119895) =

sum119870

119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889

119894isin D119905on class

119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in

test data D119905 and 119910

119894indicates the true class of 119889

119894

53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)

We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For

comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets

In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog

2(119872) + 1rceil The minimal node

size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory

54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional

Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative

8 The Scientific World Journal

825

850

875

900

925

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

MethodsRFGRRF

wsRFxRF

YaleB + eigenface

(a)

MethodsRFGRRF

wsRFxRF

85

90

95

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

YaleB + randomface

(b)

Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + eigenface

MethodsRFGRRF

wsRFxRF

(a)

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + randomface

MethodsRFGRRF

wsRFxRF

(b)

Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504

features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results

Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset

However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests

Table 3 presents the prediction accuracies (mean plusmn

std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

The Scientific World Journal 5

scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]

In each node of trees each shadow 119860119895shares approxi-

mately the same properties of the corresponding 119883119895 but it is

independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883

119895according to 119901-value

and can yield correct ranking of features according to theirimportance

43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features

Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical

response feature and each feature 119883119895 Each observation is

allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is

1205942=

119903

sum

119894=1

119888

sum

119895=1

(119874119894119895minus 119864119894119895)2

119864119894119895

(8)

For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature

Let X119904be the best subset of features we collect all feature

119883119895whose 119901-value is smaller or equal to 005 as a result

from the 1205942 statistical test according to (8) The remainingfeatures X X

119904 are added to X

119908 and this approach is

described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX

119904andX

119908 at each nodewe

randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910

119904= lceil119898119905119903119910 times (X

119904X)rceil and 119898119905119903119910

119908= lfloor119898119905119903119910 times

(X119908X)rfloor where X

119904 and X

119908 are the number of features

in the groups of highly informative features X119904and weak

informative features X119908 respectively X is the number of

informative features in the input datasetThese are merged toform the feature subspace for splitting the node

44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits

at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X

119904

and X119908 (iii) building RF using the subspaces containing

features which are taken randomly and separately from X119904

X119908 and (iv) classifying a new data The new algorithm is

summarized as follows

(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features

(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax

119860of 119877 elements

(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature

(4) Given a significance level threshold 120579 neglect alluninformative features

(5) Partition the remaining features into two subsets X119904

and X119908described in Algorithm 2

(6) Sample the training set L with replacement to gener-ate bagged samples L

1L2 L

119870

(7) For each 119871119896 grow a CART tree 119879

119896as follows

(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX

119904and

X119908and use the subspace features as candidates

for splitting the node(b) Each tree is grown nondeterministically with-

out pruning until the minimum node size 119899minis reached

(8) Given a 119883 = 119909new use (1) to predict the responsevalue

5 Experiments

51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]

For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels

6 The Scientific World Journal

input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold

output X119904and X

119908

(1) Let S119883= L 119884119872 = S

119883

(2) for 119903 larr 1 to 119877 do(3) S

119860larr 119901119890119903119898119906119905119890(S

119883)

(4) S119883119860

= S119883cup S119860

(5) Build RF model from S119883119860

to produce IS119903119883119895

(6) IS119903119860119895 and ISmax

119860 (119895 = 1 119872)

(7) Set X = 0(8) for 119895 larr 1 to 119872 do(9) Compute Wilcoxon rank-sum test with IS

119883119895and ISmax

119860

(10) Compute 119901119895values for each feature119883

119895

(11) if 119901119895le 120579 then

(12) X = X cup 119883119895(119883119895isin S119883)

(13) Set X119904= 0 X

119908= 0

(14) Compute 1205942(X 119884) statistic to get 119901119895value

(15) for 119895 larr 1 to X do(16) if (119901

119895lt 005) then

(17) X119904= X119904cup 119883119895(119883119895isin X)

(18) X119908= X X

119904

(19) return X119904X119908

Algorithm 2 Feature subspace selection

and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons

In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification

In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes

For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256

119872120 and119872504 Totally we created 16 subdatasets as

Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly

Dataset No offeatures

No oftraining No of tests No of

classesColon 2000 62 mdash 2Srbct 2308 63 mdash 4Leukemia 3051 38 mdash 2Lymphoma 4026 62 mdash 3breast2class 4869 78 mdash 2breast3class 4869 96 mdash 3nci 5244 61 mdash 8Brain 5597 42 mdash 5Prostate 6033 102 mdash 2adenocarcinoma 9868 76 mdash 2Fbis 2000 1711 752 17La2s 12432 1855 845 6La1s 13195 1963 887 6

YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenface-M120 YaleBEigenfaceM504 YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 YaleBRandom-faceM504 ORLEigenfaceM30 ORLEigenM56 ORLEigen-M120 ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 ORLRandomM504

The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s

The Scientific World Journal 7

datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset

52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model

The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas

Acc = 1

119873

119873

sum

119894=1

119868 (119876 (119889119894 119910119894) minusmax119895 =119910119894

119876 (119889119894 119895) gt 0) (9)

where 119868(sdot) is the indicator function and 119876(119889119894 119895) =

sum119870

119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889

119894isin D119905on class

119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in

test data D119905 and 119910

119894indicates the true class of 119889

119894

53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)

We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For

comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets

In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog

2(119872) + 1rceil The minimal node

size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory

54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional

Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative

8 The Scientific World Journal

825

850

875

900

925

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

MethodsRFGRRF

wsRFxRF

YaleB + eigenface

(a)

MethodsRFGRRF

wsRFxRF

85

90

95

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

YaleB + randomface

(b)

Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + eigenface

MethodsRFGRRF

wsRFxRF

(a)

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + randomface

MethodsRFGRRF

wsRFxRF

(b)

Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504

features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results

Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset

However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests

Table 3 presents the prediction accuracies (mean plusmn

std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

6 The Scientific World Journal

input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold

output X119904and X

119908

(1) Let S119883= L 119884119872 = S

119883

(2) for 119903 larr 1 to 119877 do(3) S

119860larr 119901119890119903119898119906119905119890(S

119883)

(4) S119883119860

= S119883cup S119860

(5) Build RF model from S119883119860

to produce IS119903119883119895

(6) IS119903119860119895 and ISmax

119860 (119895 = 1 119872)

(7) Set X = 0(8) for 119895 larr 1 to 119872 do(9) Compute Wilcoxon rank-sum test with IS

119883119895and ISmax

119860

(10) Compute 119901119895values for each feature119883

119895

(11) if 119901119895le 120579 then

(12) X = X cup 119883119895(119883119895isin S119883)

(13) Set X119904= 0 X

119908= 0

(14) Compute 1205942(X 119884) statistic to get 119901119895value

(15) for 119895 larr 1 to X do(16) if (119901

119895lt 005) then

(17) X119904= X119904cup 119883119895(119883119895isin X)

(18) X119908= X X

119904

(19) return X119904X119908

Algorithm 2 Feature subspace selection

and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons

In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification

In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes

For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256

119872120 and119872504 Totally we created 16 subdatasets as

Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly

Dataset No offeatures

No oftraining No of tests No of

classesColon 2000 62 mdash 2Srbct 2308 63 mdash 4Leukemia 3051 38 mdash 2Lymphoma 4026 62 mdash 3breast2class 4869 78 mdash 2breast3class 4869 96 mdash 3nci 5244 61 mdash 8Brain 5597 42 mdash 5Prostate 6033 102 mdash 2adenocarcinoma 9868 76 mdash 2Fbis 2000 1711 752 17La2s 12432 1855 845 6La1s 13195 1963 887 6

YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenface-M120 YaleBEigenfaceM504 YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 YaleBRandom-faceM504 ORLEigenfaceM30 ORLEigenM56 ORLEigen-M120 ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 ORLRandomM504

The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s

The Scientific World Journal 7

datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset

52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model

The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas

Acc = 1

119873

119873

sum

119894=1

119868 (119876 (119889119894 119910119894) minusmax119895 =119910119894

119876 (119889119894 119895) gt 0) (9)

where 119868(sdot) is the indicator function and 119876(119889119894 119895) =

sum119870

119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889

119894isin D119905on class

119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in

test data D119905 and 119910

119894indicates the true class of 119889

119894

53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)

We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For

comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets

In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog

2(119872) + 1rceil The minimal node

size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory

54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional

Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative

8 The Scientific World Journal

825

850

875

900

925

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

MethodsRFGRRF

wsRFxRF

YaleB + eigenface

(a)

MethodsRFGRRF

wsRFxRF

85

90

95

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

YaleB + randomface

(b)

Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + eigenface

MethodsRFGRRF

wsRFxRF

(a)

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + randomface

MethodsRFGRRF

wsRFxRF

(b)

Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504

features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results

Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset

However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests

Table 3 presents the prediction accuracies (mean plusmn

std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

The Scientific World Journal 7

datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset

52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model

The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas

Acc = 1

119873

119873

sum

119894=1

119868 (119876 (119889119894 119910119894) minusmax119895 =119910119894

119876 (119889119894 119895) gt 0) (9)

where 119868(sdot) is the indicator function and 119876(119889119894 119895) =

sum119870

119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889

119894isin D119905on class

119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in

test data D119905 and 119910

119894indicates the true class of 119889

119894

53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)

We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For

comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets

In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog

2(119872) + 1rceil The minimal node

size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory

54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional

Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative

8 The Scientific World Journal

825

850

875

900

925

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

MethodsRFGRRF

wsRFxRF

YaleB + eigenface

(a)

MethodsRFGRRF

wsRFxRF

85

90

95

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

YaleB + randomface

(b)

Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + eigenface

MethodsRFGRRF

wsRFxRF

(a)

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + randomface

MethodsRFGRRF

wsRFxRF

(b)

Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504

features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results

Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset

However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests

Table 3 presents the prediction accuracies (mean plusmn

std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

8 The Scientific World Journal

825

850

875

900

925

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

MethodsRFGRRF

wsRFxRF

YaleB + eigenface

(a)

MethodsRFGRRF

wsRFxRF

85

90

95

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

YaleB + randomface

(b)

Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + eigenface

MethodsRFGRRF

wsRFxRF

(a)

850

875

900

925

950

100 200 300 400 500Feature dimension of subdatasets

Reco

gniti

on ra

te (

)

ORL + randomface

MethodsRFGRRF

wsRFxRF

(b)

Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504

features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results

Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset

However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests

Table 3 presents the prediction accuracies (mean plusmn

std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

The Scientific World Journal 9

70

80

90

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

Accu

racy

()

70

80

90

100

Accu

racy

()

75

80

85

90

95

100Ac

cura

cy (

)

70

80

90

100

Accu

racy

()

70

80

90

100

Accu

racy

()

60

70

80

90

100

Accu

racy

()

50

60

70

80

90Ac

cura

cy (

)

Figure 3 Box plots the test accuracy of the nine Caltech subdatasets

different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees

55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X

119904(strong) and X

119908(weak) These

genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

10 The Scientific World Journal

085

090

095

100AU

C

075

080

085

090

095

100

AUC

085

090

095

100

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM7000

RF GRRF wsRF xRFCaltechM15000

RF GRRF wsRF xRFCaltechM12000

RF GRRF wsRF xRFCaltechM1000

RF GRRF wsRF xRFCaltechM5000

RF GRRF wsRF xRFCaltechM3000

RF GRRF wsRF xRFCaltechM500

RF GRRF wsRF xRFCaltechM300

AUC

08

09

10

AUC

094

096

098

100AU

C

094

096

098

100

AUC

092

094

096

098

100

AUC

090

095

100

AUC

07

08

09

10AU

C

Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets

The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models

Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound

rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =

lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model

56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

The Scientific World Journal 11

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

70

80

90

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

90

Accu

racy

()

60

70

80

90

Accu

racy

()

70

80

90

Accu

racy

()

60

70

80

Accu

racy

()

60

70

80

Accu

racy

()

Figure 5 Box plots of the test accuracy of the nine Horse subdatasets

while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog

2(119872) + 1rceil was used to

build trees on the La1s and La2s datasets The number offeatures in the X

119904and X

119908columns on the right of Table 6

was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets

Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s

dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog

2(119872) + 1rceil

This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

12 The Scientific World Journal

06

07

08

09AU

C

065

070

075

080

085

090

AUC

070

075

080

085

090

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM7000

RF GRRF wsRF xRFHorseM15000

RF GRRF wsRF xRFHorseM12000

RF GRRF wsRF xRFHorseM1000

RF GRRF wsRF xRFHorseM5000

RF GRRF wsRF xRFHorseM3000

RF GRRF wsRF xRFHorseM500

RF GRRF wsRF xRFHorseM300

AUC

06

07

08

09

AUC

07

08

09AU

C

06

07

08

09

AUC

07

08

09

AUC

05

06

07

08

09

AUC

065

070

075

080

085

AUC

Figure 6 Box plots of the AUC measures of the nine Horse subdatasets

It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog

2(119872) + 1rceil was used

However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need

many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory

However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

The Scientific World Journal 13

Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result

Dataset Model 300 500 1000 3000 5000 7000 10000 12000 15000

CaltechxRF 0312 0271 0280 0287 0357 0440 0650 0742 0789RF 0369 0288 0294 0327 0435 0592 0908 1114 3611

wsRF 0413 0297 0268 0221 0265 0333 0461 0456 0789

HorsexRF 0266 0262 0246 0277 0259 0298 0275 0288 0382RF 0331 0342 0354 0374 0417 0463 0519 0537 0695

wsRF 0429 0414 0391 0295 0288 0333 0295 0339 0455

70

80

90

100

RF GRRF wsRF xRFColon

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFSrbct

Accu

racy

()

50

60

70

80

90

100

RF GRRF wsRF xRFLeukemia

Accu

racy

()

75

80

85

90

95

100

RF GRRF wsRF xRFLymphoma

Accu

racy

()

50

60

70

80

90

RF GRRF wsRF xRFBreast2class

Accu

racy

()

40

50

60

70

80

RF GRRF wsRF xRFBreast3class

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFnci

Accu

racy

()

40

60

80

100

RF GRRF wsRF xRFBrain

Accu

racy

()

80

90

100

RF GRRF wsRF xRFProstate

Accu

racy

()

70

80

90

100

RF GRRF wsRF xRFAdenocarcinoma

Accu

racy

()

Figure 7 Box plots of test accuracy of the models on the ten gene datasets

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

14 The Scientific World Journal

Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 9550 plusmn 2 9650 plusmn 1 9650 plusmn 2 9700 plusmn 1 9750 plusmn 2RF 7000 plusmn 7 7600 plusmn 9 7750 plusmn 12 8250 plusmn 16 8150 plusmn 2

wsRF 9150 plusmn 4 9100 plusmn 3 9300 plusmn 2 9450 plusmn 4 9200 plusmn 9GRRF 9300 plusmn 2 9600 plusmn 2 9450 plusmn 2 9500 plusmn 3 9400 plusmn 2

HorseM3000

xRF 8059 plusmn 4 8176 plusmn 2 7971 plusmn 6 8029 plusmn 1 7765 plusmn 5RF 5059 plusmn 10 5294 plusmn 8 5618 plusmn 4 5824 plusmn 5 5735 plusmn 9

wsRF 6206 plusmn 4 6882 plusmn 3 6765 plusmn 3 6765 plusmn 5 6588 plusmn 7GRRF 6500 plusmn 9 6353 plusmn 3 6853 plusmn 3 6353 plusmn 9 7118 plusmn 4

YaleBEigenfaceM504

xRF 7568 plusmn 1 8565 plusmn 1 8808 plusmn 1 8894 plusmn 0 9122 plusmn 0RF 7193 plusmn 1 7948 plusmn 1 8069 plusmn 1 8167 plusmn 1 8289 plusmn 1

wsRF 7760 plusmn 1 8561 plusmn 0 8811 plusmn 0 8931 plusmn 0 9068 plusmn 0GRRF 7473 plusmn 0 8470 plusmn 1 8725 plusmn 0 8961 plusmn 0 9189 plusmn 0

YaleBrandomfaceM504

xRF 9471 plusmn 0 9764 plusmn 0 9801 plusmn 0 9822 plusmn 0 9859 plusmn 0RF 8800 plusmn 0 9259 plusmn 0 9413 plusmn 0 9486 plusmn 0 9606 plusmn 0

wsRF 9540 plusmn 0 9790 plusmn 0 9817 plusmn 0 9814 plusmn 0 9838 plusmn 0GRRF 9566 plusmn 0 9810 plusmn 0 9842 plusmn 0 9892 plusmn 0 9884 plusmn 0

ORLEigenfaceM504

xRF 7625 plusmn 6 8725 plusmn 3 9175 plusmn 2 9325 plusmn 2 9475 plusmn 2RF 7175 plusmn 2 7875 plusmn 4 8200 plusmn 3 8275 plusmn 3 8550 plusmn 5

wsRF 7825 plusmn 4 8875 plusmn 3 9000 plusmn 1 9125 plusmn 2 9250 plusmn 2GRRF 7350 plusmn 6 8500 plusmn 2 9000 plusmn 1 9075 plusmn 3 9475 plusmn 1

ORLrandomfaceM504

xRF 8775 plusmn 3 9250 plusmn 2 9550 plusmn 1 9425 plusmn 1 9600 plusmn 1RF 7750 plusmn 3 8200 plusmn 7 8450 plusmn 2 8750 plusmn 2 8600 plusmn 2

wsRF 8700 plusmn 5 9375 plusmn 2 9375 plusmn 0 9500 plusmn 1 9550 plusmn 1GRRF 8725 plusmn 1 9325 plusmn 1 9450 plusmn 1 9425 plusmn 1 9550 plusmn 1

Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result

Dataset Model 119870 = 20 119870 = 50 119870 = 80 119870 = 100 119870 = 200

CaltechM3000

xRF 995 plusmn 0 999 plusmn 5 100 plusmn 2 100 plusmn 1 100 plusmn 1RF 851 plusmn 7 817 plusmn 4 826 plusmn 12 865 plusmn 6 864 plusmn 1

wsRF 841 plusmn 1 845 plusmn 8 834 plusmn 7 850 plusmn 8 870 plusmn 9GRRF 846 plusmn 1 860 plusmn 2 862 plusmn 1 908 plusmn 1 923 plusmn 1

HorseM3000

xRF 849 plusmn 1 887 plusmn 0 895 plusmn 0 898 plusmn 0 897 plusmn 0RF 637 plusmn 4 664 plusmn 7 692 plusmn 15 696 plusmn 3 733 plusmn 9

wsRF 635 plusmn 8 687 plusmn 4 679 plusmn 6 671 plusmn 4 718 plusmn 9GRRF 786 plusmn 3 778 plusmn 3 785 plusmn 8 699 plusmn 1 806 plusmn 4

Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X

119904and the weak

group X119908is used in xRF

Dataset xRF RF wsRF GRRF varSelRF LASSO X119904

X119908

colon 8765 8435 8450 8645 7680 8200 245 317srbct 9771 9590 9676 9757 9650 9930 606 546Leukemia 8925 8258 8483 8725 8930 9240 502 200Lymphoma 9930 9715 9810 9910 9780 9910 1404 275breast2class 7884 6272 6340 7132 6140 6340 194 631breast3class 6542 5600 5719 6355 5820 6000 724 533nci 7415 5885 5940 6305 5820 6040 247 1345Brain 8193 7079 7079 7479 7690 7410 1270 1219Prostate 9256 8871 9079 9285 9150 9120 601 323Adenocarcinoma 9088 8404 8412 8552 7880 8110 108 669

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

The Scientific World Journal 15

Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold

Dataset 1198881199042 Error bound Test accuracy () X119904

X119908RF wsRF xRF RF GRRF wsRF xRF

Fbis 2149 1179 1209 7642 7651 8414 8469 201 555La2s 1526 0904 0780 6677 6799 8726 8861 353 1136La1s 408 0892 1499 7776 8049 8603 8721 220 1532

002

004

006

008

RF wsRF xRFColon

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFSrbct

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFLeukemia

cs2

erro

r bou

nd

001

002

003

RF wsRF xRFLymphoma

cs2

erro

r bou

nd002

003

004

005

006

007

RF wsRF xRFBreast2class

cs2

erro

r bou

nd

004

006

008

010

012

RF wsRF xRFBreast3class

cs2

erro

r bou

nd

002

004

006

RF wsRF xRFnci

cs2

erro

r bou

nd

0025

0050

0075

RF wsRF xRFBrain

cs2

erro

r bou

nd

002

003

004

005

006

RF wsRF xRFProstate

cs2

erro

r bou

nd

002

004

006

008

010

RF wsRF xRFAdenocarcinoma

cs2

erro

r bou

nd

Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets

by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work

6 Conclusions

We have presented a new method for feature subspaceselection for building efficient random forest xRF model for

classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 16: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

16 The Scientific World Journal

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

70

75

80

85

25 50 75 100Number of features

Accu

racy

()

log(M) + 1

(a) Fbis

85

86

87

88

89

100 200 300 400 500Number of trees

Accu

racy

()

60

70

80

90

10 20 30 40 50Number of features

Accu

racy

()

log(M) + 1

(b) La2s

70

75

80

85

50 100 150 200Number of trees

Accu

racy

()

MethodsRFwsRFxRF

MethodsRFwsRFxRF

30

40

50

60

70

80

10 20 30 40 50Number of features

Accu

racy

() log(M) + 1

(c) La1s

Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 17: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

The Scientific World Journal 17

especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF

For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117

References

[1] L Breiman ldquoRandom forestsrdquo Machine Learning vol 450 no1 pp 5ndash32 2001

[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984

[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001

[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994

[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000

[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995

[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012

[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995

[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[10] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006

[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010

[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012

[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013

[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012

[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008

[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013

[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf

[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007

[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007

[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party

[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945

[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014

[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014

[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001

[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994

[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991

[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 18: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

18 The Scientific World Journal

[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002

[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml

[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 19: Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014