Research Article Unbiased Feature Selection in …downloads.hindawi.com/journals/tswj/2015/471371.pdfResearch Article Unbiased Feature Selection in Learning Random Forests for High-Dimensional
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Research ArticleUnbiased Feature Selection in Learning Random Forests forHigh-Dimensional Data
Thanh-Tung Nguyen123 Joshua Zhexue Huang14 and Thuy Thi Nguyen5
1 Shenzhen Key Laboratory of High Performance Data Mining Shenzhen Institutes of Advanced TechnologyChinese Academy of Sciences Shenzhen 518055 China
2University of Chinese Academy of Sciences Beijing 100049 China3 School of Computer Science and Engineering Water Resources University Hanoi 10000 Vietnam4College of Computer Science and Software Engineering Shenzhen University Shenzhen 518060 China5 Faculty of Information Technology Vietnam National University of Agriculture Hanoi 10000 Vietnam
Correspondence should be addressed toThanh-Tung Nguyen tungntwruvn
Received 20 June 2014 Accepted 20 August 2014
Academic Editor Shifei Ding
Copyright copy 2015 Thanh-Tung Nguyen et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited
Random forests (RFs) have been widely used as a powerful classificationmethod However with the randomization in both baggingsamples and feature selection the trees in the forest tend to select uninformative features for node splitting This makes RFshave poor accuracy when working with high-dimensional data Besides that RFs have bias in the feature selection process wheremultivalued features are favored Aiming at debiasing feature selection in RFs we propose a new RF algorithm called xRF to selectgood features in learning RFs for high-dimensional data We first remove the uninformative features using 119901-value assessmentand the subset of unbiased features is then selected based on some statistical measures This feature subset is then partitioned intotwo subsets A feature weighting sampling technique is used to sample features from these two subsets for building trees Thisapproach enables one to generate more accurate trees while allowing one to reduce dimensionality and the amount of data neededfor learning RFs An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including imagedatasets The experimental results have shown that RFs with the proposed approach outperformed the existing random forests inincreasing the accuracy and the AUC measures
1 Introduction
Random forests (RFs) [1] are a nonparametric method thatbuilds an ensemble model of decision trees from randomsubsets of features and bagged samples of the training data
RFs have shown excellent performance for both clas-sification and regression problems RF model works welleven when predictive features contain irrelevant features(or noise) it can be used when the number of features ismuch larger than the number of samples However withrandomizingmechanism in both bagging samples and featureselection RFs could give poor accuracy when applied to highdimensional data The main cause is that in the process ofgrowing a tree from the bagged sample data the subspaceof features randomly sampled from thousands of features to
split a node of the tree is often dominated by uninformativefeatures (or noise) and the tree grown from such baggedsubspace of features will have a low accuracy in predictionwhich affects the final prediction of the RFs FurthermoreBreiman et al noted that feature selection is biased in theclassification and regression tree (CART) model because it isbased on an information criteria called multivalue problem[2] It tends in favor of features containingmore values even ifthese features have lower importance than other ones or haveno relationship with the response feature (ie containingless missing values many categorical or distinct numericalvalues) [3 4]
In this paper we propose a new random forests algo-rithm using an unbiased feature sampling method to builda good subspace of unbiased features for growing trees
Hindawi Publishing Corporatione Scientific World JournalVolume 2015 Article ID 471371 18 pageshttpdxdoiorg1011552015471371
2 The Scientific World Journal
We first use random forests to measure the importance offeatures and produce raw feature importance scores Thenwe apply a statistical Wilcoxon rank-sum test to separateinformative features from the uninformative ones This isdone by neglecting all uninformative features by definingthreshold 120579 for instance 120579 = 005 Second we use the Chi-square statistic test (1205942) to compute the related scores ofeach feature to the response feature We then partition theset of the remaining informative features into two subsetsone containing highly informative features and the otherone containing weak informative features We independentlysample features from the two subsets and merge themtogether to get a new subspace of features which is usedfor splitting the data at nodes Since the subspace alwayscontains highly informative features which can guarantee abetter split at a node this feature sampling method enablesavoiding selecting biased features and generates trees frombagged sample data with higher accuracy This samplingmethod also is used for dimensionality reduction the amountof data needed for training the random forests modelOur experimental results have shown that random forestswith this weighting feature selection technique outperformedrecently the proposed random forests in increasing of theprediction accuracy we also applied the new approachon microarray and image data and achieved outstandingresults
The structure of this paper is organized as followsIn Section 2 we give a brief summary of related worksIn Section 3 we give a brief summary of random forestsand measurement of feature importance score Section 4describes our newproposed algorithmusing unbiased featureselection Section 5 provides the experimental results evalu-ations and comparisons Section 6 gives our conclusions
2 Related Works
Random forests are an ensemble approach to make classifi-cation decisions by voting the results of individual decisiontrees An ensemble learner with excellent generalizationaccuracy has two properties high accuracy of each compo-nent learner and high diversity in component learners [5]Unlike other ensemble methods such as bagging [1] andboosting [6 7] which create basic classifiers from randomsamples of the training data the random forest approachcreates the basic classifiers from randomly selected subspacesof data [8 9] The randomly selected subspaces increasethe diversity of basic classifiers learnt by a decision treealgorithm
Feature importance is the importancemeasure of featuresin the feature selection process [1 10ndash14] In RF frameworksthe most commonly used score of importance of a givenfeature is the mean error of a tree in the forest when theobserved values of this feature are randomly permuted inthe out-of-bag samples Feature selection is an important stepto obtain good performance for an RF model especially indealing with high dimensional data problems
For feature weighting techniques recently Xu et al [13]proposed an improved RF method which uses a novel fea-ture weighting method for subspace selection and therefore
enhances classification performance on high dimensionaldata The weights of feature were calculated by informationgain ratio or 1205942-test Ye et al [14] then used these weightsto propose a stratified sampling method to select featuresubspaces for RF in classification problems Chen et al[15] used a stratified idea to propose a new clusteringmethod However implementation of the random forestmodel suggested by Ye et al is based on a binary classificationsetting and it uses linear discriminant analysis as the splittingcriteria This stratified RF model is not efficient on highdimensional datasets with multiple classes With the sameway for solving two-class problem Amaratunga et al [16]presented a feature weighting method for subspace samplingto deal with microarray data the 119905-test of variance analysisis used to compute weights for the features Genuer et al[12] proposed a strategy involving a ranking of explanatoryfeatures using the RFs score weights of importance and astepwise ascending feature introduction strategy Deng andRunger [17] proposed a guided regularized RF (GRRF)in which weights of importance scores from an ordinaryrandom forest (RF) are used to guide the feature selectionprocess They found that the regularized least subset selectedby their GRRF with minimal regularization ensures betteraccuracy than the complete feature set However regularRF was used as a classifier due to the fact that regularizedRF may have higher variance than RF because the trees arecorrelated
Several methods have been proposed to correct bias ofimportance measures in the feature selection process in RFsto improve the prediction accuracy [18ndash21] These methodsintend to avoid selecting an uninformative feature for nodesplitting in decision trees Although the methods of this kindwere well investigated and can be used to address the highdimensional problem there are still some unsolved problemssuch as the need to specify in advance the probabilitydistributions as well as the fact that they struggle whenapplied to large high dimensional data
In summary in the reviewed approaches the gain athigher levels of the tree is weighted differently than the gainat lower levels of the tree In fact at lower levels of the treethe gain is reduced because of the effect of splits on differentfeatures at higher levels of the tree That affects the finalprediction performance of RFs model To remedy this inthis paper we propose a new method for unbiased featuresubsets selection in high dimensional space to build RFs Ourapproach differs from previous approaches in the techniquesused to partition a subset of features All uninformativefeatures (considered as noise) are removed from the systemand the best feature set which is highly related to the responsefeature is found using a statistical method The proposedsamplingmethod always provides enough highly informativefeatures for the subspace feature at any levels of the decisiontrees For the case of growing an RF model on data withoutnoise we used in-bagmeasuresThis is a different importancescore of features which requires less computational timecompared to the measures used by others Our experimentalresults showed that our approach outperformed recently theproposed RF methods
The Scientific World Journal 3
input L = (119883119894 119884119894)119873
119894=1| 119883 isin R119872 119884 isin 1 2 119888 the training data set
119870 the number of treesmtry the size of the subspaces
output A random forest RF(1) for 119896 larr 1 to 119870 do(2) Draw a bagged subset of samples L
119896from L
(4) while (stopping criteria is not met) do(5) Select randomlymtry features(6) for 119898 larr 1 to 119898119905119903119910 do(7) Compute the decrease in the node impurity(8) Choose the feature which decreases the impurity the most and
the node is divided into two children nodes(9) Combine the 119870 trees to form a random forest
Algorithm 1 Random forest algorithm
3 Background
31 Random Forest Algorithm Given a training dataset L =
(119883119894 119884119894)119873
119894=1| 119883119894isin R119872 119884 isin 1 2 119888 where 119883
119894are
features (also called predictor variables) 119884 is a class responsefeature 119873 is the number of training samples and 119872 is thenumber of features and a random forest model RF describedin Algorithm 1 let 119896 be the prediction of tree 119879
119896given input
119883 The prediction of random forest with119870 trees is
= majority vote 119896119870
1 (1)
Since each tree is grown from a bagged sample set it isgrown with only two-thirds of the samples in L called in-bagsamples About one-third of the samples is left out and thesesamples are called out-of-bag (OOB) samples which are usedto estimate the prediction error
The OOB predicted value is OOB= (1O
1198941015840)sum119896isinO1198941015840119896
whereO1198941015840 = LO
119894 119894 and 1198941015840 are in-bag and out-of-bag sampled
indices O1198941015840 is the size of OOB subdataset and the OOB
prediction error is
ErrOOB
=1
119873OOB
119873OOB
sum
119894=1
E (119884 OOB
) (2)
where E(sdot) is an error function and 119873OOB is OOB samplesrsquosize
32 Measurement of Feature Importance Score from an RFBreiman presented a permutation technique to measure theimportance of features in the prediction [1] called an out-of-bag importance scoreThe basic idea for measuring this kindof importance score of features is to compute the differencebetween the original mean error and the randomly permutedmean error in OOB samplesThemethod rearranges stochas-tically all values of the 119895th feature in OOB for each tree anduses the RF model to predict this permuted feature and getthe mean error The aim of this permutation is to eliminatethe existing association between the 119895th feature and 119884 values
and then to test the effect of this on the RF model A featureis considered to be in a strong association if the mean errordecreases dramatically
The other kind of feature importance measure canbe obtained when the random forest is growing This isdescribed as follows At each node 119905 in a decision tree thesplit is determined by the decrease in node impurity Δ119877(119905)The node impurity 119877(119905) is the gini index If a subdataset innode 119905 contains samples from 119888 classes gini(119905) is defined as
119877 (119905) = 1 minus
119888
sum
119895=1
1199012
119895 (3)
where 1199012119895is the relative frequency of class 119895 in 119905 Gini(119905) is
minimized if the classes in 119905 are skewed After splitting 119905 intotwo child nodes 119905
1and 1199052with sample sizes119873
1(119905) and119873
2(119905)
the gini index of the split data is defined as
Ginisplit (119905) =1198731(119905)
119873 (119905)Gini (119905
1) +
1198732(119905)
119873 (119905)Gini (119905
2) (4)
The feature providing smallest Ginisplit(119905) is chosen to split thenodeThe importance score of feature119883
119895in a single decision
tree 119879119896is
IS119896(119883119895) = sum
119905isin119879119896
Δ119877 (119905) (5)
and it is computed over all119870 trees in a random forest definedas
IS (119883119895) =
1
119870
119870
sum
119896=1
IS119896(119883119895) (6)
It is worth noting that a random forest uses in-bag sam-ples to produce a kind of importance measure called an in-bag importance scoreThis is themain difference between thein-bag importance score and an out-of-bag measure which isproduced with the decrease of the prediction error using RFinOOB samples In other words the in-bag importance scorerequires less computation time than the out-of-bag measure
4 The Scientific World Journal
4 Our Approach
41 Issues in Feature Selection on High Dimensional DataWhen Breiman et al suggested the classification and regres-sion tree (CART) model they noted that feature selection isbiased because it is based on an information gain criteriacalled multivalue problem [2] Random forest methods arebased on CART trees [1] hence this bias is carried to randomforest RF model In particular the importance scores can bebiased when very high dimensional data contains multipledata types Several methods have been proposed to correctbias of feature importance measures [18ndash21] The conditionalinference framework (referred to as cRF [22]) could be suc-cessfully applied for both the null and power cases [19 20 22]The typical characteristic of the power case is that only onepredictor feature is important while the rest of the featuresare redundant with different cardinality In contrast in thenull case all features used for prediction are redundant withdifferent cardinality Although the methods of this kind werewell investigated and can be used to address the multivalueproblem there are still some unsolved problems such asthe need to specify in advance the probability distributionsas well as the fact that they struggle when applied to highdimensional data
Another issue is that in high dimensional data whenthe number of features is large the fraction of importancefeatures remains so small In this case the original RF modelwhich uses simple random sampling is likely to performpoorly with small 119898 and the trees are likely to select anuninformative feature as a split too frequently (119898 denotesa subspace size of features) At each node 119905 of a tree theprobability of uninformative feature selection is too high
To illustrate this issue let 119866 be the number of noisyfeatures denote by119872 the total number of predictor featuresand let the features 119872 minus 119866 be important ones which have ahigh correlationwith119884 valuesThen if we use simple randomsampling when growing trees to select a subset of 119898 features(119898 ≪ 119872) the total number of possible uninformative aC119898119872minus119866
and the total number of all subset features isC119898119872 The
probability distribution of selecting a subset of 119898 (119898 gt 1)important features is given by
C119898119872minus119866
C119898119872
=(119872 minus 119866) (119872 minus 119866 minus 1) sdot sdot sdot (119872 minus 119866 minus 119898 + 1)
119872 (119872 minus 1) sdot sdot sdot (119872 minus 119898 + 1)
=(1 minus 119866119872) sdot sdot sdot (1 minus 119866119872 minus 119898119872 + 1119872)
(1 minus 1119872) sdot sdot sdot (1 minus 119898119872 + 1119872)
≃ (1 minus119866
119872)
119898
(7)
Because the fraction of important features is too small theprobability in (7) tends to 0 which means that the importantfeatures are rarely selected by the simple sampling methodin RF [1] For example with 5 informative and 5000 noise oruninformative features assuming119898 = radic(5 + 5000) ≃ 70 theprobability of an informative feature to be selected at any splitis 0068
42 Bias Correction for Feature Selection and Feature Weight-ing The bias correction in feature selection is intended tomake the RF model to avoid selecting an uninformative fea-ture To correct this kind of bias in the feature selection stagewe generate shadow features to add to the original datasetThe shadow features set contains the same values possiblecut-points and distribution with the original features buthave no association with 119884 values To create each shadowfeature we rearrange the values of the feature in the originaldataset 119877 times to create the corresponding shadowThis dis-turbance of features eliminates the correlations of the featureswith the response value but keeps its attributes The shadowfeature participates only in the competition for the best splitand makes a decrease in the probability of selecting this kindof uninformative feature For the feature weight computationwe first need to distinguish the important features from theless important ones To do so we run a defined numberof random forests to obtain raw importance scores each ofwhich is obtained using (6) Then we use Wilcoxon rank-sum test [23] that compares the importance score of a featurewith the maximum importance scores of generated noisyfeatures called shadowsThe shadow features are added to theoriginal dataset and they do not have prediction power to theresponse feature Therefore any feature whose importancescore is smaller than the maximum importance score ofnoisy features is considered less important Otherwise it isconsidered important Having computed theWilcoxon rank-sum test we can compute the 119901-value for the feature The 119901-value of a feature in Wilcoxon rank-sum test is assigned aweight with a feature 119883
119895 119901-value isin [0 1] and this weight
indicates the importance of the feature in the predictionThe smaller the 119901-value of a feature the more correlated thepredictor feature to the response feature and therefore themore powerful the feature in prediction The feature weightcomputation is described as follows
Let119872 be the number of features in the original datasetand denote the feature set as S
119883= 119883
119895 119895 = 1 2 119872
In each replicate 119903 (119903 = 1 2 119877) shadow features aregenerated from features119883
119895in SX and we randomly permute
all values of119883119895119877 times to get a corresponding shadow feature
119860119895 denote the shadow feature set as S
119860= 119860
119895119872
1 The
extended feature set is denoted by S119883119860
= S119883S119860
Let the importance score of S119883119860
at replicate 119903 be IS119903119883119860
=
IS119903119883 IS119903119860 where IS119903
119883119895
and IS119903119860119895
are the importance scoresof 119883119895and 119860
119895at the 119903th replicate respectively We built a
random forest model RF from the S119883119860
dataset to compute2119872 importance scores for 2119872 featuresWe repeated the sameprocess119877 times to compute119877 replicates getting IS
119883119895
= IS119903119883119895
119877
1
and IS119860119895
= IS119903119860119895
119877
1 From the replicates of shadow features
we extracted the maximum value from 119903th row of IS119860119895
andput it into the comparison sample denoted by ISmax
119860 For each
data feature 119883119895 we computed Wilcoxon test and performed
hypothesis test on IS119883119895
gt ISmax119860
to calculate the 119901-valuefor the feature Given a statistical significance level we canidentify important features from less important ones Thistest confirms that if a feature is important it consistently
The Scientific World Journal 5
scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]
In each node of trees each shadow 119860119895shares approxi-
mately the same properties of the corresponding 119883119895 but it is
independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883
119895according to 119901-value
and can yield correct ranking of features according to theirimportance
43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features
Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical
response feature and each feature 119883119895 Each observation is
allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is
1205942=
119903
sum
119894=1
119888
sum
119895=1
(119874119894119895minus 119864119894119895)2
119864119894119895
(8)
For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature
Let X119904be the best subset of features we collect all feature
119883119895whose 119901-value is smaller or equal to 005 as a result
from the 1205942 statistical test according to (8) The remainingfeatures X X
119904 are added to X
119908 and this approach is
described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX
119904andX
119908 at each nodewe
randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910
119904= lceil119898119905119903119910 times (X
119904X)rceil and 119898119905119903119910
119908= lfloor119898119905119903119910 times
(X119908X)rfloor where X
119904 and X
119908 are the number of features
in the groups of highly informative features X119904and weak
informative features X119908 respectively X is the number of
informative features in the input datasetThese are merged toform the feature subspace for splitting the node
44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits
at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X
119904
and X119908 (iii) building RF using the subspaces containing
features which are taken randomly and separately from X119904
X119908 and (iv) classifying a new data The new algorithm is
summarized as follows
(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features
(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax
119860of 119877 elements
(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature
(4) Given a significance level threshold 120579 neglect alluninformative features
(5) Partition the remaining features into two subsets X119904
and X119908described in Algorithm 2
(6) Sample the training set L with replacement to gener-ate bagged samples L
1L2 L
119870
(7) For each 119871119896 grow a CART tree 119879
119896as follows
(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX
119904and
X119908and use the subspace features as candidates
for splitting the node(b) Each tree is grown nondeterministically with-
out pruning until the minimum node size 119899minis reached
(8) Given a 119883 = 119909new use (1) to predict the responsevalue
5 Experiments
51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]
For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels
6 The Scientific World Journal
input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold
and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons
In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification
In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes
For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256
119872120 and119872504 Totally we created 16 subdatasets as
Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly
The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s
The Scientific World Journal 7
datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset
52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model
The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas
where 119868(sdot) is the indicator function and 119876(119889119894 119895) =
sum119870
119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889
119894isin D119905on class
119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in
test data D119905 and 119910
119894indicates the true class of 119889
119894
53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)
We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For
comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets
In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog
2(119872) + 1rceil The minimal node
size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory
54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional
Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative
8 The Scientific World Journal
825
850
875
900
925
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
MethodsRFGRRF
wsRFxRF
YaleB + eigenface
(a)
MethodsRFGRRF
wsRFxRF
85
90
95
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
YaleB + randomface
(b)
Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + eigenface
MethodsRFGRRF
wsRFxRF
(a)
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + randomface
MethodsRFGRRF
wsRFxRF
(b)
Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504
features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results
Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset
However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests
Table 3 presents the prediction accuracies (mean plusmn
std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with
The Scientific World Journal 9
70
80
90
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
75
80
85
90
95
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
Accu
racy
()
70
80
90
100
Accu
racy
()
75
80
85
90
95
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
70
80
90
100
Accu
racy
()
60
70
80
90
100
Accu
racy
()
50
60
70
80
90Ac
cura
cy (
)
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
We first use random forests to measure the importance offeatures and produce raw feature importance scores Thenwe apply a statistical Wilcoxon rank-sum test to separateinformative features from the uninformative ones This isdone by neglecting all uninformative features by definingthreshold 120579 for instance 120579 = 005 Second we use the Chi-square statistic test (1205942) to compute the related scores ofeach feature to the response feature We then partition theset of the remaining informative features into two subsetsone containing highly informative features and the otherone containing weak informative features We independentlysample features from the two subsets and merge themtogether to get a new subspace of features which is usedfor splitting the data at nodes Since the subspace alwayscontains highly informative features which can guarantee abetter split at a node this feature sampling method enablesavoiding selecting biased features and generates trees frombagged sample data with higher accuracy This samplingmethod also is used for dimensionality reduction the amountof data needed for training the random forests modelOur experimental results have shown that random forestswith this weighting feature selection technique outperformedrecently the proposed random forests in increasing of theprediction accuracy we also applied the new approachon microarray and image data and achieved outstandingresults
The structure of this paper is organized as followsIn Section 2 we give a brief summary of related worksIn Section 3 we give a brief summary of random forestsand measurement of feature importance score Section 4describes our newproposed algorithmusing unbiased featureselection Section 5 provides the experimental results evalu-ations and comparisons Section 6 gives our conclusions
2 Related Works
Random forests are an ensemble approach to make classifi-cation decisions by voting the results of individual decisiontrees An ensemble learner with excellent generalizationaccuracy has two properties high accuracy of each compo-nent learner and high diversity in component learners [5]Unlike other ensemble methods such as bagging [1] andboosting [6 7] which create basic classifiers from randomsamples of the training data the random forest approachcreates the basic classifiers from randomly selected subspacesof data [8 9] The randomly selected subspaces increasethe diversity of basic classifiers learnt by a decision treealgorithm
Feature importance is the importancemeasure of featuresin the feature selection process [1 10ndash14] In RF frameworksthe most commonly used score of importance of a givenfeature is the mean error of a tree in the forest when theobserved values of this feature are randomly permuted inthe out-of-bag samples Feature selection is an important stepto obtain good performance for an RF model especially indealing with high dimensional data problems
For feature weighting techniques recently Xu et al [13]proposed an improved RF method which uses a novel fea-ture weighting method for subspace selection and therefore
enhances classification performance on high dimensionaldata The weights of feature were calculated by informationgain ratio or 1205942-test Ye et al [14] then used these weightsto propose a stratified sampling method to select featuresubspaces for RF in classification problems Chen et al[15] used a stratified idea to propose a new clusteringmethod However implementation of the random forestmodel suggested by Ye et al is based on a binary classificationsetting and it uses linear discriminant analysis as the splittingcriteria This stratified RF model is not efficient on highdimensional datasets with multiple classes With the sameway for solving two-class problem Amaratunga et al [16]presented a feature weighting method for subspace samplingto deal with microarray data the 119905-test of variance analysisis used to compute weights for the features Genuer et al[12] proposed a strategy involving a ranking of explanatoryfeatures using the RFs score weights of importance and astepwise ascending feature introduction strategy Deng andRunger [17] proposed a guided regularized RF (GRRF)in which weights of importance scores from an ordinaryrandom forest (RF) are used to guide the feature selectionprocess They found that the regularized least subset selectedby their GRRF with minimal regularization ensures betteraccuracy than the complete feature set However regularRF was used as a classifier due to the fact that regularizedRF may have higher variance than RF because the trees arecorrelated
Several methods have been proposed to correct bias ofimportance measures in the feature selection process in RFsto improve the prediction accuracy [18ndash21] These methodsintend to avoid selecting an uninformative feature for nodesplitting in decision trees Although the methods of this kindwere well investigated and can be used to address the highdimensional problem there are still some unsolved problemssuch as the need to specify in advance the probabilitydistributions as well as the fact that they struggle whenapplied to large high dimensional data
In summary in the reviewed approaches the gain athigher levels of the tree is weighted differently than the gainat lower levels of the tree In fact at lower levels of the treethe gain is reduced because of the effect of splits on differentfeatures at higher levels of the tree That affects the finalprediction performance of RFs model To remedy this inthis paper we propose a new method for unbiased featuresubsets selection in high dimensional space to build RFs Ourapproach differs from previous approaches in the techniquesused to partition a subset of features All uninformativefeatures (considered as noise) are removed from the systemand the best feature set which is highly related to the responsefeature is found using a statistical method The proposedsamplingmethod always provides enough highly informativefeatures for the subspace feature at any levels of the decisiontrees For the case of growing an RF model on data withoutnoise we used in-bagmeasuresThis is a different importancescore of features which requires less computational timecompared to the measures used by others Our experimentalresults showed that our approach outperformed recently theproposed RF methods
The Scientific World Journal 3
input L = (119883119894 119884119894)119873
119894=1| 119883 isin R119872 119884 isin 1 2 119888 the training data set
119870 the number of treesmtry the size of the subspaces
output A random forest RF(1) for 119896 larr 1 to 119870 do(2) Draw a bagged subset of samples L
119896from L
(4) while (stopping criteria is not met) do(5) Select randomlymtry features(6) for 119898 larr 1 to 119898119905119903119910 do(7) Compute the decrease in the node impurity(8) Choose the feature which decreases the impurity the most and
the node is divided into two children nodes(9) Combine the 119870 trees to form a random forest
Algorithm 1 Random forest algorithm
3 Background
31 Random Forest Algorithm Given a training dataset L =
(119883119894 119884119894)119873
119894=1| 119883119894isin R119872 119884 isin 1 2 119888 where 119883
119894are
features (also called predictor variables) 119884 is a class responsefeature 119873 is the number of training samples and 119872 is thenumber of features and a random forest model RF describedin Algorithm 1 let 119896 be the prediction of tree 119879
119896given input
119883 The prediction of random forest with119870 trees is
= majority vote 119896119870
1 (1)
Since each tree is grown from a bagged sample set it isgrown with only two-thirds of the samples in L called in-bagsamples About one-third of the samples is left out and thesesamples are called out-of-bag (OOB) samples which are usedto estimate the prediction error
The OOB predicted value is OOB= (1O
1198941015840)sum119896isinO1198941015840119896
whereO1198941015840 = LO
119894 119894 and 1198941015840 are in-bag and out-of-bag sampled
indices O1198941015840 is the size of OOB subdataset and the OOB
prediction error is
ErrOOB
=1
119873OOB
119873OOB
sum
119894=1
E (119884 OOB
) (2)
where E(sdot) is an error function and 119873OOB is OOB samplesrsquosize
32 Measurement of Feature Importance Score from an RFBreiman presented a permutation technique to measure theimportance of features in the prediction [1] called an out-of-bag importance scoreThe basic idea for measuring this kindof importance score of features is to compute the differencebetween the original mean error and the randomly permutedmean error in OOB samplesThemethod rearranges stochas-tically all values of the 119895th feature in OOB for each tree anduses the RF model to predict this permuted feature and getthe mean error The aim of this permutation is to eliminatethe existing association between the 119895th feature and 119884 values
and then to test the effect of this on the RF model A featureis considered to be in a strong association if the mean errordecreases dramatically
The other kind of feature importance measure canbe obtained when the random forest is growing This isdescribed as follows At each node 119905 in a decision tree thesplit is determined by the decrease in node impurity Δ119877(119905)The node impurity 119877(119905) is the gini index If a subdataset innode 119905 contains samples from 119888 classes gini(119905) is defined as
119877 (119905) = 1 minus
119888
sum
119895=1
1199012
119895 (3)
where 1199012119895is the relative frequency of class 119895 in 119905 Gini(119905) is
minimized if the classes in 119905 are skewed After splitting 119905 intotwo child nodes 119905
1and 1199052with sample sizes119873
1(119905) and119873
2(119905)
the gini index of the split data is defined as
Ginisplit (119905) =1198731(119905)
119873 (119905)Gini (119905
1) +
1198732(119905)
119873 (119905)Gini (119905
2) (4)
The feature providing smallest Ginisplit(119905) is chosen to split thenodeThe importance score of feature119883
119895in a single decision
tree 119879119896is
IS119896(119883119895) = sum
119905isin119879119896
Δ119877 (119905) (5)
and it is computed over all119870 trees in a random forest definedas
IS (119883119895) =
1
119870
119870
sum
119896=1
IS119896(119883119895) (6)
It is worth noting that a random forest uses in-bag sam-ples to produce a kind of importance measure called an in-bag importance scoreThis is themain difference between thein-bag importance score and an out-of-bag measure which isproduced with the decrease of the prediction error using RFinOOB samples In other words the in-bag importance scorerequires less computation time than the out-of-bag measure
4 The Scientific World Journal
4 Our Approach
41 Issues in Feature Selection on High Dimensional DataWhen Breiman et al suggested the classification and regres-sion tree (CART) model they noted that feature selection isbiased because it is based on an information gain criteriacalled multivalue problem [2] Random forest methods arebased on CART trees [1] hence this bias is carried to randomforest RF model In particular the importance scores can bebiased when very high dimensional data contains multipledata types Several methods have been proposed to correctbias of feature importance measures [18ndash21] The conditionalinference framework (referred to as cRF [22]) could be suc-cessfully applied for both the null and power cases [19 20 22]The typical characteristic of the power case is that only onepredictor feature is important while the rest of the featuresare redundant with different cardinality In contrast in thenull case all features used for prediction are redundant withdifferent cardinality Although the methods of this kind werewell investigated and can be used to address the multivalueproblem there are still some unsolved problems such asthe need to specify in advance the probability distributionsas well as the fact that they struggle when applied to highdimensional data
Another issue is that in high dimensional data whenthe number of features is large the fraction of importancefeatures remains so small In this case the original RF modelwhich uses simple random sampling is likely to performpoorly with small 119898 and the trees are likely to select anuninformative feature as a split too frequently (119898 denotesa subspace size of features) At each node 119905 of a tree theprobability of uninformative feature selection is too high
To illustrate this issue let 119866 be the number of noisyfeatures denote by119872 the total number of predictor featuresand let the features 119872 minus 119866 be important ones which have ahigh correlationwith119884 valuesThen if we use simple randomsampling when growing trees to select a subset of 119898 features(119898 ≪ 119872) the total number of possible uninformative aC119898119872minus119866
and the total number of all subset features isC119898119872 The
probability distribution of selecting a subset of 119898 (119898 gt 1)important features is given by
C119898119872minus119866
C119898119872
=(119872 minus 119866) (119872 minus 119866 minus 1) sdot sdot sdot (119872 minus 119866 minus 119898 + 1)
119872 (119872 minus 1) sdot sdot sdot (119872 minus 119898 + 1)
=(1 minus 119866119872) sdot sdot sdot (1 minus 119866119872 minus 119898119872 + 1119872)
(1 minus 1119872) sdot sdot sdot (1 minus 119898119872 + 1119872)
≃ (1 minus119866
119872)
119898
(7)
Because the fraction of important features is too small theprobability in (7) tends to 0 which means that the importantfeatures are rarely selected by the simple sampling methodin RF [1] For example with 5 informative and 5000 noise oruninformative features assuming119898 = radic(5 + 5000) ≃ 70 theprobability of an informative feature to be selected at any splitis 0068
42 Bias Correction for Feature Selection and Feature Weight-ing The bias correction in feature selection is intended tomake the RF model to avoid selecting an uninformative fea-ture To correct this kind of bias in the feature selection stagewe generate shadow features to add to the original datasetThe shadow features set contains the same values possiblecut-points and distribution with the original features buthave no association with 119884 values To create each shadowfeature we rearrange the values of the feature in the originaldataset 119877 times to create the corresponding shadowThis dis-turbance of features eliminates the correlations of the featureswith the response value but keeps its attributes The shadowfeature participates only in the competition for the best splitand makes a decrease in the probability of selecting this kindof uninformative feature For the feature weight computationwe first need to distinguish the important features from theless important ones To do so we run a defined numberof random forests to obtain raw importance scores each ofwhich is obtained using (6) Then we use Wilcoxon rank-sum test [23] that compares the importance score of a featurewith the maximum importance scores of generated noisyfeatures called shadowsThe shadow features are added to theoriginal dataset and they do not have prediction power to theresponse feature Therefore any feature whose importancescore is smaller than the maximum importance score ofnoisy features is considered less important Otherwise it isconsidered important Having computed theWilcoxon rank-sum test we can compute the 119901-value for the feature The 119901-value of a feature in Wilcoxon rank-sum test is assigned aweight with a feature 119883
119895 119901-value isin [0 1] and this weight
indicates the importance of the feature in the predictionThe smaller the 119901-value of a feature the more correlated thepredictor feature to the response feature and therefore themore powerful the feature in prediction The feature weightcomputation is described as follows
Let119872 be the number of features in the original datasetand denote the feature set as S
119883= 119883
119895 119895 = 1 2 119872
In each replicate 119903 (119903 = 1 2 119877) shadow features aregenerated from features119883
119895in SX and we randomly permute
all values of119883119895119877 times to get a corresponding shadow feature
119860119895 denote the shadow feature set as S
119860= 119860
119895119872
1 The
extended feature set is denoted by S119883119860
= S119883S119860
Let the importance score of S119883119860
at replicate 119903 be IS119903119883119860
=
IS119903119883 IS119903119860 where IS119903
119883119895
and IS119903119860119895
are the importance scoresof 119883119895and 119860
119895at the 119903th replicate respectively We built a
random forest model RF from the S119883119860
dataset to compute2119872 importance scores for 2119872 featuresWe repeated the sameprocess119877 times to compute119877 replicates getting IS
119883119895
= IS119903119883119895
119877
1
and IS119860119895
= IS119903119860119895
119877
1 From the replicates of shadow features
we extracted the maximum value from 119903th row of IS119860119895
andput it into the comparison sample denoted by ISmax
119860 For each
data feature 119883119895 we computed Wilcoxon test and performed
hypothesis test on IS119883119895
gt ISmax119860
to calculate the 119901-valuefor the feature Given a statistical significance level we canidentify important features from less important ones Thistest confirms that if a feature is important it consistently
The Scientific World Journal 5
scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]
In each node of trees each shadow 119860119895shares approxi-
mately the same properties of the corresponding 119883119895 but it is
independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883
119895according to 119901-value
and can yield correct ranking of features according to theirimportance
43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features
Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical
response feature and each feature 119883119895 Each observation is
allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is
1205942=
119903
sum
119894=1
119888
sum
119895=1
(119874119894119895minus 119864119894119895)2
119864119894119895
(8)
For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature
Let X119904be the best subset of features we collect all feature
119883119895whose 119901-value is smaller or equal to 005 as a result
from the 1205942 statistical test according to (8) The remainingfeatures X X
119904 are added to X
119908 and this approach is
described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX
119904andX
119908 at each nodewe
randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910
119904= lceil119898119905119903119910 times (X
119904X)rceil and 119898119905119903119910
119908= lfloor119898119905119903119910 times
(X119908X)rfloor where X
119904 and X
119908 are the number of features
in the groups of highly informative features X119904and weak
informative features X119908 respectively X is the number of
informative features in the input datasetThese are merged toform the feature subspace for splitting the node
44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits
at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X
119904
and X119908 (iii) building RF using the subspaces containing
features which are taken randomly and separately from X119904
X119908 and (iv) classifying a new data The new algorithm is
summarized as follows
(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features
(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax
119860of 119877 elements
(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature
(4) Given a significance level threshold 120579 neglect alluninformative features
(5) Partition the remaining features into two subsets X119904
and X119908described in Algorithm 2
(6) Sample the training set L with replacement to gener-ate bagged samples L
1L2 L
119870
(7) For each 119871119896 grow a CART tree 119879
119896as follows
(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX
119904and
X119908and use the subspace features as candidates
for splitting the node(b) Each tree is grown nondeterministically with-
out pruning until the minimum node size 119899minis reached
(8) Given a 119883 = 119909new use (1) to predict the responsevalue
5 Experiments
51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]
For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels
6 The Scientific World Journal
input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold
and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons
In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification
In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes
For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256
119872120 and119872504 Totally we created 16 subdatasets as
Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly
The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s
The Scientific World Journal 7
datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset
52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model
The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas
where 119868(sdot) is the indicator function and 119876(119889119894 119895) =
sum119870
119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889
119894isin D119905on class
119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in
test data D119905 and 119910
119894indicates the true class of 119889
119894
53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)
We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For
comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets
In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog
2(119872) + 1rceil The minimal node
size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory
54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional
Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative
8 The Scientific World Journal
825
850
875
900
925
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
MethodsRFGRRF
wsRFxRF
YaleB + eigenface
(a)
MethodsRFGRRF
wsRFxRF
85
90
95
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
YaleB + randomface
(b)
Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + eigenface
MethodsRFGRRF
wsRFxRF
(a)
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + randomface
MethodsRFGRRF
wsRFxRF
(b)
Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504
features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results
Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset
However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests
Table 3 presents the prediction accuracies (mean plusmn
std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with
The Scientific World Journal 9
70
80
90
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
75
80
85
90
95
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
Accu
racy
()
70
80
90
100
Accu
racy
()
75
80
85
90
95
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
70
80
90
100
Accu
racy
()
60
70
80
90
100
Accu
racy
()
50
60
70
80
90Ac
cura
cy (
)
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
119894=1| 119883 isin R119872 119884 isin 1 2 119888 the training data set
119870 the number of treesmtry the size of the subspaces
output A random forest RF(1) for 119896 larr 1 to 119870 do(2) Draw a bagged subset of samples L
119896from L
(4) while (stopping criteria is not met) do(5) Select randomlymtry features(6) for 119898 larr 1 to 119898119905119903119910 do(7) Compute the decrease in the node impurity(8) Choose the feature which decreases the impurity the most and
the node is divided into two children nodes(9) Combine the 119870 trees to form a random forest
Algorithm 1 Random forest algorithm
3 Background
31 Random Forest Algorithm Given a training dataset L =
(119883119894 119884119894)119873
119894=1| 119883119894isin R119872 119884 isin 1 2 119888 where 119883
119894are
features (also called predictor variables) 119884 is a class responsefeature 119873 is the number of training samples and 119872 is thenumber of features and a random forest model RF describedin Algorithm 1 let 119896 be the prediction of tree 119879
119896given input
119883 The prediction of random forest with119870 trees is
= majority vote 119896119870
1 (1)
Since each tree is grown from a bagged sample set it isgrown with only two-thirds of the samples in L called in-bagsamples About one-third of the samples is left out and thesesamples are called out-of-bag (OOB) samples which are usedto estimate the prediction error
The OOB predicted value is OOB= (1O
1198941015840)sum119896isinO1198941015840119896
whereO1198941015840 = LO
119894 119894 and 1198941015840 are in-bag and out-of-bag sampled
indices O1198941015840 is the size of OOB subdataset and the OOB
prediction error is
ErrOOB
=1
119873OOB
119873OOB
sum
119894=1
E (119884 OOB
) (2)
where E(sdot) is an error function and 119873OOB is OOB samplesrsquosize
32 Measurement of Feature Importance Score from an RFBreiman presented a permutation technique to measure theimportance of features in the prediction [1] called an out-of-bag importance scoreThe basic idea for measuring this kindof importance score of features is to compute the differencebetween the original mean error and the randomly permutedmean error in OOB samplesThemethod rearranges stochas-tically all values of the 119895th feature in OOB for each tree anduses the RF model to predict this permuted feature and getthe mean error The aim of this permutation is to eliminatethe existing association between the 119895th feature and 119884 values
and then to test the effect of this on the RF model A featureis considered to be in a strong association if the mean errordecreases dramatically
The other kind of feature importance measure canbe obtained when the random forest is growing This isdescribed as follows At each node 119905 in a decision tree thesplit is determined by the decrease in node impurity Δ119877(119905)The node impurity 119877(119905) is the gini index If a subdataset innode 119905 contains samples from 119888 classes gini(119905) is defined as
119877 (119905) = 1 minus
119888
sum
119895=1
1199012
119895 (3)
where 1199012119895is the relative frequency of class 119895 in 119905 Gini(119905) is
minimized if the classes in 119905 are skewed After splitting 119905 intotwo child nodes 119905
1and 1199052with sample sizes119873
1(119905) and119873
2(119905)
the gini index of the split data is defined as
Ginisplit (119905) =1198731(119905)
119873 (119905)Gini (119905
1) +
1198732(119905)
119873 (119905)Gini (119905
2) (4)
The feature providing smallest Ginisplit(119905) is chosen to split thenodeThe importance score of feature119883
119895in a single decision
tree 119879119896is
IS119896(119883119895) = sum
119905isin119879119896
Δ119877 (119905) (5)
and it is computed over all119870 trees in a random forest definedas
IS (119883119895) =
1
119870
119870
sum
119896=1
IS119896(119883119895) (6)
It is worth noting that a random forest uses in-bag sam-ples to produce a kind of importance measure called an in-bag importance scoreThis is themain difference between thein-bag importance score and an out-of-bag measure which isproduced with the decrease of the prediction error using RFinOOB samples In other words the in-bag importance scorerequires less computation time than the out-of-bag measure
4 The Scientific World Journal
4 Our Approach
41 Issues in Feature Selection on High Dimensional DataWhen Breiman et al suggested the classification and regres-sion tree (CART) model they noted that feature selection isbiased because it is based on an information gain criteriacalled multivalue problem [2] Random forest methods arebased on CART trees [1] hence this bias is carried to randomforest RF model In particular the importance scores can bebiased when very high dimensional data contains multipledata types Several methods have been proposed to correctbias of feature importance measures [18ndash21] The conditionalinference framework (referred to as cRF [22]) could be suc-cessfully applied for both the null and power cases [19 20 22]The typical characteristic of the power case is that only onepredictor feature is important while the rest of the featuresare redundant with different cardinality In contrast in thenull case all features used for prediction are redundant withdifferent cardinality Although the methods of this kind werewell investigated and can be used to address the multivalueproblem there are still some unsolved problems such asthe need to specify in advance the probability distributionsas well as the fact that they struggle when applied to highdimensional data
Another issue is that in high dimensional data whenthe number of features is large the fraction of importancefeatures remains so small In this case the original RF modelwhich uses simple random sampling is likely to performpoorly with small 119898 and the trees are likely to select anuninformative feature as a split too frequently (119898 denotesa subspace size of features) At each node 119905 of a tree theprobability of uninformative feature selection is too high
To illustrate this issue let 119866 be the number of noisyfeatures denote by119872 the total number of predictor featuresand let the features 119872 minus 119866 be important ones which have ahigh correlationwith119884 valuesThen if we use simple randomsampling when growing trees to select a subset of 119898 features(119898 ≪ 119872) the total number of possible uninformative aC119898119872minus119866
and the total number of all subset features isC119898119872 The
probability distribution of selecting a subset of 119898 (119898 gt 1)important features is given by
C119898119872minus119866
C119898119872
=(119872 minus 119866) (119872 minus 119866 minus 1) sdot sdot sdot (119872 minus 119866 minus 119898 + 1)
119872 (119872 minus 1) sdot sdot sdot (119872 minus 119898 + 1)
=(1 minus 119866119872) sdot sdot sdot (1 minus 119866119872 minus 119898119872 + 1119872)
(1 minus 1119872) sdot sdot sdot (1 minus 119898119872 + 1119872)
≃ (1 minus119866
119872)
119898
(7)
Because the fraction of important features is too small theprobability in (7) tends to 0 which means that the importantfeatures are rarely selected by the simple sampling methodin RF [1] For example with 5 informative and 5000 noise oruninformative features assuming119898 = radic(5 + 5000) ≃ 70 theprobability of an informative feature to be selected at any splitis 0068
42 Bias Correction for Feature Selection and Feature Weight-ing The bias correction in feature selection is intended tomake the RF model to avoid selecting an uninformative fea-ture To correct this kind of bias in the feature selection stagewe generate shadow features to add to the original datasetThe shadow features set contains the same values possiblecut-points and distribution with the original features buthave no association with 119884 values To create each shadowfeature we rearrange the values of the feature in the originaldataset 119877 times to create the corresponding shadowThis dis-turbance of features eliminates the correlations of the featureswith the response value but keeps its attributes The shadowfeature participates only in the competition for the best splitand makes a decrease in the probability of selecting this kindof uninformative feature For the feature weight computationwe first need to distinguish the important features from theless important ones To do so we run a defined numberof random forests to obtain raw importance scores each ofwhich is obtained using (6) Then we use Wilcoxon rank-sum test [23] that compares the importance score of a featurewith the maximum importance scores of generated noisyfeatures called shadowsThe shadow features are added to theoriginal dataset and they do not have prediction power to theresponse feature Therefore any feature whose importancescore is smaller than the maximum importance score ofnoisy features is considered less important Otherwise it isconsidered important Having computed theWilcoxon rank-sum test we can compute the 119901-value for the feature The 119901-value of a feature in Wilcoxon rank-sum test is assigned aweight with a feature 119883
119895 119901-value isin [0 1] and this weight
indicates the importance of the feature in the predictionThe smaller the 119901-value of a feature the more correlated thepredictor feature to the response feature and therefore themore powerful the feature in prediction The feature weightcomputation is described as follows
Let119872 be the number of features in the original datasetand denote the feature set as S
119883= 119883
119895 119895 = 1 2 119872
In each replicate 119903 (119903 = 1 2 119877) shadow features aregenerated from features119883
119895in SX and we randomly permute
all values of119883119895119877 times to get a corresponding shadow feature
119860119895 denote the shadow feature set as S
119860= 119860
119895119872
1 The
extended feature set is denoted by S119883119860
= S119883S119860
Let the importance score of S119883119860
at replicate 119903 be IS119903119883119860
=
IS119903119883 IS119903119860 where IS119903
119883119895
and IS119903119860119895
are the importance scoresof 119883119895and 119860
119895at the 119903th replicate respectively We built a
random forest model RF from the S119883119860
dataset to compute2119872 importance scores for 2119872 featuresWe repeated the sameprocess119877 times to compute119877 replicates getting IS
119883119895
= IS119903119883119895
119877
1
and IS119860119895
= IS119903119860119895
119877
1 From the replicates of shadow features
we extracted the maximum value from 119903th row of IS119860119895
andput it into the comparison sample denoted by ISmax
119860 For each
data feature 119883119895 we computed Wilcoxon test and performed
hypothesis test on IS119883119895
gt ISmax119860
to calculate the 119901-valuefor the feature Given a statistical significance level we canidentify important features from less important ones Thistest confirms that if a feature is important it consistently
The Scientific World Journal 5
scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]
In each node of trees each shadow 119860119895shares approxi-
mately the same properties of the corresponding 119883119895 but it is
independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883
119895according to 119901-value
and can yield correct ranking of features according to theirimportance
43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features
Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical
response feature and each feature 119883119895 Each observation is
allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is
1205942=
119903
sum
119894=1
119888
sum
119895=1
(119874119894119895minus 119864119894119895)2
119864119894119895
(8)
For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature
Let X119904be the best subset of features we collect all feature
119883119895whose 119901-value is smaller or equal to 005 as a result
from the 1205942 statistical test according to (8) The remainingfeatures X X
119904 are added to X
119908 and this approach is
described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX
119904andX
119908 at each nodewe
randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910
119904= lceil119898119905119903119910 times (X
119904X)rceil and 119898119905119903119910
119908= lfloor119898119905119903119910 times
(X119908X)rfloor where X
119904 and X
119908 are the number of features
in the groups of highly informative features X119904and weak
informative features X119908 respectively X is the number of
informative features in the input datasetThese are merged toform the feature subspace for splitting the node
44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits
at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X
119904
and X119908 (iii) building RF using the subspaces containing
features which are taken randomly and separately from X119904
X119908 and (iv) classifying a new data The new algorithm is
summarized as follows
(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features
(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax
119860of 119877 elements
(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature
(4) Given a significance level threshold 120579 neglect alluninformative features
(5) Partition the remaining features into two subsets X119904
and X119908described in Algorithm 2
(6) Sample the training set L with replacement to gener-ate bagged samples L
1L2 L
119870
(7) For each 119871119896 grow a CART tree 119879
119896as follows
(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX
119904and
X119908and use the subspace features as candidates
for splitting the node(b) Each tree is grown nondeterministically with-
out pruning until the minimum node size 119899minis reached
(8) Given a 119883 = 119909new use (1) to predict the responsevalue
5 Experiments
51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]
For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels
6 The Scientific World Journal
input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold
and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons
In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification
In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes
For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256
119872120 and119872504 Totally we created 16 subdatasets as
Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly
The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s
The Scientific World Journal 7
datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset
52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model
The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas
where 119868(sdot) is the indicator function and 119876(119889119894 119895) =
sum119870
119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889
119894isin D119905on class
119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in
test data D119905 and 119910
119894indicates the true class of 119889
119894
53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)
We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For
comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets
In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog
2(119872) + 1rceil The minimal node
size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory
54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional
Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative
8 The Scientific World Journal
825
850
875
900
925
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
MethodsRFGRRF
wsRFxRF
YaleB + eigenface
(a)
MethodsRFGRRF
wsRFxRF
85
90
95
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
YaleB + randomface
(b)
Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + eigenface
MethodsRFGRRF
wsRFxRF
(a)
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + randomface
MethodsRFGRRF
wsRFxRF
(b)
Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504
features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results
Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset
However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests
Table 3 presents the prediction accuracies (mean plusmn
std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with
The Scientific World Journal 9
70
80
90
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
75
80
85
90
95
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
Accu
racy
()
70
80
90
100
Accu
racy
()
75
80
85
90
95
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
70
80
90
100
Accu
racy
()
60
70
80
90
100
Accu
racy
()
50
60
70
80
90Ac
cura
cy (
)
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
41 Issues in Feature Selection on High Dimensional DataWhen Breiman et al suggested the classification and regres-sion tree (CART) model they noted that feature selection isbiased because it is based on an information gain criteriacalled multivalue problem [2] Random forest methods arebased on CART trees [1] hence this bias is carried to randomforest RF model In particular the importance scores can bebiased when very high dimensional data contains multipledata types Several methods have been proposed to correctbias of feature importance measures [18ndash21] The conditionalinference framework (referred to as cRF [22]) could be suc-cessfully applied for both the null and power cases [19 20 22]The typical characteristic of the power case is that only onepredictor feature is important while the rest of the featuresare redundant with different cardinality In contrast in thenull case all features used for prediction are redundant withdifferent cardinality Although the methods of this kind werewell investigated and can be used to address the multivalueproblem there are still some unsolved problems such asthe need to specify in advance the probability distributionsas well as the fact that they struggle when applied to highdimensional data
Another issue is that in high dimensional data whenthe number of features is large the fraction of importancefeatures remains so small In this case the original RF modelwhich uses simple random sampling is likely to performpoorly with small 119898 and the trees are likely to select anuninformative feature as a split too frequently (119898 denotesa subspace size of features) At each node 119905 of a tree theprobability of uninformative feature selection is too high
To illustrate this issue let 119866 be the number of noisyfeatures denote by119872 the total number of predictor featuresand let the features 119872 minus 119866 be important ones which have ahigh correlationwith119884 valuesThen if we use simple randomsampling when growing trees to select a subset of 119898 features(119898 ≪ 119872) the total number of possible uninformative aC119898119872minus119866
and the total number of all subset features isC119898119872 The
probability distribution of selecting a subset of 119898 (119898 gt 1)important features is given by
C119898119872minus119866
C119898119872
=(119872 minus 119866) (119872 minus 119866 minus 1) sdot sdot sdot (119872 minus 119866 minus 119898 + 1)
119872 (119872 minus 1) sdot sdot sdot (119872 minus 119898 + 1)
=(1 minus 119866119872) sdot sdot sdot (1 minus 119866119872 minus 119898119872 + 1119872)
(1 minus 1119872) sdot sdot sdot (1 minus 119898119872 + 1119872)
≃ (1 minus119866
119872)
119898
(7)
Because the fraction of important features is too small theprobability in (7) tends to 0 which means that the importantfeatures are rarely selected by the simple sampling methodin RF [1] For example with 5 informative and 5000 noise oruninformative features assuming119898 = radic(5 + 5000) ≃ 70 theprobability of an informative feature to be selected at any splitis 0068
42 Bias Correction for Feature Selection and Feature Weight-ing The bias correction in feature selection is intended tomake the RF model to avoid selecting an uninformative fea-ture To correct this kind of bias in the feature selection stagewe generate shadow features to add to the original datasetThe shadow features set contains the same values possiblecut-points and distribution with the original features buthave no association with 119884 values To create each shadowfeature we rearrange the values of the feature in the originaldataset 119877 times to create the corresponding shadowThis dis-turbance of features eliminates the correlations of the featureswith the response value but keeps its attributes The shadowfeature participates only in the competition for the best splitand makes a decrease in the probability of selecting this kindof uninformative feature For the feature weight computationwe first need to distinguish the important features from theless important ones To do so we run a defined numberof random forests to obtain raw importance scores each ofwhich is obtained using (6) Then we use Wilcoxon rank-sum test [23] that compares the importance score of a featurewith the maximum importance scores of generated noisyfeatures called shadowsThe shadow features are added to theoriginal dataset and they do not have prediction power to theresponse feature Therefore any feature whose importancescore is smaller than the maximum importance score ofnoisy features is considered less important Otherwise it isconsidered important Having computed theWilcoxon rank-sum test we can compute the 119901-value for the feature The 119901-value of a feature in Wilcoxon rank-sum test is assigned aweight with a feature 119883
119895 119901-value isin [0 1] and this weight
indicates the importance of the feature in the predictionThe smaller the 119901-value of a feature the more correlated thepredictor feature to the response feature and therefore themore powerful the feature in prediction The feature weightcomputation is described as follows
Let119872 be the number of features in the original datasetand denote the feature set as S
119883= 119883
119895 119895 = 1 2 119872
In each replicate 119903 (119903 = 1 2 119877) shadow features aregenerated from features119883
119895in SX and we randomly permute
all values of119883119895119877 times to get a corresponding shadow feature
119860119895 denote the shadow feature set as S
119860= 119860
119895119872
1 The
extended feature set is denoted by S119883119860
= S119883S119860
Let the importance score of S119883119860
at replicate 119903 be IS119903119883119860
=
IS119903119883 IS119903119860 where IS119903
119883119895
and IS119903119860119895
are the importance scoresof 119883119895and 119860
119895at the 119903th replicate respectively We built a
random forest model RF from the S119883119860
dataset to compute2119872 importance scores for 2119872 featuresWe repeated the sameprocess119877 times to compute119877 replicates getting IS
119883119895
= IS119903119883119895
119877
1
and IS119860119895
= IS119903119860119895
119877
1 From the replicates of shadow features
we extracted the maximum value from 119903th row of IS119860119895
andput it into the comparison sample denoted by ISmax
119860 For each
data feature 119883119895 we computed Wilcoxon test and performed
hypothesis test on IS119883119895
gt ISmax119860
to calculate the 119901-valuefor the feature Given a statistical significance level we canidentify important features from less important ones Thistest confirms that if a feature is important it consistently
The Scientific World Journal 5
scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]
In each node of trees each shadow 119860119895shares approxi-
mately the same properties of the corresponding 119883119895 but it is
independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883
119895according to 119901-value
and can yield correct ranking of features according to theirimportance
43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features
Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical
response feature and each feature 119883119895 Each observation is
allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is
1205942=
119903
sum
119894=1
119888
sum
119895=1
(119874119894119895minus 119864119894119895)2
119864119894119895
(8)
For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature
Let X119904be the best subset of features we collect all feature
119883119895whose 119901-value is smaller or equal to 005 as a result
from the 1205942 statistical test according to (8) The remainingfeatures X X
119904 are added to X
119908 and this approach is
described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX
119904andX
119908 at each nodewe
randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910
119904= lceil119898119905119903119910 times (X
119904X)rceil and 119898119905119903119910
119908= lfloor119898119905119903119910 times
(X119908X)rfloor where X
119904 and X
119908 are the number of features
in the groups of highly informative features X119904and weak
informative features X119908 respectively X is the number of
informative features in the input datasetThese are merged toform the feature subspace for splitting the node
44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits
at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X
119904
and X119908 (iii) building RF using the subspaces containing
features which are taken randomly and separately from X119904
X119908 and (iv) classifying a new data The new algorithm is
summarized as follows
(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features
(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax
119860of 119877 elements
(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature
(4) Given a significance level threshold 120579 neglect alluninformative features
(5) Partition the remaining features into two subsets X119904
and X119908described in Algorithm 2
(6) Sample the training set L with replacement to gener-ate bagged samples L
1L2 L
119870
(7) For each 119871119896 grow a CART tree 119879
119896as follows
(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX
119904and
X119908and use the subspace features as candidates
for splitting the node(b) Each tree is grown nondeterministically with-
out pruning until the minimum node size 119899minis reached
(8) Given a 119883 = 119909new use (1) to predict the responsevalue
5 Experiments
51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]
For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels
6 The Scientific World Journal
input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold
and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons
In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification
In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes
For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256
119872120 and119872504 Totally we created 16 subdatasets as
Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly
The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s
The Scientific World Journal 7
datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset
52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model
The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas
where 119868(sdot) is the indicator function and 119876(119889119894 119895) =
sum119870
119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889
119894isin D119905on class
119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in
test data D119905 and 119910
119894indicates the true class of 119889
119894
53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)
We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For
comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets
In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog
2(119872) + 1rceil The minimal node
size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory
54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional
Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative
8 The Scientific World Journal
825
850
875
900
925
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
MethodsRFGRRF
wsRFxRF
YaleB + eigenface
(a)
MethodsRFGRRF
wsRFxRF
85
90
95
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
YaleB + randomface
(b)
Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + eigenface
MethodsRFGRRF
wsRFxRF
(a)
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + randomface
MethodsRFGRRF
wsRFxRF
(b)
Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504
features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results
Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset
However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests
Table 3 presents the prediction accuracies (mean plusmn
std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with
The Scientific World Journal 9
70
80
90
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
75
80
85
90
95
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
Accu
racy
()
70
80
90
100
Accu
racy
()
75
80
85
90
95
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
70
80
90
100
Accu
racy
()
60
70
80
90
100
Accu
racy
()
50
60
70
80
90Ac
cura
cy (
)
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
scores higher than the shadow over multiple permutationsThis method has been presented in [24 25]
In each node of trees each shadow 119860119895shares approxi-
mately the same properties of the corresponding 119883119895 but it is
independent on 119884 and consequently has approximately thesame probability of being selected as a splitting candidateThis feature permutation method can reduce bias due todifferent measurement levels of 119883
119895according to 119901-value
and can yield correct ranking of features according to theirimportance
43 Unbiased FeatureWeighting for Subspace Selection Givenall 119901-values for all features we first set a significance level asthe threshold 120579 for instance 120579 = 005 Any feature whose 119901-value is greater than 120579 is considered a uninformative featureand is removed from the system otherwise the relationshipwith 119884 is assessed We now consider the set of features Xobtained from L after neglecting all uninformative features
Second we find the best subset of features which is highlyrelated to the response feature ameasure correlation function1205942(X 119884) is used to test the association between the categorical
response feature and each feature 119883119895 Each observation is
allocated to one cell of a two-dimensional array of cells (calleda contingency table) according to the values of (X 119884) If thereare 119903 rows and 119888 columns in the table and119873 is the number oftotal samples the value of the test statistic is
1205942=
119903
sum
119894=1
119888
sum
119895=1
(119874119894119895minus 119864119894119895)2
119864119894119895
(8)
For the test of independence a chi-squared probability of lessthan or equal to 005 is commonly interpreted for rejectingthe hypothesis that the row variable is independent of thecolumn feature
Let X119904be the best subset of features we collect all feature
119883119895whose 119901-value is smaller or equal to 005 as a result
from the 1205942 statistical test according to (8) The remainingfeatures X X
119904 are added to X
119908 and this approach is
described in Algorithm 2 We independently sample featuresfrom the two subsets and put them together as the subspacefeatures for splitting the data at any node recursively Thetwo subsets partition the set of informative features in datawithout irrelevant features GivenX
119904andX
119908 at each nodewe
randomly select119898119905119903119910 (119898119905119903119910 gt 1) features from each group offeatures For a given subspace size we can choose proportionsbetween highly informative features and weak informativefeatures that depend on the size of the two groups Thatis 119898119905119903119910
119904= lceil119898119905119903119910 times (X
119904X)rceil and 119898119905119903119910
119908= lfloor119898119905119903119910 times
(X119908X)rfloor where X
119904 and X
119908 are the number of features
in the groups of highly informative features X119904and weak
informative features X119908 respectively X is the number of
informative features in the input datasetThese are merged toform the feature subspace for splitting the node
44 Our Proposed RF Algorithm In this section we presentour new random forest algorithm called xRF which usesthe new unbiased feature sampling method to generate splits
at the nodes of CART trees [2] The proposed algorithmincludes the following main steps (i) weighting the featuresusing the feature permutation method (ii) identifying allunbiased features and partitioning them into two groups X
119904
and X119908 (iii) building RF using the subspaces containing
features which are taken randomly and separately from X119904
X119908 and (iv) classifying a new data The new algorithm is
summarized as follows
(1) Generate the extended dataset SX119860 of 2119872 dimen-sions by permuting the corresponding predictor fea-ture values for shadow features
(2) Build a random forest model RF from SX119860 119884 andcompute 119877 replicates of raw importance scores of allpredictor features and shadows with RF Extract themaximum importance score of each replicate to formthe comparison sample ISmax
119860of 119877 elements
(3) For each predictor feature take 119877 importance scoresand computeWilcoxon test to get 119901-value that is theweight of each feature
(4) Given a significance level threshold 120579 neglect alluninformative features
(5) Partition the remaining features into two subsets X119904
and X119908described in Algorithm 2
(6) Sample the training set L with replacement to gener-ate bagged samples L
1L2 L
119870
(7) For each 119871119896 grow a CART tree 119879
119896as follows
(a) At each node select a subspace of119898119905119903119910 (119898119905119903119910 gt1) features randomly and separately fromX
119904and
X119908and use the subspace features as candidates
for splitting the node(b) Each tree is grown nondeterministically with-
out pruning until the minimum node size 119899minis reached
(8) Given a 119883 = 119909new use (1) to predict the responsevalue
5 Experiments
51 Datasets Real-world datasets including image datasetsand microarray datasets were used in our experimentsImage classification and object recognition are importantproblems in computer vision We conducted experimentson four benchmark image datasets including the Caltechcategories (httpwwwvisioncaltecheduhtml-filesarchivehtml) dataset the Horse (httppascalinrialpesfrdatahorses) dataset the extended YaleB database [26] and theATampT ORL dataset [27]
For the Caltech dataset we use a subset of 100 imagesfrom theCaltech face dataset and 100 images from theCaltechbackground dataset following the setting in ICCV (httppeoplecsailmitedutorralbashortCourseRLOC) The ex-tended YaleB database consists of 2414 face images of 38individuals captured under various lighting conditions Eachimage has been cropped to a size of 192 times 168 pixels
6 The Scientific World Journal
input The training data set L and a random forest RF119877 120579 The number of replicates and the threshold
and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons
In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification
In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes
For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256
119872120 and119872504 Totally we created 16 subdatasets as
Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly
The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s
The Scientific World Journal 7
datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset
52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model
The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas
where 119868(sdot) is the indicator function and 119876(119889119894 119895) =
sum119870
119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889
119894isin D119905on class
119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in
test data D119905 and 119910
119894indicates the true class of 119889
119894
53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)
We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For
comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets
In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog
2(119872) + 1rceil The minimal node
size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory
54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional
Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative
8 The Scientific World Journal
825
850
875
900
925
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
MethodsRFGRRF
wsRFxRF
YaleB + eigenface
(a)
MethodsRFGRRF
wsRFxRF
85
90
95
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
YaleB + randomface
(b)
Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + eigenface
MethodsRFGRRF
wsRFxRF
(a)
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + randomface
MethodsRFGRRF
wsRFxRF
(b)
Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504
features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results
Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset
However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests
Table 3 presents the prediction accuracies (mean plusmn
std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with
The Scientific World Journal 9
70
80
90
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
75
80
85
90
95
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
Accu
racy
()
70
80
90
100
Accu
racy
()
75
80
85
90
95
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
70
80
90
100
Accu
racy
()
60
70
80
90
100
Accu
racy
()
50
60
70
80
90Ac
cura
cy (
)
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
and normalized The Horse dataset consists of 170 imagescontaining horses for the positive class and 170 images of thebackground for the negative class The ATampT ORL datasetincludes of 400 face images of 40 persons
In the experiments we use a bag of words for imagefeatures representation for theCaltech and theHorse datasetsTo obtain feature vectors using bag-of-words method imagepatches (subwindows) are sampled from the training imagesat the detected interest points or on a dense grid A visualdescriptor is then applied to these patches to extract the localvisual features A clustering technique is then used to clusterthese and the cluster centers are used as visual code wordsto form visual codebook An image is then represented as ahistogram of these visual words A classifier is then learnedfrom this feature set for classification
In our experiments traditional 119896-means quantization isused to produce the visual codebook The number of clustercenters can be adjusted to produce the different vocabulariesthat is dimensions of the feature vectors For the Caltechand Horse datasets nine codebook sizes were used in theexperiments to create 18 datasets as follows CaltechM300CaltechM500 CaltechM1000 CaltechM3000 CaltechM5000CaltechM7000 CaltechM1000 CaltechM12000 CaltechM-15000 and HorseM300 HorseM500 HorseM1000 Horse-M3000 HorseM5000 HorseM7000 HorseM1000 HorseM-12000HorseM15000 whereM denotes the number of code-book sizes
For the face datasets we use two type of featureseigenface [28] and the random features (randomly samplepixels from the images) We used four groups of datasetswith four different numbers of dimensions 11987230 11987256
119872120 and119872504 Totally we created 16 subdatasets as
Table 1 Description of the real-world datasets sorted by the numberof features and grouped into two groups microarray data and real-world datasets accordingly
The properties of the remaining datasets are summarizedin Table 1 The Fbis dataset was compiled from the archive ofthe Foreign Broadcast Information Service and the La1s La2s
The Scientific World Journal 7
datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset
52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model
The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas
where 119868(sdot) is the indicator function and 119876(119889119894 119895) =
sum119870
119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889
119894isin D119905on class
119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in
test data D119905 and 119910
119894indicates the true class of 119889
119894
53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)
We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For
comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets
In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog
2(119872) + 1rceil The minimal node
size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory
54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional
Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative
8 The Scientific World Journal
825
850
875
900
925
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
MethodsRFGRRF
wsRFxRF
YaleB + eigenface
(a)
MethodsRFGRRF
wsRFxRF
85
90
95
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
YaleB + randomface
(b)
Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + eigenface
MethodsRFGRRF
wsRFxRF
(a)
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + randomface
MethodsRFGRRF
wsRFxRF
(b)
Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504
features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results
Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset
However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests
Table 3 presents the prediction accuracies (mean plusmn
std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with
The Scientific World Journal 9
70
80
90
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
75
80
85
90
95
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
Accu
racy
()
70
80
90
100
Accu
racy
()
75
80
85
90
95
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
70
80
90
100
Accu
racy
()
60
70
80
90
100
Accu
racy
()
50
60
70
80
90Ac
cura
cy (
)
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
datasets were taken from the archive of the LosAngeles Timesfor TREC-5 (httptrecnistgov) The ten gene datasets areused and described in [11 17] they are always high dimen-sional and fall within a category of classification problemswhich deal with large number of features and small samplesRegarding the characteristics of the datasets given in Table 1the proportion of the subdatasets namely Fbis La1s La2swas used individually for a training and testing dataset
52 Evaluation Methods We calculated some measures suchas error bound (1198881199042) strength (119904) and correlation (120588)according to the formulas given in Breimanrsquos method [1]The correlation measures indicate the independence of treesin a forest whereas the average strength corresponds to theaccuracy of individual trees Lower correlation and higherstrength result in a reduction of general error bound mea-sured by (1198881199042) which indicates a high accuracy RF model
The twomeasures are also used to evaluate the accuracy ofprediction on the test datasets one is the area under the curve(AUC) and the other one is the test accuracy (Acc) definedas
where 119868(sdot) is the indicator function and 119876(119889119894 119895) =
sum119870
119896=1119868(ℎ119896(119889119894) = 119895) is the number of votes for 119889
119894isin D119905on class
119895 ℎ119896is the 119896th tree classifier 119873 is the number of samples in
test data D119905 and 119910
119894indicates the true class of 119889
119894
53 Experimental Settings The latest 119877-packages randomForest and RRF [29 30] were used in 119877 environment toconduct these experimentsTheGRRFmodel was available inthe RRF 119877-package The wsRF model which used weightedsampling method [13] was intended to solve classificationproblems For the image datasets the 10-fold cross-validationwas used to evaluate the prediction performance of the mod-els From each fold we built the models with 500 trees andthe feature partition for subspace selection in Algorithm 2was recalculated on each training fold dataset The119898119905119903119910 and119899min parameters were set to radic119872 and 1 respectively Theexperimental results were evaluated in two measures AUCand the test accuracy according to (9)
We compared across awide range the performances of the10 gene datasets used in [11]The results from the applicationof GRRF varSelRF and LASSO logistic regression on theten gene datasets are presented in [17] These three geneselection methods used RF 119877-package [30] as the classifierFor the comparison of themethods we used the same settingswhich are presented in [17] for the coefficient 120574 we usedvalue of 01 because GR-RF(01) has shown competitiveaccuracy [17] when applied to the 10 gene datasets The100 models were generated with different seeds from eachtraining dataset and each model contained 1000 trees The119898119905119903119910 and 119899min parameters were of the same settings on theimage dataset From each of the datasets two-thirds of thedata were randomly selected for training The other one-third of the dataset was used to validate the models For
comparison Breimanrsquos RF method the weighted samplingrandom forest wsRF model and the xRF model were usedin the experiments The guided regularized random forestGRRF [17] and the twowell-known feature selectionmethodsusing RF as a classifier namely varSelRF [31] and LASSOlogistic regression [32] are also used to evaluate the accuracyof prediction on high-dimensional datasets
In the remaining datasets the prediction performancesof the ten random forest models were evaluated each onewas built with 500 trees The number of features candidatesto split a node was119898119905119903119910 = lceillog
2(119872) + 1rceil The minimal node
size 119899min was 1The xRFmodel with the new unbiased featuresampling method is a new implementationWe implementedthe xRF model as multithread processes while other modelswere run as single-thread processes We used 119877 to callthe corresponding CC++ functions All experiments wereconducted on the six 64-bit Linux machines with each onebeing equipped with Intel 119877Xeon 119877CPU E5620 240GHz 16cores 4MB cache and 32GB main memory
54 Results on Image Datasets Figures 1 and 2 show theaverage accuracy plots of recognition rates of the modelson different subdatasets of the datasets 119884119886119897119890119861 and 119874119877119871The GRRF model produced slightly better results on thesubdataset ORLRandomM120 and ORL dataset using eigen-face and showed competitive accuracy performance withthe xRF model on some cases in both 119884119886119897119890119861 and ORLdatasets for example YaleBEigenM120 ORLRandomM56andORLRandomM120 The reason could be that truly infor-mative features in this kind of datasets were manyThereforewhen the informative feature set was large the chance ofselecting informative features in the subspace increasedwhich in turn increased the average recognition rates of theGRRF model However the xRF model produced the bestresults in the remaining casesThe effect of the new approachfor feature subspace selection is clearly demonstrated in theseresults although these datasets are not high dimensional
Figures 3 and 5 present the box plots of the test accuracy(mean plusmn std-dev) Figures 4 and 6 show the box plots ofthe AUCmeasures of the models on the 18 image subdatasetsof the Caltech and Horse respectively From these figureswe can observe that the accuracy and the AUC measuresof the models GRRF wsRF and xRF were increased on allhigh-dimensional subdatasets when the selected subspace119898119905119903119910 was not so large This implies that when the numberof features in the subspace is small the proportion of theinformative features in the feature subspace is comparativelylarge in the three models There will be a high chance thathighly informative features are selected in the trees so theoverall performance of individual trees is increased In Brie-manrsquos method many randomly selected subspaces may notcontain informative features which affect the performanceof trees grown from these subspaces It can be seen thatthe xRF model outperformed other random forests modelson these subdatasets in increasing the test accuracy and theAUC measures This was because the new unbiased featuresampling was used in generating trees in the xRF modelthe feature subspace provided enough highly informative
8 The Scientific World Journal
825
850
875
900
925
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
MethodsRFGRRF
wsRFxRF
YaleB + eigenface
(a)
MethodsRFGRRF
wsRFxRF
85
90
95
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
YaleB + randomface
(b)
Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + eigenface
MethodsRFGRRF
wsRFxRF
(a)
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + randomface
MethodsRFGRRF
wsRFxRF
(b)
Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504
features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results
Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset
However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests
Table 3 presents the prediction accuracies (mean plusmn
std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with
The Scientific World Journal 9
70
80
90
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
75
80
85
90
95
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
Accu
racy
()
70
80
90
100
Accu
racy
()
75
80
85
90
95
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
70
80
90
100
Accu
racy
()
60
70
80
90
100
Accu
racy
()
50
60
70
80
90Ac
cura
cy (
)
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
MethodsRFGRRF
wsRFxRF
YaleB + eigenface
(a)
MethodsRFGRRF
wsRFxRF
85
90
95
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
YaleB + randomface
(b)
Figure 1 Recognition rates of themodels on the YaleB subdatasets namely YaleBEigenfaceM30 YaleBEigenfaceM56 YaleBEigenfaceM120YaleBEigenfaceM504 and YaleBRandomfaceM30 YaleBRandomfaceM56 YaleBRandomfaceM120 and YaleBRandomfaceM504
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + eigenface
MethodsRFGRRF
wsRFxRF
(a)
850
875
900
925
950
100 200 300 400 500Feature dimension of subdatasets
Reco
gniti
on ra
te (
)
ORL + randomface
MethodsRFGRRF
wsRFxRF
(b)
Figure 2 Recognition rates of the models on the ORL subdatasets namely ORLEigenfaceM30 ORLEigenM56 ORLEigenM120ORLEigenM504 and ORLRandomfaceM30 ORLRandomM56 ORLRandomM120 and ORLRandomM504
features at any levels of the decision trees The effect of theunbiased feature selection method is clearly demonstrated inthese results
Table 2 shows the results of 1198881199042 against the numberof codebook sizes on the Caltech and Horse datasets In arandom forest the tree was grown from a bagging trainingdata Out-of-bag estimates were used to evaluate the strengthcorrelation and 1198881199042 The GRRF model was not consideredin this experiment because this method aims to find a smallsubset of features and the same RF model in 119877-package [30]is used as a classifier We compared the xRF model withtwo kinds of random forest models RF and wsRF From thistable we can observe that the lowest 1198881199042 values occurredwhen the wsRF model was applied to the Caltech dataset
However the xRFmodel produced the lowest error bound onthe119867119900119903119904119890 dataset These results demonstrate the reason thatthe new unbiased feature sampling method can reduce theupper bound of the generalization error in random forests
Table 3 presents the prediction accuracies (mean plusmn
std-dev) of the models on subdatasets CaltechM3000HorseM3000 YaleBEigenfaceM504 YaleBrandomfaceM504ORLEigenfaceM504 and ORLrandomfaceM504 In theseexperiments we used the four models to generate randomforests with different sizes from 20 trees to 200 trees Forthe same size we used each model to generate 10 ran-dom forests for the 10-fold cross-validation and computedthe average accuracy of the 10 results The GRRF modelshowed slightly better results on YaleBEigenfaceM504 with
The Scientific World Journal 9
70
80
90
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
75
80
85
90
95
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
Accu
racy
()
70
80
90
100
Accu
racy
()
75
80
85
90
95
100Ac
cura
cy (
)
70
80
90
100
Accu
racy
()
70
80
90
100
Accu
racy
()
60
70
80
90
100
Accu
racy
()
50
60
70
80
90Ac
cura
cy (
)
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
Figure 3 Box plots the test accuracy of the nine Caltech subdatasets
different tree sizes The wsRF model produced the bestprediction performance on some cases when applied to smallsubdatasets YaleBEigenfaceM504 ORLEigenfaceM504 andORLrandomfaceM504 However the xRF model producedrespectively the highest test accuracy on the remaining sub-datasets andAUCmeasures on high-dimensional subdatasetsCaltechM3000 and HorseM3000 as shown in Tables 3 and4 We can clearly see that the xRF model also outperformedother random forests models in classification accuracy onmost cases in all image datasets Another observation is thatthe new method is more stable in classification performancebecause the mean and variance of the test accuracy measureswere minor changed when varying the number of trees
55 Results on Microarray Datasets Table 5 shows the aver-age test results in terms of accuracy of the 100 random forestmodels computed according to (9) on the gene datasets Theaverage number of genes selected by the xRFmodel from 100repetitions for each dataset is shown on the right of Table 5divided into two groups X
119904(strong) and X
119908(weak) These
genes are used by the unbiased feature sampling method forgrowing trees in the xRF model LASSO logistic regressionwhich uses the RF model as a classifier showed fairly goodaccuracy on the two gene datasets srbct and leukemia TheGRRF model produced slightly better result on the prostategene dataset However the xRF model produced the bestaccuracy on most cases of the remaining gene datasets
10 The Scientific World Journal
085
090
095
100AU
C
075
080
085
090
095
100
AUC
085
090
095
100
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM7000
RF GRRF wsRF xRFCaltechM15000
RF GRRF wsRF xRFCaltechM12000
RF GRRF wsRF xRFCaltechM1000
RF GRRF wsRF xRFCaltechM5000
RF GRRF wsRF xRFCaltechM3000
RF GRRF wsRF xRFCaltechM500
RF GRRF wsRF xRFCaltechM300
AUC
08
09
10
AUC
094
096
098
100AU
C
094
096
098
100
AUC
092
094
096
098
100
AUC
090
095
100
AUC
07
08
09
10AU
C
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
Figure 4 Box plots of the AUC measures of the nine Caltech subdatasets
The detailed results containing the median and thevariance values are presented in Figure 7 with box plotsOnly the GRRF model was used for this comparison theLASSO logistic regression and varSelRF method for featureselection were not considered in this experiment becausetheir accuracies are lower than that of the GRRF model asshown in [17] We can see that the xRF model achieved thehighest average accuracy of prediction on nine datasets out often Its result was significantly different on the prostate genedataset and the variance was also smaller than those of theother models
Figure 8 shows the box plots of the (1198881199042) error bound ofthe RF wsRF and xRF models on the ten gene datasets from100 repetitionsThe wsRF model obtained lower error bound
rate on five gene datasets out of 10 The xRF model produceda significantly different error bound rate on two gene datasetsand obtained the lowest error rate on three datasets Thisimplies that when the optimal parameters such as 119898119905119903119910 =
lceilradic119872rceil and 119899min = 1 were used in growing trees the numberof genes in the subspace was not small and out-of-bag datawas used in prediction and the results were comparativelyfavored to the xRF model
56 Comparison of Prediction Performance for Various Num-bers of Features and Trees Table 6 shows the average 1198881199042error bound and accuracy test results of 10 repetitions ofrandom forest models on the three large datasets The xRFmodel produced the lowest error 1198881199042 on the dataset La1s
The Scientific World Journal 11
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
70
80
90
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
90
Accu
racy
()
60
70
80
90
Accu
racy
()
70
80
90
Accu
racy
()
60
70
80
Accu
racy
()
60
70
80
Accu
racy
()
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
Figure 5 Box plots of the test accuracy of the nine Horse subdatasets
while the wsRF model showed the lower error bound onother two datasets Fbis andLa2sTheRFmodel demonstratedthe worst accuracy of prediction compared to the othermodels this model also produced a large 1198881199042 error whenthe small subspace size 119898119905119903119910 = lceillog
2(119872) + 1rceil was used to
build trees on the La1s and La2s datasets The number offeatures in the X
119904and X
119908columns on the right of Table 6
was used in the xRF model We can see that the xRF modelachieved the highest accuracy of prediction on all three largedatasets
Figure 9 shows the plots of the performance curves of theRF models when the number of trees and features increasesThe number of trees was increased stepwise by 20 treesfrom 20 to 200 when the models were applied to the La1s
dataset For the remaining data sets the number of treesincreased stepwise by 50 trees from 50 to 500 The numberof random features in a subspace was set to 119898119905119903119910 = lceilradic119872rceilThe number of features each consisting of a random sumof five inputs varied from 5 to 100 and for each 200 treeswere combined The vertical line in each plot indicates thesize of a subspace of features 119898119905119903119910 = lceillog
2(119872) + 1rceil
This subspace was suggested by Breiman [1] for the case oflow-dimensional datasets Three feature selection methodsnamely GRRF varSelRF and LASSO were not considered inthis experimentThemain reason is that when the119898119905119903119910 valueis large the computational time of the GRRF and varSelRFmodels required to deal with large high datasets was too long[17]
12 The Scientific World Journal
06
07
08
09AU
C
065
070
075
080
085
090
AUC
070
075
080
085
090
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM7000
RF GRRF wsRF xRFHorseM15000
RF GRRF wsRF xRFHorseM12000
RF GRRF wsRF xRFHorseM1000
RF GRRF wsRF xRFHorseM5000
RF GRRF wsRF xRFHorseM3000
RF GRRF wsRF xRFHorseM500
RF GRRF wsRF xRFHorseM300
AUC
06
07
08
09
AUC
07
08
09AU
C
06
07
08
09
AUC
07
08
09
AUC
05
06
07
08
09
AUC
065
070
075
080
085
AUC
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
Figure 6 Box plots of the AUC measures of the nine Horse subdatasets
It can be seen that the xRF and wsRF models alwaysprovided good results and achieved higher prediction accu-racies when the subspace 119898119905119903119910 = lceillog
2(119872) + 1rceil was used
However the xRF model is better than the wsRF model inincreasing the prediction accuracy on the three classificationdatasetsThe RFmodel requires the larger number of featuresto achieve the higher accuracy of prediction as shown in theright of Figures 9(a) and 9(b) When the number of treesin a forests was varied the xRF model produced the bestresults on the Fbis and La2s datasets In the La1s datasetwhere the xRF model did not obtain the best results asshown in Figure 9(c) (left) the differences from the bestresults were minor From the right of Figures 9(a) 9(b)and 9(c) we can observe that the xRF model does not need
many features in the selected subspace to achieve the bestprediction performanceThese empirical results indicate thatfor application on high-dimensional data when the xRFmodel uses the small subspace the achieved results can besatisfactory
However the RF model using the simple samplingmethod for feature selection [1] could achieve good predic-tion performance only if it is provided with a much largersubspace as shown in the right part of Figures 9(a) and 9(b)Breiman suggested to use a subspace of size 119898119905119903119910 = radic119872 inclassification problemWith this size the computational timefor building a random forest is still too high especially forlarge high datasets In general when the xRF model is usedwith a feature subspace of the same size as the one suggested
The Scientific World Journal 13
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
Table 2 The (1198881199042) error bound results of random forest models against the number of codebook size on the Caltech and Horse datasetsThe bold value in each row indicates the best result
Figure 7 Box plots of test accuracy of the models on the ten gene datasets
14 The Scientific World Journal
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
Table 3 The prediction test accuracy (mean plusmn std-dev) of the models on the image datasets against the number of trees 119870 The numberof feature dimensions in each subdataset is fixed Numbers in bold are the best results
Table 4 AUC results (mean plusmn std-dev) of random forest models against the number of trees 119870 on the CaltechM3000 and HorseM3000subdatasets The bold value in each row indicates the best result
Table 5 Test accuracy results () of random forest models GRRF(01) varSelRF and LASSO logistic regression applied to gene datasetsThe average results of 100 repetitions were computed higher values are better The number of genes in the strong group X
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
Table 6The accuracy of prediction and error bound 1198881199042 of the models using a small subspace119898119905119903119910 = [log2(119872)+ 1] better values are bold
Dataset 1198881199042 Error bound Test accuracy () X119904
Figure 8 Box plots of (1198881199042) error bound for the models applied to the 10 gene datasets
by Breiman it demonstrates higher prediction accuracy andshorter computational time than those reported by BreimanThis achievement is considered to be one of the contributionsin our work
6 Conclusions
We have presented a new method for feature subspaceselection for building efficient random forest xRF model for
classification high-dimensional data Our main contributionis to make a new approach for unbiased feature samplingwhich selects the set of unbiased features for splitting anode when growing trees in the forests Furthermore thisnew unbiased feature selection method also reduces dimen-sionality using a defined threshold to remove uninformativefeatures (or noise) from the dataset Experimental resultshave demonstrated the improvements in increasing of the testaccuracy and the AUC measures for classification problems
16 The Scientific World Journal
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
70
75
80
85
25 50 75 100Number of features
Accu
racy
()
log(M) + 1
(a) Fbis
85
86
87
88
89
100 200 300 400 500Number of trees
Accu
racy
()
60
70
80
90
10 20 30 40 50Number of features
Accu
racy
()
log(M) + 1
(b) La2s
70
75
80
85
50 100 150 200Number of trees
Accu
racy
()
MethodsRFwsRFxRF
MethodsRFwsRFxRF
30
40
50
60
70
80
10 20 30 40 50Number of features
Accu
racy
() log(M) + 1
(c) La1s
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
Figure 9 The accuracy of prediction of the three random forests models against the number of trees and features on the three datasets
The Scientific World Journal 17
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
especially for image and microarray datasets in comparisonwith recent proposed random forests models including RFGRRF and wsRF
For futurework we think it would be desirable to increasethe scalability of the proposed random forests algorithm byparallelizing themon the cloud platform to deal with big datathat is hundreds of millions of samples and features
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research is supported in part by NSFC under Grantno 61203294 and Hanoi-DOST under the Grant no 01C-0701-2012-2 The author Thuy Thi Nguyen is supported bythe project ldquoSome Advanced Statistical Learning Techniquesfor Computer Visionrdquo funded by the National Foundation ofScience and Technology Development Vietnam under theGrant no 10201-201117
[2] L Breiman J Friedman C J Stone and R A OlshenClassification and Regression Trees CRC Press Boca Raton FlaUSA 1984
[3] H Kim and W-Y Loh ldquoClassification trees with unbiasedmultiway splitsrdquo Journal of the American Statistical Associationvol 96 no 454 pp 589ndash604 2001
[4] A PWhite andW Z Liu ldquoTechnical note bias in information-based measures in decision tree inductionrdquo Machine Learningvol 15 no 3 pp 321ndash329 1994
[5] T G Dietterich ldquoExperimental comparison of three methodsfor constructing ensembles of decision trees bagging boostingand randomizationrdquo Machine Learning vol 40 no 2 pp 139ndash157 2000
[6] Y Freund and R E Schapire ldquoA desicion-theoretic general-ization of on-line learning and an application to boostingrdquo inComputational Learning Theory pp 23ndash37 Springer 1995
[7] T-T Nguyen and T T Nguyen ldquoA real time license platedetection system based on boosting learning algorithmrdquo inProceedings of the 5th International Congress on Image and SignalProcessing (CISP rsquo12) pp 819ndash823 IEEE October 2012
[8] T K Ho ldquoRandom decision forestsrdquo in Proceedings of the 3rdInternational Conference on Document Analysis and Recogni-tion vol 1 pp 278ndash282 1995
[9] T K Ho ldquoThe random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998
[11] R Dıaz-Uriarte and S Alvarez de Andres ldquoGene selection andclassification of microarray data using random forestrdquo BMCBioinformatics vol 7 article 3 2006
[12] RGenuer J-M Poggi andC Tuleau-Malot ldquoVariable selectionusing random forestsrdquoPattern Recognition Letters vol 31 no 14pp 2225ndash2236 2010
[13] B Xu J Z Huang GWilliams QWang and Y Ye ldquoClassifyingvery high-dimensional data with random forests built fromsmall subspacesrdquo International Journal ofDataWarehousing andMining vol 8 no 2 pp 44ndash63 2012
[14] Y Ye Q Wu J Zhexue Huang M K Ng and X Li ldquoStratifiedsampling for feature subspace selection in random forests forhigh dimensional datardquo Pattern Recognition vol 46 no 3 pp769ndash787 2013
[15] X Chen Y Ye X Xu and J Z Huang ldquoA feature groupweighting method for subspace clustering of high-dimensionaldatardquo Pattern Recognition vol 45 no 1 pp 434ndash446 2012
[16] D Amaratunga J Cabrera and Y-S Lee ldquoEnriched randomforestsrdquo Bioinformatics vol 240 no 18 pp 2010ndash2014 2008
[17] H Deng and G Runger ldquoGene selection with guided regular-ized random forestrdquo Pattern Recognition vol 46 no 12 pp3483ndash3489 2013
[18] C Strobl ldquoStatistical sources of variable selection bias inclassification trees based on the gini indexrdquo Tech Rep SFB 3862005 httpepububuni-muenchendearchive0000178901paper 420pdf
[19] C Strobl A-L Boulesteix and T Augustin ldquoUnbiased splitselection for classification trees based on the gini indexrdquoComputational Statistics amp Data Analysis vol 520 no 1 pp483ndash501 2007
[20] C Strobl A-L Boulesteix A Zeileis and T Hothorn ldquoBiasin random forest variable importance measures illustrationssources and a solutionrdquo BMC Bioinformatics vol 8 article 252007
[21] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008
[22] T Hothorn K Hornik and A Zeileis Party a laboratoryfor recursive partytioning r package version 09-9999 2011httpcranr-projectorgpackage=party
[23] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics vol 10 no 6 pp 80ndash83 1945
[24] T-TNguyen J ZHuang andT TNguyen ldquoTwo-level quantileregression forests for bias correction in range predictionrdquoMachine Learning 2014
[25] T-T Nguyen J Z Huang K Imran M J Li and GWilliams ldquoExtensions to quantile regression forests for veryhigh-dimensional datardquo in Advances in Knowledge Discoveryand Data Mining vol 8444 of Lecture Notes in ComputerScience pp 247ndash258 Springer Berlin Germany 2014
[26] A S Georghiades P N Belhumeur and D J Kriegman ldquoFromfew to many illumination cone models for face recognitionunder variable lighting and poserdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 23 no 6 pp 643ndash6602001
[27] F S Samaria and A C Harter ldquoParameterisation of a stochasticmodel for human face identificationrdquo in Proceedings of the 2ndIEEEWorkshop onApplications of Computer Vision pp 138ndash142IEEE December 1994
[28] M Turk and A Pentland ldquoEigenfaces for recognitionrdquo Journalof Cognitive Neuroscience vol 3 no 1 pp 71ndash86 1991
[29] H Deng ldquoGuided random forest in the RRF packagerdquohttparxivorgabs13060237
18 The Scientific World Journal
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet
[30] A Liaw and M Wiener ldquoClassification and regression byrandomforestrdquo R News vol 20 no 3 pp 18ndash22 2002
[31] R Diaz-Uriarte ldquovarselrf variable selection using randomforestsrdquo R package version 07-1 2009 httpligartoorgrdiazSoftwareSoftwarehtml
[32] J H Friedman T J Hastie and R J Tibshirani ldquoglmnetLasso and elastic-net regularized generalized linear modelsrdquo Rpackage version pages 1-1 2010 httpCRANR-projectorgpackage=glmnet