A Brief Overview to RandomForests™ Salford Systems http ...nymetro.chapter.informs.org/prac_cor_pubs/RandomForest_Steinber… · Random Forests machinery unlike CART in that Only

© Copyright Salford Systems 2004

Data Mining with Random Forests™

A Brief Overview to RandomForests™

Salford Systemshttp://www.salford-systems.com

[email protected] Steinberg, Mikhail Golovnya, N. Scott Cardell

http://www.salford-systems.com/

http://www.salford-systems.com/


What are Random Forests?

A powerful new approach to data exploration, data analysis, and predictive modelingDeveloped by Leo Breiman (father of CART®) at University of California, BerkeleyHas its roots in

CARTLearning ensembles, committtees of experts, combining modelsBootstrap Aggregation (Bagging)


What Random Forests Offer

Data visualization for high dimensional data (many columns)ClusteringAnomaly, outlier, and error detectionAnalysis of small sample data (small numbers in group of interestAutomated identification of important predictorsExplicit missing value imputation (for use by other tools)Generation of strong (accurate) predictive models


What exactly is a Random Forest?

A random forest is a collection of CART-like trees following specific rules for

Tree growingTree combinationSelf-testing Post-processing


Random Forests Tree Growing

Trees are grown using binary partitioning (each parent node is split into no more than two children)Each tree is grown at least partially at random

Randomness is injected by growing each tree on a different random subsample of the training dataRandomness is injected into the split selection process so that the splitter at any node is determined partly at random


RF Split selection

First select a small subset of available variables at randomActually a bootstrap subsampleTypically we select about square root (K) when there are K is the total number of predictors availableIf we have 500 columns of predictors we will select only about 23We split our node with the best variable among the 23, not the best variable among the 500Radically speeds up tree growing process


RF Split Selection

The best splitter from the eligible random subset is used to split the node in question

Splitter might be the best overall, but probably notSplitter might still be a fairly good splitterSplitter may also be not very helpful

If the splitter is not very good we just end up with two children that are essentially alike

Eg we might split on GENDER when there is little difference between males and femalesMeans that the essential work has been deferred to lower in the tree


RF tree evolution

Once a node is split on the best eligible splitter the process is repeated in its entirety on each child nodeA new list of eligible predictors is selected at random for each nodeWith a large number of predictors the eligible predictor set will be quite different from node to nodeImportant variables will make it into the tree (eventually)Explains in part why the trees must be grown out to absolute maximum full sizeAim for terminal nodes with one data record


Can a single random tree be any good?

Surprisingly then answer is yes!This is true even if we split every node completely at random

By making the eligible set contain just one randomly selected splitter

We are not saying that anyone would want to use a single random tree for predictionBut these trees do have some weak predictive power


Why are single random trees predictive?

A single tree in an RF forest can be predictive because it is a form of nearest neighbor classifierOne of the oldest and robust of the machine learning technologies nearest neighbor prediction is completely model freeA good nearest neighbor system should achieve an error rate about twice that of the theoretically best model

If the best possible model is correct for 90% of all cases it iswrong for 10%Nearest neighbor should then be correct for 80% and wrong for 20%


Why is a big tree like a near neighbor classifier?

Follow the path down a big treeTo reach a node in the tree a record must satisfy the condition at every parent

AGE>35, INCOME<60, EDUCATION>12, CITY=YES etcReaching a terminal node means being very similar to the training records that occupy that node

The near neighbor predictor mechanism: find historical record that looks as similar as possible to the new record Predict that new record behaves just like historical record


RF prediction Mechanism

Grow many trees. Recommend 500 but for large data sets 150 or so may be sufficientEach tree casts a vote at its terminal nodes. For a binary target the vote will be YES or NOCount up the YES votes. This is the RF score and the percent YES votes received is the predicted probabilityVery simple mechanismAlternatives developed by Jerome Friedman

Optimally reweight the trees via regularized regression (lasso)Combines best trees and drops a large percentage


RF Self testing

Each tree is grown on about 63% of the original training data (due to the bootstrap sampling process)This 37% of the data is available to test any single tree

Will be a different 37% for each tree

Use this left out data, named “Out of Bag” or OOB to calibrate performance of each treeUse OOB data to also keep a running tab on how often each record is classified correctly when it belongs to OOBAll performance statistics reported by RF are based on OOB calculations


RF self testing and small samples

RF self-testing is similar to a cross-validation that is repeated possibly hundreds of times, each time with different random partitioning of the dataAllows self-tested models even on samples with small numbers of the record types of interest (eg Fraud)


Introduction to Random Forests

New approach for many data analytical tasks developed by Leo Breiman of University of California, Berkeley

Co-author of CART® with Friedman, Olshen and StoneAuthor of Bagging and Arcing approaches to combining trees

Good for classification and regression problemsAlso for clustering, density estimationOutlier and anomaly detectionExplicit missing value imputation

Builds on the notions of committees of experts but is substantially different in key implementation details


Benefits of Random Forests

High levels of predictive accuracy delivered automaticallyOnly a few control parameters to experiment withStrong for both regression and classification

Resistant to overtraining (overfitting) – generalizes well to new dataTrains rapidly even with thousands of potential predictors

No need for prior feature (variable) selection

Diagnostics pinpoint multivariate outliers Offers a revolutionary new approach to clustering using tree-based between-record distance measuresBuilt on CART® inspired trees and thus

results invariant to monotone transformations of variables


Random Forests: Key Innovations -1

Method intended to generate a large number of substantially different models

Randomness introduced in two simultaneous waysBy row: records selected for training at random with replacement (as in bootstrap resampling of the bagger)By column: candidate predictors at any node are chosen at

random and best splitter selected from the random subset Each tree is grown out to maximal size and left unpruned

Trees are deliberately overfit, becoming a form of nearest neighbor predictorExperiments convincingly show that pruning these trees hurt performance. Possibly overfit individual trees combine to yield properly fit ensembles



Self-testing possible even if all data is used for trainingOnly 63% of available training data will be used to grow any one treeA 37% portion of training data always unused

Unused portion of training data is known as Out-Of-Bag (OOB) data and provides an ongoing dynamic assessment of model performance

Allows fitting to small data sets without explicitly holding back any data for testingAll training data is used cumulatively in training, but only a 63% portion used at any one time.

Similar to repeated cross-validation but unstructured.



Intensive post processing of data to extract more insight into data.

Most important is introduction of distance metric between any two data records:

The more similar two records are the more often they will land in same terminal node of a treeWith a large number of different trees simply count the number of times they co-locate in same leaf nodesDistance metric can be used to construct dissimilarity matrix input into hierarchical clustering


Combining Trees

Ultimately in modeling our goal is to produce a single score, prediction, forecast, or class assignmentThe motivation for generating multiple models is the hope that by somehow combining models results will be better than if we relied on a single modelWhen multiple models are generated they are normally combined by

Voting in classification problems, perhaps weightedAveraging in regression problems, perhaps weighted


Random Forests and Uncorrelated Trees

Combining trees via averaging or voting will only be beneficial if the trees are different from each otherIn original bootstrap aggregation paper Breiman noted bagging worked best for high variance (unstable) techniques

If results of each model are near identical little to be gained by averaging.

Resampling of the bagger from the training data intended to induce differences in trees

Accomplished essentially varying the weight on any data record


Bootstrap and Random Sampling

Bootstrap sample is fairly similar to taking a 50% sample from the original training dataIf you grow many trees each based on a different 50% random sample of your data you expect some variation in the trees producedBootstrap sample goes a bit further in ensuring that the new sample is of the same size as the original by allowing some records to be selected multiple timesIn practice the different samples induce different trees BUT trees may not be all that different


Random Forests Key Insight: How to minimize inter-tree dependence

The bagger was limited by the fact that even with resampling trees are likely to be somewhat similar to each other, particularly with strong data structure

The more similar the trees the less advantage to combining

Random Forests induces vastly more between-tree difference by forcing splits to be based on different predictors

Accomplished by introducing randomness into split selection


Trade-Off: Individual tree strength vsadvantage of the ensemble

Breiman points out tradeoff:As R increases strength of individual tree should increase However, correlation between trees also increases reducing advantage of combining

Want to select R to optimally balance the two effectsCan only be determined via experimentation

Breiman has suggested three values to test:R = ½ sqrt (M)R= sqrt (M)R= 2 sqrt (M)For M=100 test values for R: 5, 10, 20For M=400 test values for R: 10, 20, 40


Random Forests vs CART tree

Random Forests machinery unlike CART in that Only one splitting rule: GiniClass weight concept but no explicit priors or costsNo surrogates: missing values imputed for data first automatically

Default fast imputation just uses meansCompute intensive method uses tree-based nearest neighbors to base imputation on (discussed later)

None of the display and reporting machinery or tree refinement services of CART

Does follow CART in that all splits are binary


Trees Can be Combined By Voting or Averaging

Trees combined via voting (classification) or averaging (regression)Classification trees “vote”

Recall that classification trees classifyassign each case to ONE class only

With 50 trees, 50 class assignments for each caseWinner is the class with the most votesVotes could be weighted – say by accuracy of individual trees

Regression trees assign a real predicted value for each casePredictions are combined via averagingResults will be much smoother than from a single tree


Bootstrap Resampling Effectively ReweightsTraining Data (Randomly and Independently)

Probability of being omitted in a single draw is (1 - 1/n)Probability of being omitted in all n draws is (1 - 1/n)n

Limit of series as n increases is (1/e) = 0.368approximately 36.8% sample excluded 0 % of resample36.8% sample included once 36.8 % of resample18.4% sample included twice thus represent ... 36.8 % of resample6.1% sample included three times ... 18.4 % of resample1.9% sample included four or more times ... 8 % of resample

100 %Example: distribution of weights in a 2,000 record resample:

0 1 2 3 4 5 6732 749 359 119 32 6 3

0.366 0.375 0.179 0.06 0.016 0.003 0.002


Bootstrap Aggregation Performance Gains

Statlog Data Set Summary

Test Set Misclassification Rate (%)

Data Set # Training # Variables # Classes # Test SetLetters 15,000 16 26 5,000 Satellite 4,435 36 6 2,000 Shuttle 43,500 9 7 14,500 DNA 2,000 60 3 6,186

Data Set 1 Tree Bag DecreaseLetters 12.6 6.4 49% Satellite 14.8 10.3 30% Shuttle 0.062 0.014 77% DNA 6.2 5.0 19%


A Working Example:Mass Spectra Analysis

Want to use mass spectrometer data to classify different types of prostate cancer772 observations available

398 – healthy samples178 – 1st type of cancer samples196 – 2nd type of cancer samples

111 mass spectra measurements are recorded for each sample


Conventional CART – A Baseline

The above table shows cross-validated prediction successresults of a single CART tree for the prostate dataThe run was conducted under PRIORS DATA to facilitate comparisons with subsequent RF runThe relative error corresponds to the absolute error of 30.4%


Randomness in split selection

Topic discussed by several Machine Learning researchers Possibilities:

Select splitter, split point, or both at randomChoose splitter at random from the top K splitters

Random Forests: Suppose we have M available predictors Select R eligible splitters at random and let best split nodeIf R=1 this is just random splitter selectionIf R=M this becomes Brieman’s baggerIf R << M then we get Breiman’s Random Forests

Breiman suggests R=sqrt(M) as a good rule of thumb


Strength of a Single Random Forest Tree

A performance of a single tree will be somewhat driven by the number of candidate predictors allowed at each nodeConsider R=1: the splitter is always chosen at random

Performance could be quite weak

As relevant splitters get into tree and tree is allowed to grow massively, single tree can be predictive even if R=1 As R is allowed to increase quality of splits can improve as there will be better (and more relevant) splitters


Performance of RF versus SingleTree as a function of NVars

05

101520253035404550

0 20 40 60 80 100 120

N Vars

Err

or

1st Tree 100 Trees

In this experiment, we ran RF with 100 trees on the prostate data using different values for the number of variables Nvars searched at each split


CommentsRF clearly outperforms single tree for any number of Nvars

We saw above that a properly pruned tree gives cross-validated absolute error of 30.4% (the very right end of the red curve)

The performance of a single tree varies substantially with the number of predictors allowed to be searched The RF reaches a stable error rate of about 20% when only 10 variables are searched in each node (marked by the blue line)With minor fluctuation, the error rate remains stable for Nvarsabove 10

This agrees with Breiman’s suggestion to use square root N (sqrt(111)) as good value for Nvars

The performance for smaller values of Nvars can usually be improved by increasing the number of trees


Initial RF Run - GUI

Error rate of 19.4%


Conventional CART – Baseline Repeated

The absolute error is 30.4%Single CART tree is a little better for class 2 but quite a bit worse for class 0CART tree evaluated by CV whereas RF evaluated by OOB so there is a difference in performance evaluation


Initial RF Run - Classic

The above results correspond to a standard RF run with 500 trees, Nvars=15, and unit class weightsThe overall error rate of 19.4% is 2/3 of the baseline CART error rate of 30.4%


The OOB Error Estimate

RF does not require use of a test dataset to report accuracy and does not use conventional cross-validationFor every tree grown, about 37% of data are left out-of-bag (OOB)These cases OOB cases are used as test data to evaluate the performance of the current treeFor any tree in RF, its own OOB sample is used: a true random sampleThe final OOB estimate for the entire RF can be simply obtained by cumulating individual OOB results on a case-by-case basis

Error rate for a case estimated over subset of trees in which it is OOB

Error estimate is unbiased and behaves as if we had an independent test sample of the same size as the learn sample


Comparing Within-Class Performance:Error rates stabilize but to quite different rates


Class Weights

The prostate dataset is somewhat unbalanced – class 1 contains fewer records than class 0 or class 2Under the default RF settings, the minority classes will have higher misclassification rates than the dominant classesImbalance in the individual class error rates may also be caused by other data specific issuesClass Weights are used in RF to boost the accuracy of the specified classesGeneral Rule of Thumb: to increase accuracy in the given class, one should increase the corresponding class weightSimilar to the PRIORS control used in CART for the same purpose


Manipulating Class Weights

Our next run sets the weight for class one to 2Class 1 is classified with far better accuracy at the cost of reduced accuracy in the remaining classes


Class Vote Proportions and Margin

At the end of an RF run, for every record, the proportion of votes for each class represents the probability of class membershipThe Margin of a case simply as the proportion of votes for the true class minus the maximum proportion of votes for the other classesThe larger the margin, the higher the confidence of classification


Variable ImportanceThe concept of margin allows new “unbiased” definition of variable importanceTo estimate the importance of the mth variable:

Take the OOB cases for the kth tree, assume that we already know the margin for those cases M0

Randomly permute all values of the variable mApply the kth tree to the OOB cases with the permuted valuesCompute the new margin MCompute the difference M0 -M

The variable importance is defined as the average lowering of the margin across all OOB cases and all trees in the RFThis procedure is fundamentally different from the intrinsic variable importance scores computed by CART – the latter are always based on the LEARN data and are subject to the overfitting issues


Example

The top portion of the variable importance list for the prostate data is shown hereAnalysis of the complete list reveals that all 111 variables are nearly equally strongly contributing to the model predictionsThis is in a striking contrast with the single CART tree that has no choice but to use a limited subset of variables by tree’s constructionThe above explains why the RF model has a significantly lower error rate (20%) when compared to a single CART tree (30%)


Proximity Measure

RF introduces a novel way to define proximity between two observations:

Initialize proximities to zeroesFor any given tree, apply the tree to all casesIf case i and case j both end up in the same node, increase proximity prox(ij) between i and j by oneAccumulate over all trees in RF and normalize by twice the number of trees in RF

The resulting matrix of size NxN provides intrinsic measure of proximity

The measure is invariant to monotone transformationsThe measure is clearly defined for any type of independent variables, including categorical


Example

The above extract shows the proximity matrix for the top 10 records of the prostate datasetNote ones on the main diagonal – any case has “perfect” proximity to itselfObservations that are “alike” will have proximities close to one(these cells have green background )The closer proximity to 0, the more dissimilar cases i and j are(these cells have pink background)


Using Proximities

Having the full intrinsic proximity matrix opens new horizons

Informative data views using metric scalingMissing value imputationOutlier detection

Unfortunately, things get out of control when dataset size exceeds 5,000 observations (25,000,000+ cells are needed)RF switches to “compressed” form of the proximity matrix to handle large datasets – for any case, only M closest cases are recorded. M is usually less than 100.


Scaling Coordinates

The values 1-prox(ij) can be treated as Euclidean distances in a high dimensional spaceThe theory of metric scaling solves the problem of finding the most representative projections of the underlying data “cloud” onto low dimensional space using the data proximities

The theory is similar in spirit to the principal components analysis and discriminant analysis

The solution is given in the form of ordered “scaling coordinates”Looking at the scatter plots of the top scaling coordinates provides informative views of the data


Outlier Detection

Outliers are defined as cases having small proximities to all other cases cases belonging to the same target classThe following algorithm is used:

For a case n, compute the sum of the squares of prox(nk) for all k in the same class as nTake the inverse – it will be large if the case is “far away” from the restStandardize using the median and standard deviationLook at the cases with the largest values – those are potential outliers

Generally, a value above 10 is reason to suspect the case of being an outlier


Missing Value Imputation

RF offers two ways of missing value imputationThe Cheap Way – conventional median imputation for continuous variables and mode imputation for categorical variablesA Better Way:

Suppose case n has x coordinate missing1. Do the Cheap Way imputation for starters2. Grow a full size RF3. We can now re-estimate the missing value by a weighted average over all cases k with non-missing x using weights prox(nk)4. Repeat steps 2 and 3 several times to ensure convergence


Parallel Coordinates

An alternative display to view how the target classes are different with respect to the individual predictors

Recall, at the end of an RF run all cases in the dataset obtain K separate votes for the class membership (assuming K target classes)Take any target class and sort all observations by the count of votes for this class descendingTake the top 50 observations and the bottom 50 observations, those are correspondingly the most likely and the least likely members of the given target classParallel coordinate plots report uniformly (0,1) scaled values of all predictors for the top 50 and bottom 50 sorted records, along with the 25th, 50th, and 75th percentiles within each predictor


Example

This is a detailed display of the normalized values of the initial 20 predictors for the top voted 50 records in each target class (this gives 50x3=150 graphs)Class 0 generally has normalized values of the initial 20 predictors close to 0 (the left side of the variable range), except perhaps X8-X11


Example (continued)

It is easier to see this when looking at the quartile plots onlyNote that class 2 tends to have the largest values of the corresponding predictorsThe graph can be scrolled forward to view all of the 111 predictors


Example (continued)

The least likely plots roughly result to the similar conclusions: small predictor values are the least likely for class 2, etc.


The Road to ClusteringRF admits an interesting possibility to solve unsupervised learning problems, in particular, clustering problems and missing value imputation in the general senseRecall that in the unsupervised learning the concept of target is not definedRF generates a synthetic target variable in order to proceed with a regular run:

Give class label 1 to the original dataCreate a copy of the data such that each variable is sampled independently from the values available in the original datasetGive class label 2 to the copy of the dataNote that the second copy has marginal distributions identical to the first copy, whereas the possible dependency among predictors is completely destroyedA necessary drawback is that the resulting dataset is twice as large as the original


Analyzing the Synthetic DataWe now have a clear binary supervised learning problemRunning an RF on this dataset may provide the following insights:

When the resulting misclassification error is high (above 50%), the variables are basically independent – no interesting structure existsOtherwise, the dependency structure can be further studied by looking at the scaling coordinates and exploiting the proximity matrix in other waysFor instance, the resulting proximity matrix can be used as an important starting point for the subsequent hierarchical clustering analysis

Recall that the proximity measures are invariant to monotone transformations and naturally support categorical variablesThe same missing value imputation procedure as before can now beemployedThese techniques work extremely well for small datasets


Example

We generated a synthetic dataset based on the prostate dataThe resulting dataset still has 111 predictors but twice the number of records – the first half being the exact replica of the original data

The final error is only 0.2% which is an indication of a very strong dependency among the predictors


ReferencesBreiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.Breiman, L. (1996). Arcing classifiers (Technical Report). Berkeley: Statistics Department, University of California.Buntine, W. (1991). Learning classification trees. In D.J. Hand, ed., Artificial Intelligence Frontiers in Statistics, Chapman and Hall: London, 182-201.Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization. Machine Learning, 40, 139-158.Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156.Friedman, J.H. (1999). RandomForests. Stanford: Statistics Department, Stanford University.Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.Heath, D., Kasif, S., and Salzberg, S. (1993) k-dt: A multi-tree learning method. Proceedings of the Second International Workshop on Multistrategy Learning, 1002-1007, Morgan Kaufman: Chambery, France.Kwok, S., and Carter, C. (1990). Multiple decision trees. In Shachter, R., Levitt, T., Kanal, L., and Lemmer, J., eds. Uncertainty in Artificial Intelligence 4, North-Holland, 327-335.

A Brief Overview to RandomForests™ Salford Systems http ...nymetro.chapter.informs.org/prac_cor_pubs/RandomForest_Steinber… · Random Forests machinery unlike CART in that Only

Documents