Boosting Trevor Hastie, Stanford University 1 Trees, Bagging, Random Forests and Boosting • Classification Trees • Bagging: Averaging Trees • Random Forests: Cleverer Averaging of Trees • Boosting: Cleverest Averaging of Trees Methods for improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not generalize well. The techniques discussed here enhance their performance considerably.
43
Embed
Trees, Bagging, Random Forests and Boosting - MSRI
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Boosting Trevor Hastie, Stanford University 1
Trees, Bagging, Random Forests and Boosting
• Classification Trees
• Bagging: Averaging Trees
• Random Forests: Cleverer Averaging of Trees
• Boosting: Cleverest Averaging of Trees
Methods for improving the performance of weak learners such asTrees. Classification trees are adaptive and robust, but do notgeneralize well. The techniques discussed here enhance theirperformance considerably.
Boosting Trevor Hastie, Stanford University 2
Two-class Classification
• Observations are classified into two or more classes, coded by aresponse variable Y taking values 1, 2, . . . , K.
• We have a feature vector X = (X1, X2, . . . , Xp), and we hopeto build a classification rule C(X) to assign a class label to anindividual with feature X.
• We have a sample of pairs (yi, xi), i = 1, . . . , N . Note thateach of the xi are vectors xi = (xi1, xi2, . . . , xip).
• Example: Y indicates whether an email is spam or not. X
represents the relative frequency of a subset of specially chosenwords in the email message.
• The technology described here estimates C(X) directly, or viathe probability function P (C = k|X).
Boosting Trevor Hastie, Stanford University 3
Classification Trees
• Represented by a series of binary splits.
• Each internal node represents a value query on one of thevariables — e.g. “Is X3 > 0.4”. If the answer is “Yes”, go right,else go left.
• The terminal nodes are the decision nodes. Typically eachterminal node is dominated by one of the classes.
• The tree is grown using training data, by recursive splitting.
• The tree is often pruned to an optimal size, evaluated bycross-validation.
• New observations are classified by passing their X down to aterminal node of the tree, and then using majority vote.
Boosting Trevor Hastie, Stanford University 4
x.2<0.39x.2>0.39
10/30
0
x.3<-1.575x.3>-1.575
3/21
0
2/5
1
0/16
0
2/9
1
Classification Tree
Boosting Trevor Hastie, Stanford University 5
Properties of Trees
✔ Can handle huge datasets
✔ Can handle mixed predictors—quantitative and qualitative
✔ Easily ignore redundant variables
✔ Handle missing data elegantly
✔ Small trees are easy to interpret
✖ large trees are hard to interpret
✖ Often prediction performance is poor
Boosting Trevor Hastie, Stanford University 6
Example: Predicting e-mail spam
• data from 4601 email messages
• goal: predict whether an email message is spam (junk email) orgood.
• input features: relative frequencies in a message of 57 of themost commonly occurring words and punctuation marks in allthe training the email messages.
• for this problem not all errors are equal; we want to avoidfiltering out good email, while letting spam get through is notdesirable but less serious in its consequences.
• we coded spam as 1 and email as 0.
• A system like this would be trained for each user separately(e.g. their word lists would be different)
Boosting Trevor Hastie, Stanford University 7
Predictors
• 48 quantitative predictors—the percentage of words in theemail that match a given word. Examples include business,address, internet, free, and george. The idea was that thesecould be customized for individual users.
• 6 quantitative predictors—the percentage of characters in theemail that match a given character. The characters are ch;,ch(, ch[, ch!, ch$, and ch#.
• The average length of uninterrupted sequences of capitalletters: CAPAVE.
• The length of the longest uninterrupted sequence of capitalletters: CAPMAX.
• The sum of the length of uninterrupted sequences of capitalletters: CAPTOT.
Boosting Trevor Hastie, Stanford University 8
Details
• A test set of size 1536 was randomly chosen, leaving 3065observations in the training set.
• A full tree was grown on the training set, with splittingcontinuing until a minimum bucket size of 5 was reached.
• This bushy tree was pruned back using cost-complexitypruning, and the tree size was chosen by 10-foldcross-validation.
• We then compute the test error and ROC curve on the testdata.
Boosting Trevor Hastie, Stanford University 9
Some important features
39% of the training data were spam.
Average percentage of words or characters in an email messageequal to the indicated word or character. We have chosen thewords and characters showing the largest difference between spam
and email.
george you your hp free hpl
spam 0.00 2.26 1.38 0.02 0.52 0.01
email 1.27 1.27 0.44 0.90 0.07 0.43
! our re edu remove
spam 0.51 0.51 0.13 0.01 0.28
email 0.11 0.18 0.42 0.29 0.01
Boosting Trevor Hastie, Stanford University 10
600/1536
280/1177
180/1065
80/861
80/652
77/423
20/238
19/236 1/2
57/185
48/113
37/101 1/12
9/72
3/229
0/209
100/204
36/123
16/94
14/89 3/5
9/29
16/81
9/112
6/109 0/3
48/359
26/337
19/110
18/109 0/1
7/227
0/22
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
email
email
email
email
email
email
email
email
email
email
email
email
email
email
email
email
email
email
email
email
email
ch$<0.0555
remove<0.06
ch!<0.191
george<0.005
hp<0.03
CAPMAX<10.5
receive<0.125edu<0.045
our<1.2
CAPAVE<2.7505
free<0.065
business<0.145
george<0.15
hp<0.405
CAPAVE<2.907
1999<0.58
ch$>0.0555
remove>0.06
ch!>0.191
george>0.005
hp>0.03
CAPMAX>10.5
receive>0.125edu>0.045
our>1.2
CAPAVE>2.7505
free>0.065
business>0.145
george>0.15
hp>0.405
CAPAVE>2.907
1999>0.58
Boosting Trevor Hastie, Stanford University 11
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for pruned tree on SPAM data
o TREE − Error: 8.7%
SPAM Data
Overall error rate on test data:8.7%.ROC curve obtained by vary-ing the threshold c0 of the clas-sifier:C(X) = +1 if P (+1|X) > c0.Sensitivity: proportion of truespam identifiedSpecificity: proportion of trueemail identified.
We may want specificity to be high, and suffer some spam:Specificity : 95% =⇒ Sensitivity : 79%
Boosting Trevor Hastie, Stanford University 12
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for TREE vs SVM on SPAM data
oo
SVM − Error: 6.7%TREE − Error: 8.7%
TREE vs SVM
Comparing ROC curves onthe test data is a goodway to compare classi-fiers. SVM dominatesTREE here.
Boosting Trevor Hastie, Stanford University 13
Toy Classification Problem
-6 -4 -2 0 2 4 6
-6-4
-20
24
6
Bayes Error Rate: 0.25
0
00
0
00
0
0
00
0
0
0
0
0
0
0
0
0 0 0
00
0
00
00
00
0
0
00
00
00
0 0
0
0
00
0
0
0 0
00
0
0
0
000
0
0 0
00
00
00 00 0000
0
0
0
0
0
0
0
00
00
0
0
00
00
0
0
0
000
00
00
0
000 00
0
0
0
0
0
00
00
0
00
0
00
00
0
0
0
0
0
0
0
0
00
00
0
0
0
0
00
0
0
0 00
0
0 00
00
00
0
0
0
0
0 0
0
0
000
0
0
0
0
00
0
0
000
0
00
0
0
0
0 00
0
00 00 0
0 0
000
00
0
0000 0
0
0
0
0
00 0
000
0
0
0
0
00
0
00
0
0
0
0
0
0
0
0 00
0
00
00
0
0
0
0 0
00
00
0 0
0
0
0
00
0
00
0
0
000
0
0
0
00
0
00 00
0
00
0
0
0
0
0
0
0
00
0
0
0
0
00
000
0
0
00
00
0
00
0
0
0 0
0
0
0
0
0
0
00
0
0
0
0
0
00
00
0
0
00
00 0
0
0
0
0 0
0
0
00
000
0
0 0
00
00
0 0
0 0
0
0
0
0
00
0
0
0
0
000
0
00
0 0
0
0
00
00
0 0
0
0
0
0
0
0
0
0
0
0 00
00
0
00
0 0
00
00
1
1
1
11
1
1
1
11
1
111
1
1
11
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1
1
11
1
1
11
1
1 11
1
1
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
11
1
1
11
1
1
11
11
11
1
11
11
1
1
1
11
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
111
1
1
1
11
11
11
1
1
1
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
1
1
1
1
11
111
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
11
11
1
1
1
1
1
11
1
1
1
11
1
1
1 1
1
1
1
1
1
1
1
1
11
1
1
1 1
1
1
11
1
1
1
1
1 1
1
11
1
1
1
1
1
1
1
1
11
1
11
1
X2
X1
• Data X and Y , with Y
taking values +1 or −1.
• Here X = (X1, X2)
• The black boundaryis the Bayes DecisionBoundary - the bestone can do.
• Goal: Given N train-ing pairs (Xi, Yi)produce a classifierC(X) ∈ {−1, 1}
• Also estimate the probability of the class labels P (Y = +1|X).
Boosting Trevor Hastie, Stanford University 14
Toy Example - No Noise
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
0
0
00
0 0
0
0 0
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
00
0
0
0
000
0
0
0
0
0
0
0
0
0
00
00 0
0
00
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
00
00
0
0
0
0
0
0
00
0
0
0
0
00
0
0
000
0
0
00
0
0
0
0 0
00
0
0 00
0
0
0
00
00 0
0
0
00
0
0
0
0
0
00
0
00
00 0
0
000
0
0
0
00
0 00
0
0
0
0
0
0
0
0
0
00
0
0 0
0
0
0
00
0
00
00
0
0
00
0
0
0
0
00
0 0
0
00
0
0
0
0 0
0
0
0
0
00
0
000
000
0
00 0
0
0
0
0 00
0
0
0
00
00
0
0
0
0
0
00
0
00
0
00
0
0
00
00
0
0 0
0
0
0
0
0
0
00
0
00
00
00
0
0
00
0
00
0
0
00
0
00
0 0
0
0
0
0
0
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0 00
00
0
0
0
0
0
00
00
0
0
00
0
00
00
0
0
00
00
0
000 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 00
0
0
00
00 00
00
00
00
0
0
0
0
0
0
0
0 0
0
0
0
0
00
0
00
0
000
00
00
0
0
0
1
1
1
1
1
1
11
11
11
1
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
11
11
1
1 1
1
1
1
1
1
1
1
11
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
11
1 1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1 11
1
1
11
11
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1 11 1
1
1
1
1
1
1
1
11
1
1
1
1
11
11
1
1
11
1
1 1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
Bayes Error Rate: 0
X2
X1
• Deterministic problem;noise comes from sam-pling distribution of X.
• Use a training sampleof size 200.
• Here Bayes Error is0%.
Boosting Trevor Hastie, Stanford University 15
Classification Tree
x.2<-1.06711x.2>-1.06711
94/200
1
0/34
1
x.2<1.14988x.2>1.14988
72/166
0
x.1<1.13632x.1>1.13632
40/134
0
x.1<-0.900735x.1>-0.900735
23/117
0
x.1<-1.1668x.1>-1.1668
5/26
1
0/12
1
x.1<-1.07831x.1>-1.07831
5/14
1
1/5
1
4/9
1
x.2<-0.823968x.2>-0.823968
2/91
0
2/8
0
0/83
0
0/17
1
0/32
1
Boosting Trevor Hastie, Stanford University 16
Decision Boundary: Tree
X1
X2
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
1
1
11
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
11
1
11
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
0
0
0
0
0
0
0
00
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
000
0
00
00
0
0
0
00
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0
0
0
00
0 0
0
0
0
0
0 0
00
0
0
0
0
0
0
00
0
00
0
0
0 0
0
0
0
00
0
0
0
00
0
Error Rate: 0.073
When the nested spheresare in 10-dimensions, Clas-sification Trees produces arather noisy and inaccuraterule C(X), with error ratesaround 30%.
Boosting Trevor Hastie, Stanford University 17
Model Averaging
Classification trees can be simple, but often produce noisy (bushy)or weak (stunted) classifiers.
• Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.
• Boosting (Freund & Shapire, 1996): Fit many large or smalltrees to reweighted versions of the training data. Classify byweighted majority vote.
• Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting � Random Forests � Bagging � Single Tree.
Boosting Trevor Hastie, Stanford University 18
Bagging
Bagging or bootstrap aggregation averages a given procedure overmany samples, to reduce its variance — a poor man’s Bayes. See
pp 246.
Suppose C(S, x) is a classifier, such as a tree, based on our trainingdata S, producing a predicted class label at input point x.
To bag C, we draw bootstrap samples S∗1, . . .S∗B each of size N
with replacement from the training data. Then
Cbag(x) = Majority Vote {C(S∗b, x)}Bb=1.
Bagging can dramatically reduce the variance of unstableprocedures (like trees), leading to improved prediction. Howeverany simple structure in C (e.g a tree) is lost.
Boosting Trevor Hastie, Stanford University 19
x.2<0.39x.2>0.39
10/30
0
x.3<-1.575x.3>-1.575
3/21
0
2/5
1
0/16
0
2/9
1
Original Tree
x.2<0.36x.2>0.36
7/30
0
x.1<-0.965x.1>-0.965
1/23
0
1/5
0
0/18
0
1/7
1
Bootstrap Tree 1
x.2<0.39x.2>0.39
11/30
0
x.3<-1.575x.3>-1.575
3/22
0
2/5
1
0/17
0
0/8
1
Bootstrap Tree 2
x.4<0.395x.4>0.395
4/30
0
x.3<-1.575x.3>-1.575
2/25
0
2/5
0
0/20
0
2/5
0
Bootstrap Tree 3
x.2<0.255x.2>0.255
13/30
0
x.3<-1.385x.3>-1.385
2/16
0
2/5
0
0/11
0
3/14
1
Bootstrap Tree 4
x.2<0.38x.2>0.38
12/30
0
x.3<-1.61x.3>-1.61
4/20
0
2/6
1
0/14
0
2/10
1
Bootstrap Tree 5
Boosting Trevor Hastie, Stanford University 20
Decision Boundary: Bagging
X1
X2
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
1
1
11
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
11
1
11
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
0
0
0
0
0
0
0
00
0
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
0
0
0
000
0
00
00
0
0
0
00
0
0
0
0
0
0
00
00
0
0
0
0
0
00
0
0
0
0
00
0 0
0
0
0
0
0 0
00
0
0
0
0
0
0
00
0
00
0
0
0 0
0
0
0
00
0
0
0
00
0
Error Rate: 0.032
Bagging averages manytrees, and producessmoother decision bound-aries.
Boosting Trevor Hastie, Stanford University 21
Random forests
• refinement of bagged trees; quite popular
• at each tree split, a random sample of m features is drawn, andonly those m features are considered for splitting. Typicallym =
√p or log2 p, where p is the number of features
• For each tree grown on a bootstrap sample, the error rate forobservations left out of the bootstrap sample is monitored.This is called the “out-of-bag” error rate.
• random forests tries to improve on bagging by “de-correlating”the trees. Each tree has the same expectation.
Boosting Trevor Hastie, Stanford University 22
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sen
sitiv
ity
ROC curve for TREE, SVM and Random Forest on SPAM data
• e−yF (x) is a monotone,smooth upper bound onmisclassification loss at x.
• Leads to simple reweightingscheme.
• Has logit transform as popu-lation minimizer
f∗(x) =12
logPr(Y = 1|x)
Pr(Y = −1|x)
• Other more robust loss func-tions, like binomial deviance.
Boosting Trevor Hastie, Stanford University 35
General Stagewise Algorithm
We can do the same for more general loss functions, not only leastsquares.
1. Initialize f0(x) = 0.
2. For m = 1 to M :
(a) Compute(βm, γm) = arg minβ,γ
∑Ni=1 L(yi, fm−1(xi) + βb(xi; γ)).
(b) Set fm(x) = fm−1(x) + βmb(x; γm).
Sometimes we replace step (b) in item 2 by
(b∗) Set fm(x) = fm−1(x) + νβmb(x; γm)
Here ν is a shrinkage factor, and often ν < 0.1. Shrinkage slows thestagewise model-building even more, and typically leads to betterperformance.
Boosting Trevor Hastie, Stanford University 36
Gradient Boosting
• General boosting algorithm that works with a variety ofdifferent loss functions. Models include regression, resistantregression, K-class classification and risk modeling.
• Gradient Boosting builds additive tree models, for example, forrepresenting the logits in logistic regression.
• Tree size is a parameter that determines the order ofinteraction (next slide).
• Gradient Boosting inherits all the good features of trees(variable selection, missing data, mixed predictors), andimproves on the weak features, such as prediction performance.
• Gradient Boosting is described in detail in , section 10.10.
Boosting Trevor Hastie, Stanford University 37
Number of Terms
Tes
t Err
or
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4 Stumps
10 Node100 NodeAdaboost
Tree Size
The tree size J determinesthe interaction order of themodel:
η(X) =∑
j
ηj(Xj)
+∑jk
ηjk(Xj , Xk)
+∑jkl
ηjkl(Xj , Xk, Xl)
+ · · ·
Boosting Trevor Hastie, Stanford University 38
Stumps win!
Since the true decision boundary is the surface of a sphere, thefunction that describes it has the form
f(X) = X21 + X2
2 + . . . + X2p − c = 0.
Boosted stumps via Gradient Boosting returns reasonableapproximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees
f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)
f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)
Boosting Trevor Hastie, Stanford University 39
Spam Example Results
With 3000 training and 1500 test observations, Gradient Boostingfits an additive logistic model
f(x) = logPr(spam|x)Pr(email|x)
using trees with J = 6 terminal-node trees.
Gradient Boosting achieves a test error of 4%, compared to 5.3% foran additive GAM, 5.0% for Random Forests, and 8.7% for CART.
Boosting Trevor Hastie, Stanford University 40
Spam: Variable Importance
!$
hpremove
freeCAPAVE
yourCAPMAX
georgeCAPTOT
eduyouour
moneywill
1999business
re(
receiveinternet
000email
meeting;
650overmailpm
peopletechnology
hplall
orderaddress
makefont
projectdata
originalreport
conferencelab
[creditparts
#85
tablecs
direct415857
telnetlabs
addresses3d
0 20 40 60 80 100
Relative importance
Boosting Trevor Hastie, Stanford University 41
Spam: Partial Dependence
!
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
remove
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
edu
Par
tial D
epen
denc
e
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
-0.6
-0.2
0.0
0.2
hp
Par
tial D
epen
denc
e
0.0 0.5 1.0 1.5 2.0 2.5 3.0
-1.0
-0.6
-0.2
0.0
0.2
Boosting Trevor Hastie, Stanford University 42
Comparison of Learning Methods
Some characteristics of different learning methods.
Key: ● = good, ● =fair, and ● =poor.
Characteristic NeuralNets
SVM CART GAM KNN,Kernel
GradientBoost
Natural handling of dataof “mixed” type ● ● ● ● ● ●
Handling of missing val-ues ● ● ● ● ● ●
Robustness to outliers ininput space ● ● ● ● ● ●
Insensitive to monotonetransformations of in-puts
● ● ● ● ● ●
Computational scalabil-ity (large N) ● ● ● ● ● ●
Ability to deal with irrel-evant inputs ● ● ● ● ● ●
Ability to extract linearcombinations of features ● ● ● ● ● ●
Interpretability● ● ● ● ● ●
Predictive power● ● ● ● ● ●
Boosting Trevor Hastie, Stanford University 43
Software
• R: free GPL statistical computing environment available fromCRAN, implements the S language. Includes:
– randomForest: implementation of Leo Breimans algorithms.
– rpart: Terry Therneau’s implementation of classificationand regression trees.
– gbm: Greg Ridgeway’s implementation of Friedman’sgradient boosting algorithm.
• Salford Systems: Commercial implementation of trees, randomforests and gradient boosting.
• Splus (Insightful): Commerical version of S.
• Weka: GPL software from University of Waikato, New Zealand.Includes Trees, Random Forests and many other procedures.