Improve Your RegressionCART® and RandomForests®
Charles HarrisonMarketing Statistician
OutlineApplications of CART and Random Forests
Ordinary Least Squares Regression– A review
– Common issues in standard linear regression
Data Description
Improving your regression with an applied example– CART decision tree
– Random Forest
Conclusions
Salford Systems © 2016 2
Applications
In this webinar we use CART® software and RandomForests® software to predict concrete strength, but as we will see these techniques can be applied to any field
Quantitative Targets: Number of Cavities, Blood Pressure, Income etc.
Qualitative Targets: Disease or No Disease; Buy or Not Buy; Lend or Do Not Lend; Buy Product A vs Product B vs. Product C vs. Product D
ExamplesCredit RiskGlaucoma ScreeningInsurance FraudCustomer LoyaltyDrug DiscoveryEarly Identification of Reading DisabilitiesBiodiversity and Wildlife Conservation
Preview: CART and Random Forest Advantages
As we will see in this presentation both CART and Random Forests have desirable properties that allow you to build accurate predictive models with dirty data (i.e. missing values, lots of variables, nonlinear relationships, outliers etc.)
Preview: Geometry of a CART tree (1 split) Preview: Geometry of a CART tree (2 splits)
Preview: Model Performance
Salford Systems © 2016 5
Method
Linear Regression 109.04
Linear Regression with interactions
67.35
Min 1 SE CART (default settings 65.05
CART (default settings)55.99
RandomForests® (default settings)
37.570
ImprovedRandomForests® using
an SPM Automate36.02
𝑀𝑆𝐸 =1
𝑛
𝑖=1
𝑛
𝑌𝑖 − 𝑌𝑖2
What is OLS?OLS – ordinary least squares regression
– Discovered by Legendre (1805) and Gauss (1809) to solve problems in astronomy using pen and paper
The model is of the form
𝜷𝟎 – the intercept term
𝜷𝟏, 𝜷𝟐, 𝜷𝟑… – coefficient estimates
𝒙𝟏, 𝒙𝟐, 𝒙𝟑, … 𝒙𝒑 - predictor variables (i.e. columns in the dataset)
Example: Income= 20,000 + 2,500*WorkExperience + 1,000*EducationYears
Salford Systems © 2016 6
Y = 𝜷𝟎 + 𝜷𝟏𝒙𝟏+ 𝜷𝟐𝒙𝟐+ 𝜷𝟑 𝒙𝟑 + … + 𝜷𝒑𝒙𝒑
Common Issues in RegressionMissing values
– Requires imputation OR– Results in record deletion
Nonlinearities and Local Effects– Example: Y = 10 + 3𝑥1 + 𝑥2 − .3𝑥1
2
– Modeled via manual transformations or they are automatically added and then selected via forward, backward, stepwise, or regularization
– Ignores local effects unless specified by the analyst, but this is very difficult/impossible in practice without subject matter expertise or prior knowledge
Interactions– Example: 𝑌 = 10 + 3𝑥1 − 2𝑥2 + .25𝑥1𝑥2– Manually added to the model (or through some automated procedure)– Add interactions then use variable selection (i.e. regularized regression or
forward, backward, or stepwise selection)
Variable selection– Usually accomplished manually or in combination with automated
selection procedures
Salford Systems © 2016 7
Solutions to OLS Problems
Two methods that do not suffer from the drawbacks of linear regression are CART and Random Forests
These methods automatically
– Handle missing values
– Model nonlinear relationships and local effects
– Select variables
– Model variable interactions
Salford Systems © 2016 8
Concrete StrengthTarget:
– STRENGTHCompressive strength of concrete in megapascals
Predictors:– CEMENT– BLAST_FURNACE_SLAG– FLY_ASH– WATER– SUPERPLASTICIZER– COARSE_AGGREGATE– FINE_AGGREGATE– AGE
Salford Systems © 2016 9
I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)
Why predict concrete strength?
Concrete is one of the most important materials in our society and is a key ingredient in important infrastructure projects like bridges, roads, buildings, and dams (MATSE)
Predicting the strength of concrete is important because its concrete strength is a key component of the overall stability these structures
Source: http://matse1.matse.illinois.edu/concrete/prin.html
Data Sample
Salford Systems © 2016 11
Cement Blast Furnace Slag Fly Ash Water SuperplasticizerCoarse
AggregateFine Aggregate Age Strength
540 0 0 162 2.5 1040 676 28 79.98611076
540 0 0 162 2.5 1055 676 28 61.88736576
332.5 142.5 0 228 0 932 594 270 40.26953526
332.5 142.5 0 228 0 932 594 365 41.05277999
198.6 132.4 0 192 0 978.4 825.5 360 44.2960751
266 114 0 228 0 932 670 90 47.02984744
380 95 0 228 0 932 594 365 43.6982994
380 95 0 228 0 932 594 28 36.44776979
266 114 0 228 0 932 670 28 45.85429086
475 0 0 228 0 932 594 28 39.28978986
198.6 132.4 0 192 0 978.4 825.5 90 38.07424367
198.6 132.4 0 192 0 978.4 825.5 28 28.02168359
427.5 47.5 0 228 0 932 594 270 43.01296026
190 190 0 228 0 932 670 90 42.32693164
304 76 0 228 0 932 670 28 47.81378165
380 0 0 228 0 932 670 90 52.90831981
Regression Results
Salford Systems © 2016 12
Method
Linear Regression 109.04
Linear Regression with interactions
67.35
Strength = -9.70 + .115*Cement + .01*BlastFurnaceSlag + .014*FlyAsh - .172*Water + .10*Superplasticizer + .01*CoarseAggregate + .01*FineAggregate + .11*Age
𝑇𝑒𝑠𝑡 𝑀𝑆𝐸 =1
𝑛
𝑖=1
𝑛
𝑌𝑖 − 𝑌𝑖2
**Test sample: 20% of observations were randomly selected for the testing dataset
**This same test dataset was used to evaluate all models for the purpose of comparisons
Classification And Regression Trees
Authors: Breiman, Friedman, Olshen, and Stone (1984)
CART is a decision tree algorithm used for both regression and classification problems
1. Classification: tries to separate classes by choosing variables and points that best separate them
2. Regression: chooses the best variables and split points for reducing the squared or absolute error criterion
CART is available exclusively in the SPM® 8 Software Suite and was developed in close consultation with the original authors
CART: Introduction
CART IntroductionMain Idea: divide the predictor variables (often people say “partition” instead of “divide”) into different regions so that the dependent variable can be predicted more accurately.
The following shows the predicted values from a CART tree (i.e. the red horizontal bars) to the curve 𝑌 = 𝑥2 + 𝑛𝑜𝑖𝑠𝑒.
“noise” is from a N(0,1)
𝑌
𝑥
CART: TerminologyA tree split occurs whena variable is partitioned (in-depth example starts after the next slide). This tree has two splits:
1. AGE_DAY <=212. CEMENT_AMT <=355.95
The node at the top of the tree is called the root node
A node that has no sub-branch is a terminal node
This tree has three terminal nodes (i.e. red boxes in the tree)
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
CEMENT_AMT <= 355.95
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
CEMENT_AMT <= 355.95
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825
The predicted value in a CARTregression model is the average of the targetvariable (i.e. “Y”) for the records that fall into one of the terminal nodes
Example: If Age = 26 days and the amount of cement is 400 then the predicted strength is 57.026 megapasucals
CART: AlgorithmStep 1: Grow a large treeThis is done for you automatically
All variables are considered at each split in the tree
Each split is made using onevariable and a specific value or set of values.
Splits are chosen so as to minimize modelerror
The tree is grown until either a user-specified criterion is met or until the tree cannot be grown further
Step2: Prune the large treeThis is also done for you automatically
Use either a test sample or cross validation to prune subtrees
CART: Splitting Procedure
Consider the following CART tree grown on this dataset
How exactly do we get this tree? Y 𝑋1 𝑋2
79.9861 162 28
61.8874 162 28
40.2695 228 270
41.0528 228 365
44.2961 192 360
47.0298 228 90
43.6983 228 365
36.4478 228 28
45.8543 228 28
39.2898 228 28
38.0742 192 90
28.0217 192 28
43.013 228 270
42.3269 228 90
47.8138 228 28
52.9083 228 90
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
CART: Splitting ProcedureStep 1: Find the best split point for the variable 𝑋1
Sort the variable 𝑋1 Compute the split improvement for each split point Best split for 𝑋1 :
𝑿𝟏 ≤ 177
Note: the midpoint between𝑋1 = 192 and 𝑋1 = 162 is 177
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
40.27 228 270
41.05 228 365
44.30 192 360
47.03 228 90
43.70 228 365
36.45 228 28
45.85 228 28
39.29 228 28
38.07 192 90
28.02 192 28
43.01 228 270
42.33 228 90
47.81 228 28
52.91 228 90
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
Y 𝑋1 𝑋228.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Split Improvement: ∆𝑅 𝑠, 𝑡 = 𝑅 𝑡 − 𝑅 𝑡𝐿 − 𝑅(𝑡𝑅)
𝑅 𝑡 =1
𝑁
𝑥𝑛𝜖𝑡
(𝑦𝑛 − 𝑦 𝑡 )2 𝐿𝑒𝑎𝑠𝑡 𝑆𝑞𝑢𝑎𝑟𝑒𝑠
CART: Splitting ProcedureStep 2: Find the best split point for the variable 𝑋2
Sort the variable 𝑋2 Compute the split improvement for each split point Best Split for 𝑋2:
𝑿𝟐 ≤ 59
Note: the midpoint between 𝑋2 = 28 and 𝑋2 = 90 is 59
Y 𝑋1 𝑋279.9861 162 28
61.8874 162 28
40.2695 228 270
41.0528 228 365
44.2961 192 360
47.0298 228 90
43.6983 228 365
36.4478 228 28
45.8543 228 28
39.2898 228 28
38.0742 192 90
28.0217 192 28
43.013 228 270
42.3269 228 90
47.8138 228 28
52.9083 228 90
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
28.02 192 28
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
38.07 192 90
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
44.30 192 360
41.05 228 365
43.70 228 365
X2 <= 59.00
Terminal
Node 1
STD = 16.158
Avg = 48.472
W = 7.000
N = 7
X2 > 59.00
Terminal
Node 2
STD = 4.069
Avg = 43.630
W = 9.000
N = 9
Node 1
X2 <= 59.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16 Y 𝑋1 𝑋238.07 192 90
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
44.30 192 360
41.05 228 365
43.70 228 365
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
28.02 192 28
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
Split Improvement: ∆𝑅 𝑠, 𝑡 = 𝑅 𝑡 − 𝑅 𝑡𝐿 − 𝑅(𝑡𝑅)
𝑅 𝑡 =1
𝑁
𝑥𝑛𝜖𝑡
(𝑦𝑛 − 𝑦 𝑡 )2 𝐿𝑒𝑎𝑠𝑡 𝑆𝑞𝑢𝑎𝑟𝑒𝑠
CART: Splitting Procedure
At this point CART has evaluated all possible split points for our two variables, 𝑋1 and 𝑋2, and determined the optimal split points for each.
Splitting on either 𝑋1 or 𝑋2 will yield a different tree, so what is the best split? The one with the largest split improvement.
Best split for 𝑿𝟏: 𝑋1 ≤ 177Improvement Value: 90.64
Best split for 𝑿𝟐: 𝑋2 ≤ 59Improvement Value: 5.77
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
X2 <= 59.00
Terminal
Node 1
STD = 16.158
Avg = 48.472
W = 7.000
N = 7
X2 > 59.00
Terminal
Node 2
STD = 4.069
Avg = 43.630
W = 9.000
N = 9
Node 1
X2 <= 59.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Split Improvement: ∆𝑅 𝑠, 𝑡 = 𝑅 𝑡 − 𝑅 𝑡𝐿 − 𝑅(𝑡𝑅)
𝑅 𝑡 =1
𝑁
𝑥𝑛𝜖𝑡
(𝑦𝑛 − 𝑦 𝑡 )2
𝐿𝑒𝑎𝑠𝑡 𝑆𝑞𝑢𝑎𝑟𝑒𝑠
CART: 1st Split
Our best first split in the tree is 𝑋1 ≤ 177 which leads to the following tree and partitioned dataset
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
Y 𝑋1 𝑋228.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Note: the predicted values for this tree are the respective averages in each terminal node.
Terminal Node 1 predicted value:79.99+61.89 ≈ 70.94
Terminal Node 1
Terminal Node 2
CART Geometry
Y
𝑋1
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
𝑋1
𝑋2
CART: Splitting Procedure
So how do we get to our final tree?
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
CART: Splitting Procedure
We now perform the same procedure again, but this time for each partition of the data (we can only split one partition at a time)
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
Y 𝑋1 𝑋228.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y 𝑋1 𝑋228.02 192 28
38.07 192 90
44.3 192 360
Y 𝑋1 𝑋236.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Best Split: Split Partition 2 at 𝑋1 ≤ 210Partition 1 Partition 2
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
CART Geometry
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
𝑋1
𝑋2
Y
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Where are we?
CART Splitting Process
CART Pruning
Advantages of CART
Interpreting CART Output
Applied Example using CART
Random Forest Section
CART: Algorithm
Step 1: Grow a large tree
Step2: Prune the large treeThis is also done for you automatically
Use either a test sample or cross validation toprune subtrees
CART: Pruning with a Test SampleTest sample- randomly select a certain percentage of data (often ~20%-30%)to be used to assess the model error
Prune the CART tree
1. Run the test data down the large tree and the smaller trees (the smaller trees are called “subtrees”)
2. Compute the test error for each tree
3. The final tree shown to the user is the tree with the smallest test error
Subtree Error
1 200
2 125
3 100
4 83
5 113
6 137
Where are we?
CART Splitting Process
CART Pruning
Advantages of CART (this is what allows you to build models with dirty data)
Interpreting CART Output
Applied Example using CART
Random Forest Section
CART Advantages
In practice, you can build CART models with dirty data (i.e. missing values, lots of variables, nonlinear relationships, outliers, and numerous local effects)
This is due to CART’s desirable properties:1. Easy to interpret2. Automatic handling of the following:
a) Variable selectionb) Variable interaction modelingc) Local effect modelingd) Nonlinear relationship modelinge) Missing valuesf) Outliers
3. Not affected by monotonic transformations of variables
CART: Interpretation and Automatic Variable Selection
Interpretation: CART trees have a simple interpretation and only require that someone ask themselves a series of “yes or no” questions like “Is Age_Day <= 21?” etc.
Variable Selection:All variables will be considered for each split but not all variables will be used. Some variables will be used more than others.
Only one variable is used for each split
The variables chosen are those that reduce the error the most
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
CEMENT_AMT <= 355.95
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
CEMENT_AMT <= 355.95
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825
CART: Automatic Variable Interactions and Local Effects
In regression interaction terms modeled globally in the form x1*x2 or x1*x2*x3 (global means that the interaction is present everywhere).
In CART interactions are automatically modeled over certain regions of the data (i.e. locally) so you do not have to worry about adding interaction terms or local terms to your model
Example: Notice how the prediction changes for different amounts of cement given that the Age is over 21 days (i.e. this is the interaction)
1. If Age > 21 and Cement Amount <= 355. 95 then the average strength is 37 megapascuals
2. If Age > 21 and Cement Amount > 355.95 then the average strength is 57 megapascuals
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
CEMENT_AMT <= 355.95
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
CEMENT_AMT <= 355.95
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825
CART: Automatic Nonlinear ModelingNonlinear functions (and linear) are approximated via step functions, so in practice you do not need to worry about adding terms like 𝒙𝟐 𝒐𝒓 𝒍𝒏 𝒙 to capture nonlinear relationships. The picture below is the CART fit to 𝑌 = 𝑋2 + noise. CART modeled this data automatically. No data pre-processing. Just CART.
𝑌
𝑋
CART: Automatic Missing Value Handling
CART automatically handles missing values while building the model, so you do not need to impute missing values yourself
The missing values are handled using a surrogate splitSurrogate Split- find another variable whose split is “similar” to the variable with the missing values and split on the variable that does not have missing values
Reference: see Section 5.3 in Breiman, Friedman, Olshen, and Stone for more information
CART: Outliers in the Target Variable
Two types of outliers are1. Outliers in the target variable (i.e. “Y”)2. Outliers in the predictor variable (i.e. “x”)
CART is more sensitive to outliers with respect to the target variable
1. More severe in a regression context than a classification context
2. CART may treat target variable outliers by isolating them in small terminal nodes which can limit their effect
Reference: Pages 197-200 and 253 in Breiman, Friedman, Olshen, and Stone (1984)
𝑋1
𝑋2
Y
Here the target outliers are isolated in terminal node 1
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
CART: Outliers in the Predictor VariablesCART is more robust to outliers in the predictor variables partly due to nature of the splitting process
Reference: Pages 197-200 and 253 in Breiman, Friedman, Olshen, and Stone (1984)
Y
X1 <= 177.00
Terminal
Node 1
STD = 8.532
Avg = 73.957
W = 3.000
N = 3
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.149
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 13.661
Avg = 47.762
W = 17.000
N = 17
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
𝑋1
𝑋2
CART: Monotonic Transformations of VariablesMonotonic transformation- is a transformation that does not change the order of a variable.
– CART, unlike linear regression, is not affected by this so if a transformation does not affect the order of a variable then you do not need to worry about adding it to a CART model
– Example: Our best first split in the example tree was 𝑋1 ≤ 177. What happens if we square 𝑋1? The split point value changes, but nothing else does including the predicted values. This happens because the same Y values fall into the
same partition (i.e. their order has not changed after we squared and sorted 𝑋1)
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
Y 𝑋1 𝑋228.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y 𝑋1 𝑋279.99 162 28
61.89 162 28
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Y 𝑋1 𝑋279.99 26,244 28
61.89 26,244 28
28.02 36,864 28
38.07 36,864 90
44.3 36,864 360
36.45 51,984 28
45.85 51,984 28
39.29 51,984 28
47.81 51,984 28
47.03 51,984 90
42.33 51,984 90
52.91 51,984 90
40.27 51,984 270
43.01 51,984 270
41.05 51,984 365
43.7 51,984 365
Y 𝑋1 𝑋279.99 26,244 28
61.89 26,244 28
Y 𝑋1 𝑋228.02 36,864 28
38.07 36,864 90
44.3 36,864 360
36.45 51,984 28
45.85 51,984 28
39.29 51,984 28
47.81 51,984 28
47.03 51,984 90
42.33 51,984 90
52.91 51,984 90
40.27 51,984 270
43.01 51,984 270
41.05 51,984 365
43.7 51,984 365
X1_SQ <= 31554.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1_SQ > 31554.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1_SQ <= 31554.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Where are we?
CART Splitting Process
CART Pruning
Advantages of CART (this is what allows you to build models with dirty data)
Interpreting CART Output
Applied Example using CART
Random Forest Section
CART: Relative Error
Relative error- used to determine the optimal model complexity for CART models
GOOD: Relative error values close to zero mean that CART is doing a better job than predicting only the overall average (or median) for all records in the data
BAD: Relative error values equal to one means that that CART is no better than predicting the overall average (or median) of the target variable for every record. Note: the relative error can be greater than one which is especially bad.
The relative error can be computed for both Least Squares: LS = 𝑖 𝑦𝑖 − 𝑦𝑖2and Least Absolute Deviation:
LAD = 𝑖 𝑦𝑖 − 𝑦𝑖
Relative Error = 𝐶𝐴𝑅𝑇 𝑀𝑜𝑑𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 𝑢𝑠𝑖𝑛𝑔 𝑒𝑖𝑡ℎ𝑒𝑟 𝐿𝑆 𝑜𝑟 𝐿𝐴𝐷
𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑟𝑒𝑐𝑜𝑟𝑑𝑠
LAD Relative Error =
Relative Error: 0.129
CART: Variable ImportanceCART Variable Importance: sum each variable’s split improvement score across the splits in the tree. The importance scores for variables are increased in two ways:
1) When the variable is actually used to a split a node
2) When the variable is as the surrogate split (i.e. the backup splitting variable when the primary splitting variables has a missing value)
“Consider Only Primary Splitters” (green rectangle on the right) removes the surrogate splitting variables from the variable importance calculation
“Discount Surrogates” allows you to discount surrogates in a more specific manner
Applied Example using CART
CART Performance
Salford Systems © 2016 42
Method
Linear Regression 109.04
Linear Regression with interactions
67.35
Min 1 SE CART (default settings) 65.05
CART (default settings)55.99
Random Forest(default settings)
Improved Random Forest using an SPM Automate
𝑀𝑆𝐸 =1
𝑛
𝑖=1
𝑛
𝑌𝑖 − 𝑌𝑖2
***More information about the 1 Standard Error Rule for CART can be found in the appendix
Where are we?
CART Splitting Process
CART Pruning
Advantages of CART (this is what allows you to build models with dirty data)
Interpreting CART Output
Applied Example using CART
Random Forest Section
Introduction to Random Forests
Main Idea: fit multiple CART trees to independent “bootstrap” samples of the data and then combine the predictions
Leo Breiman, one of the co-creators of CART, also created Random Forests and published a
paper on this method in 2001
Our RandomForests® software was developed in close consultation with Breiman himself
What is a bootstrap sample?A bootstrap sample is a random sample conducted with replacementSteps:
1. Randomly select an observation from the original data2. “Write it down”3. “Put it back” (i.e. any observation can be selected more than once)
Repeat steps 1-3 N times; N is the number of observations in the original sampleFINAL RESULT: One “bootstrap sample” with N observations
…
Bootstrap SampleOriginal Data
0 48 30 48 3
0 37 10 37 1
0 37 1
0 . 10 . 1
0 24 4
0 24 4
0 37 1
Y X1 X2
Original Data
…..Bootstrap 1 Bootstrap 2 Bootstrap 199 Bootstrap 200
Final Prediction for a New Record: take the average of the 200 individual predictions
10.5 + 9.8…+ 10.73 + 12
200
Predict a New Record : run the record down each tree, each time computing a prediction
1. Draw a bootstrap sample
2. Fit a large, unpruned, CART tree to this bootstrap sample
-At each split in the tree consider only k randomly selected variables instead of all of them
3. Average the predictions to predict a new record
Repeat Steps 1-2 at least 200 times
Tree 1: 10.5 Tree 2: 9.8 Tree 199: 10.73 Tree 200: 12…..
CART and Random Forests
When you build a Random Forest model just keep this picture in the back of your mind:
The reason is because a Random Forest is really just an average of CART trees constructed on bootstrap samples of the original data
= ++ +… +1 2 3 B
[ ]1
𝐵
CART and Random Forests
Random Forests generally have superior predictive performance versus CART trees because Random Forests have lower variance than a single CART tree
Since Random Forests are a combination of CART trees they inherit many of CART’s properties:
AutomaticVariable selectionVariable interaction detectionNonlinear relationship detectionMissing value handlingOutlier handling Modeling of local effects
Invariant to monotone transformations of predictors
One drawback is that a Random Forest is not as interpretable as a single CART tree
Random Forests: Tuning Parameters
The performance of a Random Forest is dependent upon the values of certain model parameters
Two of these parameters are
1. Number of trees
2. Random number of variables chosen at each split
Random Forests: Number of TreesNumber of trees
Default: 200
The number of trees should be large enough so that the model error no longer meaningfully declines as the number of trees increases (Experimentation will be required)
In Random Forests the optimal number of trees tends to be the maximum value allotted (due to the Law of Large Numbers) Default Setting: 200 Trees My Setting: 400 Trees
There is not much of a difference between the error for a forest with 200 trees and one with 400 trees, so, at least for this dataset, a larger number of trees will not improve the model meaningfully
Random Forests: Tuning Parameters
The performance of a Random Forest is dependent upon the values of certain model parameters
Two of these parameters are
1. Number of trees
2. Random number of variables chosen at each split
Random Forest Parameters:Random variable subset size
Random number of variables k chosen at each split in each tree in the forest
Default: k=3
Experimentation will be required to find the optimal value and this can be done using Automate RFNPREDS
Automate RFNPREDS- automatically build multiple Random Forests: each time the forest is the same except that the number of randomly selected variables at each split in each CART tree changes
This allows us to conveniently determine the optimal number of variables to randomly select at each split in each tree
This output is telling you that optimal number of randomly selected variables at each split in each tree in the forest is 5.
Interpreting a Random Forest
Since a Random Forest is a collection of hundreds or even thousands of CART trees, the simple interpretation is lost because we now have hundreds of trees and are averaging the predictions
One method used to interpret a Random Forest is variable importance
Random Forest for Regression: Variable Importance
CART Variable Importance: sum each variable’s split improvement score across the splits in the tree
The importance scores for variables are increased in two way: 1) when the variable is actually used to a split a node and 2) when the variable is as the surrogate split (i.e. the backup splitting variable when the primary splitting variables has a missing value)
Random Forest Variable Importance for Regression: 1. Compute a score for every split the variable generates, sum the
scores across all splits made Relative Importance: divide all variable importance scores by the
maximum variable importance score (i.e. the most important variable has a relative importance value of 100)
Note: For classification models, the preferred method is the random permutation method (see appendix for more details)
Random Forest Demonstration in SPM
Random Forests: Model Performance
Salford Systems © 2016 56
Method
Linear Regression 109.04
Linear Regression with interactions
67.35
Min 1 SE CART (default settings 65.05
CART (default settings)55.99
Random Forest (default settings)
37.70
Improved Random Forest using an SPM Automate
36.02
𝑀𝑆𝐸 =1
𝑛
𝑖=1
𝑛
𝑌𝑖 − 𝑌𝑖2
Conclusion
CART produces an interpretable model that is more resistant to outliers, predicts future data well, and automatically handles
1. Variables interactions
2. Missing values
3. Nonlinear relationships
4. Local effects
Random Forests are fundamentally a combination of individual CART trees and thus inherit all of the advantages of CART above (except the nice interpretation)
*Generally is superior to a single CART tree in terms of predictive accuracy
Next in the series…
Improve Your Regression with TreeNet® Gradient Boosting
Try CART and Random Forest
Download SPM® 8 now to start building CART and Random Forest models on your data
We will be more than happy to personally help youif you have any questions or need assistance
My Email: [email protected] Email: [email protected]
The appendix follows this slide
Appendix
1 Standard Error Rule for CART trees
1SE Rule in SPMOptimal Tree 1 Standard Error Rule Tree
1SE Rule Tree: the smallest tree whose error is within one standard deviation of the minimum error Figures: Upper Left: Optimal Tree has 188 terminal nodes and a relative error of .177; Upper Right: the 1SE tree has 85 terminal nodes and a relative error .209
Smaller trees are preferred because they are less likely to overfit the data (i.e. the 1SE tree in this case is competitive in terms of accuracy and is much less complex) and they are easier to interpret
Relative Error = 𝐶𝐴𝑅𝑇 𝑀𝑜𝑑𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 𝑢𝑠𝑖𝑛𝑔 𝑒𝑖𝑡ℎ𝑒𝑟 𝐿𝑆 𝑜𝑟 𝐿𝐴𝐷
𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 (𝑜𝑟 𝑚𝑒𝑑𝑖𝑎𝑛 𝑖𝑓 𝑢𝑠𝑖𝑛𝑔 𝐿𝐴𝐷) 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑟𝑒𝑐𝑜𝑟𝑑𝑠