Improve Your Regression with CART and RandomForests

Improve Your RegressionCART® and RandomForests®

Charles HarrisonMarketing Statistician

OutlineApplications of CART and Random Forests

Ordinary Least Squares Regression– A review

– Common issues in standard linear regression

Data Description

Improving your regression with an applied example– CART decision tree

– Random Forest

Conclusions

Salford Systems © 2016 2

Applications

In this webinar we use CART® software and RandomForests® software to predict concrete strength, but as we will see these techniques can be applied to any field

Quantitative Targets: Number of Cavities, Blood Pressure, Income etc.

Qualitative Targets: Disease or No Disease; Buy or Not Buy; Lend or Do Not Lend; Buy Product A vs Product B vs. Product C vs. Product D

ExamplesCredit RiskGlaucoma ScreeningInsurance FraudCustomer LoyaltyDrug DiscoveryEarly Identification of Reading DisabilitiesBiodiversity and Wildlife Conservation

Preview: CART and Random Forest Advantages

As we will see in this presentation both CART and Random Forests have desirable properties that allow you to build accurate predictive models with dirty data (i.e. missing values, lots of variables, nonlinear relationships, outliers etc.)

Preview: Geometry of a CART tree (1 split) Preview: Geometry of a CART tree (2 splits)

Preview: Model Performance


Method

Linear Regression 109.04

Linear Regression with interactions

67.35

Min 1 SE CART (default settings 65.05

CART (default settings)55.99

RandomForests® (default settings)

37.570

ImprovedRandomForests® using

an SPM Automate36.02

𝑀𝑆𝐸 =1

𝑛

𝑖=1

𝑛

𝑌𝑖 − 𝑌𝑖2

What is OLS?OLS – ordinary least squares regression

– Discovered by Legendre (1805) and Gauss (1809) to solve problems in astronomy using pen and paper

The model is of the form

𝜷𝟎 – the intercept term

𝜷𝟏, 𝜷𝟐, 𝜷𝟑… – coefficient estimates

𝒙𝟏, 𝒙𝟐, 𝒙𝟑, … 𝒙𝒑 - predictor variables (i.e. columns in the dataset)

Example: Income= 20,000 + 2,500*WorkExperience + 1,000*EducationYears


Y = 𝜷𝟎 + 𝜷𝟏𝒙𝟏+ 𝜷𝟐𝒙𝟐+ 𝜷𝟑 𝒙𝟑 + … + 𝜷𝒑𝒙𝒑

Common Issues in RegressionMissing values

– Requires imputation OR– Results in record deletion

Nonlinearities and Local Effects– Example: Y = 10 + 3𝑥1 + 𝑥2 − .3𝑥1

2

– Modeled via manual transformations or they are automatically added and then selected via forward, backward, stepwise, or regularization

– Ignores local effects unless specified by the analyst, but this is very difficult/impossible in practice without subject matter expertise or prior knowledge

Interactions– Example: 𝑌 = 10 + 3𝑥1 − 2𝑥2 + .25𝑥1𝑥2– Manually added to the model (or through some automated procedure)– Add interactions then use variable selection (i.e. regularized regression or

forward, backward, or stepwise selection)

Variable selection– Usually accomplished manually or in combination with automated

selection procedures


Solutions to OLS Problems

Two methods that do not suffer from the drawbacks of linear regression are CART and Random Forests

These methods automatically

– Handle missing values

– Model nonlinear relationships and local effects

– Select variables

– Model variable interactions


Concrete StrengthTarget:

– STRENGTHCompressive strength of concrete in megapascals

Predictors:– CEMENT– BLAST_FURNACE_SLAG– FLY_ASH– WATER– SUPERPLASTICIZER– COARSE_AGGREGATE– FINE_AGGREGATE– AGE


I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)

Why predict concrete strength?

Concrete is one of the most important materials in our society and is a key ingredient in important infrastructure projects like bridges, roads, buildings, and dams (MATSE)

Predicting the strength of concrete is important because its concrete strength is a key component of the overall stability these structures

Source: http://matse1.matse.illinois.edu/concrete/prin.html

http://matse1.matse.illinois.edu/concrete/prin.html

Data Sample


Cement Blast Furnace Slag Fly Ash Water SuperplasticizerCoarse

AggregateFine Aggregate Age Strength

540 0 0 162 2.5 1040 676 28 79.98611076

540 0 0 162 2.5 1055 676 28 61.88736576

332.5 142.5 0 228 0 932 594 270 40.26953526

332.5 142.5 0 228 0 932 594 365 41.05277999

198.6 132.4 0 192 0 978.4 825.5 360 44.2960751

266 114 0 228 0 932 670 90 47.02984744

380 95 0 228 0 932 594 365 43.6982994

380 95 0 228 0 932 594 28 36.44776979

266 114 0 228 0 932 670 28 45.85429086

475 0 0 228 0 932 594 28 39.28978986

198.6 132.4 0 192 0 978.4 825.5 90 38.07424367

198.6 132.4 0 192 0 978.4 825.5 28 28.02168359

427.5 47.5 0 228 0 932 594 270 43.01296026

190 190 0 228 0 932 670 90 42.32693164

304 76 0 228 0 932 670 28 47.81378165

380 0 0 228 0 932 670 90 52.90831981

Regression Results


Method



67.35

Strength = -9.70 + .115*Cement + .01*BlastFurnaceSlag + .014*FlyAsh - .172*Water + .10*Superplasticizer + .01*CoarseAggregate + .01*FineAggregate + .11*Age

𝑇𝑒𝑠𝑡 𝑀𝑆𝐸 =1

𝑛

𝑖=1

𝑛


**Test sample: 20% of observations were randomly selected for the testing dataset

**This same test dataset was used to evaluate all models for the purpose of comparisons

Classification And Regression Trees

Authors: Breiman, Friedman, Olshen, and Stone (1984)

CART is a decision tree algorithm used for both regression and classification problems

1. Classification: tries to separate classes by choosing variables and points that best separate them

2. Regression: chooses the best variables and split points for reducing the squared or absolute error criterion

CART is available exclusively in the SPM® 8 Software Suite and was developed in close consultation with the original authors

CART: Introduction

CART IntroductionMain Idea: divide the predictor variables (often people say “partition” instead of “divide”) into different regions so that the dependent variable can be predicted more accurately.

The following shows the predicted values from a CART tree (i.e. the red horizontal bars) to the curve 𝑌 = 𝑥2 + 𝑛𝑜𝑖𝑠𝑒.

“noise” is from a N(0,1)

𝑌

𝑥

CART: TerminologyA tree split occurs whena variable is partitioned (in-depth example starts after the next slide). This tree has two splits:

1. AGE_DAY <=212. CEMENT_AMT <=355.95

The node at the top of the tree is called the root node

A node that has no sub-branch is a terminal node

This tree has three terminal nodes (i.e. red boxes in the tree)

AGE_DAY <= 21.00

Terminal

Node 1

STD = 12.441

Avg = 23.944

W = 260.000

N = 260

CEMENT_AMT <= 355.95

Terminal

Node 2

STD = 12.683

Avg = 37.036

W = 436.000

N = 436

CEMENT_AMT > 355.95

Terminal

Node 3

STD = 13.452

Avg = 57.026

W = 129.000

N = 129

AGE_DAY > 21.00

Node 2


STD = 15.358

Avg = 41.600

W = 565.000

N = 565

Node 1

AGE_DAY <= 21.00

STD = 16.661

Avg = 36.036

W = 825.000

N = 825

The predicted value in a CARTregression model is the average of the targetvariable (i.e. “Y”) for the records that fall into one of the terminal nodes

Example: If Age = 26 days and the amount of cement is 400 then the predicted strength is 57.026 megapasucals

CART: AlgorithmStep 1: Grow a large treeThis is done for you automatically

All variables are considered at each split in the tree

Each split is made using onevariable and a specific value or set of values.

Splits are chosen so as to minimize modelerror

The tree is grown until either a user-specified criterion is met or until the tree cannot be grown further

Step2: Prune the large treeThis is also done for you automatically

Use either a test sample or cross validation to prune subtrees

CART: Splitting Procedure

Consider the following CART tree grown on this dataset

How exactly do we get this tree? Y 𝑋1 𝑋2

79.9861 162 28

61.8874 162 28

40.2695 228 270

41.0528 228 365

44.2961 192 360

47.0298 228 90

43.6983 228 365

36.4478 228 28

45.8543 228 28

39.2898 228 28

38.0742 192 90

28.0217 192 28

43.013 228 270

42.3269 228 90

47.8138 228 28

52.9083 228 90

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 <= 210.00

Terminal

Node 2

STD = 6.705

Avg = 36.797

W = 3.000

N = 3

X1 > 210.00

Terminal

Node 3

STD = 4.375

Avg = 43.609

W = 11.000

N = 11

X1 > 177.00

Node 2

X1 <= 210.00

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

CART: Splitting ProcedureStep 1: Find the best split point for the variable 𝑋1

Sort the variable 𝑋1 Compute the split improvement for each split point Best split for 𝑋1 :

𝑿𝟏 ≤ 177

Note: the midpoint between𝑋1 = 192 and 𝑋1 = 162 is 177

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

40.27 228 270

41.05 228 365

44.30 192 360

47.03 228 90

43.70 228 365

36.45 228 28

45.85 228 28

39.29 228 28

38.07 192 90

28.02 192 28

43.01 228 270

42.33 228 90

47.81 228 28

52.91 228 90

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

28.02 192 28

38.07 192 90

44.3 192 360

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

41.05 228 365

43.7 228 365

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

Y 𝑋1 𝑋228.02 192 28

38.07 192 90

44.3 192 360

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

41.05 228 365

43.7 228 365

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

Split Improvement: ∆𝑅 𝑠, 𝑡 = 𝑅 𝑡 − 𝑅 𝑡𝐿 − 𝑅(𝑡𝑅)

𝑅 𝑡 =1

𝑁

𝑥𝑛𝜖𝑡

(𝑦𝑛 − 𝑦 𝑡 )2 𝐿𝑒𝑎𝑠𝑡 𝑆𝑞𝑢𝑎𝑟𝑒𝑠

CART: Splitting ProcedureStep 2: Find the best split point for the variable 𝑋2

Sort the variable 𝑋2 Compute the split improvement for each split point Best Split for 𝑋2:

𝑿𝟐 ≤ 59

Note: the midpoint between 𝑋2 = 28 and 𝑋2 = 90 is 59

Y 𝑋1 𝑋279.9861 162 28

61.8874 162 28

40.2695 228 270

41.0528 228 365

44.2961 192 360

47.0298 228 90

43.6983 228 365

36.4478 228 28

45.8543 228 28

39.2898 228 28

38.0742 192 90

28.0217 192 28

43.013 228 270

42.3269 228 90

47.8138 228 28

52.9083 228 90

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

28.02 192 28

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

38.07 192 90

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

44.30 192 360

41.05 228 365

43.70 228 365

X2 <= 59.00

Terminal

Node 1

STD = 16.158

Avg = 48.472

W = 7.000

N = 7

X2 > 59.00

Terminal

Node 2

STD = 4.069

Avg = 43.630

W = 9.000

N = 9

Node 1

X2 <= 59.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16 Y 𝑋1 𝑋238.07 192 90

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

44.30 192 360

41.05 228 365

43.70 228 365

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

28.02 192 28

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28


𝑅 𝑡 =1

𝑁

𝑥𝑛𝜖𝑡

(𝑦𝑛 − 𝑦 𝑡 )2 𝐿𝑒𝑎𝑠𝑡 𝑆𝑞𝑢𝑎𝑟𝑒𝑠


At this point CART has evaluated all possible split points for our two variables, 𝑋1 and 𝑋2, and determined the optimal split points for each.

Splitting on either 𝑋1 or 𝑋2 will yield a different tree, so what is the best split? The one with the largest split improvement.

Best split for 𝑿𝟏: 𝑋1 ≤ 177Improvement Value: 90.64

Best split for 𝑿𝟐: 𝑋2 ≤ 59Improvement Value: 5.77

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

X2 <= 59.00

Terminal

Node 1

STD = 16.158

Avg = 48.472

W = 7.000

N = 7

X2 > 59.00

Terminal

Node 2

STD = 4.069

Avg = 43.630

W = 9.000

N = 9

Node 1

X2 <= 59.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16


𝑅 𝑡 =1

𝑁

𝑥𝑛𝜖𝑡

(𝑦𝑛 − 𝑦 𝑡 )2

𝐿𝑒𝑎𝑠𝑡 𝑆𝑞𝑢𝑎𝑟𝑒𝑠

CART: 1st Split

Our best first split in the tree is 𝑋1 ≤ 177 which leads to the following tree and partitioned dataset

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

Y 𝑋1 𝑋228.02 192 28

38.07 192 90

44.3 192 360

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

41.05 228 365

43.7 228 365

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

28.02 192 28

38.07 192 90

44.3 192 360

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

41.05 228 365

43.7 228 365

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

Note: the predicted values for this tree are the respective averages in each terminal node.

Terminal Node 1 predicted value:79.99+61.89 ≈ 70.94

Terminal Node 1

Terminal Node 2

CART Geometry

Y

𝑋1

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

𝑋1

𝑋2


So how do we get to our final tree?

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 <= 210.00

Terminal

Node 2

STD = 6.705

Avg = 36.797

W = 3.000

N = 3

X1 > 210.00

Terminal

Node 3

STD = 4.375

Avg = 43.609

W = 11.000

N = 11

X1 > 177.00

Node 2

X1 <= 210.00

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16


We now perform the same procedure again, but this time for each partition of the data (we can only split one partition at a time)

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

Y 𝑋1 𝑋228.02 192 28

38.07 192 90

44.3 192 360

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

41.05 228 365

43.7 228 365

Y 𝑋1 𝑋228.02 192 28

38.07 192 90

44.3 192 360

Y 𝑋1 𝑋236.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

41.05 228 365

43.7 228 365

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

Best Split: Split Partition 2 at 𝑋1 ≤ 210Partition 1 Partition 2

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 <= 210.00

Terminal

Node 2

STD = 6.705

Avg = 36.797

W = 3.000

N = 3

X1 > 210.00

Terminal

Node 3

STD = 4.375

Avg = 43.609

W = 11.000

N = 11

X1 > 177.00

Node 2

X1 <= 210.00

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

CART Geometry

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

𝑋1

𝑋2

Y

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 <= 210.00

Terminal

Node 2

STD = 6.705

Avg = 36.797

W = 3.000

N = 3

X1 > 210.00

Terminal

Node 3

STD = 4.375

Avg = 43.609

W = 11.000

N = 11

X1 > 177.00

Node 2

X1 <= 210.00

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

Where are we?

CART Splitting Process

CART Pruning

Advantages of CART

Interpreting CART Output

Applied Example using CART

Random Forest Section

CART: Algorithm

Step 1: Grow a large tree

Step2: Prune the large treeThis is also done for you automatically

Use either a test sample or cross validation toprune subtrees

CART: Pruning with a Test SampleTest sample- randomly select a certain percentage of data (often ~20%-30%)to be used to assess the model error

Prune the CART tree

1. Run the test data down the large tree and the smaller trees (the smaller trees are called “subtrees”)

2. Compute the test error for each tree

3. The final tree shown to the user is the tree with the smallest test error

Subtree Error

1 200

2 125

3 100

4 83

5 113

6 137

Where are we?


CART Pruning

Advantages of CART (this is what allows you to build models with dirty data)




CART Advantages

In practice, you can build CART models with dirty data (i.e. missing values, lots of variables, nonlinear relationships, outliers, and numerous local effects)

This is due to CART’s desirable properties:1. Easy to interpret2. Automatic handling of the following:

a) Variable selectionb) Variable interaction modelingc) Local effect modelingd) Nonlinear relationship modelinge) Missing valuesf) Outliers

3. Not affected by monotonic transformations of variables

CART: Interpretation and Automatic Variable Selection

Interpretation: CART trees have a simple interpretation and only require that someone ask themselves a series of “yes or no” questions like “Is Age_Day <= 21?” etc.

Variable Selection:All variables will be considered for each split but not all variables will be used. Some variables will be used more than others.

Only one variable is used for each split

The variables chosen are those that reduce the error the most

AGE_DAY <= 21.00

Terminal

Node 1

STD = 12.441

Avg = 23.944

W = 260.000

N = 260


Terminal

Node 2

STD = 12.683

Avg = 37.036

W = 436.000

N = 436

CEMENT_AMT > 355.95

Terminal

Node 3

STD = 13.452

Avg = 57.026

W = 129.000

N = 129

AGE_DAY > 21.00

Node 2


STD = 15.358

Avg = 41.600

W = 565.000

N = 565

Node 1

AGE_DAY <= 21.00

STD = 16.661

Avg = 36.036

W = 825.000

N = 825

CART: Automatic Variable Interactions and Local Effects

In regression interaction terms modeled globally in the form x1*x2 or x1*x2*x3 (global means that the interaction is present everywhere).

In CART interactions are automatically modeled over certain regions of the data (i.e. locally) so you do not have to worry about adding interaction terms or local terms to your model

Example: Notice how the prediction changes for different amounts of cement given that the Age is over 21 days (i.e. this is the interaction)

1. If Age > 21 and Cement Amount <= 355. 95 then the average strength is 37 megapascuals

2. If Age > 21 and Cement Amount > 355.95 then the average strength is 57 megapascuals

AGE_DAY <= 21.00

Terminal

Node 1

STD = 12.441

Avg = 23.944

W = 260.000

N = 260


Terminal

Node 2

STD = 12.683

Avg = 37.036

W = 436.000

N = 436

CEMENT_AMT > 355.95

Terminal

Node 3

STD = 13.452

Avg = 57.026

W = 129.000

N = 129

AGE_DAY > 21.00

Node 2


STD = 15.358

Avg = 41.600

W = 565.000

N = 565

Node 1

AGE_DAY <= 21.00

STD = 16.661

Avg = 36.036

W = 825.000

N = 825

CART: Automatic Nonlinear ModelingNonlinear functions (and linear) are approximated via step functions, so in practice you do not need to worry about adding terms like 𝒙𝟐 𝒐𝒓 𝒍𝒏 𝒙 to capture nonlinear relationships. The picture below is the CART fit to 𝑌 = 𝑋2 + noise. CART modeled this data automatically. No data pre-processing. Just CART.

𝑌

𝑋

CART: Automatic Missing Value Handling

CART automatically handles missing values while building the model, so you do not need to impute missing values yourself

The missing values are handled using a surrogate splitSurrogate Split- find another variable whose split is “similar” to the variable with the missing values and split on the variable that does not have missing values

Reference: see Section 5.3 in Breiman, Friedman, Olshen, and Stone for more information

CART: Outliers in the Target Variable

Two types of outliers are1. Outliers in the target variable (i.e. “Y”)2. Outliers in the predictor variable (i.e. “x”)

CART is more sensitive to outliers with respect to the target variable

1. More severe in a regression context than a classification context

2. CART may treat target variable outliers by isolating them in small terminal nodes which can limit their effect

Reference: Pages 197-200 and 253 in Breiman, Friedman, Olshen, and Stone (1984)

𝑋1

𝑋2

Y

Here the target outliers are isolated in terminal node 1

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

CART: Outliers in the Predictor VariablesCART is more robust to outliers in the predictor variables partly due to nature of the splitting process

Reference: Pages 197-200 and 253 in Breiman, Friedman, Olshen, and Stone (1984)

Y

X1 <= 177.00

Terminal

Node 1

STD = 8.532

Avg = 73.957

W = 3.000

N = 3

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.149

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 13.661

Avg = 47.762

W = 17.000

N = 17

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

𝑋1

𝑋2

CART: Monotonic Transformations of VariablesMonotonic transformation- is a transformation that does not change the order of a variable.

– CART, unlike linear regression, is not affected by this so if a transformation does not affect the order of a variable then you do not need to worry about adding it to a CART model

– Example: Our best first split in the example tree was 𝑋1 ≤ 177. What happens if we square 𝑋1? The split point value changes, but nothing else does including the predicted values. This happens because the same Y values fall into the

same partition (i.e. their order has not changed after we squared and sorted 𝑋1)

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

Y 𝑋1 𝑋228.02 192 28

38.07 192 90

44.3 192 360

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

41.05 228 365

43.7 228 365

Y 𝑋1 𝑋279.99 162 28

61.89 162 28

28.02 192 28

38.07 192 90

44.3 192 360

36.45 228 28

45.85 228 28

39.29 228 28

47.81 228 28

47.03 228 90

42.33 228 90

52.91 228 90

40.27 228 270

43.01 228 270

41.05 228 365

43.7 228 365

X1 <= 177.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1 > 177.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1 <= 177.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

Y 𝑋1 𝑋279.99 26,244 28

61.89 26,244 28

28.02 36,864 28

38.07 36,864 90

44.3 36,864 360

36.45 51,984 28

45.85 51,984 28

39.29 51,984 28

47.81 51,984 28

47.03 51,984 90

42.33 51,984 90

52.91 51,984 90

40.27 51,984 270

43.01 51,984 270

41.05 51,984 365

43.7 51,984 365

Y 𝑋1 𝑋279.99 26,244 28

61.89 26,244 28

Y 𝑋1 𝑋228.02 36,864 28

38.07 36,864 90

44.3 36,864 360

36.45 51,984 28

45.85 51,984 28

39.29 51,984 28

47.81 51,984 28

47.03 51,984 90

42.33 51,984 90

52.91 51,984 90

40.27 51,984 270

43.01 51,984 270

41.05 51,984 365

43.7 51,984 365

X1_SQ <= 31554.00

Terminal

Node 1

STD = 9.049

Avg = 70.937

W = 2.000

N = 2

X1_SQ > 31554.00

Terminal

Node 2

STD = 5.700

Avg = 42.150

W = 14.000

N = 14

Node 1

X1_SQ <= 31554.00

STD = 11.371

Avg = 45.748

W = 16.000

N = 16

Where are we?


CART Pruning





CART: Relative Error

Relative error- used to determine the optimal model complexity for CART models

GOOD: Relative error values close to zero mean that CART is doing a better job than predicting only the overall average (or median) for all records in the data

BAD: Relative error values equal to one means that that CART is no better than predicting the overall average (or median) of the target variable for every record. Note: the relative error can be greater than one which is especially bad.

The relative error can be computed for both Least Squares: LS = 𝑖 𝑦𝑖 − 𝑦𝑖2and Least Absolute Deviation:

LAD = 𝑖 𝑦𝑖 − 𝑦𝑖

Relative Error = 𝐶𝐴𝑅𝑇 𝑀𝑜𝑑𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 𝑢𝑠𝑖𝑛𝑔 𝑒𝑖𝑡ℎ𝑒𝑟 𝐿𝑆 𝑜𝑟 𝐿𝐴𝐷

𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑟𝑒𝑐𝑜𝑟𝑑𝑠

LAD Relative Error =

Relative Error: 0.129

CART: Variable ImportanceCART Variable Importance: sum each variable’s split improvement score across the splits in the tree. The importance scores for variables are increased in two ways:

1) When the variable is actually used to a split a node

2) When the variable is as the surrogate split (i.e. the backup splitting variable when the primary splitting variables has a missing value)

“Consider Only Primary Splitters” (green rectangle on the right) removes the surrogate splitting variables from the variable importance calculation

“Discount Surrogates” allows you to discount surrogates in a more specific manner


CART Performance


Method



67.35

Min 1 SE CART (default settings) 65.05


Random Forest(default settings)

Improved Random Forest using an SPM Automate

𝑀𝑆𝐸 =1

𝑛

𝑖=1

𝑛


***More information about the 1 Standard Error Rule for CART can be found in the appendix

Where are we?


CART Pruning





Introduction to Random Forests

Main Idea: fit multiple CART trees to independent “bootstrap” samples of the data and then combine the predictions

Leo Breiman, one of the co-creators of CART, also created Random Forests and published a

paper on this method in 2001

Our RandomForests® software was developed in close consultation with Breiman himself

What is a bootstrap sample?A bootstrap sample is a random sample conducted with replacementSteps:

1. Randomly select an observation from the original data2. “Write it down”3. “Put it back” (i.e. any observation can be selected more than once)

Repeat steps 1-3 N times; N is the number of observations in the original sampleFINAL RESULT: One “bootstrap sample” with N observations

…

Bootstrap SampleOriginal Data

0 48 30 48 3

0 37 10 37 1

0 37 1

0 . 10 . 1

0 24 4

0 24 4

0 37 1

Y X1 X2

Original Data

…..Bootstrap 1 Bootstrap 2 Bootstrap 199 Bootstrap 200

Final Prediction for a New Record: take the average of the 200 individual predictions

10.5 + 9.8…+ 10.73 + 12

200

Predict a New Record : run the record down each tree, each time computing a prediction

1. Draw a bootstrap sample

2. Fit a large, unpruned, CART tree to this bootstrap sample

-At each split in the tree consider only k randomly selected variables instead of all of them

3. Average the predictions to predict a new record

Repeat Steps 1-2 at least 200 times

Tree 1: 10.5 Tree 2: 9.8 Tree 199: 10.73 Tree 200: 12…..

CART and Random Forests

When you build a Random Forest model just keep this picture in the back of your mind:

The reason is because a Random Forest is really just an average of CART trees constructed on bootstrap samples of the original data

= ++ +… +1 2 3 B

[ ]1

𝐵

CART and Random Forests

Random Forests generally have superior predictive performance versus CART trees because Random Forests have lower variance than a single CART tree

Since Random Forests are a combination of CART trees they inherit many of CART’s properties:

AutomaticVariable selectionVariable interaction detectionNonlinear relationship detectionMissing value handlingOutlier handling Modeling of local effects

Invariant to monotone transformations of predictors

One drawback is that a Random Forest is not as interpretable as a single CART tree

Random Forests: Tuning Parameters

The performance of a Random Forest is dependent upon the values of certain model parameters

Two of these parameters are

1. Number of trees

2. Random number of variables chosen at each split

Random Forests: Number of TreesNumber of trees

Default: 200

The number of trees should be large enough so that the model error no longer meaningfully declines as the number of trees increases (Experimentation will be required)

In Random Forests the optimal number of trees tends to be the maximum value allotted (due to the Law of Large Numbers) Default Setting: 200 Trees My Setting: 400 Trees

There is not much of a difference between the error for a forest with 200 trees and one with 400 trees, so, at least for this dataset, a larger number of trees will not improve the model meaningfully

Random Forests: Tuning Parameters

The performance of a Random Forest is dependent upon the values of certain model parameters

Two of these parameters are

1. Number of trees

2. Random number of variables chosen at each split

Random Forest Parameters:Random variable subset size

Random number of variables k chosen at each split in each tree in the forest

Default: k=3

Experimentation will be required to find the optimal value and this can be done using Automate RFNPREDS

Automate RFNPREDS- automatically build multiple Random Forests: each time the forest is the same except that the number of randomly selected variables at each split in each CART tree changes

This allows us to conveniently determine the optimal number of variables to randomly select at each split in each tree

This output is telling you that optimal number of randomly selected variables at each split in each tree in the forest is 5.

Interpreting a Random Forest

Since a Random Forest is a collection of hundreds or even thousands of CART trees, the simple interpretation is lost because we now have hundreds of trees and are averaging the predictions

One method used to interpret a Random Forest is variable importance

Random Forest for Regression: Variable Importance

CART Variable Importance: sum each variable’s split improvement score across the splits in the tree

The importance scores for variables are increased in two way: 1) when the variable is actually used to a split a node and 2) when the variable is as the surrogate split (i.e. the backup splitting variable when the primary splitting variables has a missing value)

Random Forest Variable Importance for Regression: 1. Compute a score for every split the variable generates, sum the

scores across all splits made Relative Importance: divide all variable importance scores by the

maximum variable importance score (i.e. the most important variable has a relative importance value of 100)

Note: For classification models, the preferred method is the random permutation method (see appendix for more details)

Random Forest Demonstration in SPM

Random Forests: Model Performance


Method



67.35

Min 1 SE CART (default settings 65.05


Random Forest (default settings)

37.70

Improved Random Forest using an SPM Automate

36.02

𝑀𝑆𝐸 =1

𝑛

𝑖=1

𝑛


Conclusion

CART produces an interpretable model that is more resistant to outliers, predicts future data well, and automatically handles

1. Variables interactions

2. Missing values

3. Nonlinear relationships

4. Local effects

Random Forests are fundamentally a combination of individual CART trees and thus inherit all of the advantages of CART above (except the nice interpretation)

*Generally is superior to a single CART tree in terms of predictive accuracy

Next in the series…

Improve Your Regression with TreeNet® Gradient Boosting

Try CART and Random Forest

Download SPM® 8 now to start building CART and Random Forest models on your data

We will be more than happy to personally help youif you have any questions or need assistance

My Email: [email protected] Email: [email protected]

The appendix follows this slide

mailto:[email protected]

mailto:[email protected]

Appendix

1 Standard Error Rule for CART trees

1SE Rule in SPMOptimal Tree 1 Standard Error Rule Tree

1SE Rule Tree: the smallest tree whose error is within one standard deviation of the minimum error Figures: Upper Left: Optimal Tree has 188 terminal nodes and a relative error of .177; Upper Right: the 1SE tree has 85 terminal nodes and a relative error .209

Smaller trees are preferred because they are less likely to overfit the data (i.e. the 1SE tree in this case is competitive in terms of accuracy and is much less complex) and they are easier to interpret

Relative Error = 𝐶𝐴𝑅𝑇 𝑀𝑜𝑑𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 𝑢𝑠𝑖𝑛𝑔 𝑒𝑖𝑡ℎ𝑒𝑟 𝐿𝑆 𝑜𝑟 𝐿𝐴𝐷

𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 (𝑜𝑟 𝑚𝑒𝑑𝑖𝑎𝑛 𝑖𝑓 𝑢𝑠𝑖𝑛𝑔 𝐿𝐴𝐷) 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑟𝑒𝑐𝑜𝑟𝑑𝑠

Improve Your Regression with CART and RandomForests

Data & Analytics

Improve Your Regression with CART and RandomForests