Top Banner
ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018
34

ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

ECS171: Machine LearningLecture 15: Tree-based Algorithms

Cho-Jui HsiehUC Davis

March 7, 2018

Page 2: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Outline

Decision Tree

Random Forest

Gradient Boosted Decision Tree (GBDT)

Page 3: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Decision Tree

Each node checks one feature xi :

Go left if xi < thresholdGo right if xi ≥ threshold

Page 4: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

A real example

Page 5: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Decision Tree

Strength:

It’s a nonlinear classifierBetter interpretabilityCan naturally handle categorical features

Computation:

Training: slowPrediction: fast

h operations (h: depth of the tree, usually ≤ 15)

Page 6: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Decision Tree

Strength:

It’s a nonlinear classifierBetter interpretabilityCan naturally handle categorical features

Computation:

Training: slowPrediction: fast

h operations (h: depth of the tree, usually ≤ 15)

Page 7: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Splitting the node

Classification tree: Split the node to maximize entropy

Let S be set of data points in a node, c = 1, · · · ,C are labels:

Entroy : H(S) = −C∑

c=1

p(c) log p(c),

where p(c) is the proportion of the data belong to class c .Entropy=0 if all samples are in the same classEntropy is large if p(1) = · · · = p(C )

Page 8: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Information Gain

The averaged entropy of a split S → S1,S2

|S1||S |

H(S1) +|S2||S |

H(S2)

Information gain: measure how good is the split

H(S)−(

(|S1|/|S |)H(S1) + (|S2|/|S |)H(S2)

)

Page 9: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Information Gain

Page 10: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Information Gain

Page 11: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Splitting the node

Given the current note, how to find the best split?

For all the features and all the threshold

Compute the information gain after the split

Choose the best one (maximal information gain)

For n samples and d features: need O(nd) time

Page 12: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Splitting the node

Given the current note, how to find the best split?

For all the features and all the threshold

Compute the information gain after the split

Choose the best one (maximal information gain)

For n samples and d features: need O(nd) time

Page 13: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Splitting the node

Given the current note, how to find the best split?

For all the features and all the threshold

Compute the information gain after the split

Choose the best one (maximal information gain)

For n samples and d features: need O(nd) time

Page 14: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Regression Tree

Assign a real number for each leaf

Usually averaged y values for each leaf

(minimize square error)

Page 15: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Regression Tree

Objective function:

minF

1

n

n∑i=1

(yi − F (xi ))2 + (Regularization)

The quality of partition S = S1 ∪ S2 can be computed by the objectivefunction: ∑

i∈S1

(yi − y (1))2 +∑i∈S2

(yi − y (2))2,

where y (1) = 1|S1|∑

i∈S1 yi , y(2) = 1

|S2|∑

i∈S2 yi

Find the best split:

Try all the features & thresholds and find the one with minimalobjective function

Page 16: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Regression Tree

Objective function:

minF

1

n

n∑i=1

(yi − F (xi ))2 + (Regularization)

The quality of partition S = S1 ∪ S2 can be computed by the objectivefunction: ∑

i∈S1

(yi − y (1))2 +∑i∈S2

(yi − y (2))2,

where y (1) = 1|S1|∑

i∈S1 yi , y(2) = 1

|S2|∑

i∈S2 yi

Find the best split:

Try all the features & thresholds and find the one with minimalobjective function

Page 17: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Parameters

Maximum depth: (usually ∼ 10)

Minimum number of nodes in each node: (10, 50, 100)

Single decision tree is not very powerful· · ·Can we build multiple decision trees and ensemble them together?

Page 18: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Parameters

Maximum depth: (usually ∼ 10)

Minimum number of nodes in each node: (10, 50, 100)

Single decision tree is not very powerful· · ·Can we build multiple decision trees and ensemble them together?

Page 19: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Random Forest

Page 20: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Random Forest

Random Forest (Bootstrap ensemble for decision trees):

Create T treesLearn each tree using a subsampled dataset Si and subsampled featureset Di

Prediction: Average the results from all the T trees

Benefit:

Avoid over-fittingImprove stability and accuracy

Good software available:

R: “randomForest” packagePython: sklearn

Page 21: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Random Forest

Page 22: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Tree

Page 23: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Boosted Decision Tree

Minimize loss `(y ,F (x)) with F (·) being ensemble trees

F ∗ = argminF

n∑i=1

`(yi ,F (xi )) with F (x) =T∑

m=1

fm(x)

(each fm is a decision tree)

Direct loss minimization: at each stage m, find the best function tominimize loss

solve fm = argminfm

∑Ni=1 `(yi ,Fm−1(xi ) + fm(xi ))

update Fm ← Fm−1 + fm

Fm(x) =∑m

j=1 fj(x) is the prediction of x after m iterations.

Two problems:

Hard to implement for general lossTend to overfit training data

Page 24: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Boosted Decision Tree

Minimize loss `(y ,F (x)) with F (·) being ensemble trees

F ∗ = argminF

n∑i=1

`(yi ,F (xi )) with F (x) =T∑

m=1

fm(x)

(each fm is a decision tree)

Direct loss minimization: at each stage m, find the best function tominimize loss

solve fm = argminfm

∑Ni=1 `(yi ,Fm−1(xi ) + fm(xi ))

update Fm ← Fm−1 + fm

Fm(x) =∑m

j=1 fj(x) is the prediction of x after m iterations.

Two problems:

Hard to implement for general lossTend to overfit training data

Page 25: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Boosted Decision Tree

Minimize loss `(y ,F (x)) with F (·) being ensemble trees

F ∗ = argminF

n∑i=1

`(yi ,F (xi )) with F (x) =T∑

m=1

fm(x)

(each fm is a decision tree)

Direct loss minimization: at each stage m, find the best function tominimize loss

solve fm = argminfm

∑Ni=1 `(yi ,Fm−1(xi ) + fm(xi ))

update Fm ← Fm−1 + fm

Fm(x) =∑m

j=1 fj(x) is the prediction of x after m iterations.

Two problems:

Hard to implement for general lossTend to overfit training data

Page 26: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Tree (GBDT)

Approximate the current loss function by a quadratic approximation:

n∑i=1

`i (yi + fm(xi )) ≈n∑

i=1

(`i (yi ) + gi fm(xi ) +

1

2hi fm(xi )2

)=

n∑i=1

hi2‖fm(xi )− gi/hi‖2 + constant

where gi = ∂yi `i (yi ) is gradient,hi = ∂2yi `i (yi ) is second order derivative

Page 27: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Tree

Finding fm(x , θm) by minimizing the loss function:

argminfm

N∑i=1

[fm(xi , θ)− gi/hi ]2 + R(fm)

Reduce the training of any loss function to regression tree (just need tocompute gi for different functions)hi = α (fixed step size) for original GBDT.XGboost shows computing second order derivative yields betterperformance

Algorithm:

Computing the current gradient for each yi .Building a base learner (decision tree) to fit the gradient.Updating current prediction yi = Fm(xi ) for all i .

Page 28: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Tree

Finding fm(x , θm) by minimizing the loss function:

argminfm

N∑i=1

[fm(xi , θ)− gi/hi ]2 + R(fm)

Reduce the training of any loss function to regression tree (just need tocompute gi for different functions)hi = α (fixed step size) for original GBDT.XGboost shows computing second order derivative yields betterperformance

Algorithm:

Computing the current gradient for each yi .Building a base learner (decision tree) to fit the gradient.Updating current prediction yi = Fm(xi ) for all i .

Page 29: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Trees (GBDT)

Key idea:

Each base learner is a decision treeEach regression tree approximates the functional gradient ∂`

∂F

Page 30: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Trees (GBDT)

Key idea:

Each base learner is a decision treeEach regression tree approximates the functional gradient ∂`

∂F

Page 31: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Trees (GBDT)

Key idea:

Each base learner is a decision treeEach regression tree approximates the functional gradient ∂`

∂F

Page 32: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Trees (GBDT)

Key idea:

Each base learner is a decision treeEach regression tree approximates the functional gradient ∂`

∂F

Page 33: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Gradient Boosted Decision Trees (GBDT)

Key idea:

Each base learner is a decision treeEach regression tree approximates the functional gradient ∂`

∂f

Page 34: ECS171: Machine Learningchohsieh/teaching/ECS171_Winter...XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each

Conclusions

Next class: Matrix factorization, word embedding

Questions?