2014-09-21 1 Hanyang University Quest Lab. Chapter 9 Classification and regression trees Fall, 2014 Department of IME Hanyang University Hanyang University Quest Lab. 9.2 Classification trees • Data-driven method for classification (classification tree) and prediction (regression tree) • Tree method developed by Breiman et al. → CART • Obs are separated into subgroups based on predictors • Goal: Classify or predict an outcome based on a set of predictors • The output is a set of rules • Recursive partitioning and pruning with consideration of the homogeneity
20
Embed
Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2014-09-21
1
Han
yang
Un
iversity
Quest L
ab.
Chapter 9
Classification and regression trees
Fall, 2014Department of IMEHanyang University
Hanyang UniversityQuest Lab.
9.2 Classification trees
• Data-driven method for classification (classification tree) and prediction
(regression tree)
• Tree method developed by Breiman et al. → CART
• Obs are separated into subgroups based on predictors
• Goal: Classify or predict an outcome based on a set of predictors
• The output is a set of rules
• Recursive partitioning and pruning with consideration of the
homogeneity
2014-09-21
2
Hanyang UniversityQuest Lab.
Example:
• Goal: classify a record as “will accept credit card offer” or “will not
accept”
• Rule might be “IF (Income > 92.5) AND (Education < 1.5) AND (Family
<= 2.5) THEN Class = 0 (nonacceptor)
• Also called CART, Decision Trees, or just Trees
• Rules are represented by tree diagrams
9.1 Introduction
Hanyang UniversityQuest Lab.
9.1 Introduction
Number in the circle node: splitting value
Terminal (square) node: acceptor(1) or nonacceptor(0)
Number on the fork: number of records
Predictor
2014-09-21
3
Hanyang UniversityQuest Lab.
9.1 Introduction
“IF (Income > 92.5) AND (Education < 1.5) AND (Family <= 2.5) THEN Class = 0 (nonacceptor)”
Hanyang UniversityQuest Lab.
9.2 Classification trees
• Divide the p-dimensional space into two by a split value of such
that: 1, … , : and 1, … , :
• Measure how “pure” or homogeneous each of the resulting portions are
(“Pure” = containing records of mostly one class)
• Algorithm tries different values of ,and
to maximize purity in split
• After you get a “maximum purity” split, repeat the process for a second
split, and so on
• Repeat until we get “pure” classes (belonging to one class)
Splitting stops when purity improvement is not statistically significant
Stopping tree growth: CHAID
Hanyang UniversityQuest Lab.
9.5 Avoiding overfitting
- Lets tree grow to full extent, then prunes it back
- Riding mower 예제에서 마지막 몇 개 split에 의한 rectangle은
single record만 가짐 → 이들은 noise를 포함하고 있다고 볼 수 있음
- Pruning이란 decision node를 leaf로 바꾸는 것
- Unseen data의 error rate가 증가하는 시점을 발견하는 것이 목표
Pruning the tree
2014-09-21
16
Hanyang UniversityQuest Lab.
9.5 Avoiding overfitting
Use cost complexity(CC) of a tree : ,
where
=tree 에서의 misclassification에 의한 cost
=number of leaves in
=penalty factor (set by user)
(Tree size가 커질수록 에러는 작아지지만 대신 페널티가 커진다)
( 값이 커질 수록 큰 tree의 CC가 높아진다)
• Among trees of given size(# of decision nodes), choose the one with
lowest CC
Pruning the tree (CART)
Hanyang UniversityQuest Lab.
9.5 Avoiding overfitting
• Do this for each size of tree
→각 size 별로 최소 CC를 가지는 tree들을 얻음
• 이들 tree 중에서 validation set의 misclassification error가 minimum인
것(“minimum error tree”)을 선택 하거나
• 또는 minimum error tree의 one standard error 범위내의 tree 중 제일
작은 것을 선택(“best pruned tree”)
• 이는 Sampling error를 고려한 수정임 (validation data도 pruning 과정
에서 사용했으며, 다른 sample일 경우 달라질 수도 있었기 때문)
(1 )p pp
n
• Do this for each size of tree
→각 size 별로 최소 CC를 가지는 tree들을 얻음
• 이들 tree 중에서 validation set의 misclassification error가 minimum인
것(“minimum error tree”)을 선택 하거나
• 또는 minimum error tree의 one standard error 범위내의 tree 중 제일
작은 것을 선택(“best pruned tree”)
• 이는 Sampling error를 고려한 수정임 (validation data도 pruning 과정
에서 사용했으며, 다른 sample일 경우 달라질 수도 있었기 때문)
(1 ),
p pp
n
n은 validation set의 샘플수
2014-09-21
17
Hanyang UniversityQuest Lab.
9.5 Avoiding overfitting
full grown tree
0.0147 (1 0.0147)0.0147 0.0178 1.78%
1500
. . . . . . . . . . . .
minimum error tree
best pruned tree
Hanyang UniversityQuest Lab.
9.6 Classification rules from trees
Tree를 바탕으로 classification 쉽게 함
Ex. Rule might be “IF (Income > 92.5) AND (Education < 1.5) AND
(Family <= 2.5) THEN Class = 0 (nonacceptor)
• 두 개 이상의 class가 있
는 경우로 확장 가능
• Leaf node 는 m 개의
class 중 하나의 label
2014-09-21
18
Hanyang UniversityQuest Lab.
9.8 Regression trees
• Regression tree - Used with continuous outcome variable• Procedure similar to classification tree• Many splits attempted, choose the one that minimizes impurity
Ex. Predicting Toyota Corolla (see Ch 6): 10 predictors, 600 training set
Best tree: only two
predictors are useful –
Age and Horse Power.
Hanyang UniversityQuest Lab.
9.8 Regression trees
Cf. Classification tree: leaf의 class는 그 leaf에 있는 data를 “vote”해서 결정
Regression tree: leaf node의 value는 그 leaf에 있는 data의 average
Prediction
Typical measure: sum of squared deviation from mean of the leaf
Measuring impurity
RMSE (root-mean-squared-error), lift charts, etc.
same way as other predictive methods
Evaluating performance
2014-09-21
19
Hanyang UniversityQuest Lab.
• Good classifier, useful for variable selection
• No need to transform variables
• Automatically selection of variables by means of split (pruning에 의해 일부
variable만 선택 가능)
• Outliers에 대해 robust (split은 record의 value의 order에만 관련이 있고, 크기
와는 무관)
• But sensitive to the change in the data (slight change can cause very
different split)
• Predictor와 response 간에 특별한 relationship을 미리 가정하지 않음
(cf. linear regression은 linear relationship을 미리 가정)
9.9 Advantages, weaknesses, and extensions
Hanyang UniversityQuest Lab.
• Split가 one predictor에 의해서만 이루어지므로 predictor 간 상호관계 고려하
지 못함
• horizontal and vertical splitting이 적절하지 않은 경우에는 performance 저하