Machine Learning Workshop [email protected] Machine learning introduc8on Logis8c regression Feature selec8on Addi$ve Model and Boos$ng Tree See more machine learning post: h<p://dongguo.me
Nov 28, 2014
Machine Learning Workshop [email protected]
Machine learning introduc8on Logis8c regression Feature selec8on
Addi$ve Model and Boos$ng Tree
See more machine learning post: h<p://dongguo.me
Machine learning problem
• Goal of machine learning problem – Based on observed samples, find a predic8on func8on(mapping input variables space to response value space), which has predic8on ability on unseen samples
• Minimize risk exp ( ) [ ( , ( ))] ( , ( )) ( , )PR f E L Y f X L y f x P x y dxdy= = ∫
1
1( ) ( , ( ))N
emp i ii
R f L y f xN =
= ∑
Components of machine learning ‘algorithm’
• ML = Representa$on + Strategy + Op$miza$on – Representa8on: Change func8on op8miza8on problem to parameter op8miza8on problem by choosing a family space for predic8on func8on;
– Strategy: Define a loss func8on to evaluate the error between predic8on value and response value;
– Op8miza8on: Search a op8mal predic8on func8on by minimize loss
Representa8on
• Determine hypothesis space of predic8on func8on by choosing a ‘model’ – E.g. Linear model, mul8-‐level linear model, trees, Bayesian network, addi8ve model and so on
– Need balance expressive and generaliza8on ability • Choose the model with following factors considered
– About the learning problem • Difficulty of the learning problem • What models are successfully used in other similar learning problem
– About the data • Amount of samples could be observed; amount of features; interac8ve between features; outliers in data
– Specific requirements • Interpretability, Computa8onal/storage cost
Strategy
• Dis8nguish good classifiers from bad ones in hypothesis space by define a loss func8on
• Typical loss func8on • For classifica8on
– 0-‐1 LF, Logarithmic LF, binomial deviance LF, exponen8al LF, Hinge LF
• For regression – Quadra8c LF, Absolute LF, Huber LF
1
1( ) ( , ( ))N
emp i ii
R f L y f x regularizationN =
= +∑
Logarithmic loss func8on
• Loss func8on
– Binomial logarithmic loss func8on
• Minimize logarithmic loss = Maximize likelihood es8ma8on
( , ( | )) log ( | )L Y P Y X P Y X= −
( , ( | )) log ( 1| ) (1 ) log ( 0 | )L Y P Y X y P y X y P y X= − = − − =
3 typical loss func8ons for classifica8on
• binomial deviance loss func8on
• Exponen8al loss func8on
• Hinge loss func8on
( , ( )) exp( ( ))L y f x yf x= −
( , ( )) [1 ( )]L y f x yf x += −
( , ( )) log[1 exp( ( ))]L y f x yf x= + −
loss func8ons for classifica8on
From “Elements of sta/s/cal learning”
Loss func8ons for regression
From “Elements of sta/s/cal learning”
Op8miza8on
• Nothing to share this 8me
Components of typical algorithms
“model” Representa$on Strategy Op$miza$on
Polynomial regression
Polynomial func8on Squared loss usually Has closed solu8on
Linear regression Linear model of variable
Squared loss usually has closed solu8on
LR Linear func8on+ Logit link
logarithmic loss Gradient descent, Newton method
ANN Mul8 level linear func8on + Logit link
Squared loss usually Gradient descent
SVM Linear func8on Hinge loss quadra8c programming (SMO)
HMM Bayes network Logarithmic loss EM
Adaboost Addi8ve model Exponen8al loss Stagewise + op8mize base learner
Boos8ng Tree
• Addi8ve model and forward stagewise algorithm • Boos8ng tree • Adaboost • Gradient boos8ng tree
Addi8ve model
• Linear combina8on of base predictor
• Determine f(x)
– Which is difficult to inference for general loss func8on and base learner
1( ) ( ; )
M
m mm
f x b x rβ=
=∑
, 1 1min ( , ( ; ))m m
N M
i m i mr i mL y b x r
ββ
= =∑ ∑
Forward Stagewise Addi8ve Modeling
• Idea: Approximately inference by learning base func8on one by one
0
1, 1
1
1
(1). ( ) 0(2). 1,2,...,
( ). ( , ) argmin ( , ( ) ( ; ));
( ). ( ) ( ) ( ; )
(3). ( ) ( ) ( ; )
N
m m i m i ir i
m m m mM
M m mm
f xform M
a r L y f x b x r
b f x f x b x r
f x f x b x r
ββ β
β
β
−=
−
=
==
= +
= +
= =
∑
∑
Boos8ng tree
• Boos8ng tree = forward stagewise addi8ve modeling with decision tree as base learner
• Different implementa8ons of boos8ng tree with different loss func8on
• Could be used for regression and classifica8on both
1( ) ( ) ( ; )m m mf x f x T x−= + Θ
11
argmin ( , ( ) ( ; ))m
N
m i m i i miL y f x T x
∧
−Θ =
Θ = + Θ∑
Boos8ng tree for regression
• When quadra8c loss func8on is chosen 1 1 2 2
0
1
: {( , ), ( , ),..., ( , )}, ,: ( )
1. ( ) 02. 1 :( ). ( ), 1, 2,...,( ). ( ; )
nN N i i
M
mi i m i
m
Input training setT x y x y x y x R y ROutput boosting tree for regression f xInit f xForm toMa residual r y f x i Nb learna regressiontreeT x by fitting
−
= ∈ ∈
==
= − =Θ
1
1
( ). ( ) ( ) ( ; )3.
( ) ( ; )
mi
m m m
M
M mm
rc update f x f x T xget final regressionboosting tree
f x T x
−
=
= + Θ
= Θ∑
Boos8ng tree for classifica8on
• When exponen8al loss func8on is chosen – Adaboost + classifica8on tree
• When binomial deviance loss func8on is chosen – LogitBoost + classifica8on tree
( , ( )) log[1 exp( ( ))]L y f x yf x= + −
( , ( )) exp( ( ))L y f x yf x= −
Adaboost review
1 11 1 1 1
: , { 1, 1};
1( ,..., ,..., ), , 1, 2,...,
: ( ) : { 1,1
ni i i=1 i
i N i
m m
Input training set{(x , y )} y interationsnumberM1.Init weight of training samples
W w w w w i NN
2.Form=1toM :1). fit a baselearnerusing dataset withweightW G x χ
= − +
= = =
→ −
1
1 1,1 1,
}
: ( ( ) )
113). ( ) : log2
4).( ,..., ,.
N
m mi m i ii
mm m
m
m m m i
2).calculateclassificaiton error on training dataset e w I G x y
ecalculatecoeffient of G x using classificationerror ae
updateweight of each training sampleW w w
=
+ + +
= ≠
−=
=
∑
1, 1,
1
.., ), exp( ( ))3.
( ) ( ( )) ( ( ))
m N m i mi m i m i
M
m mm
w w w a y G xget final classifier
G x sign f x sign a G x
+ +
=
← −
= = ∑
Adaboost : forward stagewise addi8ve modeling with exponen8al loss
• Exponen8al loss func8on
• Forward stagewise addi8ve modeling
( , ( )) exp[ ( )]L y f x yf x= −
1( ) ( ) ( )m m m mf x f x a G x−= +
1, 1
1, 1
( , ( )) argmin exp[ ( ( ) ( ))]
argmin exp[ ( ))], exp[ ( )]
N
m m i m i ia G iN
mi mii i i m ia G i
a G x y f x aG x
w y aG x w y f x
−=
−=
= − +
= − = −
∑
∑
m minferencea andG (x)
Adaboost : forward stagewise addi8ve modeling with exponen8al loss (2)
• Con8nue..
1
( ) argmin ( ( ))N
mim i iG iG x w I y G x∗
=
= ≠∑( ): 0,mInferenceG x for anya wehave>
mInferencea
1 ( ) ( )
( ) ( ) ( )
1 1
exp[ ( ))]
( )
( ) ( ( ))
i m i i m i
i m i i m i i m i
Na a
mi mi mii ii y G x y G x
a a a ami mi mi
y G x y G x y G x
N Na a a
mi mii ii i
w y aG x w e w e
w e e w e w e
e e w I y G x e w
−
= = ≠
− − −
≠ ≠ =
− −
= =
− = +
= − + +
= − ≠ +
∑ ∑ ∑
∑ ∑ ∑
∑ ∑
1
1
1
( ( ))11 log , ( ( ))
2
N
mi i m i Nm i
m m mi i m iNim mi
i
w I y G xea e w I y G xe w
∗ =
=
=
≠−⇒ = = = ≠
∑∑
∑
Adaboost : forward stagewise addi8ve modeling with exponen8al loss (3)
• Weight update for each sample
1( ) ( ) ( )m m m mf x f x a G x−= +
1, exp[ ( )]m i i m iw y f x+ = −
1, , exp( ( ))m i m i i m mw w y a G x+⇒ = −
CART review
• Select variable according to gini
• Could be used for regression and classifica8on • Generate the tree as large as possible firstly, and prune via valida8on
• Parameters – Height; Stop split condi8on
1 21 2 1 2 1
| | | |( , ) ( ) ( ), {( , ) | ( ) },| | | |D DGini D A Gini D Gini D D x y D A x a D D DD D
= + = ∈ = = −
2
1 1( ) (1 ) 1
K K
k k kk k
Gini p p p p= =
= − = −∑ ∑
Experiment
• Goal: evaluate performance of boos8ng tree • Algorithms – Logis8c regression – CART – Boos8ng tree (adaboost + CART)
• Hulu inside datasets – Ad intelligence
Experiment (2)
• Task: predict whether the recall is high or low (binary classifica8on)
• Dataset: Ad intelligence – 718 samples; 93 features – 5-‐fold cross valida8on
• AUC with Logis8c regression: 0.89 • Parameters for boos8ng tree – Tree height, base learner number, and stop split condi8ons
Experiment (3)
• Test results with boos8ng tree: 0.96 – 0.79 for single CART (height 6)
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
AUC
base leaner number
AUC on test dataset (5-‐fold cross valida$on)
H=2
H=3
H=4
H=5
H=6
Gradient boos8ng
• Allow op8miza8on of an arbitrary differen8able loss func8on
• Use gradient descent idea to approximate the residual
– When choose quadra8c loss func8on, it’s common residual
1( ) ( )
( , ( )):( )
mf x f x
L y f xpseduo residualf x
−=
⎡ ⎤∂− ⎢ ⎥∂⎣ ⎦
21( ( )) ( ( ))2
L y f x y f x− = −
Gradient boos8ng: Pseudo code
ni i i=1
n
0 ir i=1
im
Input : training set{(x , y )} ; a differentiable loss function L(y,F(x));interations number M1.Initializemodel withaconstant value :
F (x)= argmin L(y ,r)
2.For m = 1to M :1).Compute pseudo - residuals :
L(y,F(x)r = - ∂
∑
)
m-1F(x)=F (x)
nm i im i=1
m
m i m-1 i
) for i = 1,...,n.F(x)
2).Fit abaselearner h (x)to pseudo - residuals(trainusing dataset{(x ,r )} )3 .Computemultipiler r by solving the following optimization problem
= argmin L(y ,F (x )+γ
γ
⎡ ⎤⎢ ⎥∂⎣ ⎦
) ( )3. ( )
n
m ii=1
m m-1 m m
M
h (x ))
4.Updatethemodel :F (x)= F (x h x
Output F x
γ
γ+
∑
Gradient tree boos8ng
• Use decision tree as base learner • Stagewise learning and choose r with line search • Friedman proposes to choose a separate op8mize value r for each of the tree’s regions
1( ) ( )
J
m jm jmj
h x b I x R=
= ∈∑
1 11
( ) ( ) ( ), argmin ( , ( ) ( ))n
m m m m m i m i m ii
F x F x h x L y F x h xγ
γ γ γ− −=
= + = +∑
1 11
( ) ( ) ( ), argmin ( , ( ) ( ))jm
i jm
J
m m jm jm i m i m ij x R
F x F x I x R L y F x h xγ
γ γ γ− −= ∈
= + ∈ = +∑ ∑
Parameters choice and tricks
• Parameters choice – Terminal nodes J: [4, 8] is recommended – Itera8ons M: selected by evalua8on on test/valida8on data
• Tricks for improvement – Shrinkage: – Stochas8c gradient boos8ng
1( ) ( ) ( ), 0 1m m m mF x F x h x vν γ−= + ⋅ < ≤
Boos8ng Tree Summary
• Forward stagewise addi8ve model with tree • Pros – Performance is good usually – Adapt to regression and classifica8on both – No need to transform/normalized the data – Few parameters and is easy to tune
• Tips – Try more loss func8ons besides exponen8al loss, especially when noise exists in data
– Bump is usually good
Resource
• Implementa8on/Tools – MART(Mul8ply Addi8ve regression tree) – Will share my implementa8on later
• More for boos8ng tree – “Elements of sta/s/cal learning” – 《统计学习方法》 – Paralleliza8on: “Scaling up machine learning”