Understanding variable importances in forests of randomized trees
Post on 18-Jun-2015
705 Views
Preview:
Transcript
Understanding variable importancesin forests of randomized trees
Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Dept. of EE & CS, & GIGA-RUniversite de Liege, Belgium
June 6, 2014
1 / 12
Ensembles of randomized trees
...
︸ ︷︷ ︸Majority vote or prediction average
I Improve standard classification and regression trees byreducing their variance.
I Many examples : Bagging (Breiman, 1996), Random Forests(Breiman, 2001), Extremely randomized trees (Geurts et al., 2006).
I Standard Random Forests : bootstrap sampling + randomselection of K features at each node
2 / 12
Strengths and weaknesses
I Universal approximation
I Robustness to outliers
I Robustness to irrelevant attributes (to some extent)
I Invariance to scaling of inputs
I Good computational efficiency and scalability
I Very good accuracy
I Loss of interpretability w.r.t. single decision trees
3 / 12
Strengths and weaknesses
I Universal approximation
I Robustness to outliers
I Robustness to irrelevant attributes (to some extent)
I Invariance to scaling of inputs
I Good computational efficiency and scalability
I Very good accuracy
I Loss of interpretability w.r.t. single decision trees
3 / 12
Variable importances
I Interpretability can be recovered through variable importances
I Two main importance measures :• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variableappears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuringaccuracy reduction on out-of-bag samples when the values ofthe variable are randomly permuted (Breiman, 2001).
I We focus here on MDI because :• It is faster to compute ;• It does not require to use bootstrap sampling ;• In practice, it correlates well with the MDA measure (except in
specific conditions).
4 / 12
Variable importances
I Interpretability can be recovered through variable importances
I Two main importance measures :• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variableappears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuringaccuracy reduction on out-of-bag samples when the values ofthe variable are randomly permuted (Breiman, 2001).
I We focus here on MDI because :• It is faster to compute ;• It does not require to use bootstrap sampling ;• In practice, it correlates well with the MDA measure (except in
specific conditions).
4 / 12
Variable importances
I Interpretability can be recovered through variable importances
I Two main importance measures :• The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variableappears (Breiman et al., 1984) ;
• The mean decrease of accuracy (MDA) : measuringaccuracy reduction on out-of-bag samples when the values ofthe variable are randomly permuted (Breiman, 2001).
I We focus here on MDI because :• It is faster to compute ;• It does not require to use bootstrap sampling ;• In practice, it correlates well with the MDA measure (except in
specific conditions).
4 / 12
Mean decrease of impurity
...
I Importance of variable Xj for an ensemble of M trees ϕm is :
Imp(Xj) =1
M
M∑m=1
∑t∈ϕm
1(jt = j)[p(t)∆i(t)
], (1)
where jt denotes the variable used at node t, p(t) = Nt/Nand ∆i(t) is the impurity reduction at node t :
∆i(t) = i(t)− NtL
Nti(tL)− Ntr
Nti(tR) (2)
I Impurity i(t) can be Shannon entropy, gini index, variance, ...
5 / 12
Motivation and assumptions
Imp(Xj) =1
M
M∑m=1
∑t∈ϕm
1(jt = j)[p(t)∆i(t)
]I MDI works well, but it is not well understood theoretically ;
I We would like to better characterize it and derive its mainproperties from this characterization.
I Working assumptions :
• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :
i(t) = −∑c
Nt,c
Ntlog
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.
6 / 12
Motivation and assumptions
Imp(Xj) =1
M
M∑m=1
∑t∈ϕm
1(jt = j)[p(t)∆i(t)
]I MDI works well, but it is not well understood theoretically ;
I We would like to better characterize it and derive its mainproperties from this characterization.
I Working assumptions :
• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :
i(t) = −∑c
Nt,c
Ntlog
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.
6 / 12
Motivation and assumptions
Imp(Xj) =1
M
M∑m=1
∑t∈ϕm
1(jt = j)[p(t)∆i(t)
]I MDI works well, but it is not well understood theoretically ;
I We would like to better characterize it and derive its mainproperties from this characterization.
I Working assumptions :• All variables are discrete ;
• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :
i(t) = −∑c
Nt,c
Ntlog
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.
6 / 12
Motivation and assumptions
Imp(Xj) =1
M
M∑m=1
∑t∈ϕm
1(jt = j)[p(t)∆i(t)
]I MDI works well, but it is not well understood theoretically ;
I We would like to better characterize it and derive its mainproperties from this characterization.
I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;
• Shannon entropy as impurity measure :
i(t) = −∑c
Nt,c
Ntlog
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.
6 / 12
Motivation and assumptions
Imp(Xj) =1
M
M∑m=1
∑t∈ϕm
1(jt = j)[p(t)∆i(t)
]I MDI works well, but it is not well understood theoretically ;
I We would like to better characterize it and derive its mainproperties from this characterization.
I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :
i(t) = −∑c
Nt,c
Ntlog
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.
6 / 12
Motivation and assumptions
Imp(Xj) =1
M
M∑m=1
∑t∈ϕm
1(jt = j)[p(t)∆i(t)
]I MDI works well, but it is not well understood theoretically ;
I We would like to better characterize it and derive its mainproperties from this characterization.
I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :
i(t) = −∑c
Nt,c
Ntlog
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;
• Asymptotic conditions : N →∞, M →∞.
6 / 12
Motivation and assumptions
Imp(Xj) =1
M
M∑m=1
∑t∈ϕm
1(jt = j)[p(t)∆i(t)
]I MDI works well, but it is not well understood theoretically ;
I We would like to better characterize it and derive its mainproperties from this characterization.
I Working assumptions :• All variables are discrete ;• Multi-way splits a la C4.5 (i.e., one branch per value) ;• Shannon entropy as impurity measure :
i(t) = −∑c
Nt,c
Ntlog
Nt,c
Nt
• Totally randomized trees (RF with K = 1) ;• Asymptotic conditions : N →∞, M →∞.
6 / 12
Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.
I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided
by all input variablesabout the output
=
p∑j=1
Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of
the MDI importance ofeach input variable
Imp(Xj) =
p−1∑k=0
1
C kp
1
p − k︸ ︷︷ ︸ii) Decomposition along
the degrees k of interactionwith the other variables
∑B∈Pk (V−j )
I (Xj ;Y |B)
︸ ︷︷ ︸iii) Decomposition along all
interaction terms Bof a given degree k
E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1
6(I (X1;Y |X2) + I (X1;Y |X3)) + 1
3I (X1;Y |X2,X3)
7 / 12
Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.
I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided
by all input variablesabout the output
=
p∑j=1
Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of
the MDI importance ofeach input variable
Imp(Xj) =
p−1∑k=0
1
C kp
1
p − k︸ ︷︷ ︸ii) Decomposition along
the degrees k of interactionwith the other variables
∑B∈Pk (V−j )
I (Xj ;Y |B)
︸ ︷︷ ︸iii) Decomposition along all
interaction terms Bof a given degree k
E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1
6(I (X1;Y |X2) + I (X1;Y |X3)) + 1
3I (X1;Y |X2,X3)
7 / 12
Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.
I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided
by all input variablesabout the output
=
p∑j=1
Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of
the MDI importance ofeach input variable
Imp(Xj) =
p−1∑k=0
1
C kp
1
p − k︸ ︷︷ ︸ii) Decomposition along
the degrees k of interactionwith the other variables
∑B∈Pk (V−j )
I (Xj ;Y |B)
︸ ︷︷ ︸iii) Decomposition along all
interaction terms Bof a given degree k
E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1
6(I (X1;Y |X2) + I (X1;Y |X3)) + 1
3I (X1;Y |X2,X3)
7 / 12
Result 1 : Three-level decompositionVariable importances provide a three-level decomposition of theinformation jointly provided by all the input variables aboutthe output, accounting for all interaction terms in a fair andexhaustive way.
I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided
by all input variablesabout the output
=
p∑j=1
Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of
the MDI importance ofeach input variable
Imp(Xj) =
p−1∑k=0
1
C kp
1
p − k︸ ︷︷ ︸ii) Decomposition along
the degrees k of interactionwith the other variables
∑B∈Pk (V−j )
I (Xj ;Y |B)
︸ ︷︷ ︸iii) Decomposition along all
interaction terms Bof a given degree k
E.g. : p = 3, Imp(X1) = 13I (X1;Y ) + 1
6(I (X1;Y |X2) + I (X1;Y |X3)) + 1
3I (X1;Y |X2,X3)
7 / 12
Illustration (Breiman et al., 1984)
X1
X2 X3
X4
X5 X6
X7
y x1 x2 x3 x4 x5 x6 x7
0 1 1 1 0 1 1 11 0 0 1 0 0 1 02 1 0 1 1 1 0 13 1 0 1 1 0 1 14 0 1 1 1 0 1 05 1 1 0 1 0 1 16 1 1 0 1 1 1 17 1 0 1 0 0 1 08 1 1 1 1 1 1 19 1 1 1 1 0 1 1
8 / 12
Illustration (Breiman et al., 1984)
X1
X2 X3
X4
X5 X6
X7
Imp(Xj) =
p−1∑k=0
1
C kp
1
p − k
∑B∈Pk (V−j )
I (Xj ;Y |B)
Var Imp
X1 0.412X2 0.581X3 0.531X4 0.542X5 0.656X6 0.225X7 0.372∑
3.3210 1 2 3 4 5 6
X1
X2
X3
X4
X5
X6
X7
k
8 / 12
Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.
A variable Xj is irrelevant if and only if Imp(Xj) = 0.
The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.
9 / 12
Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.
A variable Xj is irrelevant if and only if Imp(Xj) = 0.
The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.
9 / 12
Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.
A variable Xj is irrelevant if and only if Imp(Xj) = 0.
The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.
9 / 12
Result 2 : Irrelevant variables
Variable importances depend only on the relevant variables.
Definition (Kohavi & John, 1997) : A variable X is irrelevant (to Ywith respect to V ) if, for all B ⊆ V , I (X ;Y |B) = 0. A variable isrelevant if it is not irrelevant.
A variable Xj is irrelevant if and only if Imp(Xj) = 0.
The importance of a relevant variable is insensitive to the additionor the removal of irrelevant variables.
9 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due tomasking effects).
Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0
I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1
2I (X2;Y )
I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.
I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due tomasking effects).
Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0
I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1
2I (X2;Y )
I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.
I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due tomasking effects).
Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0
I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1
2I (X2;Y )
I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.
I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due tomasking effects).
Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0
I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1
2I (X2;Y )
I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.
I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due tomasking effects).
Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0
I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1
2I (X2;Y )
I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.
I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due tomasking effects).
Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0
I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1
2I (X2;Y )
I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.
I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
Non-totally randomized trees
Most properties are lost as soon as K > 1.
⇒ There can be relevant variables with zero importances (due tomasking effects).
Example :I (X1;Y ) = H(Y ), I (X1;Y ) ≈ I (X2;Y ), I (X1;Y |X2) = ε and I (X2;Y |X1) = 0
I K = 1→ ImpK=1(X1) ≈ 12I (X1;Y ) + ε and ImpK=1(X1) ≈ 1
2I (X2;Y )
I K = 2→ ImpK=2(X1) = I (X1;Y ) and ImpK=2(X2) = 0
⇒ The importance of relevant variables can be influenced by thenumber of irrelevant variables.
I K = 2 and we add a new irrelevant variable X3 → ImpK=2(X2) > 0
10 / 12
Illustration (continued)
X1
X2 X3
X4
X5 X6
X7
1 2 3 4 5 6 70.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
K
Impo
rtan
ce
X1X2X3X4X5X6X7
11 / 12
Conclusions
I We propose a theoretical characterization of MDI importancescores in asymptotic conditions ;
I In the case of totally randomized trees, variable importancesactually show sound and desirable theoretical properties, thatare lost when using non-totally randomized trees ;
I Main results remain valid for a large range of impuritymeasures (Gini entropy, variance, kernelized variance, etc).
I Future works :• More formal treatment of the non-totally random case
(K > 1) ;• Derive distributions of importances in finite setting ;• Use these results to design better importance score estimators.
12 / 12
top related