Top Banner
Gradient Boosting Survival Tree with Applications in Credit Scoring Miaojun Bai, Yan Zheng, Yun Shen 360 F I. (N: QFIN) Credit Scoring and Credit Control XVI, Edinburgh, 29.08.2019 Yun Shen | Gradient Boosting Survival Tree 1/21
24

Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Jul 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Gradient Boosting Survival Tree

with Applications in Credit Scoring

Miaojun Bai, Yan Zheng, Yun Shen

360 Finance Inc. (Nasdaq: QFIN)

Credit Scoring and Credit Control XVI, Edinburgh, 29.08.2019

Yun Shen | Gradient Boosting Survival Tree 1/21

Page 2: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Outline

1 Motivation

2 Gradient boosting survival tree

3 Applications in credit scoring

4 Conclusion

Yun Shen | Gradient Boosting Survival Tree 2/21

Page 3: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Chinese consumer finance market

01.2010 10.2018

97.1

1207.7

Market size ($ billion)

Rapid growth

Heterogeneous data

PBC report: 1/3 has credit ratings

personal info.

device info.

third party rate agencies

Changing market conditions

regulation

macroeconomic factor

Yun Shen | Gradient Boosting Survival Tree 3/21

Page 4: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Motivation

Pros of tree ensemble methods (e.g., XGB, LightGBM)

robust for heterogeneous data

fast modeling for credit scoring

utilize numerous “weak” a�ributes

Pros of survival analysis

predict the probability of default time

take long-term behavior into consideration

Idea: survival analysis + tree ensemble methods?

Yun Shen | Gradient Boosting Survival Tree 4/21

Page 5: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Motivation

Pros of tree ensemble methods (e.g., XGB, LightGBM)

robust for heterogeneous data

fast modeling for credit scoring

utilize numerous “weak” a�ributes

Pros of survival analysis

predict the probability of default time

take long-term behavior into consideration

Idea: survival analysis + tree ensemble methods?

Yun Shen | Gradient Boosting Survival Tree 4/21

Page 6: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Motivation

Pros of tree ensemble methods (e.g., XGB, LightGBM)

robust for heterogeneous data

fast modeling for credit scoring

utilize numerous “weak” a�ributes

Pros of survival analysis

predict the probability of default time

take long-term behavior into consideration

Idea: survival analysis + tree ensemble methods?

Yun Shen | Gradient Boosting Survival Tree 4/21

Page 7: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Survival analysis

Survival function: S(t) = P(T > t)Discrete time periods

...

Hazard function:

h(τj) := P(τj−1 < T ≤ τj|T > τj−1), j = 1, 2, . . . ,

Hence,

S(τj) =j∏

l=1

(1− h(τl))

Likelihood

P(τj−1 < T ≤ τj) = h(τj)S(τj−1) = h(τj)j−1∏l=1

(1− h(τl))

Yun Shen | Gradient Boosting Survival Tree 5/21

Page 8: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Survival analysis

Survival function: S(t) = P(T > t)Discrete time periods

...

Hazard function:

h(τj) := P(τj−1 < T ≤ τj|T > τj−1), j = 1, 2, . . . ,

Hence,

S(τj) =j∏

l=1

(1− h(τl))

Likelihood

P(τj−1 < T ≤ τj) = h(τj)S(τj−1) = h(τj)j−1∏l=1

(1− h(τl))

Yun Shen | Gradient Boosting Survival Tree 5/21

Page 9: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Likelihood

Log hazard function

f (t) := log

h(t)1− h(t)

Likelihood

P(T = t) =J(t)∧J∏j=1

1

1 + e−yj(t)f (τj),

where

J(t) :={

j, if t ∈ (τj−1, τj]J + 1, if t > τJ

yj(t) :={−1, if t > τj1, if t ≤ τj

...

Yun Shen | Gradient Boosting Survival Tree 6/21

Page 10: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Learning objective

For each individual x, f is approximated by a survival tree ensemble

f (t; x) ∼= f̂ (t; x) :=K∑

k=1

fk(t; x)

age

sex

education salary

education

sex

male female

femalemale low high

low high

Yun Shen | Gradient Boosting Survival Tree 7/21

Page 11: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Learning objective

To minimize the negative log-likelihood

L =N∑i=1

J(ti)∧J∑j=1

log

(1 + exp

{−yj(ti)f̂ (τj; xi)

})+λ

2

‖w‖2

=

J∑j=1

∑i∈Nj

log

(1 + exp

(−yj(ti)f̂ (τj; xi)

))+λ

2

‖w‖2

where Nj := {i ∈ {1, 2, . . . ,N}|J(ti) ≥ j} is the set of samples

surviving longer than τj−1.

Regularization term

punish model complexity

avoid over-fi�ing

overcome numerical problems

Yun Shen | Gradient Boosting Survival Tree 8/21

Page 12: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Gradient tree boosting

Boosting algorithm:

At mth iteration, given f̂ (m−1)

min

fL(m) =

∑j,i

log

(1 + exp

{−yj(ti)

(f̂ (m−1)(τj; xi) + f (τj; xi)

)})+λ

2

‖w‖2 ⇒ fm

update f̂ (m)(t; x) = f̂ (m−1)(t; x) + fm(t; x)

Approximate by Taylor expansion up to the 2nd order

L(m)(f ) ∼=∑j,i

(r(m−1)i,j f (τj; xi) +

1

2

σ(m−1)i,j f 2(τj; xi)

)+λ

2

‖w‖2

Yun Shen | Gradient Boosting Survival Tree 9/21

Page 13: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Gradient tree boosting

Survival tree with L nodes: f (τj; xi) =∑L

l=1wl(τj)1(i ∈ Il)

The objective function is strictly convex with optimal solution

w(m)l (τj) = −

∑i∈Nj∩Il r

(m−1)i,j∑

i∈Nj∩Il σ(m−1)i,j + λ

Split rule: I = IL ∪ IR

L̃split =1

2

∑j

(∑i∈Nj∩IL r

(m−1)i,j

)2

∑i∈Nj∩IL σ

(m−1)i,j + λ

+

(∑i∈Nj∩IR r

(m−1)i,j

)2

∑i∈Nj∩IR σ

(m−1)i,j + λ

(∑i∈Nj∩I r

(m−1)i,j

)2

∑i∈Nj∩I σ

(m−1)i,j + λ

.

Yun Shen | Gradient Boosting Survival Tree 10/21

Page 14: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Summary

Log hazard function is approximated by a survival tree ensemble

maximum likelihood as the objective function

boosting algorithm

for each step, a gradient method applied to optimize the approximated

objective up to 2nd order

Yun Shen | Gradient Boosting Survival Tree 11/21

Page 15: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Datasets

Installment loans with 12 months

Definition of default: if on any scheduled repayment due date the

borrower is overdue for at least 10 days

Early repayments: regarded as “repaying on time” in the rest time

training and testing datasets

dataset time sample sizetraining set January 2018 200,000

testing set March 2018 120,000

Default rate

default rate(t) =#default accounts up to month t

#total accounts

Yun Shen | Gradient Boosting Survival Tree 12/21

Page 16: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Default rates on datasets

1 2 3 4 5 6 7 8 9 10 11 12Month

0

0.2b

0.4b

0.6b

0.8b

b

1.2bDe

fault R

ate

Training dataTesting database_rate: b

Yun Shen | Gradient Boosting Survival Tree 13/21

Page 17: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Dataset and preprocessing

Over 400 original a�ributes are collected

exclude a�ributes with missing rate higher than 80%one-hot encoding for categorical a�ributes

50 features are selected by xgboost

source feature

PBC report

income score

credit score

overdue information of credit cards

personal information

age

sex

education level

device information location

third-party rate agency

no. of loans in other lending platforms

travel intensity

other information

whether possessing a car

application channel

Yun Shen | Gradient Boosting Survival Tree 14/21

Page 18: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Convergence

1000 runs with λ = 0.001 and the max tree depth 6

0 5 10 15 20 25 30Iterations

0

2

4

6

8

10

12

14

16Lo

ss

Yun Shen | Gradient Boosting Survival Tree 15/21

Page 19: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Survival Groups

Default R

ate

Month 1Month 2Month 3Month 4Month 5Month 6Month 7Month 8Month 9Month 10Month 11Month 12

Yun Shen | Gradient Boosting Survival Tree 16/21

Page 20: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Comparison with existing models: C-Index

1 2 3 4 5 6 7 8 9 10 11 12Month

0.77

0.78

0.79

0.80

0.81

C-Inde

x

GBSTCOXRSFXGB

Yun Shen | Gradient Boosting Survival Tree 17/21

Page 21: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Comparison with existing models: AUC

1 2 3 4 5 6 7 8 9 10 11 12Month

0.77

0.78

0.79

0.80

0.81

AUC

GBSTCOXRSFXGB

Yun Shen | Gradient Boosting Survival Tree 18/21

Page 22: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Comparison with existing models

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Survival Groups

0

0.5b

b

1.5b

2b

2.5b

3b

3.5b

4bDe

faul

t Rat

e

GBSTCOXRSFXGBbase_rate: b

Yun Shen | Gradient Boosting Survival Tree 19/21

Page 23: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Conclusion

Propose the gradient boosting survival tree (GBST) model

Confirm the convergence of GBST with a real dataset

GBST outperforms existing survival analysis and machine learning

models

Thank you!

Yun Shen | Gradient Boosting Survival Tree 20/21

Page 24: Gradient Boosting Survival Tree with Applications in ...€¦ · Rapid growth Heterogeneous data PBC report: 1/3 has credit ratings personal info. device info. third party rate agencies

Conclusion

Propose the gradient boosting survival tree (GBST) model

Confirm the convergence of GBST with a real dataset

GBST outperforms existing survival analysis and machine learning

models

Thank you!

Yun Shen | Gradient Boosting Survival Tree 20/21