RISK BASED APPROVAL FRAMEWORK -Auto Loans Dec 2013
May 26, 2015
RISK BASED APPROVAL FRAMEWORK-Auto Loans
Dec 2013
Business Problem
Methodology & Process
How does the model get Deployed - 30K feet view
Where else will the lender use the models?
Do other industries use this framework too?
References for reading materials
CONTENTS
Intended for Knowledge Sharing only 2
BUSINESS PROBLEM
Risk based Approval/Pricing
Framework
Intended for Knowledge Sharing only 3
Intended for Knowledge Sharing only 4
BUSINESS PROBLEM
Risk based Approval/Pricing Framework
1 What are the chances of non-repayment?
2 If it happens, how much money will go bad?
3 How much will I ultimately recover if I repossess and sell off the vehicle?
How Business sees it?
Note: * Non-repayment is defined as payments delayed by over 180 days since the due date.
Intended for Knowledge Sharing only 4
BUSINESS PROBLEM
Intended for Knowledge Sharing only 5
BUSINESS PROBLEM
Risk based Approval/Pricing Framework
1
2
3
How Statisticians See it?
Intended for Knowledge Sharing only 5
BUSINESS PROBLEM
Intended for Knowledge Sharing only 6
BUSINESS PROBLEM
Risk based Approval/Pricing Framework
1
2
3
How Analysts See it? Probability of non-repayment (PD)
Estimated $ of non-repayment (EAD)
Loss Post Recovery(LGD)
Intended for Knowledge Sharing only 6
BUSINESS PROBLEM
Intended for Knowledge Sharing only 7
HOW IS IT DONE?
Data Preparation
Dimensionality Reduction
Modeling & Analysis
Validation
Recommendations & Implementation
Strategy
● Model building on Development Sample-Identification of statistically significant drivers, Overall fit & Accuracy
● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA)
● Model rebuilding on Validation Sample-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
First step would be to convert a business problem into Analytical Framework (Label & Inputs), followed by….
Intended for Knowledge Sharing only 7
Intended for Knowledge Sharing only 8
HOWEVER IT SHOULD BE PRECEDED BY SEGMENTATION
Loan Term Credit Score Bands Low End Models
Mid Range Models
Luxury Brands
1 year
Least Score Range
13
5
Mid Score Range
High Score Range 4
2 year
Least Score Range
23
Mid Score Range
High Score Range 4
Customers need to be bucketed into homogenous buckets, to normalize for inherent variation between various types of customers/products etc.
Intended for Knowledge Sharing only 8
Intended for Knowledge Sharing only 9
A model is a mathematical relationship between a “Target/Label” Variable and the “Predictor/Input” variables. Here “Non-repayment” is the “Target/Label” and application information are “Predictors/Input Variables”…
TRANSLATE INTO ANALYTICAL FRAMEWORK
Non-repayment = f {application data like Credit Score, %Monthly Payment to Income, etc.}
Appl_ID Crd_sc %Pymt_Inc1 750 10%2 500 70%3 650 25%
Customer info at the time of application
We build models on a historical sample, i.e., where we have both application data and what happened with that application later on over the loan term….
Appl_ID NP_Flag When1 No -2 Yes 5th Month3 No -
Non-repayment info over
loan term
Predictors/Input Variables
Target/Labels
Appl_ID Crd_sc %Pymt_Inc NP_flag When1 750 10% No -2 500 70% Yes 5th Month3 650 25% No -
Modeling Data Predictors/Input Variables
+ Target/Labels
Intended for Knowledge Sharing only 9
Intended for Knowledge Sharing only 10
DATA CREATION- PREDICTOR VARIABLES & HYPOTHESES
DATA TYPE VARIABLES EXPECTED RELATIONSHIP
BUREAU DATAAbsolute values
Credit Score -vePayment to Income Ratio +ve
Debt to Income Ratio +ve#Inquiries in last qtr, 12 months +ve
Total Outstanding Loan +veBankrupty, Non-repayments, Charge offs, etc. +ve
Deviations in Slope and Level
Trend, Shocks, etc. -ve/+ve
LOAN DETAILS Absolute values
Total Loan Requested DependsTerm of the loan -ve/+ve
Make/Model/Model Year of the CarDepends on market demand for
the Make/ModelPast relationship with the Lender -ve
New/Used Car New = -ve
DEMOGRAPHIC DETAILS
Absolute valuesHome Owner/Renter, #Dependents, Gender, Marital Status, Age,Occupation, Education,
ProfessionDepends on the variable
MACROECONOMIC DATA
Absolute valuesGDP, Household Savings Ratio, Fuel Prices,
Unemployment Rate, Interest Rates, etc.Depends on the variable
Deviation Trend, Shocks, etc. Depends on the variable
GEO DATA Absolute ValuesCity, State, Region Cluster, Local Competition Data,
Dealership level factors, etc. Depends on the variable
TRANSACTIONS DATA
Absolute valuesMonthly Payments, #Payments made, #Non-
repayments, Time to CO, Amount of Non-repayment, Recovery Rate, etc.
Depends on the variable
Deviation Trend, Shocks, etc. Depends on the variable
Intended for Knowledge Sharing only 10
Intended for Knowledge Sharing only 11
HOW IS IT DONE?
Data Preparation
Dimensionality Reduction
Modeling & Analysis
Validation
Recommendations & Implementation
Strategy
● Model building on Development Sample-Identification of statistically significant drivers, Overall fit & Accuracy
● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA)
● Model rebuilding on Validation Sample-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended for Knowledge Sharing only 11
Intended for Knowledge Sharing only 12
Capping treatment is necessary to remove the effect of extreme/non-sensical values, very different from the rest of population….
DATA PREPARATION
No. Pyoffflg Prin0105 Loanamt Term Fixed Agnsttr Bbctrad Nummortt Rvoptbal Numminq Numminq3
1 0 2324.9 19900 360 1 21 282 1 282 0 0
2 0 3796.5 22100 240 0 6 6911 1 33978 1 1
3 1 12523.2 42000 360 1 1 36350 . 36732 1 1
4 0 5190.9 21760 349 1 42 885 1 911 0 0
5 1 53.6 18000 360 1 5 8851 1 9506 0 0
6 0 1256.9 15500 360 . 13 409 1 760 0 0
7 0 4403.3 25150 900 1 3 21417 5 23579 3 1
8 0 3137.2 17800 240 1 4 4528 2 5967 1 0
9 0 4256.5 9999999 360 1 9 18179 47 130683 4 1
10 0 6442.4 31200 360 1 34 33177 1 0 2 0
Missing observations
Unrealistic values
CAPPING & MISSING VALUE TREATMENT
….Missing treatment is imputation of missing values for certain variables, and is mandatory. If left unattended, entire record is excluded from Modeling.
Intended for Knowledge Sharing only 12
Intended for Knowledge Sharing only 13
HOW IS IT DONE?
Data Preparation
Dimensionality Reduction
Modeling & Analysis
Validation
Recommendations & Implementation
Strategy
● Model building on Development Sample-Identification of statistically significant drivers, Overall fit & Accuracy
● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA)
● Model rebuilding on Validation Sample-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended for Knowledge Sharing only 13
40 50 60 70 80 90 1000
5
10
15
20
Predictor DecilesA
vg T
arg
et
Intended for Knowledge Sharing only 14
Bivariate analysis explores the nature and degree of relationship between the independent and dependent variables….
..it not only helps in finding related predictors, predictor transformations, it also helps in dimensionality reduction
DIMENSIONALITY REDUCTIONBIVARIATE ANALYSIS
1 1.5 2 2.5 3 3.50
10
20
30
40
50
Predictor Deciles
Avg
Tar
get
Dummy = (predictor value<=2)
No relationship
• Rank Plots: Checks if the predictor variables correlate with Target variable. Steps:
• Sort the population by predictor variable values• Split into groups with equal number of obs, generally ten groups or deciles• Get the average of Target variable in each group• Check if there is a trend in average value of Target variables from the top group to bottom
Intended for Knowledge Sharing only 14
Intended for Knowledge Sharing only 15
Two metrics that are predominantly used are Variance Inflation Factor (VIF) and Conditional Index (CI)….
MULTIVARIATE ANALYSIS
Variance Inflation factor (VIF)
VIF is obtained by regressing each independent variable, say X on the remaining independent variables
(say x1 and x2) and checking how much of it (of X) is explained by these variables.
->Cut-offs used vary from 2 to 10
Conditional Index (CI)
Conditional Index is the square root of the ratio of the highest eigen value (λmax) and individual eigen
value (λ).
->Cut-offs used vary from 13 to 30
Very similar to Principal Component Analysis (PCA)
Intended for Knowledge Sharing only 15
DIMENSIONALITY REDUCTION
Intended for Knowledge Sharing only 16
GENERALIZED LINEAR MODELSSAMPLE VIF/CI OUTPUT
The REG ProcedureModel: MODEL1
Dependent Variable: NP_FlagNumber of Observations Read 40162Number of Observations Used 40162
Analysis of VarianceSource DF Sum of Mean F Value Pr > F
Squares SquareModel 12 610.91533 50.90961 219.02<.0001Error 40149 9332.36401 0.23244
Corrected Total 40161 9943.27934
Root MSE 0.48212 R-Square 0.0614Dependent Mean 0.5492 Adj R-Sq 0.0612
Coeff Var 87.78642
Parameter EstimatesVariable DF Parameter Standard t Value Pr > |t| Variance
Estimate Error InflationIntercept 1 1.24953 0.20693 6.04 <.0001 0
Credit_Score 1 -0.000216 0.00028377 -0.76 0.4465 1.0205%Down_Pymt_to_Loan 1 -0.1166 0.0117 -9.96 <.0001 1.09417%Mnthly_Pymt_to_Loan 1 0.01966 0.00517 -3.8 0.0001 1.17587
Collinearity DiagnosticsNumber Eigenvalue Condition Proportion of Variation
Index Intercept Credit_Score %Down_Pymt_to_Loan
%Mnthly_Pymt_to_Loan
1 8.3631 1 0.00000188 0.00000202 0.00002708 0.000578152 1.01345 2.87264 8.65E-09 8.73E-09 1.04E-07 5.68E-063 0.96895 2.93787 2.42E-11 5.60E-14 1.68E-09 0.00000198 0.22138 6.14626 0.00000754 0.00000817 0.00009252 0.003969 0.20341 6.41212 0.00001611 0.00001745 0.00020511 0.0191110 0.05087 12.82208 0.00000322 0.00000279 0.00011988 0.2614311 0.02578 18.01153 0.00082432 0.00088072 0.00992 0.6857412 0.00137 78.10783 0.01375 0.01859 0.96941 0.0208513 0.00007104 343.097 0.98539 0.98048 0.02008 0.00000173
Intended for Knowledge Sharing only 16
Intended for Knowledge Sharing only 17
HOW IS IT DONE?
Data Preparation
Dimensionality Reduction
Modeling & Analysis
Validation
Recommendations & Implementation
Strategy
● Model building on Development Sample-Identification of statistically significant drivers, Overall fit & Accuracy
● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA)
● Model rebuilding on Validation Sample-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended for Knowledge Sharing only 17
Intended for Knowledge Sharing only 18
MODELING DETAILS
1 Probability of Non-repayment (PD)
2 Predict the $ amount at risk of Non-repayment (EAD)
3 Estimate the % of Amount at risk that cannot be recovered (LGD)
What are the chances of Non-repayment?
If it happens, how much money will go bad?
How much will I ultimately recover if I repossess and sell off the vehicle?
Intended for Knowledge Sharing only 18
Intended for Knowledge Sharing only 19
MODELING DETAILS
1 Logistic Model
2 OLS Model
3 Average by Risk Deciles
Probability of Non-repayment (PD)
Predict the $ amount at risk of Non-repayment (EAD)
Predict the % of Amount at risk that cannot be recovered** (LGD)
Intended for Knowledge Sharing only 19
Intended for Knowledge Sharing only 20
SAMPLING
Model Validation on
data from another time
window
Modeling Sample (50%)
Testing Sample (50%)
Model development
Model validation
Full Applications
data from analysis time
window
Intended for Knowledge Sharing only 20
Intended for Knowledge Sharing only 21
MODELING DETAILS
1 Logistic Model
2 OLS Model
3 Average by Risk Deciles
Probability of Non-repayment (PD)
Predict the $ amount at risk of Non-repayment (EAD)
Predict the % of Amount at risk that cannot be recovered** (LGD)
Intended for Knowledge Sharing only 21
Intended for Knowledge Sharing only 22
What is a Logistic Model?
->Predicts log odds(event/non-event) ->Predictive Model is as a mathematical relationship between the predictors and Target
Log (odds) = α + β1X1 + β2X2
SAS procedure: Proc Logistic (with various link functions)
LOGISTIC REGRESSION
Intended for Knowledge Sharing only 22
Intended for Knowledge Sharing only 23
HOW TO FIND IF A METHOD WORKS?
For Logistic Models, following metrics are used as Performance diagnostics…
• Concordance/Discordance: Overall indicator of the model prediction accuracy• Pair all observations randomly• Check the %pairs where the “bad” guy is given higher probability vs. the “good” guy
• Rank Order: Similar test like above, but a more structured format Steps:
• Sorting: Sort the population by predicted probability• Deciling: Bucket them into ten groups, each having 10% of the population in the sorted order• Check the %Non-repayment guys in each decile• Capturing: Ideally %bad guys should be highest in top deciles and lowest in bottom deciles. Top deciles
should capture most of the Non-repayment guys.
• Gains Chart: Graphical representation of capturing by the model and performance against random bucketing.
• Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum information capture with least number of predictors.
…apart from usual checks on Signs, Statistical Significance and if the model holds in the validation samples also
Intended for Knowledge Sharing only 23
Intended for Knowledge Sharing only 24
SAMPLE MODEL OUTPUT
Type 3 Analysis of EffectsEffect DF Wald Pr > ChiSq
Chi-SquareAPPLICATION_PRIM_CB_ 2 14.5230 0.0007%Down_Pymt_to_Loan 2 126.6605 <.0001%Mnthly_Pymt_to_Loan 2 83.5880 <.0001
Analysis of Maximum Likelihood EstimatesParameter DF Development
Model EstimateValidation
Model EstimateStandard Wald Pr > Chi
SqError Chi-SquareIntercept 1 1.1321 -0.4085 0.8909 0.2102 0.6466
APPLICATION_PRIM_CB_ 1 -0.00349 -0.00220 0.00122 3.2494 0.0715%Down_Pymt_to_Loan 1 -0.3934 -0.2839 0.0485 34.2834 <.0001%Mnthly_Pymt_to_Loan 1 0.1206 -0.0900 0.0221 16.5920 <.0001
Odds Ratio EstimatesEffect outcome Point Estimate 95% Wald Confidence Limits
APPLICATION_PRIM_CB_ 1 0.998 0.995 1.000%Down_Pymt_to_Loan 1 0.753 0.685 0.828%Mnthly_Pymt_to_Loan 1 0.914 0.875 0.954
Percent Concordant
65.9 Somers' D
0.338
Percent Discordant 32.1 Gamma 0.345
Percent Tied 2.0 Tau-a 0.074
Pairs 1806529536 c 0.669
Higher the percent concordant, better
the model
Intended for Knowledge Sharing only 24
Intended for Knowledge Sharing only 25
SAMPLE GAINS CHART
0
20
40
60
80
100
120
0 20 40 60 80 100
Model capturing
Random capturing
Population (%)
Responders captured
Higher the capturing in the initial deciles,
better the model performance
Intended for Knowledge Sharing only 25
Intended for Knowledge Sharing only 26
MODELING DETAILS
1 Logistic Model
2 OLS Model
3 Average by Risk Deciles
Probability of Non-repayment (PD)
Predict the $ amount at risk of Non-repayment (EAD)
Predict the % of Amount at risk that cannot be recovered** (LGD)
Intended for Knowledge Sharing only 26
Intended for Knowledge Sharing only 27
What is a Linear Model? ->Predicts the value of the Target variable->Predictive Model is as a mathematical relationship between the predictors and Target
y =α + β1X1 + β2X2
* These models are developed only on the “bad” population, since including “good” will skew the model.
SAS procedure: Proc Reg
OLS MODELS
Intended for Knowledge Sharing only 27
Intended for Knowledge Sharing only 28
HOW TO FIND IF A METHOD WORKS?
For Linear Models, following metrics are used as Performance diagnostics…
• R-square: Tells how much of the variance in “Target” variable is captured by the model.
• Error rate(%): Tells what is the error relative to actual values of Target variable. Error rate (%) = average of square(actual – predicted)/average of actuals
• Rank Order: Checks if the predicted values correlate with actual values. Steps:
• Sorting: Sort the population by predicted values• Deciling: Bucket them into ten groups, each having 10% of the population in the sorted order• Check the average value of prediction in each decile and average value of actuals in each deciles• Check if both averages are gradually decreasing from the top group to bottom
• Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum information capture with least number of predictors.
…apart from usual checks on Signs, Statistical Significance and if the model holds in the validation samples also
Intended for Knowledge Sharing only 28
Intended for Knowledge Sharing only 29
SAMPLE MODEL OUTPUT
The REG ProcedureModel: MODEL1
Dependent Variable: NP_FlagNumber of Observations Read 40162Number of Observations Used 40162
Analysis of VarianceSource DF Sum of Mean F Value Pr > F
Squares SquareModel 12 610.91533 50.90961 219.02<.0001Error 40149 9332.36401 0.23244
Corrected Total 40161 9943.27934
Root MSE 0.48212 R-Square 0.0614Dependent Mean 0.5492 Adj R-Sq 0.0612
Coeff Var 87.78642
Parameter EstimatesVariable DF Parameter Standard t Value Pr > |t| Variance
Estimate Error InflationIntercept 1 1.24953 0.20693 6.04 <.0001 0
Credit_Score 1 -0.000216 0.00028377 -0.76 0.4465 1.0205%Down_Pymt_to_Loan 1 -0.1166 0.0117 -9.96 <.0001 1.09417%Mnthly_Pymt_to_Loan 1 0.01966 0.00517 -3.8 0.0001 1.17587
Intended for Knowledge Sharing only 29
Intended for Knowledge Sharing only 30
GENERALIZED LINEAR MODELSSAMPLE RANK ORDERING FOR LINEAR MODELS
1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7Rankordering
Avg Predicted in this Decile Avg Actual in this Decile
Decile
Avg
Val
ue in
a D
ecile
Intended for Knowledge Sharing only 30
Intended for Knowledge Sharing only 31
MODELING DETAILS
1 Logistic Model
2 OLS Model
3 Average by Risk Deciles
Probability of Non-repayment (PD)
Predict the $ amount at risk of Non-repayment (EAD)
Predict the % of Amount at risk that cannot be recovered** (LGD)
Intended for Knowledge Sharing only 31
Intended for Knowledge Sharing only 32
LOSS POST RECOVERY -NON RECOVERY RATE (%)SAMPLE CALCULATION BY DECILES
Deciles of $ Model Avg Non Recovery Rate (%)1 50%2 47%3 40%4 38%5 37%6 27%7 15%8 12%9 10%
10 5%
Intended for Knowledge Sharing only 32
Intended for Knowledge Sharing only 33
HOW IS IT DONE?
Data Preparation
Dimensionality Reduction
Modeling & Analysis
Validation
Recommendations & Implementation
Strategy
● Model building on Development Sample-Identification of statistically significant drivers, Overall fit & Accuracy
● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA)
● Model rebuilding on Validation Sample-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended for Knowledge Sharing only 33
Intended for Knowledge Sharing only 34
SAMPLING
Model Validation on
data from another time
window
Modeling Sample (50%)
Testing Sample (50%)
Model development
Model validation
Full Applications
data from analysis time
window
Intended for Knowledge Sharing only 34
Intended for Knowledge Sharing only 35
Data Preparation
Dimensionality Reduction
Modeling & Analysis
Validation
Recommendations & Implementation
Strategy
● Model building on Development Sample-Identification of statistically significant drivers, Overall fit & Accuracy
● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA)
● Model rebuilding on Validation Sample-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended for Knowledge Sharing only 35
HOW IS IT DONE?
Intended for Knowledge Sharing only 36
RECOMMENDATIONS
HIGH RISKDecline or Price at a Premium to recover maximum amount before Non-repayment
LOW RISKApprove and Proactive interest rate reduction/cross sell efforts with an aim of making them come back.
MID RISKApprove but charge high interest at the beginning, which can then be negotiated to a floor value
Based on Simulations/Business needs, Score buckets are created with ranges for High Risk/Mid/Low Risk
Intended for Knowledge Sharing only 36
Intended for Knowledge Sharing only 37
DEPLOYMENT AT A 30K FEET LEVEL
Typical steps…
• At a dealership level - negative list verification from Driving License details
• Finance guy at the dealer - then inputs all PII information with Social Security into the “Approval” system- the engine runs the model with the Bureau data/other model details
• System recommends decision - yes/no and a guidance price, which then can be negotiated with Credit executive based on the scenarios/sales/risk guidance he has.
Intended for Knowledge Sharing only 37
Intended for Knowledge Sharing only 38
OTHERS APPLICATIONS OF THE MODEL
Some other areas within the institution where the models outputs are leveraged…
• Portfolio P&L estimation
Net Income from this business = Sum (all Monthly Paymentss) - (Probability of Non-repayment*Estimated $ of Non-repayment*Loss Post
Recovery)
*In the accounting world, 1. Monthly Payments figures are “discounted” for inflation over loan time window 2. then the net income is compared against returns that the firm would have gotten if they invested the same
amount in US Government Treasury rates, to justify running this business
• Regulatory risk reporting - BASEL norms
• Customer bucketing for Upselling/Cross selling/Retention programs.
Intended for Knowledge Sharing only 38
Intended for Knowledge Sharing only 39
SIMILAR FRAMEWORK IN OTHER INDUSTRIES
Similar framework is used in other industries for solving various business problems…
• Marketing Campaigns: e.g., find out which customer is more likely to respond to campaigns and if they do how much $ would they spend with us
• How many will use Friend finder on Facebook, if yes, how many invites will they send?
• How many will see the promoted news feed? How many will they re-share it?
• Loyalty Models (ecommerce): e.g., will a customer get engaged (Repeat purchases) and if he does how much $ will he spend with us
• Attrition Models (Telecom) : e.g., are we going to lose a customer and if yes, how much revenue impact is it going to be
Intended for Knowledge Sharing only 39
Intended for Knowledge Sharing only 40
APPENDIX
Intended for Knowledge Sharing only 40
Intended for Knowledge Sharing only 41
Linear Regressionhttp://faculty.chass.ncsu.edu/garson/PA765/regress.htm
Logistic Regressionhttp://faculty.chass.ncsu.edu/garson/PA765/logistic.htm
Intended for Knowledge Sharing only 41Intended for Knowledge Sharing only 41
GOOD INFO ON LINEAR & LOGISTIC REGRESSION AT…