Risk Based Loan Approval Framework

RISK BASED APPROVAL FRAMEWORK-Auto Loans

Dec 2013

Business Problem

Methodology & Process

How does the model get Deployed - 30K feet view

Where else will the lender use the models?

Do other industries use this framework too?

References for reading materials

CONTENTS

Intended for Knowledge Sharing only 2

BUSINESS PROBLEM

Risk based Approval/Pricing

Framework



BUSINESS PROBLEM

Risk based Approval/Pricing Framework

1 What are the chances of non-repayment?

2 If it happens, how much money will go bad?

3 How much will I ultimately recover if I repossess and sell off the vehicle?

How Business sees it?

Note: * Non-repayment is defined as payments delayed by over 180 days since the due date.


BUSINESS PROBLEM


BUSINESS PROBLEM


1

2

3

How Statisticians See it?


BUSINESS PROBLEM


BUSINESS PROBLEM


1

2

3

How Analysts See it? Probability of non-repayment (PD)

Estimated $ of non-repayment (EAD)

Loss Post Recovery(LGD)


BUSINESS PROBLEM


HOW IS IT DONE?

Data Preparation

Dimensionality Reduction

Modeling & Analysis

Validation

Recommendations & Implementation

Strategy

● Model building on Development Sample-Identification of statistically significant drivers, Overall fit & Accuracy

● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment

● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA)

● Model rebuilding on Validation Sample-Stability of drivers, Fit of model & Accuracy

● Framing of actionable recommendations and impact analysis

First step would be to convert a business problem into Analytical Framework (Label & Inputs), followed by….



HOWEVER IT SHOULD BE PRECEDED BY SEGMENTATION

Loan Term Credit Score Bands Low End Models

Mid Range Models

Luxury Brands

1 year

Least Score Range

13

5

Mid Score Range

High Score Range 4

2 year

Least Score Range

23

Mid Score Range

High Score Range 4

Customers need to be bucketed into homogenous buckets, to normalize for inherent variation between various types of customers/products etc.



A model is a mathematical relationship between a “Target/Label” Variable and the “Predictor/Input” variables. Here “Non-repayment” is the “Target/Label” and application information are “Predictors/Input Variables”…

TRANSLATE INTO ANALYTICAL FRAMEWORK

Non-repayment = f {application data like Credit Score, %Monthly Payment to Income, etc.}

Appl_ID Crd_sc %Pymt_Inc1 750 10%2 500 70%3 650 25%

Customer info at the time of application

We build models on a historical sample, i.e., where we have both application data and what happened with that application later on over the loan term….

Appl_ID NP_Flag When1 No -2 Yes 5th Month3 No -

Non-repayment info over

loan term

Predictors/Input Variables

Target/Labels

Appl_ID Crd_sc %Pymt_Inc NP_flag When1 750 10% No -2 500 70% Yes 5th Month3 650 25% No -

Modeling Data Predictors/Input Variables

+ Target/Labels



DATA CREATION- PREDICTOR VARIABLES & HYPOTHESES

DATA TYPE VARIABLES EXPECTED RELATIONSHIP

BUREAU DATAAbsolute values

Credit Score -vePayment to Income Ratio +ve

Debt to Income Ratio +ve#Inquiries in last qtr, 12 months +ve

Total Outstanding Loan +veBankrupty, Non-repayments, Charge offs, etc. +ve

Deviations in Slope and Level

Trend, Shocks, etc. -ve/+ve

LOAN DETAILS Absolute values

Total Loan Requested DependsTerm of the loan -ve/+ve

Make/Model/Model Year of the CarDepends on market demand for

the Make/ModelPast relationship with the Lender -ve

New/Used Car New = -ve

DEMOGRAPHIC DETAILS

Absolute valuesHome Owner/Renter, #Dependents, Gender, Marital Status, Age,Occupation, Education,

ProfessionDepends on the variable

MACROECONOMIC DATA

Absolute valuesGDP, Household Savings Ratio, Fuel Prices,

Unemployment Rate, Interest Rates, etc.Depends on the variable

Deviation Trend, Shocks, etc. Depends on the variable

GEO DATA Absolute ValuesCity, State, Region Cluster, Local Competition Data,

Dealership level factors, etc. Depends on the variable

TRANSACTIONS DATA

Absolute valuesMonthly Payments, #Payments made, #Non-

repayments, Time to CO, Amount of Non-repayment, Recovery Rate, etc.

Depends on the variable

Deviation Trend, Shocks, etc. Depends on the variable



HOW IS IT DONE?

Data Preparation


Modeling & Analysis

Validation


Strategy








Capping treatment is necessary to remove the effect of extreme/non-sensical values, very different from the rest of population….

DATA PREPARATION

No. Pyoffflg Prin0105 Loanamt Term Fixed Agnsttr Bbctrad Nummortt Rvoptbal Numminq Numminq3

1 0 2324.9 19900 360 1 21 282 1 282 0 0

2 0 3796.5 22100 240 0 6 6911 1 33978 1 1

3 1 12523.2 42000 360 1 1 36350 . 36732 1 1

4 0 5190.9 21760 349 1 42 885 1 911 0 0

5 1 53.6 18000 360 1 5 8851 1 9506 0 0

6 0 1256.9 15500 360 . 13 409 1 760 0 0

7 0 4403.3 25150 900 1 3 21417 5 23579 3 1

8 0 3137.2 17800 240 1 4 4528 2 5967 1 0

9 0 4256.5 9999999 360 1 9 18179 47 130683 4 1

10 0 6442.4 31200 360 1 34 33177 1 0 2 0

Missing observations

Unrealistic values

CAPPING & MISSING VALUE TREATMENT

….Missing treatment is imputation of missing values for certain variables, and is mandatory. If left unattended, entire record is excluded from Modeling.



HOW IS IT DONE?

Data Preparation


Modeling & Analysis

Validation


Strategy







40 50 60 70 80 90 1000

5

10

15

20

Predictor DecilesA

vg T

arg

et


Bivariate analysis explores the nature and degree of relationship between the independent and dependent variables….

..it not only helps in finding related predictors, predictor transformations, it also helps in dimensionality reduction

DIMENSIONALITY REDUCTIONBIVARIATE ANALYSIS

1 1.5 2 2.5 3 3.50

10

20

30

40

50

Predictor Deciles

Avg

Tar

get

Dummy = (predictor value<=2)

No relationship

• Rank Plots: Checks if the predictor variables correlate with Target variable. Steps:

• Sort the population by predictor variable values• Split into groups with equal number of obs, generally ten groups or deciles• Get the average of Target variable in each group• Check if there is a trend in average value of Target variables from the top group to bottom



Two metrics that are predominantly used are Variance Inflation Factor (VIF) and Conditional Index (CI)….

MULTIVARIATE ANALYSIS

Variance Inflation factor (VIF)

VIF is obtained by regressing each independent variable, say X on the remaining independent variables

(say x1 and x2) and checking how much of it (of X) is explained by these variables.

->Cut-offs used vary from 2 to 10

Conditional Index (CI)

Conditional Index is the square root of the ratio of the highest eigen value (λmax) and individual eigen

value (λ).

->Cut-offs used vary from 13 to 30

Very similar to Principal Component Analysis (PCA)


DIMENSIONALITY REDUCTION


GENERALIZED LINEAR MODELSSAMPLE VIF/CI OUTPUT

The REG ProcedureModel: MODEL1

Dependent Variable: NP_FlagNumber of Observations Read 40162Number of Observations Used 40162

Analysis of VarianceSource DF Sum of Mean F Value Pr > F

Squares SquareModel 12 610.91533 50.90961 219.02<.0001Error 40149 9332.36401 0.23244

Corrected Total 40161 9943.27934

Root MSE 0.48212 R-Square 0.0614Dependent Mean 0.5492 Adj R-Sq 0.0612

Coeff Var 87.78642

Parameter EstimatesVariable DF Parameter Standard t Value Pr > |t| Variance

Estimate Error InflationIntercept 1 1.24953 0.20693 6.04 <.0001 0

Credit_Score 1 -0.000216 0.00028377 -0.76 0.4465 1.0205%Down_Pymt_to_Loan 1 -0.1166 0.0117 -9.96 <.0001 1.09417%Mnthly_Pymt_to_Loan 1 0.01966 0.00517 -3.8 0.0001 1.17587

Collinearity DiagnosticsNumber Eigenvalue Condition Proportion of Variation

Index Intercept Credit_Score %Down_Pymt_to_Loan

%Mnthly_Pymt_to_Loan

1 8.3631 1 0.00000188 0.00000202 0.00002708 0.000578152 1.01345 2.87264 8.65E-09 8.73E-09 1.04E-07 5.68E-063 0.96895 2.93787 2.42E-11 5.60E-14 1.68E-09 0.00000198 0.22138 6.14626 0.00000754 0.00000817 0.00009252 0.003969 0.20341 6.41212 0.00001611 0.00001745 0.00020511 0.0191110 0.05087 12.82208 0.00000322 0.00000279 0.00011988 0.2614311 0.02578 18.01153 0.00082432 0.00088072 0.00992 0.6857412 0.00137 78.10783 0.01375 0.01859 0.96941 0.0208513 0.00007104 343.097 0.98539 0.98048 0.02008 0.00000173



HOW IS IT DONE?

Data Preparation


Modeling & Analysis

Validation


Strategy








MODELING DETAILS

1 Probability of Non-repayment (PD)

2 Predict the $ amount at risk of Non-repayment (EAD)

3 Estimate the % of Amount at risk that cannot be recovered (LGD)

What are the chances of Non-repayment?

If it happens, how much money will go bad?

How much will I ultimately recover if I repossess and sell off the vehicle?



MODELING DETAILS

1 Logistic Model

2 OLS Model

3 Average by Risk Deciles

Probability of Non-repayment (PD)

Predict the $ amount at risk of Non-repayment (EAD)

Predict the % of Amount at risk that cannot be recovered** (LGD)



SAMPLING

Model Validation on

data from another time

window

Modeling Sample (50%)

Testing Sample (50%)

Model development

Model validation

Full Applications

data from analysis time

window



MODELING DETAILS

1 Logistic Model

2 OLS Model







What is a Logistic Model?

->Predicts log odds(event/non-event) ->Predictive Model is as a mathematical relationship between the predictors and Target

Log (odds) = α + β1X1 + β2X2

SAS procedure: Proc Logistic (with various link functions)

LOGISTIC REGRESSION



HOW TO FIND IF A METHOD WORKS?

For Logistic Models, following metrics are used as Performance diagnostics…

• Concordance/Discordance: Overall indicator of the model prediction accuracy• Pair all observations randomly• Check the %pairs where the “bad” guy is given higher probability vs. the “good” guy

• Rank Order: Similar test like above, but a more structured format Steps:

• Sorting: Sort the population by predicted probability• Deciling: Bucket them into ten groups, each having 10% of the population in the sorted order• Check the %Non-repayment guys in each decile• Capturing: Ideally %bad guys should be highest in top deciles and lowest in bottom deciles. Top deciles

should capture most of the Non-repayment guys.

• Gains Chart: Graphical representation of capturing by the model and performance against random bucketing.

• Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum information capture with least number of predictors.

…apart from usual checks on Signs, Statistical Significance and if the model holds in the validation samples also



SAMPLE MODEL OUTPUT

Type 3 Analysis of EffectsEffect DF Wald Pr > ChiSq

Chi-SquareAPPLICATION_PRIM_CB_ 2 14.5230 0.0007%Down_Pymt_to_Loan 2 126.6605 <.0001%Mnthly_Pymt_to_Loan 2 83.5880 <.0001

Analysis of Maximum Likelihood EstimatesParameter DF Development

Model EstimateValidation

Model EstimateStandard Wald Pr > Chi

SqError Chi-SquareIntercept 1 1.1321 -0.4085 0.8909 0.2102 0.6466

APPLICATION_PRIM_CB_ 1 -0.00349 -0.00220 0.00122 3.2494 0.0715%Down_Pymt_to_Loan 1 -0.3934 -0.2839 0.0485 34.2834 <.0001%Mnthly_Pymt_to_Loan 1 0.1206 -0.0900 0.0221 16.5920 <.0001

Odds Ratio EstimatesEffect outcome Point Estimate 95% Wald Confidence Limits

APPLICATION_PRIM_CB_ 1 0.998 0.995 1.000%Down_Pymt_to_Loan 1 0.753 0.685 0.828%Mnthly_Pymt_to_Loan 1 0.914 0.875 0.954

Percent Concordant

65.9 Somers' D

0.338

Percent Discordant 32.1 Gamma 0.345

Percent Tied 2.0 Tau-a 0.074

Pairs 1806529536 c 0.669

Higher the percent concordant, better

the model



SAMPLE GAINS CHART

0

20

40

60

80

100

120

0 20 40 60 80 100

Model capturing

Random capturing

Population (%)

Responders captured

Higher the capturing in the initial deciles,

better the model performance



MODELING DETAILS

1 Logistic Model

2 OLS Model







What is a Linear Model? ->Predicts the value of the Target variable->Predictive Model is as a mathematical relationship between the predictors and Target

y =α + β1X1 + β2X2

* These models are developed only on the “bad” population, since including “good” will skew the model.

SAS procedure: Proc Reg

OLS MODELS



HOW TO FIND IF A METHOD WORKS?

For Linear Models, following metrics are used as Performance diagnostics…

• R-square: Tells how much of the variance in “Target” variable is captured by the model.

• Error rate(%): Tells what is the error relative to actual values of Target variable. Error rate (%) = average of square(actual – predicted)/average of actuals

• Rank Order: Checks if the predicted values correlate with actual values. Steps:

• Sorting: Sort the population by predicted values• Deciling: Bucket them into ten groups, each having 10% of the population in the sorted order• Check the average value of prediction in each decile and average value of actuals in each deciles• Check if both averages are gradually decreasing from the top group to bottom

• Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum information capture with least number of predictors.

…apart from usual checks on Signs, Statistical Significance and if the model holds in the validation samples also



SAMPLE MODEL OUTPUT

The REG ProcedureModel: MODEL1

Dependent Variable: NP_FlagNumber of Observations Read 40162Number of Observations Used 40162

Analysis of VarianceSource DF Sum of Mean F Value Pr > F

Squares SquareModel 12 610.91533 50.90961 219.02<.0001Error 40149 9332.36401 0.23244

Corrected Total 40161 9943.27934

Root MSE 0.48212 R-Square 0.0614Dependent Mean 0.5492 Adj R-Sq 0.0612

Coeff Var 87.78642

Parameter EstimatesVariable DF Parameter Standard t Value Pr > |t| Variance

Estimate Error InflationIntercept 1 1.24953 0.20693 6.04 <.0001 0

Credit_Score 1 -0.000216 0.00028377 -0.76 0.4465 1.0205%Down_Pymt_to_Loan 1 -0.1166 0.0117 -9.96 <.0001 1.09417%Mnthly_Pymt_to_Loan 1 0.01966 0.00517 -3.8 0.0001 1.17587



GENERALIZED LINEAR MODELSSAMPLE RANK ORDERING FOR LINEAR MODELS

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7Rankordering

Avg Predicted in this Decile Avg Actual in this Decile

Decile

Avg

Val

ue in

a D

ecile



MODELING DETAILS

1 Logistic Model

2 OLS Model







LOSS POST RECOVERY -NON RECOVERY RATE (%)SAMPLE CALCULATION BY DECILES

Deciles of $ Model Avg Non Recovery Rate (%)1 50%2 47%3 40%4 38%5 37%6 27%7 15%8 12%9 10%

10 5%



HOW IS IT DONE?

Data Preparation


Modeling & Analysis

Validation


Strategy








SAMPLING

Model Validation on

data from another time

window

Modeling Sample (50%)

Testing Sample (50%)

Model development

Model validation

Full Applications

data from analysis time

window



Data Preparation


Modeling & Analysis

Validation


Strategy







HOW IS IT DONE?


RECOMMENDATIONS

HIGH RISKDecline or Price at a Premium to recover maximum amount before Non-repayment

LOW RISKApprove and Proactive interest rate reduction/cross sell efforts with an aim of making them come back.

MID RISKApprove but charge high interest at the beginning, which can then be negotiated to a floor value

Based on Simulations/Business needs, Score buckets are created with ranges for High Risk/Mid/Low Risk



DEPLOYMENT AT A 30K FEET LEVEL

Typical steps…

• At a dealership level - negative list verification from Driving License details

• Finance guy at the dealer - then inputs all PII information with Social Security into the “Approval” system- the engine runs the model with the Bureau data/other model details

• System recommends decision - yes/no and a guidance price, which then can be negotiated with Credit executive based on the scenarios/sales/risk guidance he has.



OTHERS APPLICATIONS OF THE MODEL

Some other areas within the institution where the models outputs are leveraged…

• Portfolio P&L estimation

Net Income from this business = Sum (all Monthly Paymentss) - (Probability of Non-repayment*Estimated $ of Non-repayment*Loss Post

Recovery)

*In the accounting world, 1. Monthly Payments figures are “discounted” for inflation over loan time window 2. then the net income is compared against returns that the firm would have gotten if they invested the same

amount in US Government Treasury rates, to justify running this business

• Regulatory risk reporting - BASEL norms

• Customer bucketing for Upselling/Cross selling/Retention programs.



SIMILAR FRAMEWORK IN OTHER INDUSTRIES

Similar framework is used in other industries for solving various business problems…

• Marketing Campaigns: e.g., find out which customer is more likely to respond to campaigns and if they do how much $ would they spend with us

• How many will use Friend finder on Facebook, if yes, how many invites will they send?

• How many will see the promoted news feed? How many will they re-share it?

• Loyalty Models (ecommerce): e.g., will a customer get engaged (Repeat purchases) and if he does how much $ will he spend with us

• Attrition Models (Telecom) : e.g., are we going to lose a customer and if yes, how much revenue impact is it going to be



APPENDIX



Linear Regressionhttp://faculty.chass.ncsu.edu/garson/PA765/regress.htm

Logistic Regressionhttp://faculty.chass.ncsu.edu/garson/PA765/logistic.htm

Intended for Knowledge Sharing only 41Intended for Knowledge Sharing only 41

GOOD INFO ON LINEAR & LOGISTIC REGRESSION AT…

http://faculty.chass.ncsu.edu/garson/PA765/regress.htm

http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm

Risk Based Loan Approval Framework

Data & Analytics

knowledge sharing only3

knowledge sharing only2

knowledge sharing only4

knowledge sharing only5

knowledge sharing only6

knowledge sharing only7

knowledge sharing only9

knowledge sharing only10