YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Stochastics and Financial Mathematics

Master Thesis

Credit risk and survival analysis:Estimation of Conditional Cure Rate

Author: Supervisor:Just Bajzelj dr. A.J. van Es

Examination date: Daily supervisor:August 30, 2018 R. Man MSc

Korteweg-de Vries Institute forMathematics

Rabobank

Page 2: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Abstract

Rabobank currently uses a non-parametric estimator for the computation of ConditionalCure Rate (CCR) and this method has several shortcomings. The goal of this thesis isto find a better estimator than the currently used oneThis master thesis looks into three CCR estimators. The first one is the currently usedmethod. We analyze its performance with the bootstrap and later develop a method,with better performance. Since the newly developed and currently used estimators arenot theoretically correct with respect to the data, a third method is introduced. However,according to the bootstrap the latter method exhibits the worst performance. For themodeling and data analysis the programing language Python is used.

Title: Credit risk and survival analysis: Estimation of Conditional Cure RateAuthor: Just Bajzelj, [email protected], 11406690Supervisor: dhr. dr. A.J. van EsSecond Examiner: dhr. dr. A.V. den BoerExamination date: August 30, 2018

Korteweg-de Vries Institute for MathematicsUniversity of AmsterdamScience Park 105-107, 1098 XG Amsterdamhttp://kdvi.uva.nl

RabobankCroeselaan 18,3521 CB Utrecht

https://www.rabobank.nl

2

Page 3: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Aknowledgments

I would like to thank my parents who made possible for me to finish the two yearsMasters in Stochastics and Financial Mathematics in Amsterdam, that helped me tobecome the person I am today.I would also like to thank people from Rabobank and the department of Risk Analytics,thanks to whom I have written this thesis and, during my six-month internship, andshowed me that work can be more than enjoyable.In particular, I would like to acknowledge all my mentors, Bert van Es who always hadtime to answer all my questions, Viktor Tchistiakov who gave me challenging questionsand ideas, which represent the core of this thesis and, Ramon Man, who always showedme support and cared that this thesis was done on schedule.

3

Page 4: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Contents

Introduction 50.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60.2 Research objective and approach . . . . . . . . . . . . . . . . . . . . . . . 7

1 Survival analysis 81.1 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Competing risk setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Current CCR Model 202.1 Model implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Performance of the method . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.1 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Cox proportional hazards model 283.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Estimation of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.2 Baseline hazard estimation . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Model implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.5 Performance of the method . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Proportional hazards model in an interval censored setting 454.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.1 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . 464.2 Model implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.1 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4 Performance of the method . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Popular summary 60

4

Page 5: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Introduction

Under Basel II banks are allowed to build their internal models for the estimation of riskparameters. This is known as the Internal Rating Based approach (IRB). Risk parame-ters are used by banks in order to calculate their own regulatory capital. In Rabobank,the Loss Given Default (LGD), Probability of Default (PD) and Exposure at Default(EAD) are calculated with IRB.Loss given default describes the loss of a bank in case that the client defaults. Defaulthappens when the client is unable to pay monthly payments for the mortgage for sometime or one of other default events, usually connected with the client’s financial diffi-culties, happen. Missed payment is also known as arrear. After the client defaults hisportfolio is non-performing and two events can happen. The event when the clientsportfolio returns to performing one is known as cure. Cure happens if the costumer payshis arrears and has no missed payments during a three-months period or he pays thearrears after the loan restructuring and has no arrears in a twelve-months period. Theevent in which the bank needs to sell the client’s security in order to cover the loss iscalled liquidation.Two approaches are available for LGD modeling. The non-structural approach consistsof estimating the LGD by observing the historical loss and recovery data. Rabobank usesanother approach, the so-called structural approach. While the bank deals with LGDin a structural way, different probabilities and outcomes of default are considered. Themodel is split in several components that are developed separately and later combinedin order to produce the final LGD estimation, as can be seen from Figure 0.1. In orderto calculate the loss given default we firstly need to calculate the probability of cure, theloss given liquidation, the loss given cure and indirect costs. Rabobank assumes thatloss given cure equals zero, since cure is offered to the client only if there is a zero-loss tothe bank and any indirect costs that are suffered during the cure process are taken intoaccount in the indirect costs component. Therefore, in this thesis sometimes the termloss is used instead of the term liquidation.Indirect costs are usually costs that are made internally by the departments of the bankthat are involved into processing of defaults, e.g., salaries paid to employees and admin-istrative costs. The equation used for LGD calculation, can be seen in Figure 0.1.

The parameter probability of cure (PCure) is the probability that the client will curebefore his security is liquidated. All model components, including PCure depend on co-variates from each client.A big proportion of cases, which are used to estimate PCure, are unresolved cases. Aclient that defaulted and was not yet liquidated nor cured is called unresolved. Theeasiest way to imagine an unresolved state is when every client needs some time afterdefault in order to cure or sell their property. The non-absorbing state before cure or

5

Page 6: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 0.1: LGD = PCureLGC + IC + (1− PCure)LGL.

liquidation is called unresolved state. Unresolved cases can be treated in two ways:

• Exclusion: the cases can be excluded from the PCure estimation

• Inclusion: with expected outcome: we can include unresolved cases into the PCureestimation by assigning an expected outcome or Conditional Cure Rate to them.

If unresolved cases are excluded from the PCure estimation, it can happen that theparameter estimator will be biased. In other words PCure will be estimated on a samplewhere clients are cured after a short time. Consequently, clients, who would need moretime to be cured would get a smaller value for PCure than they would deserve. Since PCuretells us the probability that a client will be cured after default and such a probability isnot time dependent, treatment of unresolved cases with exclusion would be wrong.One approach within Rabobank is to treat unresolved cases by assigning them a valuecalled Conditional Cure Rate. Conditional Cure rate (CCR) can be estimated with anon-parametric technique. This thesis will be about developing a new model able toeliminate the existing shortcomings of the currently employed model.

0.1 Background

CCR tells us the probability that a client will cure after a certain time point conditionedon the event that client is still unresolved at that point. Rabobank’s current modelestimates CCR with survival analysis, which is a branch of statistics that is specializedin the distribution of lifetimes. The lifetime is the time to the occurrence of an event. Inour case the lifetime is the time between the default of the client and cure or liquidation.Current CCR is a combination of Kaplan Meier and Neslon Aalen estimators, whichare two of the most recognized and simple survival distribution estimators. The currentCCR model has some shortcomings. The goal of the present research is to develop anew CCR model which will be able to outperform current model.

6

Page 7: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

0.2 Research objective and approach

The objective of this thesis is to find new ways to estimate CCR that would be ableto improve CCR estimation. New techniques will be compared with the currently usedCCR estimator. The research goal of this thesis is:

Develop an alternative CCR model which is better than the current one.

In order to reach this research goal, we will need to answer the first research question:

What type of model is natural for the problem?

To answer this question first, we study the currently used model. Second, we will getfamiliar with basic and more advanced concepts of survival analysis. Once this is done,we will be able to derive more advanced and probably better estimators. In particular wewill get some basic ideas about the techniques that can be used for the CCR modeling.In order to find a better model than the one which is currently used we will have to beable to know what better means. In other words, we will need to be able to answer thesecond research question:

What are criteria for comparing the models?

7

Page 8: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

1 Survival analysis

Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The random variable which denotes a lifetime will be denoted by T . Survivalanalysis has its roots in medical studies as they look into the lifetime of patients or thetime to cure of a disease. In our case the lifetime of interest is the time which a clientspends in the unresolved state and the events of interest are cure and liquidation. Oneof the most attractive features of survival analysis is that it can cope with censoredobservations.The estimator which is currently used for CCR estimation is a combination of two es-timators of two different quantities which are specific for survival analysis. In order tobe able to understand and to model CCR as it is currently model by the Rabobank,we first have to know what the quantities are that are modeled by the Rabobank, howthese quantities are estimated and how to model these quantities as a client can becomeresolved due to two reasons. Censoring is represented in Chapter 1.1. The most basicconcepts of survival analysis and its specific quantities are introduced in Chapter 1.2. InChapter 1.3 we will look into basic estimators of survival analysis. In Chapter 1.4 thetheory and estimators when two outcomes are possible are presented, since our modelassumes that client can become resolved due to two reasons, cure and liquidation.

1.1 Censoring

When lifetime distributions are modeled with a non-survival analysis approach only ob-servations where the event of interest took place are used. For simplifying reasons wewill look into an example from medical studies. For instance if a study on 10 patientshas been made and only 5 of them died, and the other 5 are still alive, in order tomodel the distribution of lifetimes only 5 observations can be used. Survival analysis isable to get some information about distribution also from the other patients. Patientswhich are still alive at the end of the study are called censored observations and survivalanalysis is able to cope with censored data. A visualization of censored observations canbe seen in Figure 1.1. The time variable on the x axis represents the month of study,meanwhile on the y axis the number of each patient can be found. The line on theright side represents the end of the study. The circle represents in which month of studythe patient got sick, while the arrow represents the death of a patient. It can be seenthat deaths of patients 6 and 4 are not observed, because the study ended before theevent of interest happened. The event of interest cannot be seen at patient 2 as well,because the patient left the study before the end of the study. Such phenomenons arecalled censored observations. Despite the fact that the event of interest did not happen,

8

Page 9: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 1.1: Censored observations and uncensored observation

such observations tell us that the event of interest happened after the last time we sawthe patient alive and consequently has an impact on the output distribution. Types ofcensoring when we know the time when the patient got sick, but we do not know thetime of the event of interest are called right censoring (Lawless 2002).For now we will assume that we have just one type of censoring, right censoring. Inour case this means that a client which has defaulted is observed, but the observationperiod has finished before client was cured or liquidated. To understand why such aphenomenon happens it has to be taken into consideration that in the Rabobank’s datathere are observations where a client is cured after 48 months (4 years), consequently itcan happen that if clients which have defaulted in year 2016 are observed, some of themare still unresolved, despite the fact that they will be cured or liquidated sometime inthe future.

9

Page 10: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

1.2 Definitions

In order to use techniques from survival analysis some quantities need to be defined andare going to be directly or indirectly modeled. For simplicity it will be assumed that aclient can become resolved due to one reason only, cure. Later this assumption will berelaxed. All definitions and theory from this chapter can be found in any book aboutsurvival analysis. For a review of survival analysis see Lawless (2002), Kalbfleisch andPrentice (2002) and Klein and Moeschberger (2003).

Definition 1.2.1. Distribution function: Since T is a random variable it has its owndistribution and a density function, which are denoted by F and f or

F (t) = P (T ≤ t) =

∫ t

0f(t)dt.

The distribution function tells us the probability that the event of interest, cure, willhappen before time t.

In survival analysis we can also be interested in the probability that the client is stillunresolved at time t.

Definition 1.2.2. The survival function represents the probability that the event ofinterest did not happen up to time t and it is denoted by S or

S(t) = P (T > t) = 1− F (t). (1.1)

The survival function is a non-increasing right continuous function of t with S(0) = 1and limt→∞ S(t) = 0.

Since we can also be interested in the rate of events in a small time step after time t,we define the following quantity.

Definition 1.2.3. The hazard rate function is a function that tells us the rate of cures inan infinitely small step after t conditioned on the event that the client is still unresolvedat time t. The hazard rate function is denoted by λ or

λ(t) = lim∆t→0

=P (t ≤ T < t+ ∆t|T ≥ t)

∆t.

It is important to note that the hazard rate function can take a value bigger than 1. Ifλ(t) ≥ 1, it means that the event will probably happen at time t. In order to interconnectall the defined quantities we need to represent the cumulative hazard function which isdefined as

Λ(t) =

∫ t

0λ(s)ds.

Now we can look at how the hazard rate function and the survival function are con-

10

Page 11: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

nected. We know that {t ≤ T < t+ ∆t} ⊆ {t ≤ T}. Consequently we get

λ(t) = lim∆t→0

P (t ≤ T < t+ ∆t|T ≥ t)∆t

= lim∆t→0

P (t ≤ T < t+ ∆t)

∆t× P (t ≤ T )

=f(t)

S(t)=

ddt(1− S(t))

S(t)=− ddtS(t)

S(t)

= − d

dtlog(S(t)).

(1.2)

From the last equality it follows that

S(t) = exp(−Λ(t)). (1.3)

We can see that the hazard rate function, the survival function, the density functionand the distribution function uniquely define the distribution of T .

1.3 Estimators

Until now we looked into quantities that define the distribution of T . In this chaptersimple estimators of these quantities will be presented. Later those estimators will beused for the modeling of CCR.The assumption that a client can become resolved due to one reason only is still valid. Inthe data there are n defaults. The observed times will be denoted by ti, i ∈ {1, 2, . . . , n}.Each of those times can represent the time between default and cure or liquidation andcensoring time. The variable δi is named the censoring indicator and it takes value 1 ifclient i was not censored. Consequently, in cases where there is only one event of interestthe data is given as {(ti, δi)|i ∈ {1, 2, . . . , n}}. Once we observed times are obtained wehave to order them in an ascending order t(1) < t(2), · · · < t(k), k ≤ n. With D(t) wedenote the set of individuals that experienced the event of interest at time t or

D(t) = {j|Tj = t, j ∈ {1, 2, . . . , n}}.

With di the size of D(t(i)) will be denoted. The set of individuals that are at risk attime t is denoted by N(t). The individuals which are in N(t) are the individuals thatexperienced the event of interest at time t or are still in our study at time t or

N(t) = {j|Tj ≥ t, j ∈ {1, 2, . . . , n}}.

With ni we will denote the size of N(t(i)).The most basic estimator of a survival function is known as the Kaplan-Meier estimatorand it is also used in the current Rabobank model. The Kaplan-Meier estimator is given

11

Page 12: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

by the following formula,

S(t) =∏

j:t(j)≤t(1− dj

nj). (1.4)

The estimator which is used for the estimation of the cumulative hazard rate in thecurrent CCR model is called the Nelson-Aalen estimator and it is given by

Λ(t) =∑

j:t(j)≤t

dini. (1.5)

Using the Nelson-Aalen estimator we can also model the hazard rate as a point processwhich takes value di

niat time t(i) Let us look into the following example. We have an

observation which consists of 10 clients. The censoring indicator and observed time canbe seen in Figure 1.2. The times which each client needs in order to be cured afterdefault are observed. In order to compare traditional statistical techniques and survivaltechniques, survival curve will be estimated with the empirical distribution function andthe Kaplan-Meier estimator.

Figure 1.2: Example data

The survival curve estimated with the Kaplan-Meier estimator is shown in Figure 1.3.In the figure it can be seen that jumps happen only at the times when an event of interesthappens. Observed censored times are denoted with small vertical lines and they have noinfluence on the jumps, but rather on the size of the jumps, since censored observationsonly have influence on the risk set which is in the nominator of equation (1.4). The morecensored observations occur before the observed time the smaller the risk set will be andthe bigger the jump on the survival curve will be.

We will now calculate the survival curve with empirical distribution which is formu-

12

Page 13: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 1.3: Survival curve estimated with Kaplan-Meier estimator

lated as

Fn(t) =1

n

n∑i=1

1Ti≤t.

and obtain the survival curve from the identity (1.1). Doing so we obtain an estimatorfor the survival function which is equal to

Sn(t) =1

n

n∑i=1

1Ti>t.

It is important to add that if we use the empirical distribution function we can only useobservations where the event of interest interest did happen. In our case we can justtake the observations where cure happened, i = 4, 6, 7. Results from the estimator ofthe survival curve with empirical distribution is shown in the Figure 1.4.

From Figure 1.4 it can be observed tha bigger jumps occur in comparison with theKaplan-Meier estimator. An intuitive explanation for this phenomenon would be thatcensored observations also bring us an important information for the estimator. Forinstance, if we want to estimate the survival curve at time t and we have only censoredobservations which happened after time t, then the probability of an event happeningbefore time t is probably low.At this point note that if we use the Kaplan-Meier estimator with non-censored observa-tions only and the empirical distribution for the estimation of the survival curve we willobtain the same results. If we have times t1 < t2 · · · < tk and at the time ti di events ofinterest happened then the size of the risk set at time ti will be n − d1 − · · · − di. for

13

Page 14: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 1.4: Survival curve estimated with the empirical distribution

t ∈ [ti, ti+1) it follows

S(t) = (1− d1

n)(1− d2

n− d1) . . . (1− di

n− d1 − · · · − di−1)

=(n− d1)(n− d1 − d2) . . . (n− d1 − · · · − di)

n(n− d1)(n− d1 − d2) . . . (n− d1 − · · · − di−1)

=n− d1 − · · · − di−1

n=

1

n

n∑i=1

1ti>t.

(1.6)

1.4 Competing risk setting

Up to this point we assumed that clients can become resolved only by being cured,but in reality that is not the case. It is known that a client can become resolved dueto two reasons, cure and liquidation. The modeling approach in the CCR model isthat the client becomes resolved due to the reason which happens first. With otherwords there are two types of events which have the influence on the survival function.If all liquidated events would be modeled as censored, then the survival function wouldoverestimate. Since one of the building blocks of CCR is the survival function, CCRestimates where liquidated events are considered as censored would give us wrong results.Modeling of CCR in the setting where two outcomes are possible is in survival analysisknown as the competing risk setting. In this chapter we will look into quantities in thecompeting risk setting. For a review of the competing risk model see M.-J. Zhang, X.Zhang, and Scheike (2008).

14

Page 15: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

In order to model CCR(t) we have to introduce more notation. Variable ei representsthe reason due to which subject i failed. In our case ei = 1 if the subject i was cured.The variable ei will take value 2 if subject i was liquidated. Consequently, the data isgiven as (Ti, δi, ei), i ∈ {1, 2, . . . , n}. Since a subject can fail due to two reasons, a causespecific hazard rate needs to be used. By Di(t) we denote a set of individuals that faileddue to reason i at time t or

Di(t) = {k, t = Tj ∧ ek = i, j ∈ {1, 2, . . . , n}}.

With dij we denote the size of the set Di(t(j)).

Definition 1.4.1. The cause specific hazard rate is a function that tells us the rate ofevent e = i on an infinitely small step after t conditioned on the event that a subjectdid not fail up to time t. The cause specific hazard rate function is denoted by λi(t) or

λi(t) = lim∆t→0

=P (t ≤ T < t+ ∆t, e = i|T ≥ t)

∆t.

In order to calculate the survival function in a competing risk setting we need to definethe cause specific cumulative hazard function, which is given as

Λi(t) =

∫ t

0λi(s)ds.

Then the analog of Formula 1.3 in the competing risk setting is

S(t) = exp(−Λ1(t)− Λ2(t)). (1.7)

It is important to add that when a subject can fail due to 2 reasons the Kaplan Meierestimator of the survival function takes the following form

S(t) =∏

j:t(j)≤t(d1j + d2

j

nj).

The Nelson-Aalen estimator is the estimator of the cause specific cumulative hazard rate,and it is given as

Λi(t) =∑

j:t(j)≤t

dijnj. (1.8)

The value dΛi(ti) =dijnj

tells us the probability that a subject will experience the event

at time tj conditioned on the event that he is still alive at time tj . Once we obtain allthe cause specific hazard rates we can calculate the probability of failing due to a specificreason.

Definition 1.4.2. The cumulative incidence function (CIF) tells us the probability of

15

Page 16: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

failing due to cause i before time t and it is denoted by Fi or

Fi(t) = P (T ≤ t, e = i).

The cumulative incidence function is expressed mathematically as

Fi(t) =

∫ t

0λi(s)S(s)ds

=

∫ t

0λi(s) exp(

2∑i=1

Λi(s))ds.

(1.9)

Using the Nelson-Aalen and Kaplan Meier estimators we can obtain a non-parametricestimator for CIF which is given as

Fi(t) =∑i≤t

S(tj−1)dΛi(tj). (1.10)

An intuitive explanation of Formula (1.10) is that if we want that subject fails due toreason i it has to be unresolved up to time tj−1 and then at the next time instance faildue to reason i.We will continue with the example from Section 1.3, but this time we will assume thatsome observations which were censored were actually liquidated. We will take all thesteps we need in order to calculate F1(t) from equation (1.10), which will be later neededin order to estimate CCR.

Figure 1.5: Survival curve estimated with empirical distribution

In order to estimate the survival curve, the following steps need to be taken. It is seenthat no event of interest happens on interval [0, 1), consequently for t ∈ [0, 1) it holds

16

Page 17: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

thatS(t) = 1.

At time 1 one event of interest happens and the size of the risk set is 10. Since no eventsof interest happen until time 2 for t ∈ [1, 2) it holds that

S(t) = 1(1− 10) =9

10

At time 2 two events of interest happen and the size of the risk set is 9 for t ∈ [2, 3) itholds

S(t) = 1(1− 1

10)(1− 2

9) =

7

10.

In a similar fashion we obtain the following values for S(t)

S(t) =

710(1− 2

7) = 12 t ∈ [3, 4)

12(1− 2

5) = 310 t ∈ [4, 5)

310(1− 0

3) = 310 t ∈ [5, 6)

310(1− 0

1) = 310 t ∈ [6,∞).

Figure 1.6: Survival curve in competing risk setting

The Kaplan-Meier curve for the competing risk setting is shown in Figure 1.6. Com-pared with Figure 1.3 it can be seen that it has more jumps, which makes sense, sincewe have two events of interest. At the same time we can see that if we consider liqui-dated observations as censored, the survival curve will overestimate survival probability.

17

Page 18: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Consequently, the CCR will be modeled in a competing risk setting.In order to calculate an estimator of F1(t) we need to calculate dΛ1(tj). The value ofdΛ1(1) is simply the number of individuals that were cured at time 1 divided by thenumber of all individual that are at risk at time 1 or

dΛ1(1) =0

10.

In a similar fashion we obtain the other values of dΛ1(ti)

dΛ1(t) =

19 t = 227 t = 305 t = 403 t = 501 t = 6

0 else

Figure 1.7: cause specific hazard rate for cure estimated with dΛ1

Now we can finally calculate F1(t). Since we do not have any cures in the interval[0, 1) it holds that F1(t) = 0 for t ∈ [0, 1). For t ∈ [1, 2) it holds

F1(t) = S(0)dΛ1(1) = 10

10= 0.

18

Page 19: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

For t ∈ [2, 3) it holds that

F1(t) = S(0)dΛ1(1) + S(1)dΛ1(2)

= F1(1) + S(1)dΛ1(2)

= 0 +9

10

1

9=

1

10.

(1.11)

Figure 1.8: Estimated cumulative incidence function

in a similar fashion we obtain values F1(t) for other t,

F1(t) =

F1(2) + 4

527 = 23

70 t ∈ [3, 4)

F1(3) + 47

05 = 23

70 t ∈ [4, 5)

F1(4) + 1235

03 = 23

70 t ∈ [5, 6)

F1(5) + 1235

01 = 23

70 t ∈ [6,∞)

.

19

Page 20: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

2 Current CCR Model

The CCR for time t tells us the probability that a client will be cured after time tconditioned on the event that he is still unresolved at time t. This probability can beexpressed as F1(∞) − F1(t). It is important to understand that some cases will neverbe cured. Consequently probability of cure is not necessary equal to 1 as t → ∞ orF1(∞) ≤ 1. Since CCR is conditioned on the event of being unresolved up to time t itfollows that

CCR(t) =F1(∞)− F1(t)

S(t). (2.1)

Using the estimator (1.10) we can estimate a probability of curing after time t ∈ [ti, ti+1)as

F1(∞)− F1(ti). (2.2)

We assume that the probability of cure after the 58th month is equal to zero. Conse-quently it holds F1(∞) = F1(58) and CCR(58) = 0. Another assumption is that everycase will be resolved as t→∞ which is equal to the property of requiring survival func-tion that S(∞) = 0. Consequently, we will consider that every case which has observedtime bigger than 58 as liquidated. Since CCR is conditioned on the event of beingunresolved up to time t ∈ [ti, ti+1), the estimator of CCR takes the following form

CCR(t) =F (58)− F (ti)

S(ti)

=d1i+1

ni+1+ (1−

d1i+1 + d2

i+1

ni+1CCR(ti+1)).

(2.3)

If we define F1(0) = 0, since the probability of being cured before time zero is equal to

zero, CCR(0) is equal to F (58). This gives us the probability of being cured, i.e thevalue of PCure in Figure 0.1.

2.1 Model implementation

In order to estimate CCR for each time point the Rabobank uses data which consists ofmortgage defaults from the bank Lokaal Bank Bedrijf (LBB). Each default observationconsists of the time the client spent in default and the status after the last month indefault, which can be equal to cure, liquidated or unresolved. In the data we can alsofind the following variables:

• High LTV indicator,

20

Page 21: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

• NGH indicator,

• Bridge loan indicator.

Variable LTV represents Loan To Value ratio and it is calculated with the followingformulation,

LTV =Mortgage amount

Appraised value of the property.

If LTV is higher than 100% it means that the value of mortgage which was not payedback by client is larger than the value of the security. Indicator High LTV takes value1 if LTV is high. NGH is an abbreviation for the National Mortgage Guarantee or inDutch ”Nationale Hypotheek Garantie”. If a client which has an NGH-backed mortgagecannot pay the mortgage due to specific circumstances, NGH will cover support to thebank and the client. If the client sells the house under the price of the mortgage, NGHwill cover the difference and consequently neither the client nor the bank will suffer theloss. Since both sides get support in case of liquidation, we can expect that the causespecific hazard rate for liquidation will be higher and cause specific hazard rate for curelower and consequently CCR lower.Bridge loans are short-term loans that last between 2 weeks and 3 years. A client usuallyuses them until he finds a longer and larger term financing. This kind of loan providesan immediate cash flow to a client with relatively high interest. Such loans are usuallyriskier for a bank and have consequntly higher interest rates.

2.1.1 Stratification

It can be seen that Rabobank has data from different clients with different variables, butit uses an estimator for CCR which is unable to incorporate those variables. Rabobanksolves this problem with the method called stratification or segmentation. This methodseparates the original data frame into smaller data frames based on variables and thencalculates CCR on each one of them. For instance, if segmentation is based on thevariable called ”Bridge loan indicator”, the original data frame will be separated intotwo data frames. In the first data frame only clients with bridge loans can be found andin the other clients with non-bridge loans. Once this step is made, CCR is calculatedfor each segment and two CCR estimates are obtained, one for clients with and theother for clients without a bridge loan. Rabobank separates the original data frame intofour buckets, as can be seen on Figure 2.1. The segment with the most observationsis Non-Bridge-Low LTV which has almost all observations from the original data. Itis followed by the segment Non-Bridge-High LTV-non-NHG which has about ten timesless observations than segment with Low LTV. The smallest segments are segments withBridge loans and Non-Bridge-High LTV-NHG, which have about 600 observation.

21

Page 22: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 2.1: Segmentation of original data frame

2.2 Results

In this chapter the estimation of survival quantities, which are needed in order to modelCCR, will be looked into. In order to estimate the nonparametric cumulative inci-dence function we need the cause specific hazard rate for cure, which follows from equa-tion (1.10). Hazard rates estimated with dΛ(t) estimator are point processes as can beseen in Figure 2.2. The two biggest segments behave regularly, while segments with lessobservations have irregular behavior with jumps at the end of the period. Big jumpsfrom value zero to above 0.02 are occurring because a small number of individuals is atrisk. For instance, at time 53 in the segment high LTV NHG cause specific hazard ratetakes value 0.029. At that time only one cure happens, but there are 34 individuals atrisk.The next step is the estimation of the survival function with the Kaplan-Meier estimator.Results can be found on Figure 2.3.

From the figure it is visible that the segment which becomes resolved at the smallestrate is the segment which represents the clients with high LTV and without NHG, whileclients with bridge loans become resolved at the highest rate.Once we obtain the survival functions and cause specific hazard rates for cure we canmodel the cause specific incidence functions for cure, which can be found in Figure 2.4.

Results of the nonparametric estimator can be found in Figure 2.5. Curves are themost irregular, have the most jumps, for the segments where we have the least obser-vations. The explanation for this phenomenon can be found in the recursive part ofequation (2.3). It is seen that the big jumps happen because the cause specific hazardrates for cure are irregular.From the Figure 2.5 it can be seen that the clients with low LTV have the highest CCR

22

Page 23: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 2.2: Cause specific hazard rate for cure estimated with dΛ(t)1

Figure 2.3: Survival function estimated with Kaplan-Meier estimator

estimates. Clients without NHG have larger CCR estimates than clients with, sincebank and clients are more motivated into curing their defaults. From the figure is it notcompletely clear if the clients with bridge loans or clients with high LTV and withoutNHG have higher CCR estimates.

2.3 Performance of the method

In this chapter we will look into the variance, bias and confidence intervals of the non-parametric CCR estimator. Since the derivation of the asymptotic variance of the CCRestimator is out of scope of this thesis, these quantities will be estimated with the methodknown as bootstrap.

23

Page 24: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 2.4: Survival functions estimated with Kaplan-Meier estimator

Figure 2.5: CCR estimated with non parametric estimators

2.3.1 The bootstrap

The bootstrap was introduced by Efron in 1979. The method is used to estimate thebias, variance and confidence intervals of estimators by resampling. In this chapter wewill look at how this method is used for the estimation of bias, variance and confidenceintervals. Later the method will be applied to mortgage data in order to estimate thepreviously mentioned quantities for each segment. For a review of bootstrap estimatorsand techniques see Fox (2016). In order to bootstrap, data frames from the original dataframe need to be selected. These data frames are created by a selection of random defaultobservations from the original data frame. Let us assume that n random data frameswill be simulated. Simulated data frames need to be of the same size as the original dataframe and it is allowed that simulated data frames have the same observations morethan once. Once the simulated data frames are obtained, segmentation is done as inFigure 2.1. Finally CCRi(t), i ∈ {1, 2, . . . , n} estimates are calculated for each segment

24

Page 25: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

and each time as can be seen in Figure 2.6.

Figure 2.6: Estimation of quantities with bootstrap.

Once estimates CCRi(t) for each data frame are obtained, CCR(t), t ∈ {0, 1, . . . 58}for each segment can be calculated as

CCR(t) =

n∑i=1

CCRi

n.

The bootstrap estimate for variance of CCR(t) is equal to

1

n− 1

n∑i=1

(CCRi(t)− CCR(t))2.

An estimator θ is called unbiased if we have E(θ) = θ and if it is biased the bias ofthe estimator is defined as

Bθ = E(θ)− θ.

Bias of estimators is undesired. With the bootstrap the bias can be estimated by

CCR(t)− CCR(t)

where CCR(t) is the estimate of CCR from the original data frame.For the estimation of confidence intervals a method called bootstrap percentile confidenceinterval will be used. In order to obtain 100(1− α) interval for fixed time t we have to

take CCR(t)α2,L, which denotes a value where α

2 percent of the CCRi(t) is below that

value. In a similar fashion CCR(t)α2,R denotes a value where α

2 percent of the CCRi(t)

25

Page 26: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

is above that value. For an α confidence interval the following values are taken

[CCR(t)α2,L, CCR(t)α

2,R].

The Rabobank decides whether it makes sense to make a segmentation or not basedon the size of the confidence intervals. If two curves lie in each other’s 95%-confidenceintervals, then the CCR curves are pragmatically treated as the same. If we look intoCCR curves for each segment, as can be seen in Figure 2.7, we can see that segmentsLBB-LOW LTV and LBB-High LTV-non-NGH are treated as significantly different,while LBB-Low-LTV and Bridge are not. Furthermore segments with a bigger size ofpopulation have narrow confidence intervals, while segments with a small populationsize have wide confidence intervals. It follows that nonparametric CCR estimators arenot a good choice when CCR on a population of small size has to be estimated.

Figure 2.7: Estimation of confidence intervals with the bootstrap.

From Figure 2.8 it can be seen that the biggest variance is obtained by segments withthe smallest population and that variance grows with time. This happens because of

the variability of the termd1i+1

ni+1from the recursive part of the equation (2.3), as it is

seen from Figure 2.2. Since cures at the end of observation period in the Bridge andLBB-High LTV-NHG do not exist, CCRi(t) always takes the value zero. The same holds

for CCR(t). Consequently, the variance estimated with the bootstrap is equal to zerofor the CCR estimates at the end of the observation period.

In the Figure 2.9 the bias of the nonparametric CCR estimation can be found. If we

26

Page 27: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 2.8: Estimation of variance with bootstrap.

compare bias with the size of CCR estimates from the Figure 2.5 it can be concludedthat it is more than 100 times smaller than the size of the CCR and that estimator isnot problematically biased.

Figure 2.9: Estimation of bias with the bootstrap.

27

Page 28: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

3 Cox proportional hazards model

Until now we have been looking only into nonparametric estimators of Survival andHazard rate functions, which are unable to incorporate additional variables. Rabobankuses a method called stratification in order to estimate the CCR curve of individualswith different covariates, that has certain shortcomings. Since a big amount of the dataconsists out of unresolved (censored) cases, we will look for a suitable regression methodfrom survival analysis.At the beginning of this chapter we review the theory behind the so called Cox model.In Section 3.1 we will see how to estimate coefficients of explanatory variables and thebaseline function. In section 3.2 the formula for the cumulative incidence function willbe derived. In Section 3.3 it will be explained how to estimate the coefficients in orderto get comparable results with segmentation. In Section 3.4 we will look into quantitiesestimated with the Cox model that are needed in order to get CCR estimates. In Sec-tion 3.5 the performance of the method will be analyzed with the bootstrap. In the lastsection of this chapter the method will be compared with the non-parametric estimator.

In order to model CCR(t) we need a model, which is able to model the cause-specifichazard rate. Since we expect different behavior from clients with different covariates, wewill look into one of the most popular regression models in survival analysis. The Coxproportional hazards model was presented in 1972 by Sir David Cox. For a review ofthe Cox model see Kalbfleisch and Prentice (2002), Lawless (2002) and Weng (2007). Ahazard rate modeled with the Cox model takes the following form

λ(t|X) = λ0(t) exp(βTX), (3.1)

where β = (β1, β2, . . . , βp) is a vector of coefficients which represents the influence ofcovariates, X = (X1, . . . , Xp) on the hazard rate function λ(t|X), which depends on X.We denoted covariates of the individual i by Xi. The baseline hazard function is denotedby λ0(t) and it can take any form Weng (2007). The baseline hazard function can beinterpreted as a hazard rate from an individual whose values of covariates are equal tozero. In a similar way the baseline survival function can be defined

S0(t) = exp(−∫ t

0λ0(t)dt), (3.2)

according to equation (1.3). The survival function of individual j than takes the followingform

S(t|Xj) = [S0(t)]exp(βTXj) (3.3)

28

Page 29: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Corrente, Chalita, and Moreira (2003). Since the baseline function can take on any form,the Cox model is a semi-parametric estimator of the hazard rate. In a similar way thecause specific hazard rate for cause i

λi(t|X) = λ0i (t) exp(βTi X), i = 1, 2

can be modeled. Here βi represents the effect of covariate vector X on the cause specifichazard rate. The function λ0

i (t) represents a cause specific baseline function.In the following subsections we will look at how to estimate parameters in the Cox model.

3.1 Parameter estimation

Since the proportional hazards model is semi-parametric model the β coefficients andthe baseline hazard function need to be estimated. In Section 3.1.1 we will look intopartial likelihood and how to estimate the coefficients. In Section 3.1.2 we will introducethe Breslow estimator of the baseline function. For a review of baseline and β estimationsee Weng (2007).

3.1.1 Estimation of β

Firstly, it will be assumed that the data consist of n individuals and that each individualhas a different observation time ti, no ties in data. These observation times are orderedin an ascending order, so t1 < t2 < · · · < tn. In 1972 Cox proposed to estimate β usingpartial likelihood, Weng (2007). The partial likelihood of individual i, PLi, is simplythe hazard rate of individual i divided by the sum of the hazard rates of all individualsthat are at risk at time ti or for i ∈ N(ti):

Li =λ(ti|Xi)∑

j∈N(ti)λ(ti|Xj)

(3.4)

=exp(βXi)∑

j∈N(ti)exp(βXj)

. (3.5)

(3.6)

The partial likelihood of individuals that were censored is equal to 1. It follows that thepartial likelihood function of the data we have is equal to

PL(β) =

n∏i=1

exp(βTXi)∑j∈N(ti)

exp(βTXi). (3.7)

Instead of maximizing PL we maximize log(PL) = pl,

pl(β) =∑i:δi=1

βTXi − ln(∑

j∈N(ti)

exp(βTXj)).

29

Page 30: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

3.1.2 Baseline hazard estimation

Breslow has proposed the following estimator for the baseline hazard function. We stillassume that we do not have ties in our data, or t1 < t2 < . . . tn. Breslow proposed anestimator, which is constant on subintervals where no event happened, Weng (2007).Rabobank’s data shows if an event of interest happened or not at the end of eachmonth. That means that if the observation has observed time ti, that customer wascured, liquidated or censored in the interval (ti − 1, ti]. In the case of event of interest,the Breslow baseline function will be constant on the interval (ti − 1, ti], where it willtake the value λ0

i As in equation (1.2) and by using the equality f(t) = λ(t)S(t), we getthat

L(β, λ0(t)) ∝n∏i=1

λ(ti|Xi)δiS(ti|Xi).

Using the equation (1.3) we get that

L(β, λ0(t)) =

k∏i=1

(λ0(t) exp(βXi))δi exp(−

∫ ti

0λ0(s) exp(βXi)ds.

If we take ti + 1 = ti+1, so equidistant ti, we get that∫ ti

0λ0(s)ds =

i∑j=1

λ0j .

Taking the logarithm of L gives us

l(β, λ0(t)) =k∑i=1

δi(ln(λ0i ) + βXi)−

k∑i=1

λ0i

∑j∈di

exp(βXj). (3.8)

Once we obtain β from partial likelihood, we insert it into l. From the second termin (3.8) we see that only the λi with δi = 1 give a positive value to l. Consequently, wetake λ0

i = 0 for i /∈ {t1 · δi, t2 · δ2, . . . , tk · δk}. Going through the steps above we get

l(λ0(t)) =∑δi=1

δi(ln(λ0i ) + βXi)−

∑δi=1

∑j∈ni

exp(βXj).

Differentiation with respect to λ0i gives us that l(λ0(t)) for t ∈ (ti−1, ti] is maximized by

λ0(ti) = λ0ti =

1∑j∈N(ti)

exp(βXj),

Weng (2007). It is known that in continuous time it is impossible to have two individualswith the same observed time, but in reality it will most likely happen that some indi-viduals have the same observed time, since we usually check the state of individuals on

30

Page 31: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

monthly intervals. Consequently, many individuals will have the same observed times. Itfollows that a different estimator for a baseline function and a different partial likelihoodfunction has to be used. In 1974 Breslow proposed the following partial likelihood or

L(β) =

k∏i=1

exp(βX+i )

(∑

j∈N(ti)exp(βXj))di

,

where X+i =

∑j∈N(t(i))

Xj . Using the same methodology as when there are no ties in

the data we get that the Breslow baseline for ties in the data is equal to

λ0(ti) =di∑

j∈ni exp(βXj). (3.9)

3.2 Estimators

From Chapter 2 we know that we have to be able to model cause specific hazard rates,survival function and cumulative incidence function in order to estimate CCR. Forestimation of the survival function the identity (1.7) will be used. In order to model thecumulative incidence function we have to integrate equation (1.9), which is possible byusing the fact that the baseline hazard rate is a step function. For t ∈ (ti−1, ti] we get

F1(t) =

∫ t

0λ1(s)S(s)ds

=

∫ t

0λ1(s) exp(−

∫ s

0(λ1(u) + λ2(u))du)ds

= F1(ti−1) +

∫ t

ti−1

λ1(s) exp(−∫ s

0(λ1(u) + λ2(u))du)ds

(3.10)

Firstly let us look into the integral∫ s

0 (λ1(u) + λ2(u))du. We know that λi(u) is a stepfunction, which takes value λiti on the interval (ti−1, ti]. For s ∈ (ti−1, ti] it follows

∫ s

0(λ1(u) + λ2(u))du =

i−1∑j=1

(λ1tj + λ2

tj ) + (s− ti−1)(λ1ti + λ2

ti)

=Λ1(tj−1) + Λ2(tj−1) + (s− ti−1)(λ1ti + λ2

ti).

(3.11)

31

Page 32: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

From equations (3.10) and (3.11) it follows that

F1(t) = F1(ti−1) + λ1ti

∫ t

ti−1

exp(−(Λ1(tj−1) + Λ2(tj−1) + (s− ti−1)(λ1ti + λ2

ti)))

= F1(ti−1) +λ1ti

exp(Λ1(tj−1) + Λ2(tj−1))

∫ t

ti−1

exp((ti−1 − s)(λ1ti + λ2

ti)))ds

= F1(ti−1) +λ1ti

exp(Λ1(tj−1) + Λ2(tj−1))(λ1ti

+ λ2ti

)(− exp((ti−1 − s)(λ1

ti + λ2ti)) |

tti−1

)

= F1(ti−1) +λ1ti(1− exp((ti−1 − t)(λ1

ti + λ2ti))

exp(Λ1(tj−1) + Λ2(tj−1))(λ1ti

+ λ2ti

).

(3.12)

Let us continue the example from Figure 1.5 and assume that every individual also hasa variable LTV. We will define a new variable ”HighLTV”, which takes value 1, if LTVis high as it is shown in the Figure 3.1. In this example the variable HighLTV will beincluded in the regression.

Figure 3.1: Example data

In order to estimate the coefficient β1, which explains the influence of the variableHighLTV on the cause specific hazard rate for cure, we model cured individuals asindividuals who experienced the event of interest. Other individuals are considered ascensored. In the same way the coefficient for the cause specific hazard rate for liquidation,β2, is estimated. For the estimation of the parameters β1 and β2 we used the PythonStatsmodels package. This gave the following result

β1 = 0.205

32

Page 33: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

and for liquidation,β2 = −0.110.

Since no cures occurred in the interval [0, 1) the Breslow estimator for the cause specificbaseline function for cure gives us the following value for t ∈ (0, 1]

λ01(t) =

0∑j∈{1,2,3,4,5,6,7,8,9,10} exp(β1Xi)

= 0.

Since one event happened on the interval (1, 2], the cause specific baseline function fort ∈ (1, 2] is equal to

λ01(t) =

1∑j∈{1,2,4,5,6,7,8,9,10} exp(β1Xi)

= 0.103.

In a similar fashion we get

λ01(t) =

{2

7.455 = 0.268 t ∈ (2, 3]

0 t ∈ (3,∞].

For the baseline function for cure we get the following values

λ02(t) =

0.103 t ∈ (0, 1]

0.115 t ∈ (1, 2]

0 t ∈ (2, 3]

0.408 t ∈ (3, 4]

0 t ∈ (4,∞).

Once the cause specific baseline hazard rates for cure are calculated, we can multiplythem by exp(β1HighLTV), HighLTV ∈ {0, 1} in order to get cause specific hazard ratesfor cure of individuals with LTV higher and lower than 1.2. For the baseline functionfor cure we get the following values

λ1(t|1) =

0 t ∈ (0, 1]

0.127 t ∈ (1, 2]

0.329 t ∈ (2, 3]

0 t ∈ (3, 4]

0 t ∈ (4,∞)

Since exp(β10) = 1 the following identity holds

λ1(t|0) = λ01(t).

By taking the same steps as above the cause specific hazard rate for loss can becalculated and the following results are obtained:

33

Page 34: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

c

Figure 3.2: Cause specific hazard rate for cure

λ02(t|1) =

0.093 t ∈ (0, 1]

0.103 t ∈ (1, 2]

0 t ∈ (2, 3]

0.366 t ∈ (3, 4]

0 t ∈ (4,∞)

andλ1(t|0) = λ0

1(t).

The hazard rate for liquidation for HighLTV = 1 is almost equal to zero, because of thesize of β2. In order to calculate the survival function equation (1.7) will be used. Sincethe time steps are of length 1 and the cause specific hazard rate is a step function, thefollowing identity holds for t ∈ (tk−1, tk]

Λi(t|X) = Λi(tk−1|X) + (t− tk)λi(tk|X).

The identity above gives us the following functions of t for cause specific cumulativehazard functions for cure

Λ1(t|0) =

0 t ∈ (0, 1]

(t− 1) ∗ 0.103 t ∈ (1, 2]

0.103 + (t− 2) ∗ 0.268 t ∈ (2, 3]

0.371 t ∈ (3,∞)

34

Page 35: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 3.3: Cause specific cumulative function for cure

Figure 3.4: Cause specific cumulative hazard rate function for liquidation

and for HighLTV=1,

Λ1(t|1) =

0 t ∈ (0, 1]

(t− 1) ∗ 0.127 t ∈ (1, 2]

0.127 + (t− 2) ∗ 0.329 t ∈ (2, 3]

0.456 t ∈ (3,∞).

35

Page 36: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

The cause specific cumulative hazard functions for loss are

Λ2(t|0) =

t ∗ 0.103 t ∈ (0, 1]

0.103 + (t− 1) ∗ 0.115 t ∈ (1, 2]

0.218 t ∈ (2, 3]

0.218 + (t− 3) ∗ 0.408 t ∈ (3, 4]

0.626 t ∈ (4,∞)

and for HighLTV=1,

Λ02(t|1) =

t ∗ 0.093 t ∈ (0, 1]

0.093 + 0.103 ∗ (t− 1) t ∈ (1, 2]

0.196 t ∈ (2, 3]

0.196 + (t− 3) ∗ 0.366 t ∈ (3, 4]

0.562 t ∈ (4,∞).

The cause specific cumulative hazard functions for cure and loss can be seen in Fig-ures 3.3 and 3.4.After the cause specific cumulative functions for cure and liquidation are determined,the identity (1.7) can be used for the estimation of the survival function. We get

S(t|0) =

exp(−t ∗ 0.103) t ∈ (0, 1]

exp(−(0.103 + (t− 1) ∗ (0.115 + 0.103)) t ∈ (1, 2]

exp(−(0.321 + (t− 2) ∗ 0.268)) t ∈ (2, 3]

exp(−(0.589 + (t− 3) ∗ 0.408) t ∈ (3, 4]

exp(−1.997) t ∈ (4,∞)

and for HighLTV=1 we get

S(t|1) =

exp(−t ∗ 0.093) t ∈ (0, 1]

exp(−(0.093 + (t− 1) ∗ (0.127 + 0.103)) t ∈ (1, 2]

exp(−(0.323 + (t− 2) ∗ 0.329)) t ∈ (2, 3]

exp(−(0.652 + (t− 3) ∗ 0.366)) t ∈ (3, 4]

exp(−1.018) t ∈ (4,∞).

The survival curve estimates can be seen in Figure 3.5.After calculation of survival curves and hazard rates are calculated we can estimate cu-mulative incidence functions using equation (3.12). The calculated cumulative incidencefunctions can be seen in Figure 3.6.

36

Page 37: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 3.5: Survival function calculated with the Cox model

Figure 3.6: Cumulative incidence function calculated with the Cox proportional model

3.3 Model implementation

For the estimation of CCR, using the Cox model, the same data will be used as inthe previous chapter. Rabobank is still interested in the behavior of clients, whichhave Bridge loans, clients with non-Bridge loans and low LTV, clients with non-Bridgeloans, which have high LTV and do not have NHG and clients, which have non-Bridgemortgages with high LTV and do have NHG.Firstly it needs to be understood that the coefficient estimates of one risk factor dependson all variables that are included into the regression. For instance, if we compare theparameters βhLTV

HighLTV, when Cox regression is performed with variable HighLTV only, to

37

Page 38: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

βhLTV,BrHighLTV, when HighLTV and variable Bridge are included in the regression, it can be

seen that βhLTVHighLTV 6= βhLTV,Br

HighLTV.Secondly, if one is interested in the behavior of clients with bridge loans then regressiononly with the variable Bridge needs to be made. If we would include variable highLTVas well, than we would not be able to calculate a CCR curve for Bridge loans. Howeverwe would be able to calculate only CCR curves for clients which have Bridge loans andhigh LTV or CCR for clients which have bridge loans and low LTV, since XhighLTV cantake only values 0 and 1.It follows that Cox regression needs to be done three times with three different variablecombinations as can be seen in Figure 3.7.Once the coefficients are obtained for hazard rate calculation of the desired segment, a

Figure 3.7: Input variables and output coefficients

matching covariate needs to be used as can be observed from Figure 3.8.

Figure 3.8: Output coefficients multiplied with covariates

38

Page 39: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

3.4 Results

In this chapter all quantities that are needed in order to obtain CCR estimates for eachsegment will be looked into. Since we want to obtain CCR estimates for each segment.The first step is coefficient estimation for all combinations of variables. Coefficient causespecific hazard rates for cure estimated with Cox regression can be found in the table inFigure 3.9 and for loss in Figure 3.10.

Figure 3.9: Output coefficients for cure

Figure 3.10: Output coefficients for liquidation

From the figures it can bee seen that the indicator variable for bridge loan and theindicator variable for high LTV have a negative effect on the cause specific hazard ratesfor cure. In other words, individuals which have a LTV of more than 1.2 and a Bridgeloan are have a smaller hazard rate and consequently have a smaller probability of beingcured. Individuals with NGH will be cured slightly faster than individuals without it.On the other hand, it can be seen that individuals which have high LTV, non Bridgeloans and NHG will be liquidated faster than individuals without NGH. Once estimatesof coefficients are obtained we are able to model cause specific hazard rates for cure andliquidation, which can be seen in Figures 3.11 and 3.12.

It is seen that the segment which is cured at the fastest rate is the segment with clientswhich have low LTV, while the other three segments have similar hazard rates, whichbecome almost the same after the 30th month in default. These segments behave differ-ently when cause specific hazard rate for liquidation is modeled. We can conclude thatliquidation happens at the lowest rates for clients which have low LTV. This probablyhappens because clients with low LTV want to keep their properties. The segment withthe highest rate of liquidation is the segment with bridge loans.In order to calculate the survival function for segments with Cox regression equation (1.7)

39

Page 40: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 3.11: Cause specific hazard rate for cure

Figure 3.12: Cause specific hazard rate for liquidation

is used. Since the size of cause specific hazard rates and the size of the survival functionare negatively correlated, the segment with the biggest survival function is representedby clients which have High LTV and do not have NHG. If we look Figure 3.12 it is visiblethat this segment has the second smallest cause specific hazard rate for liquidation whilein Figure 3.11 it is shown that cause specific hazard rates for cure are almost the sameas the smallest hazard rates. With the same reasoning the behaviour of the other groupscan also be explained.

An intuitive explanation for the shapes of the survival curves is that clients with highcause specific hazard rates will be resolved faster and consequently the probability ofbeing unresolved at time t becomes smaller.Before CCR is modeled we need to look into the estimation of the cumulative incidencefunction for which the equation (3.12) is needed. From the Figure 3.14 it can be con-cluded that the highest probabilities of being cured are achieved by individuals withLow LTV. Since other segments have similar estimates for cause specific hazard rates

40

Page 41: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 3.13: Survival functions calculated with the Cox model

for cure, cumulative incidence functions are ordered in the same way as estimates of thecause specific hazard rates for liquidation.

Figure 3.14: Cumulative incidence functions for cure calculated with the Cox model

Finally all estimates from above can be used to estimate the CCRs, which can be seenin Figure 3.15.

3.5 Performance of the method

In Section 2.3 it was seen that the nonparametric CCR estimator has large variance andwide confidence intervals at the end of the observation period. That happens because ofthe jumps of the hazard rates that are estimated with the Nelson-Aalen estimator. Inthis chapter the methodology described in Section 2.3.1 will be used again in order toestimate confidence intervals, variance and bias of CCR estimated with Cox regression.From Figure 3.16 it can be seen that Cox regression estimators have narrower confidence

41

Page 42: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 3.15: CCR estimated with the Cox model

intervals than the nonparametric estimator but on the other hand, estimates for eachsegment are the same at the end of the observation period and consequently not signifi-cantly different.In Figure 3.17 it is visible that the variance is almost 10 times smaller than the variance

Figure 3.16: Confidence intervals estimated with the bootstrap

of the nonparametric estimator and that the small number of observations at the end ofthe observation period has almost no effect on variance. In Figure 3.18 it is be shownthat bias is smaller and that time also has almost no influence on the bias.

3.6 Discussion

From Section 3.5 it can be concluded that the Cox model gives us lower variance, lessbias and narrower confidence intervals than nonparametric estimators. All these prop-erties are definitely desirable features of an estimator.

42

Page 43: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 3.17: Variance of CCR estimated with the Cox regression

Figure 3.18: Estimation of bias with the bootstrap

At the same time there exist systematical techniques for variable selection in the Coxmodel. It can be much easier decided if a variable should be included or excluded froma segmentation than the method based on 95% confidence intervals, which is used inthe case of nonparametric estimators. Also β’s which are output from regression havesome explanatory power. When the coefficient of a boolean variable is negative it isclear from equation (3.1) that a client with such property will have a smaller hazardrate than a client without and consequently has a higher survival function and smallerprobability of resolution than a client without. The opposite phenomenon happens whenβ is positive. Consequently, tables which can be found in Figures 3.9 and 3.10 could bea helpful instrument at deciding if a client should get a mortgage or not or at decidingwhat should be a size of interest rate of a client. Such a tool cannot be found when weoperate with nonparametric estimators.This definitely is one of the desired properties of an estimator in Risk Management,the understandability, what is not a property of the Cox model. All nonparametric

43

Page 44: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

estimators, which are part of the CCR estimator, are intuitively easier to understandthan estimators, which are used in the Cox model, especially if we compare explanationsof the cumulative incidence functions. Hazard rates estimated with the Nelson-Aalenestimator are closely related to empirical probability, a simple counting process whileformula 3.1 is closely related to the exponential distribution. This makes sense only toa person who is better educated in probability.If time and computational power are important factors at deciding which estimator weare going to use, then we would have to decide to use nonparametric estimators, sincethey work faster and use less memory.

44

Page 45: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

4 Proportional hazards model in aninterval censored setting

Up until now CCR was modeled assuming that we have continuous time data and thatthe events of interest happen exactly at the observed time. In reality however, the events,which are denoted by ti, actually happened on the interval (ti−1, ti]. Consequently, weare dealing with another type of censoring, interval censoring. In order to model CCRin an interval censoring setting we have to use a different approach and a different like-lihood function. For a review of proportional hazards model in an interval censoredsetting see Corrente, Chalita, and Moreira (2003).In Section 4.1 we will get familiar with the theory behind the survival analysis in theinterval censoring setting. In Section 4.1.1 generalized linear models, which are neededfor the modeling of CCR, will be represented. In Section 4.2 we will learn how to usebinning in time intervals construction and make computations with the generalized lin-ear models computationally feasible. In Section 4.3 we review estimated quantities inthe interval censoring setting. In Section 4.4 the performance of the method will beanalyzed with the bootstrap. Which method from the three estimators is the best willbe discussed in the Section 4.5.

4.1 Theoretical background

When we have interval censored data, an interval is divided into smaller subintervalsIi = [ai−1, ai), where 0 = a0 < a1 < · · · < ak = ∞. The set of subjects that defaultedin interval Ii will be denoted with Di and Ni will denote the number set of subjects,which are at risk at the beginning of the interval Ii. The indicative variable δji, j ∈{1, 2, . . . , n}, i ∈ {1, 2, . . . , k} will denote the indicator variable, which takes value 1 ifthe subject j failed in interval i and it takes value zero if subject j was still alive at theend of interval Ii or it was censored in interval Ii. For instance, if subject j died in theinterval I3, the following equality will hold

(δj1, δj2, δj3) = (0, 0, 1).

The value p(ai|Xj) equals the conditional probability that the subject with covariateXj has already experienced an event at time ai given that the individual has not ex-perienced the event of interest at ai−1. In order to derive the likelihood function in an

45

Page 46: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

interval censoring setting the following two identities are needed

P (Tj ∈ I|Xj) = P (ai−1 ≤ Tj < ai|Xj)

= (S(ai−1|Xj)− (S(ai|Xj)

= [(1− p(a1|Xj)) . . . (1− p(ai−1|Xj))]− [(1− p(a1|Xj)) . . . (1− p(ai|Xj))]

= [(1− p(a1|Xj)) . . . (1− p(ai−1|Xj))](1− (1− p(ai|Xj)))

= [(1− p(a1|Xj)) . . . (1− p(ai−1|Xj))]p(ai|Xj)

(4.1)

and

P (Tj > Ik|Xj) = P (Tj > ak|Xj)

= S(ai|Xj)

= [(1− p(a1|Xj)) . . . (1− p(ai|Xj))]

(4.2)

Combining equations (4.1) and (4.2) the likelihood function for interval censored databecomes

L =k∏i=1

∏j∈Ni

p(ai|Xj)δji(1− p(ai|Xj))

1−δji , (4.3)

Corrente, Chalita, and Moreira (2003).Equation (4.3) is a likelihood function for observations with a Bernoulli distributionwhere δji is a binary response variable with probability p(ai|Xj).When a variable has a Bernoulli distribution generalized linear models can be used forthe modeling.

4.1.1 Generalized Linear Models

In the first part of this section generalized linear models will be presented. In the secondpart we review modeling of p(ai|Xj) with GLM. For a review of Generalized LinearModels see Fox (2016).Generalized Linear Models (GLM) is a tool for the estimation of a number of distinctstatistical models, for which we would usually need distinct statistical regression models,for instance logit and probit. GLM was firstly presented by John Nelder and R.W.M.Wedderburn in 1972.In order to use GLM three components are needed:

• Firstly we need a random component Yi for i-th observation, which is conditionedon the explanatory variables of the model. In the original formulation of GLMYi had to be a member of an exponential family. One of the members of theexponential family is binomial distribution and consequently Bernoulli distribution.

• Secondly, a linear predictor is needed. That is a function of regressors

ηi = α+ β1Xi1 + β2Xi2 + · · ·+ βkXik.

46

Page 47: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

• Thirdly, we need a smooth invertible linearizing link function, g(·). The link func-tion transforms the expectation of the response variable, µi = E(Yi), into a linearpredictor

g(µi) = ηi = α+ β1Xi1 + β2Xi2 + · · ·+ βkXik.

One of the functions that can be used in a combination with the binomial distri-bution is the complementary log-log function (clog-log) where

ηi = g(µi) = ln(− ln(1− µi))

orµi = g−1(ηi) = 1− exp(− exp(ηi)).

From the identity (4.3) it is seen that δji will be the response variable with a Bernoullidistribution with parameter p(ai|Xj) and consequently E(δji) = p(ai|Xj) = µi.Once we know µi, we have to find the link function g, for which it holds ηi = g(p(ai|Xj)).Using the Cox proportional hazard model in order to model p(ai|xj), equations (3.3) and(3.2), give us

p(ai|Xj) = 1−

(S0(ai)

S0(ai−1)

)exp(βTXj)

. (4.4)

As a complementary log-log transformation is applied to equation (4.4) we get,

ln(− ln(1− p(ai|Xj))) = βTXj + ln

(− ln

(S0(ai)

S0(ai−1)

))= βTXj + γi = ηi, (4.5)

where γi = ln(− ln

( S0(ai)S0(ai−1)

)).

From above it follows that we can use a GLM with a binomial distribution and comple-mentary log-log link function in order to model δji or in other words, we can model theCox proportional hazard model in the interval censoring setting with GLM.After the values p(ai|Xj) are obtained, the survival function for each time point can becalculated using equation (4.2).In the competing risk setting a GLM model can be used in order to model probabilitiesof failing due to reason k on the interval [aj−i, ai) or pk(ai|Xj), k = 1, 2. In a competingrisk setting indicator function δkji tells us if individual j failed on interval i due to reasonk = 1, 2. As soon as estimators pk(ai|Xj) are calculated, the following identity can beused in order to estimate the survival function

S(t|Xj) =∏i:ai≤t

(1− (p1(ai|X) + p2(ai|X))). (4.6)

In order to estimate the cumulative incidence function the following estimator will be

47

Page 48: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

used

F1(t|Xj) =∑i:ai≤t

S(ai−1|Xj)pi(ai|Xj). (4.7)

Once the cumulative incidence functions are estimated, equation (2.1) is used in orderto estimate CCR.

As can be seen from equation (4.7) GLM can be used for the modeling of proportionalhazards in the interval censoring setting. Since the indicator variable δkji is defined forindividual j for each time until he is censored, or one of events k occur to him, we needto input a different type of data frame into the regression than the one inputed when wewere using the nonparametric and Cox estimators. The easiest way to understand howthe data needs to be transformed is by continuing the example from Figure 3.1.From the figure it can be seen that we are following individual 1 until time 6, when he iscensored. And for every time interval we have to define δk1i, k = 1, 2, i = 1, 2, . . . , 6 whichtells us if individual 1 experienced cure or loss in interval (i−1, i]. In similar fashion theother rows have to be duplicated, but at observations, which are not censored, δ1

ji takesvalue 1 if cure happened as can bee seen for individual 4 from the Figure 4.1.

After we have the data in the right format, we need to make dummy variables out ofthe variable observed time. How to create dummy variables for individual 1 can be seenin Figure 4.2. Dummies have to be crated for each time from 1 to the largest time thatcan be found in the column observed time.Once dummies are created, we can start modeling pk(ai|X). In the GLM we chose theclog-log as the link function and the binomial distribution. For the response variable wechose δkji and for the independent variables we have to chose a dummy variable and thevariables we want to include into the regression. In our case it is the variable HighLTV.Coefficients, which we get as output for dummies i, represent values γi while coefficientβHighLTV explains the influence of the variable HighLTV on pk(ai|HighLTV ) as can beseen in equation (4.4).

The output coefficients for cure are

β1HighLTV 0.243

1(γ11) -23.553

2(γ12) -2.232

3(γ13) -1.154

4(γ14) -23.528

5(γ15) -23.562

6(γ16) -23.476

and for liquidation

48

Page 49: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 4.1: Transformed data frame for GLM

β2HighLTV -0.156

1(γ21) -2.204

2(γ22) -2.096

3(γ23) -22.430

4(γ24) -0.633

5(γ25) -22.423

6(γ26) -22.476

49

Page 50: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 4.2: Dummies for GLM

In order to estimate pk(ai|HighLTV) we have to use the inverse of equation (4.5) or

pk(ai|HighLTV) = 1− exp(− exp(γki + βHighLTVHighLTV)). (4.8)

After doing so we get the following values for the probabilities for cure

p1(ai|1) =

7.524 ∗ 10−11 , i = 1

0.128 , i = 2

0.331 , i = 3

7.717 ∗ 10−11 , i = 4

7.462 ∗ 10−11 , i = 5

8.125 ∗ 10−11 , i = 6

and

p1(ai|0) =

0.105 , i = 1

0.116 , i = 2

1.813 ∗ 10−10 , i = 3

0.411 , i = 4 .

1.827 ∗ 10−10 , i = 5

1.733 ∗ 10−10 , i = 6

For p2(ai|1) we get the following values

50

Page 51: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

p2(ai|0) =

0.143 , i = 1

1.082 ∗ 10−10 , i = 2

9.915 ∗ 10−11 , i = 3

0.500 , i = 4

1.082 ∗ 10−10 , i = 5

6.374 ∗ 10−11 , i = 6

and

p2(ai|1) =

0.009 , i = 1

0.010 , i = 2

1.552 ∗ 10−10 , i = 3

0.365 , i = 4 .

1.564 ∗ 10−10 , i = 5

1.483 ∗ 10−10 , i = 6

Applying equation (4.6) to the results above gives us

S(ai|0) =

1.000 , i = 1

0.895 , i = 2

0.511 , i = 3

0.301 , i = 4

0.301 , i = 5

0.301 , i = 6

and

S(ai|1) =

0.909 , i = 1

0.703 , i = 2

0.470 , i = 3

0.298 , i = 4 .

0.298 , i = 5

0.298 , i = 6

51

Page 52: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Now equation (4.7) can be used for the estimation of the cumulative incidence function

F (ai|0) =

5.902 ∗ 10−11 , i = 1

0.009 , i = 2

0.872 , i = 3

0.294 , i = 4

0.294 , i = 5

0.294 , i = 6

and

F (ai|1) =

7.525 ∗ 10−11 , i = 1

0.016 , i = 2

0.349 , i = 3

0.349 , i = 4 .

0.349 , i = 5

0.349 , i = 6

4.2 Model implementation

Rabobank assumes that the probability of cure is equal to zero after the 58th monthand that the data is collected at the end of each month. From Section 4.1 it followsthat we would need to make a regression with at least 58 coefficients in order to obtainγi, i = 1, . . . , 58, which are needed for the estimation of pki (a|X). Such a regressionis computationally too expensive. At the same time we would need to transform thedata with 31.000 observations into data with more than 2 million observations, which isexpensive as well. In order to avoid these problems a method called binning or bucketingwill be used.

4.2.1 Binning

In order to make the method computationally feasible the number of observed intervalswill be narrowed. Firstly, the cause specific hazard rate of each interval will be estimatedwith the following estimator

Number of deaths in the interval

Number of individuals in the beginning of the interval ∗ Size of interval.

Once these hazard rates are estimated, the two neighboring intervals with the smallestabsolute difference will be joined into a new interval. If we join interval Ii = (ai−1, ai]and Ii+1 = (ai, ai+1] then the new interval I ′i = (ai−1, ai+1] will be obtained. Thesetwo steps need to be repeated until we have a desired small enough number of intervals.GLM regression will be made with 10 intervals. The results can be seen in Figures 4.3and 4.4.

52

Page 53: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 4.3: Hazard rates estimated before binning

Figure 4.4: Interval hazard rates estimated after binning

As soon as binning is performed, we use the output interval I ′i = (a′i−1, a′i] in order

to model probabilities pk(ai). Values a′i, for which we will model probabilities, area′0 < a′1 < · · · < a′10 < (a′11 =∞) we get 0 < 3 < 4 < 5 < 8 < 12 < 13 < 14 < 20 < 46 <58 <∞. It will be still assumed that the probability of cure after time 58 is equal to 0,

53

Page 54: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

and consequently p1(a11|X) = 0.

4.3 Results

In this chapter all the quantities, which are need for estimation of CCR in the intervalcensoring setting, will be modeled and compared with the results from previous chap-ters. Since regression with 58 coefficients is not computationally feasible, the intervals,which were obtained in Section 4.2.1, will be used. All graphs in this chapter should behistograms, since we are operating with discrete time, but visualization of four segmentson one figure would be nearly impossible, consequently quantities are represented as stepfunctions.

After the intervals are determined we can estimate the parameters. As independentvariables we have to use a combination of variables for each segment as described in Sec-tion 3.3 and dummy variables for intervals. Once regression is done we get the followingcoefficients for the risk parameters for cure as can be seen in Figures 4.5 and 4.6.

Figure 4.5: Output coefficients for cure

Figure 4.6: Output coefficients for loss

Once γi, i = 1, 2 . . . , 10 are estimated we can start computing p1(ai|Xj) and p2(ai|Xj).Interval probabilities for cure and loss can be seen in the figures 4.7 and 4.8. Intervalprobabilities cannot be compared with modeled hazard rates from the previous chaptersbecause they are not normalized. In the figures it is also visible that interval probabilitiesare higher for wider intervals, which makes sense. The longer the interval, the higherthe probability of cure or liquidation.

From Figures 4.7 and 4.8 it is seen that segments behave in a similar way as hazardrates estimated with the Cox model, which can be seen in Figures 3.11 and 3.12. Con-

54

Page 55: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 4.7: p1(ai|Xj)

Figure 4.8: p2(ai|Xj)

sequently, the survival curves of segments, which can be visible in Figure 4.9, take asimilar shape as in Section 3.4.

Once survival functions and interval probabilities are estimated equation (4.7) can beused for the estimation of the cumulative incidence functions of each segment. Resultscan be seen in Figure 4.10.

Again we can observe the same behaviour as when we model the cumulative incidencefunction with Cox regression.Finally equation (2.3) can be used for CCR estimation and results, which can be seenin Figure 4.11 are obtained.

The CCR curves are in reality step functions, but they are plotted as smooth functionsso the resemblance with the previous chapters can be seen. CCR estimates as stepfunctions can be found in the Figure 4.12.

55

Page 56: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 4.9: Survival function estimated with GLM

Figure 4.10: Cumulative incidence functions estimated with GLM

4.4 Performance of the method

In order to validate the proportional hazards model in the interval censoring setting thesame method as described in Section 2.3.1 will be used. But this time we will make oneextra step. Before parameter estimation binning is done.

From the Figures 4.12 and 4.13 it can be seen that binning has a negative effect on theconfidence intervals and variance. If we compare the confidence intervals and variancewith the nonparametric and Cox estimators, it can be concluded that this method hasthe widest confidence intervals and highest variance. Also the segment with the mostrepresentatives has the widest confidence interval. This happens because at every CCRi

estimation different time intervals are generated. The reason why the lower bound ofthe confidence interval after the 20th month is equal to zero is, that in the least of 5%of simulations the last 10th interval is one wide interval, which starts at 20th month.

56

Page 57: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 4.11: CCR estimated in interval censoring with the Cox proportional hazardsmodel

Figure 4.12: Confidence intervals estimated with the bootstrap

As we have done in continuous time, where F (∞) = F (58), we set in the interval cen-soring also F (∞) = F (10). As in continuous time it follows from equation (2.1) that

CCR(58) = 0 we get in the interval censoring also CCR(10) = 0. The other reason forwide intervals is also that the months from the 20th month on frequently change bucketsand consequently CCRi(t) are not always calculated for the same bucket and we getcompletely different values as we got at the original binning in the Figure 4.4. We alsohave to take into consideration that binning algorithm minimizes the difference betweenthe neighboring hazard rates and consequently neighboring CCR estimates will be dif-ferent or have big jumps as it can be seen in Figure 4.12. From the same figure we canalso see that higher CCR estimates have larger jumps and consequently the segmentswith the highest CCR estimates have the highest variance. In our case the segment withthe most representatives has the highest CCR estimates and consequently the highestvariance.

57

Page 58: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Figure 4.13: Variance estimated with the bootstrap

From Figure 4.14 it can be seen that since time points in simulated buckets take otherbins than the bins from the original generated intervals, this estimator has a bigger bias.On time points where CCR is bigger than ¯CCR the bias is positive. At these time pointsit happens that time instant is in a bin of right it is the neighbor of the original one,where CCR(t) is lower. When the bias is negative the opposite phenomenon happens.

Figure 4.14: Bias estimated with the bootstrap

4.5 Discussion

From Section 4.4 it can be seen that CCR estimated in combination with binning andGLM has a high variance, wide confidence intervals and a high bias due to binning. Atthe same time it is seen that at some time points, which do not change bins frequently,at the beginning of observation period, the variance is close to zero. It can be expected

58

Page 59: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

that the results would be at least comparable to the Cox model if we would have morecomputational power and consequently binning would not be needed.On the other hand the data, which is collected by Rabobank, is collected at the end ofeach month and in reality we have interval censored observations. Consequently, usingthe GLM in order to obtain CCR estimates is more mathematically correct than ap-proximating all observations, which happen in one month, to the last day of the monthas it is done in the case of the nonparametric and Cox estimators.At this point three estimators were represented. the nonparametric estimator is defi-nitely the easiest to understand. If Rabobank collects more data, this estimator wouldgive us comparable results to the Cox estimator, which is the estimator with the lowestvariance, lowest bias and tightest confidence intervals. If we can expect that regulatorsunderstand more advanced concepts from probability theory and that our computa-tional power will not change, the Cox estimator would be best choice. At the givencomputational power mathematical correctness of interval censoring can be ignored butdefinitely not forgotten, because in some time points CCR estimated with GLM givesus good estimates.

59

Page 60: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Popular summary

In the master thesis Credit risk and survival analysis: Estimation of CCR we lookedinto three estimators of CCR.Since CCR is modeled with survival analysis, the basic concepts of survival analysis wereintroduced at the beginning of the thesis. From the introduction it is know that everyclient, which defaulted, can be cured or liquidated and consequently the competing risksetting was later introduced.After we got familiar with the concepts from survival analysis, which are needed forCCR the CCR model, which is currently used by Rabobank, was introduced. The per-formance of the method was analyzed with the bootstrap. It was discovered that themethod has a high variance and wide confidence intervals at the end of the observationperiod what is not a desirable attribute of any estimator. It follows that the estima-tion of CCR with the Cox model was represented. The results and performance wascompared to the nonparametric method, which is currently used by Rabobank. It wasdiscovered that this method has tighter confidence intervals and a lower variance andconsequently performs better. Since the data is collected at the end of each month,approach from the Cox and nonparametric model, which assumes that event of interesthappened on the last day of the month, is incorrect. Consequently, CCR needs to bemodeled in the interval censoring setting. In order to do so, GLM in the combinationwith binning were used. Binning caused a high variance, high bias and wide confidenceintervals. Consequently, this method has the worst performance.

60

Page 61: Credit risk and survival analysis: Estimation of ... · 1 Survival analysis Survival analysis is a branch of statistics which is specialized in the distribution of life-times. The

Bibliography

Corrente, Jose E., Liciana V. A. S. Chalita, and Jeanete Alves Moreira (2003). “Choosingbetween Cox proportional hazards and logitistic models for interval-censored datavia bootstrap”. In: Journal of Applied Statistics 30.1, pp. 37–47.

Fox, John (2016). Applied Regression Analysis and Generalized Linear Models. third.Sage Publishing.

Kalbfleisch, John F. and Ross L. Prentice (2002). The Statistical Analysis of FailureTime Data. second. WILEY-INTERSCIENCE.

Klein, John P. and Melvin L. Moeschberger (2003). Survival Analysis: Techniques forCensored and Truncated Data, second Edition. second. Springer.

Lawless, Jerald F. (2002). Statistical Models and Methods for Lifetime Data. second.WILEY-INTERSCIENCE.

Weng, Yi-Ping (2007). Baseline Survival Function Estimators under Proportional Haz-ards Assumption.

Zhang, Mei-Jie, Xu Zhang, and Thomas H Scheike (2008). “Modeling cumulative inci-dence function for competing risks data”. In: Expert Review of Clinical Pharmacol-ogy 1.3. PMID: 19829754, pp. 391–400. doi: 10.1586/17512433.1.3.391. eprint:https://doi.org/10.1586/17512433.1.3.391. url: https://doi.org/10.1586/17512433.1.3.391.

61


Related Documents