Application of Recursive Partitioning to Agricultural Credit

Journal of Agricultural and Applied Economics, 3 1,l(April 1999):109-1220 1999 Southern Agricultural Economics Association

Application of Recursive Partitioning toAgricultural Credit Scoring

Michael P. Novak and Eddy LaDue

ABSTRACT

Recursive Partitioning Algorithm (RPA) is introduced as a technique for credit scoringanalysis, which allows direct incorporation of misclassification costs. This study corrobo-rates nonagricultural credit studies, which indicate that RPA outperforms logistic regressionbased on within-sample observations, However, validation based on more appropriate out-of-sample observations indicates that logistic regression is superior under some conditions.Incorporation of misclassification costs can influence the creditworthiness decision.

Key Words: finance, credit scoring, misclassification, recursive partitioning algorithm

Many agricultural banks and lending institu-

tions are beginning to recognize the advantag-

es of credit scoring in conjunction with human

analysis. Several institutions are currently us-

ing such models on at least a subset of their

portfolio. Credit scoring models hold the

promise of reducing the variability of credit

decisions, adding efficiencies to credit risk as-

sessment, establishing better loan pricing pol-

icies, and improving the safety and soundnessof agricultural lending. Improved financial in-formation systems have allowed agriculturallenders to more readily collect and retain dataregarding the creditworthiness of borrowers.As such databases are populated the ability tomonitor changes in creditworthiness overtimeimproves, and the need to explore new meth-ods to estimate credit-scoring models increas-es.

Within the agricultural financial literaturevarious nonparametric and parametric meth-ods have been used to estimate credit-scoring

Michael I? Novak is Manager of Agricultural Finance;Federal AgriculturalMortgage Corporation; Washington,DC. Eddy LaDue is W.I. Myers Professor of AgriculturalFinance;Departmentof Agricultural,Resource, andMm-agerial Economics; Cornell University; Ithaca,NY.

models, such as experience-based algorithms(Alcott; Splett et al.), mathematical program-ming (Hardy and Adrian; Ziari, Leatham, andTurvey), logistic regression (Mortensen, Watt,and Leistritz), probit regression (Lufburrow,Barry, and Dixon; Miller et al.), discriminantanalysis (Hardy and Weed; Dunn and Frey;Johnson and Hagan), and linear probability re-gression (Turvey). There is not unanimousagreement as to the best method for estimatingcredit-scoring models and new methods con-tinue to be researched.

Most recently, the logistic regression hasdominated the agricultural credit-scoring lit-erature (Miller and LaDue, Turvey and Brown,Novak and LaDue, Splett et al.). Logistic re-gression succeeded discriminant analysis asthe parametric method of choice, primarilybased on its more favorable statistical prop-erties (McFadden). Turvey reviews and em-pirically compares agriculture credit-scoringmodels using four parametric methods with asingle data set. He recommends logistic re-gression over probit regression, discriminantanalysis, and linear probability regressionbased on predictive accuracy and ease of use,in addition to the favorable statistical proper-

110 Journal of Agricultural and Applied Economics, April 1999

ties previously mentioned. Logistic regressionimproves on some of the statistical propertiesof discriminant analysis and linear probabilityregression; however, it still possesses numer-ous statistical problems common to most para-metric methods. These problems include (1)the need to pre-select the exact explanatoryvariables without well-developed theory, (2)inability to identify an individual variable’srelative importance, (3) reduction of the infor-mation space’s dimensionality, and (4) limitedability to incorporate relative misclassificationcosts.

Non-agricultural studies have used the Re-cursive Partitioning Algorithm (RPA) to clas-sify financially stressed firms. RPA is a com-puterized, nonparametric classification methodthat does not impose any a-priori distributionassumptions. The essence of RPA is to devel-op a classification tree that partitions the ob-servations based on binary splits of character-istic variables. The selection and partitioningprocess occurs repeatedly until no further se-lection or division of a characteristic variableis possible, or the process is stopped by somepredetermined criteria. Ultimately the obser-vations in the terminal nodes of the classifi-cation tree are assigned to classificationgroups. Friedman originally developed RPA.A thorough theoretical exposition of RPA ispresented in Breiman, et al. A more practicalexposition of the computational aspects ofRPA and a comprehensive bibliography of re-search using RPA are presented in the CARTsoftware documentation (Steinberg and Colla).RPA has been applied to many areas of re-search, such as behavior economics (Carson,Hanemann, and Steinberg), wildlife manage-ment (Grubb and King), and livestock man-agement (Tronstad and Gum), but it has notbeen applied to agricultural credit-scoring.

Several non-agricultural financial stressclassification studies indicate RPA outper-forms the other parametric and judgmentalmodels based on predictive accuracy. Marais,Patell, and Walfson compare RPA with a po-lytomous probit regression to classify com-mercial loans for publicly and privately heldbanking firms. Frydman, Altman, and Kaocompare RPA with discriminant analysis to

classify firms according to their degree of fi-nancial stress. Srinivasan and Kim compareRPA with discrirninant analysis, logistic re-gression, goal programming, and a judgmentalmodel (the Analytic Hierarchy Process) toevaluate the corporate credit granting process.Each of these studies uses cross-validation andthe associated expected cost of misclassifica-tion to evaluate the RPA models. A shortcom-ing of these studies is that they do not useintertemporal (ex ante) predictions to compareand evaluate the models. Prediction is the ba-sic objective of credit-scoring models (Joy andTofeson). Credit-scoring models should not belimited to classifying borrowers in the sametime period. The “true” test is their ability toclassify borrowers in the future.

The primary purpose of this study is to in-troduce RPA as a method for classifying cred-itworthy and less creditworthy agriculturalborrowers, and compare RPA to the logisticregression. This study also challenges theRPA’s superior prediction accuracy, as pur-ported in the financial stress classification lit-erature. In this study, RPA models are evalu-ated based on minimizing the expected cost ofmisclassification for creditworthy and less

creditworthy borrowers in out-of-sample pe-riods.

The remainder of the paper is divided intofive sections. The first section presents thespecifics of the RPA. The second section dis-cusses the advantages and disadvantages ofand the differences between the RPA and lo-gistic regression. The third section describes

the data. The fourth and fifth sections presentthe creditworthiness models and empirical re-

sults, respectively. The final section summa-rizes the paper’s results.

Recursive Partitioning Algorithm

In this section, a hypothetical RPA tree grow-ing process is presented and the terminologyis introduced, To understand the tree growingprocess, a hypothetical tree is illustrated inFigure 1. It is constructed using classification

groups i and j, and characteristic variables A

Novak and LaDue: Recursive Partitioning and Agricultural Credit Scoring 111

TreeTo

TreeT,

TreeT2

TreeT3

mParentNcde

1A<a,

I A>a,

SubNode 1

BSb,

SubNode I

Asa2

Figure 1. Hypothetical Recursive Partition-ing Algorithm Tree

and B. 1Throughout the paper the classificationgroups are limited to two, but in general clas-sification groups can be greater than two. Tostart the tree-growing process all the obser-vations in the original sample, denoted by N,are contained in the parent node which con-stitutes the first subtree, denoted TO(not reallya tree, but we will call it one anyway). TOpos-sesses no binary splits and can be referred toas the naive classification tree. All observa-tions in the original sample are assigned togroup j or i, based on an assignment rule. Theassignment of TOto either group i or j dependson misclassification costs and prior probabili-ties. When misclassification costs are equal toeach other and prior probabilities are equal tothe sample proportions of the groups, TOis as-signed to the group with the greatest propor-tion of observations, minimizing the numberof observations misclassified, When misclass-ification costs are not equal and prior proba-bilities are not equal to the sample proportionsof the groups, TOis assigned to the group thatminimizes the observed expected cost of mis-classification.2

1Characteristic variables are analogous to indepen-dent variables in a parametric regression.

2The observed expected cost of misclassification= cijwin,,(T)/N,+cJ,r,nJ,(T)/N,,where c,, (c,,) is the cost

of misclassifying a group i (j) observation as a groupj (i) observation; w, (T,) is the prior probability of anobservation belonging to group i (j); n,J(T) (nJ,(T)) isthe total number of group i (j) observations misclas-sified as j (i) in the entire tree T, and N, (N,) is thenumber of original observations from group ifi).

To begin the tree-growing process, RPAmethodically searches each individual char-acteristic variable and split value of the char-acteristic variable. The computer algorithmthen selects a characteristic variable, in thiscase A, and a split value of the characteristicvariable A, in this case al, based on the opti-mal univariate splitting rule.3 The optimalsplitting rule implies that no other character-istic variable and split value can decrease theimpurity or, in other words, the misclassifiedobservations, taking into account misclassifi-cation costs and prior probabilities in the tworesulting descendent nodes. In this particularillustration, A is the characteristic variable se-lected and al is the “optimal” split value se-lected by the computer algorithm. Observa-tions with a value of characteristic variable Aless than or equal to al will “fall” into the leftnode and the observations with a value ofcharacteristic variable A greater than al will“fall” into the right node. The resulting sub-tree, denoted by Tl, consists of a parent nodeand a left and right terminal node. The rightterminal node is labeled Sub-Node 1 in Figure1 because the tree continues from that node.The terminal nodes in the subtree are then as-signed to groups, i or j, based on the assign-ment rule of minimizing observed expectedcost of misclassification. TOand T1 are the be-ginning of a sequence of trees that ultimatelyconcludes with T~.X. However, in some casesT, may also be T~,X depending on the prede-termined penalty parameters specified. If T, isnot T~,X then the recursive partitioning algo-rithm continues.

In this illustration, T1 is not T,.,X, so thepartitioning process continues. Now B is thecharacteristic variable selected and bl is the“optimal” split value selected by the comput-er algorithm. The right node becomes an in-ternal node and the observations within it arepartitioned. Observations with a value of char-

3The univariate splitting rule implies splitting anaxis of one variable at one point. This study is limitedto univariate splitting rules; however, CART has thecapability to split variables using linear combinationsof variables. The resulting classification trees are usu-ally very cumbersome and difficult to interpret whenlinear combination splitting rules are used.


acteristic variable B less than or equal to bl“fall” into a new left node and observationswith a value of characteristic variable B great-er than b, “fall” into a new right node. Thenew left (labeled Sub-Node 2 in Figure 1) andright nodes become terminal nodes in T2, andthe left node in T1 still remains a terminalmodel in T2. All three terminal nodes in Tz are

then assigned to classification groups, i and j,

based on the assignment rule of minimum ob-

served expected cost of misclassification.

Here again, Tz does not minimize the ob-

served expected cost of misclassification of

the original sample; therefore the partitioning

process continues. Variable A is selected again

to develop T3. When the recursive partitioning

process is finished, the resulting classification

tree is known as T~,X. In this illustration, T3 =

T ~,,. T~,X is the tree that minimizes the ex-

pected observed cost of misclassification of

the original sample. Obviously the develop-

ment method will over fit the tree; therefore,

a method is needed to prune back the tree.

Some suggested methods are v-fold cross-val-

idation, jackknife, expert judgement, boot

strapping, and holdout samples. Once the clas-

sification tree is developed and pruned back,

it can be used to classify observations from

outside the original sample.

RPA and Logistic Regression Comparison

In this section the advantages and disadvan-tages of and the differences between RPA andlogistic regression are discussed. One basicdifference between RPA and logistic regres-sion is the way RPA selects variables. A cred-it-scoring model developed using RPA doesnot require the variables to be selected in ad-vance. The computer algorithm can select var-iables from the predetermined group of vari-ables, without subjective influences orviolating parametric assumptions.

Other differences are that RPA places nolimit on the number of times a variable can beselected; the same variable can be selected nu-merous times and appear in different parts ofthe tree. All selected variables are predicatedon the preceding variables. RPA never looksahead to see where it is going nor does it try

to assess the overall performance of the treeduring the splitting process. The tree growingprocess is intentionally myopic. Furthermore,outlier values do not significantly influenceRPA: all splits occur on non-outlier values.Once the optimal split value for a variable isselected, the outlier observation is assigned toa node and the RPA procedure continues. Incontrast, logistic regression allows each vari-able only to appear once in the model and canbe severely affected by outlier values.

An advantage of RPA over the logistic re-gression methods is that RPA analyzes theunivariate attributes of individual variables.RPA selects the optimal split value of thecharacteristic variables, and surrogate andcompetitive variables, along with their optimalsplit values listed in order of importance. Thelists of surrogate and competitive variablesprovide additional insight and understandingto the predictive structure of the individualvariables. Surrogate variables mimic the se-lected variable’s ability to replicate the sizeand composition of the descendent nodes.Competitive variables are defined as alterna-tive variables to the selected variables withslightly less ability to reduce impurity in thedescendent nodes.

While lacking in variable selection and in-sight, logistic regression does have advantag-es. Logistic regression provides an overallsummary statistic. The overall summary sta-tistic can be used to evaluate and comparemodels. Logistic regression also assigns a pre-dicted probability of creditworthiness to eachindividual borrower. Often lenders want aquantitative assessment of the borrower’screditworthiness, not just a method of classi-fying borrowers as creditworthy or less cred-itworthy. RPA can classify observations intocreditworthy or less creditworthy groups, butcannot estimate a credit score for each indi-vidual borrower.

The two methods differ in the way theydivide the information space into classificationregions. IWA repetitiously partitions the infor-mation space as the tree is formed. A graphicalillustration is presented in Figure 2; it is basedon the hypothetical RPA tree in Figure 1. RPApartitions the information space into four rect-

Novak

f(zrrj= c

Et

and LaDue: Recursive Partitioning and Agricultural Credit Scoring 113

41 group j

b, group I2

glOup 1

al a2

Figure 2. Observation Space

angular regions according to characteristic

variables, A and B, and their respective opti-

mal split values, al and b,. Observations fall-

ing in regions 1 and 2 are classified as group

i and those falling in region 3 and 4 are clas-

sified as group j. Logistic regression, if imple-

mented as a binary qualitative choice model,

partitions the information space into two re-

gions based on a prior probability, say c. The

example line f(Z.,) = c divides the information

space. Z~ is a linear function of variables A

and B corresponding to observation m, and

f(x) is the cumulative logistic probability

function. The observations are assigned toclass i if f(Z~) = > c or group j if f(Z~) < c.

The two methods also differ in the manner

in which they incorporate misclassification

costs and prior probabilities. RPA uses mis-

classification costs and prior probabilities to

simultaneously determine variable selection,optimal split values, and terminal node assign-ments. Changes in the misclassification costsand prior probabilities can change the selectedvariables and the optimal split values, and, inturn, change the structure of the classificationtree. In contrast, logistic regression is usuallyestimated without incorporating misclassifica-tion costs and prior probabilities. However, af-ter the logistic regression is estimated, a priorprobability can be used to classify borrowersas creditworthyfless creditworthy.

Despite the differences in the two methods,the RPA and logistic regression methods canbe integrated. RPA can select the relevant var-iables from a predetermined set of variables.The variables then can be employed in the lo-gistic regression. In addition, the predicted

probabilities from the logistic regression can

be used as a variable in the predeterminedgroup of variables from which the RPA modelselects. Whether, and at what level, RPA se-lects the predicted probability variable to bepart of the classification tree can provide evi-dence for or against logistic regression.

Data

The data for this study were collected fromNew York State dairy farms in a programjointly sponsored by Cornell Cooperative Ex-tension and the Department of Agricultural,Resource, and Managerial Economics at theNew York State College of Agriculture andLife Sciences, Cornell University. Seventyfarms have been Dairy Farm Business Man-agement (DFBS) cooperators from 1985through 1993. Data for these seventy farms areanalyzed in this study. Such a data set is crit-ical in studying the dynamic effects of farmcreditworthiness.4 The farms represent a seg-ment of New York State dairy farms, whichvalue consistent, annual financial and manage-

4Two types of estimation biases that typicallyplague credit evaluation models are choice bias andselection bias. Choice bias occurs when the researcherfirst observes the dependent variable and then drawsthe sample based on that knowledge. This process ofsample selection typically causes an ‘ ‘oversampling”of financial distressed firms. To overcome choice bias,this study selects the sample first and then calculatesthe dependent variable, The other type of bias plaguingcredit evaluation models is selection bias. Selectionbias is a function of the nonrandomness of the dataand can asymptotically bias the model’s parametersand probabilities (Heckman). Selection bias typicallycan affect credit evaluation models in two ways. First,financially distressed borrowers are less likely to keepaccurate records; therefore, these borrowers would tendnot to be included in the sample (Zmijewski). Second,when panel data are employed there may be attritionof borrowers from the sample. In this study, some bor-rowers probably participated in the DFBS programduring the earlier years of sample period, but exitedthe industry or stopped submitting records to the da-tabase before the end of the sample period. In analyz-ing financial distress models, Zmijweski found selec-tion bias causes no significant changes in the overallclassification and prediction rate. Given Zmijweski’sresults the study does not correct for selection bias andproceeds to estimate the credit evaluation models withthe data presented.


ment information. The financial informationcollected includes the essential componentsfor deriving a complete set of the sixteen fi-nancial ratios and measures recommended bythe Farm Financial Standard Council (FFSC).5Additional farm productivity, cost manage-ment, and profitability statistics for thesefarms are summarized in Smith, Knoblauch,

and Putnam.

Creditworthiness Measures

A key value available in this data set was theplanned/scheduled principal and interest pay-ment on total debt. This variable reflects theborrower’s expectations of debt obligations forthe up-coming year. Having this componentfacilitates the calculation of the coverage ra-tio,6 an essential element of this study. Thecoverage ratio approximates whether the bor-rower generates enough income to meet all ex-pected payments and is an indicator of cred-itworthiness. The coverage ratio is based onactual financial statements and has been intro-duced to credit-scoring models as a measureof creditworthiness, an alternative to loan clas-sification and loan default models’ (Novak and

sSome of the borrowers reportedzero liabilities;therefore, their current ratio and coverage ratio couldnot be calculated. To retainthese borrowers in the sam-ple and avoid values of infinity, the currentratios weregiven a value of 7, indicating strong liquidity, and thecoverage ratio value was bounded to the –4 to 15 in-terval. The bounded interval of the coverage ratio in-dicates both extremes of debt repayment capacity.

6If not specified otherwise, the coverage ratio re-fers to the term debt and capital lease coverage ratioas defined by the FFSC.

7Historically, agricultural credit evaluation modelshave been predicated on predicting bank examiners’ orcredit reviewers’ loan classification schemes (Johnsonand Hagan; Dunn and Frey; Hardy and Weed; Lufbur-row, Barry, and Dixon; Hardy and Adrian; Hardy etal., ‘Hrrvey and Brown, Oltman). These studies haveassessed the ability of statistical,mathematical or judg-mental methods to replicate expertjudgment. However,these models present some problems when credit eval-uation is concerned. It is difficult to determine whetherthe error is due to the model or to bank examiners’ orcredit reviewers’ loan classification. These problemsare not limited to agricultural credit scoring models(Maris et al,; Dietrich and Kaplan). Some agriculturalcredit scoring studies have used default (Miller andLaDue, and Mortensen, Watt, and Leistritz). Default is

LaDue (1994); Khoju and Barry). This indi-cator of creditworthiness is aligned with cash-flow or performance-based lending, as op-posed to the more traditional collateral-basedlending, and its use has been facilitated by im-provements in farm records and computerizedloan analysis systems.

The coverage ratio, a quantitative indicatorof creditworthiness, needs to be converted toa binary variable in order to assist the lenderin making a decision to grant or deny a creditrequest. Therefore in this study an a-priori cut-off level of 1 is used. A coverage ratio greater(less) than 1 indicates that the borrower did(not) generate enough income to meet all ex-pected debt obligations. Thus, a coverage ratiogreater (less) than 1 indicates a creditworthy(less creditworthy) borrowers

In addition to the standard annual coverageratio, two-year and three-year average cover-age ratios are employed in this study. The two-year and three-year average coverage ratioswere found to provide a more stable, extendedindicator of creditworthiness (Novak andLaDue, 1997). Using the annual, two-year av-erage, and three-year average measures ofcreditworthiness and an a-priori cut-off value

inherently a more objective measure. However, lendersand borrowers can influence default classifications bydecisions to forebear, restructure, or grant additionalcredit to repay a delinquent loan. Borrowers can influ-ence or delay default by selling assets, depleting creditreserves, seeking off-farm employment, and other sim-ilar activities. Default is based on a single lender’s cri-teria. Borrowers with split credit can be current withone lender and delinquent or in arrears with anotherlender. Additionally, the severity of some types of de-fault such as loan losses makes it less than adequate.A lender would be better served to identify these bor-rowers before such action occurs. Because of these am-biguities surrounding default, an alternative cash-flowmeasure of creditworthiness is used.

8The terminology “less creditworthy” is used in-stead of “not creditworthy,” because it is recognizedthat the farms in the data sample have been in opera-tion over a nine-yea period and most of them haveutilized some form of debt over this period. The sam-ple represents borrowers from Farm Service Agency,Farm Credit and various private banks, The variouslending institutions can be translated into varying de-grees of creditworthiness among the borrowers in thesample. Creditworthiness to one lender may be lesscreditworthy to another. The data can be viewed as acompilation of lenders’ portfolios.

Novak and L.uDue: Recursive Partitioning and Agricultural Credit Scoring 115

of one, the seventy farms are classified ascreditworthy or less creditworthy. The numberfound to be creditworthy in any one year var-ied from 50 to 66 based on annual data. Usingtwo-year averages, the number of creditworthyfarms increased from 57 to 66 depending onthe two-year period chosen. For three-year pe-riods the number of creditworthy farms was68, 65, and 57 for 1985–87, 1988–90, and199 1–93, respectively. The number of borrow-ers considered creditworthy decreases overtime. Identifying a borrower with diminishingdebt repayment ability prior to any serious fi-nancial problems exemplifies the usefulness ofthe creditworthiness indicator and should beof value to lenders when evaluating a borrow-er’s credit risk or monitoring his/her overallloan portfolio.9

Development of the CreditworthinessModel

In this section the annual, two-year average,and three-year average credit-scoring modelsare discussed. The annual model uses laggedcharacteristic values to classify creditworthyand less creditworthy borrowers. That is, theannual model is developed with pooled datausing characteristic values for each year from1985–89 to classify creditworthy and lesscreditworthy borrowers for the following yearof 1986–90, respectively. The models are eval-uated using 1990, 1991 and 1992 character-istic values to predict 1991, 1992, and 1993creditworthy and less creditworthy borrowers’classifications, respectively, Finally, the pre-dicted creditworthy classifications for 1991,1992, and 1993 are compared to the actualclassifications for the same time period to de-termine the intertemporal efficacy of the mod-el.

The two-year average model is developedusing 1985–1986 and 1987–88 averages of the

9Granted other factors—such as collateral offeredand a borrower’s credit history, personal attributes,andmanagement ability—also influence credit risk. Manyof the other factors listed have to be evaluated, in con-junction with the model, by the loan officer, Credit-worthiness models are designed to assist, not replace,the loan officer in lending decisions.

characteristic values to classify creditworthyborrowers in the average periods 1987–88 and1989–90, respectively. The evaluation processthen uses 1989–90 average characteristic val-ues to predict 199 1–92 average creditworthyand less creditworthy borrowers’ classifica-tions. The three-year average model is devel-

oped using 1985–86–87 average characteristic

variables to classify 1988–90 average credit-worthy and less creditworthy borrowers. Thethree-year average model is evaluated using1988–90 average characteristic values to pre-dict 199 1–92–93 average creditworthy andless creditworthy borrowers. In both the two-

year and three-year average models, the pre-dicted classifications are compared to actualclassifications for the same time period to de-termine the intertemporal efficiency of themodels.

RPA does not require individual character-istic variables to be selected in advance. Itdoes, however, require selecting a predeter-mined group of variables. In this study, the 16FFSC recommended ratios and measures wereselected as the predetermined group of vari-ables. 10Many of the variables in this prede-termined group of variables represent similarfinancial concepts, but are still included in thepopulation set, allowing RPA to select the ap-propriate variables. In addition, the predictedprobability of creditworthiness from the logis-tic regression model and the lagged classifi-cation variables were included in the prede-termined group of variables.

The logistic regression model requires thatthe characteristic or explanatory variables beselected in advance. As a result, this study fol-lows previous studies and specifies a parsi-monious credit-scoring model where a bor-rower’s creditworthiness is a function of

solvency, liquidity, and lagged debt repaymentcapacity (Miller and LaDue; Miller et al.; No-vak and LaDue, 1997). The specific variables

10All 16 ~CJf_J recommended ratiosand measures

were included in the analysis even though two of thevariables, debt/asset ratio and equity/asset ratio, areidentical. The choice to include all 16 ratios and mea-sures was based on consistency and completeness.


used in the model are debt-to-asset ratio, cur-rent ratio, and lagged dependent variable. 11

Both estimation methods require the spec-ification of a prior probability. In this study,the proportion of creditworthy borrowers inthe total sample determines the prior proba-bility. The values are 0.852, 0.896, and 0.905for the annual, two-year average and three-year average periods, respectively. The priorprobabilities for average periods demonstratethat the percentage of creditworthy borrowersin the sample data set increases as the averageperiod lengthens.

In addition to prior probabilities, misclas-sification costs also need to be specified. Pre-vious agricultural credit-scoring models, ex-cept for Ziari, Leatham, and Turvey, 12eitherignore misclassification costs or assume theyare equal. It is not reasonable to assume thatthe misclassification costs are equal for alltypes of decisions. The cost of granting, orrenewing, a loan to a less creditworthy bor-rower is typically greater than the cost of de-nying, or not renewing, a loan to a creditworthyborrower. Estimating these misclassificationcosts is beyond the scope of this study and thedata, but the study does illustrate the classifi-cation sensitivity of these costs. The relativecosts of Type I and Type II misclassificationerrors are varied accordingly from 1:1, 2:1, 3:1, 4:1, and 5:1, with the relatively higher mis-classification cost put on the Type I error.’sWhile the less creditworthy measure used inthis model may not be as serious as actual loanlosses or bankruptcy of a borrower, there is

\1TWJotherlogistic regression models, a stepwiseand an “eight variable” model (the latter, was pre-sented in Novak and LaDue, 1994) were also estimatedfor annual, two-year, and three-year average periods,The results are not repot-tedbecause the parametersdidnot always have the expected signs and the within-sample and out-of-sample prediction rates were lowerthan RPA’s and the paramoninous (three variable) logitmodel’s prediction rates for all the comparable timeperiods.

1z Ziari, Leatham, and Turvey assume the misclas-sification cost for a noncurrent loan is twice as muchas that for a current loan.

]JType I emor is a less creditworthy borrower Clas-sified as a creditworthy borrower and a Type II erroris a creditworthy borrower classified as a less credit-worthyy borrower.

&Node

350 obsCoverage IXIIJ<150 C’overagerimo> I 50

83 267

obs obs

I ess Crcdttwcnthy (Ieibtwortby

SurrcraateVtwmhles SPMValues

I (’ap]tal Replacementand TermDebt RepaymentMmgm $18,552

2 Net Fzwmb)comefiOITIOpclatmm Ratm o 1s1

3 FhImy La~ed DependentVamable 0500

4 Pred]ctedPIobabd]tyof Creditwmthmcss O837

5 Operating Expense Ratko 0747

Comoetto] Varmbles

1 Capital Replacementand Term Debt RepaymentMiuy,m $18,419

2 DebtlEqulty Ratio 0408

3 OcbtlAssetRat,o O290

4 operating Expense Ratio O640

5 Operating profit Marty Raoo O 152

Figure 3. RPA Tree Using Annual Data

still a higher cost associated with loan servic-ing and payment collection for less creditwor-thy borrowers.

Comparison of RPA and Logit ModelResults

Figure 3 presents the classification tree gen-erated from the RPA for the annual time pe-riod when the misclassification cost of a typeI error is three times greater than a type IIerror (i.e. 3:1 ). The model is simple. It is com-prised of the coverage ratio lagged one period.Borrowers with a coverage ratio greater than1.50 a year prior are classified as creditworthyand borrowers with a coverage ratio less than1.50 a year prior are classified as less credit-worthy. Put differently, to ensure all paymentswill be tnade by the borrower in the next yearthe current coverage ratio needs to be greaterthan 1.50.

In the same figure, below the classificationtree, five surrogate variables are listed. Thesevariables were selected on their ability to


&ParentNcde

140 obsRepayment Mmgm RepaymentMargin

<$17,759 >$17,759

26 114

obs obs

Less Credmvofihy Credmvortby

SurrogateVariables

I Term Oebtand Captel Law CoverageRaho

2, Predicted Probabdity of Credmvorthmess

3 Binary LaggedD~ndent Variable

4 Net Farm Income

5 Interest Expense Raho

sDht Wbms

1405

0,818

0500

$22,922

0158

kParentNode

70 obsRepaym+ntMargin

<$21,568

II

L)obs

Less Crechtwortby

SurrogateVariables

I

2

3

4

5

TermDebt and Capm+lLease CoverageRaho

operahng Expense Raho

Net Farm Income

Rate of Return of Assets

Current Raho

Repayment Margin

h

>$21,568

59

uohs,

Credltwotiy

Sulit Values

1429

0748

S22,265

O046

0,856

Conmetitor Variables Comcehtor Variables

I Term Debt and Capital Lease Coverage Ratio 1,698 I Term Debt and Capmd Lease CoverageRatm 1663

2, operating Expense Ratio 0,749 2 @erahn8 Expense Raho O748

3 Predicted probabibty of CreWvofibtness 0853 3 Rtiteof Return on Assets 0046

4 Rate of Returnon Equity 0013 4 Interest Expase Ratm 0277

5 Net Farm Income $69,172 5, @sratmg Profit Margin Ratio O 158

Figure 4. RPA Tree Using Two-Year Aver- Figure 5. RPA Tree Using Three-Year Av-age Data erage Data

mimic the selected variable, the coverage ra-

tio, and its optimal split value of 1.50. The

repayment margin, net farm income from op-

erations, binary lagged dependent variable,

predicted probability of creditworthiness, and

operating expense ratio were identified as sur-

rogate variables. The selection of the predicted

probability of creditworthiness from the logis-

tic regression adds some additional validity to

the use of this variable as a credit score. Also

noteworthy is that the split value of the pre-

dicted probability of creditworthiness is very

similar to the prior probability for the annual

sample period.

A list of competitor variables is also pre-

sented in the same figure. The repayment mar-

gin was listed as the first competitor variable.

The competitor variable implies that if the se-

lected variable (i.e. coverage ratio) was restrict-

ed or eliminated from the sample, the repay-

ment margin-the first competitor variable—

would have been chosen as the selected

variable in the classification tree. The other

competitor variables selected were debt-to-eq-uity ratio, debt-to-asset ratio, operating expenseratio, and operating profit margin ratio.

Figure 4 presents the two-year averageclassification tree, again using a 3:1 relativemisclassification costs ratio, with the highermisclassification cost attributed to a type I er-ror. In this classification tree the repaymentmargin was selected as the characteristic var-iable and the coverage ratio was selected as acompetitor and surrogate variable. Similar tothe annual model, the binary lagged dependentvariable and predicted probability of credit-worthiness were selected as surrogate vari-ables. The other surrogate and competitivevariables selected were net farm income, in-terest expense ratio, operating expense ratio,and return on equity.

Figure 5 presents the classification tree forthe three-year average period. Similar to theprevious two trees, a 3:1 relative misclassifi-cation cost ratio is used. The repayment mar-gin was selected as the primary characteristic


Table 1. Logistic Parameter Estimates of Creditworthiness Models

Variables Annual Two-Year Three-Year

Intercept 2.02(0.01)’

Debt/Asset Ratio –1.90(0.03)

Current Ratio 0.03(0.78)

Lagged Dep. Var. 0.96(0.05)

0.70(0.59)

–1.72(0.26)0.15(0,51)2.26(0,01)

0.39

(0.09)–0.92(0.73)0.13(0.72)2.36(0.21)

Model X2 14.26 18.71 6.16

Prior Probabilities 0.852 0.896 0.905

ap-values are reportedin parentheses.

variable and the coverage ratio was selected asthe first surrogate and competitor variable. Inthis average time period, the binary lagged de-pendent variable and predicted probabilitywere not selected as either competitor or sur-rogate variables. The selected surrogate andcompetitor variables were operating expenseratio, net farm income, rate of return on assets,current ratio, interest expense ratio, and op-erating profit margin ratio.

The results are consistent with expecta-tions. In general, most of the surrogate orcompetitive variables, especially in the two-year and three-year time periods, represent aborrower’s repayment capacity, financial effi-ciency or profitability. The best indicator ofcreditworthiness is repayment capacity and therepayment capacity is predicated on operatingprofits and losses, hence profitability and fi-nancial efficiency.

The actual classification trees may at firstappear to be a concern. The classification treeshave a low number of characteristic variablesand in some cases the naive model is selectedwhen relative misclassification costs are low. 14However, this is consistent with other studies.Frydman, Altman, and Kao found the naivemodel also did best in classifying their datawhen misclassification costs were assumedequal, and found that the cross-validation clas-

14 RPA selects the naive model when the annualdata are used and misclassification costs are 1:1 and 1:2, and when the two-year average data are used andmisclassification costs are 1:1.

sification trees had considerably fewer splitsthan the non-cross-validation classificationtrees. The largest cross-validation classifica-tion tree they estimated had a maximum ofthree splits. In their study, for exposition pur-

poses the non-cross-validation trees were pre-sented. These trees are aesthetically more ap-pealing. They are not pruned, haveconsiderably more characteristic values andclassify more observations, but of course haveless generalization outside the sample data.

The parameters of the logistic regressionmodels are presented in Table 1. All the pa-rameters for each of the models have the ex-pected sign. In the annual model the debt-to-

asset ratio and the lagged dependentparameters are significant at the 95% level. Inthe two-year average model the lagged depen-

dent variable is significant at the 99% level.None of the variables is statistically significantin the three-year average model.

Table 2 presents the expected costs of mis-classification for each model and level of rel-ative misclassification cost. The RPA model,not surprisingly, does best at minimizing theexpected misclassification cost for the within-sample time periods for all relative misclassi-

fication costs scenarios. The objective of RPAis to minimize the expected cost of misclas-sification, while the objective of the logisticregression is to maximize the likelihood func-tion for the specific data set, regardless of mis-classification costs. Based on the RPA objec-

tive, the nonagricultural financial stress studies


Table 2. Expected Cost of Misclassificationa for the RPA and Logistic Regression Models

Cost Based on Within-Sample Observations (1985-1990)

RPA Logistic Regressionc

Rela-tive l-Year 2-Year 3-Year Relative l-Year 2-Year 3-Yearcosts* Model Model Model costs* Model Model Model

1:1 o.150b 0. loob 0.014 1:1 0.198 0.134 0.1102:1 o.300b 0.122 0.014 2:1 0.303 0.184 0.1643:1 0.314 0.131 0.014 3:1 0.408 0.234 0.2184:1 0.364 0.139 0.014 4:1 0.512 0.284 0.2725:1 0.414 0.147 0.014 5:1 0.617 0.334 0.326

Cost Based on Out-of-Sample Observations(1991-1993)RPA Logistic Regressionc

Rela-tive l-Year 2-Year 3-Year Relative l-Year 2-Year 3-Yearcosts Model Model Model costs’ Model Model Model

1:1 0,150’ 0.100’ 0.080 1:1 0.207 0.117 0.0872:1 o.3oob 0.234 0.129 2:1 0.332 0.171 0.1433:1 0.314 0.295 0.177 3:1 0.457 0.225 0.1984:1 0.364 0.357 0.226 4:1 0.582 0.279 0.2545:1 0.414 0.418 0.274 5:1 0.707 0.332 0.309

~ ~1:1 o.150b 1:1 0.1892:1 o.3oob 2:1 0.2353:1 0.338 3:1 0,2824:1 0.366 4:1 0.3295:1 0,395 5:1 0.376

~ ~1:1 o,150b 1:1 0.1512:1 o.300b 2:1 0.2333:1 0.356 3:1 0.3164:1 0.401 4:1 0,3985:1 0.446 5:1 0,481

‘ See endnote#2 for cost of misclassificationcalculation.bRepresentsthe nai’vemodel.c The logistic regression does not explicitly account for cost of misclassification during the development of the model.

For comparison purposes, the expected costs of misclassification is calculated by keeping the number of misclassified

borrowers constant and varying the relative misclassification cost scenarios for each model.

d Relative Cost of type I and type II misclassification errors (cost of granting credit to a less creditworthy borrower:

Cost of not granting credit to a creditworthy borrower).

have concluded that RPA is a better modelthan other models. If this study were to con-clude here, it would also conclude RPA is abetter method of classification. However, thisstudy continues by comparing intertemporal,out-of-sample observations.

Using the annual time period data, the RPAmodel performs best in 1991 for all relativemisclassification costs scenarios, and in 1992

and 1993 when the misclassification costs areequal. The annual RPA model with equal mis-classification costs is also the naive model. Itis interesting to note that previous agriculturalcredit-scoring studies typically have assumedequal misclassification costs, but did not al-ways compare the estimated model’s resultswith the naive model. In this case, the naivemodel outperforms the logistic regression


model. Nevertheless, the assumption that mis-classification costs are equal is not very real-istic in credit screening models.

Using the same annual data, the logistic re-gression model does best at minimizing ex-pected cost of misclassification when misclas-sification costs are not assumed to be equal.Logistic regression also does best at minimiz-ing the expected cost of misclassification us-ing the two-year average out-of-sample datafor each relative misclassification cost scenar-io, except when misclassification costs areequal. When misclassification costs are equal,then RPA, represented by the naive model,does better. Finally, RPA does best at mini-mizing the expected cost of misclassificationusing the three-year average out-of-sampledata for each of the relative misclassificationcosts scenarios. From these results we cannotconclude that either model is superior usingthis data set. A different data set may havedifferent results and would warrant explora-tion.

Conclusion

This study introduces RPA to agriculturalcredit-scoring. The study also demonstratesRPA’s advantages and disadvantages in rela-tion to logistic regression. The advantages ofRPA include not requiring pre-selected vari-ables, provision of the univariate attributes ofindividual variables, not being affected by out-liers, provision of surrogate and competitivevariable summary lists, and explicit incorpo-ration of misclassification costs. On the otherhand, logistic regression possesses some de-sirable advantages over RPA, such as theavailability of overall summary statistics andan individual quantitative credit score for eachobservation.

More significantly, the study only partiallycorroborates the results of the non-agriculturalcredit classification studies. RPA outperforms

tertemporal (out-of-sample) minimization ofexpected cost of misclassification as the eval-uation method, the same results are notachieved. In some cases RPA outperforms lo-gistic regression and, in other cases, logisticregression outperforms the RPA model. Given

the normal use of credit-scoring models, out-of-sample evaluation is most appropriate.These findings suggests that cross-validationmay not be sufficiently effective to surmountpotential overfitting the sample data whichlimits RPA’s intertemporal predictive ability.

This study also considers relative misclas-sification costs. Previously, agricultural credit-scoring research has generally—except forZairi, Leatham, and Turvey+valuated mod-

els based on the number of misclassified ob-servations, and has not considered minimizingexpected costs of misclassification. The resultsof this study indicate that misclassificationcosts can affect the development of the RPAmodel. Future agricultural credit-scoring re-search should consider minimizing expectedcosts of misclassification, instead of minimiz-ing misclassified observations, to evaluatemodels. Similarly, effort should be made to-wards calculating actual misclassificationcosts, instead of using relative misclassifica-tion costs.

Finally, while the study has taken strides inintroducing RPA to agricultural credit-scoring,the conclusion of RPA’s superior performanceis not as convincing as the non-agricultural fi-nancial stress literature’s results. However,RPA does appear to be superior in some sit-uations. Further testing and model refinementsare suggested. From a practical standpoint,RPA presents several attractive features and

can be employed in conjunction with other ex-isting methods.

References

logistic regression when the RPA models areAlcott, K.W. “An Agricultural Loan Rating Sys-

selected and compared using cross-validation tern.” The Journal of Commercial Bank Lend-methods and expected cost of misclassification ing, February 1985.and the evahlation is based on within-sample Betubiza, E. and D.J. Leatham, ‘<A Review of Ag-observations. However, when the validation ricultural Credit Assessment Research and An-process is taken one step further and uses in- notated Bibliography.” Texas Experiment Sta-

Novak and LuDue: Recursive Partitioning and Agricultural Credit Scoring 121

tion, Texas A&M University System, CollegeStation, TX, June 1990.

Breiman, L., J.H. Friedman, R.A. Olshen, and C.J.Stone. Classljication and Regression Trees. Bel-mont, CA: Wadsworth International Group,1984.

Carson, R., M. Hanemann, and D. Steinberg, “ADiscrete Choice Contingent Valuation Estimateof the Value of Kenai King Salmon. ” The Jour-

nal of Behavior Economics, 19( 1990):53–68.

Dietrich, J.R. and R.S. Kaplan. “Empirical Analy-sis of the Commercial Loan Classification De-cision. ” The Accounting Review 57( 1982): 18–

38.

Dunn, D.J. and T.L. Frey. “Discrimmant Analysisof Loans for Cash Grain Farms. ” Agricultural

Finance Review 36(1 976):60–66.

Farm Financial Standard Council. Financial Guide-lines for Agricultural Producers: Recommen-dations of the Farm Financial Standards Coun-cil, (Revised) 1995.

Friedman, J.H. “A Recursive Partitioning DecisionRule for Nonparametric Classification. ” IEEETransactions on Computers, April ( 1977) :404–409.

Frydman, H., E.I. Altman, and D. Kao. “Introduc-ing Recursive Partitioning for Financial Classi-fication: The Case of Financial Distress. ” The

Journal of Finance 40 ( 1985):269–291.

Grubb, T.G. and R.M. King. “Assessing HumanDisturbance of Breeding Bald Eagles with Clas-sification Tree Models.” The Journal of Wildlfe

Management 55( 199 1):500–5 11.Hardy, W.E. Jr. and J.L. Adrian, Jr. “A Linear Pro-

gramming Alternative to Discriminant Analysisin Credit Scoring, ” Agribusiness 1( 1985) :285–292.

Hardy, W.E., Jr., S.R. Spurlock, D.R. Parrish, andL.A. Benoist. “An Analysis of Factors that Af-fect the Quality of Federal Land Bank Loan.”Southern Journal of Agricultural Economics

19(1987):175–182.

Hardy, W.E. and J.B. Weed. “Objective Evaluationfor Agricultural Lending. ” Southern Journal of

Agricultural Economics 12( 1980): 159–164.Heckman, J.J. “Sample Selection Bias as a Speci-

fication Error.” Econometrics 47( 1979): 153-

162.

Johnson, R.B. and A.R. Hagan. “Agricultural LoanEvaluation with Discriminant Analysis. ” South-

ern Journal of Agricultural Economics 5(1973):

57–62.

Joy, O.M. and J.O. Tollefson. “On the FinancialApplications of Discriminant Analysis. ” Jour-

nal of Financial and Quantitative Analysis

10(1975):723-740.

Khoju, M.R. and I?J. Barry. “Business PerformanceBased Credit Scoring Models: A New Approachto Credit Evaluation. ” Proceedings North Cen-tral Region Project NC-207 “Regulatory Effi-ciency and Management Issues Affecting RuralFinancial Markets” Federal Reserve Bank ofChicago, Chicago, IL, October 4–5, 1993.

LaDue, Eddy L. Warren 1? Lee, Steven D. Hanson,and David Kohl. “Credit Evaluation Proceduresat Agricultural Banks in the Northeast and East-ern Cornbelt. ” Agricultural Economics Re-

sources 92-3, Cornell University, Department ofAgricultural Economics, February 1992.

Lufburrow, J., I?J. Barry, and B.L. Dixon. “CreditScoring for Farm Loan Pricing. ” Agricultural

Finance Review 44( 1984):8–14.Marais, M. L., J.M. Patell, and M.A. Walfson. “The

Experimental Design of Classification Models:An Application of Recursive Partitioning andBootstrapping to Commercial Bank Loan Clas-sifications. ” Journal of Accounting Research

Supplement 22( 1984) :87-1 14.

Maddala, G.S. Limited-Dependent and Qualitative

Variables in Econometrics. Cambridge Univer-sity Press, 1983.

Madalla, G.S. “Econometric Issues in the Empiri-cal Analysis of Thrift Institutions’ Insolvencyand Failure. ” Federal Home Loan Bank Board,Invited Research Working Paper 56, October1986.

McFadden, D. <‘A Comment on Discriminate Anal-ysis versus LOGIT Analysis. ” Annuals of Eco-

nomics and Social Measurement 5(1976):5 1l–523.

Miller, L.H., 1? Barry, C. DeVuyst, D.A. Lins, andB .J. Sherrick. “Farmer Mac Credit Risk andCapital Adequacy. ” Agricultural Finance Re-

view 54(1994):66–79.

Miller, L.H. and E.L. LaDue. “Credit AssessmentModels for Farm Borrowers: A Logit Analy-sis. ” Agricultural Finance Review 49( 1989):

22–36,

Mortensen, T.D., L. Watt, and EL, Leistritz. “Pre-dicting Probability of Loan Default.” Agricul-

tural Finance Review 48(1988):60–76.

Novak, Ml? and E.L. LaDue. “An Analysis ofMultiperiod Agricultural Credit EvaluationModels for New York Dairy Farms. ” Agricul-

tural Finance Review 54(1994):47–57.

Novak, M.P. and E.L. LaDue. “Stabilizing and Ex-tending, Qualitative and Quantitative Measurein Multiperiod Agricultural Credit Evaluation


Model. ” Agricultural Finance Review

57(1997):39-52.

Oltman, A.W. “Aggregate Loan Quality Assess-ment in the Search for Related Credit-ScoringModel.” Agricultural Finance Review54(1994):94-107.

Smith, S.E, W.A. Knoblauch, and L.D. Putnam.“Dairy Farm Management Business Summary,New York State, 1993. ” Department of Agri-cultural, Resource, and Managerial Economics,Cornell University, Ithaca, NY, September1994. R.B, 94-07.

Splett, N. S., l?J. Barry, B.L. Dixon, and I?N. Ellin-ger. ‘<A Joint Experience and Statistical Ap-proach to Credit Scoring.” Agricultural Finance

Review 54(1994):39–54.

Srinivansan, V., and Y.H. Kim. “Credit Granting:A Comparative Analysis of Classification Pro-cedures. ” Journal of Finance 42(1987 ):665–

681.

Steinberg, D. and 1? Colla. CART Tree-structured

Non-Parametric Data Analysis. San Diego, CA:Salford Systems, 1995.

Tronstad, R. and R. Gum. “Cow Culling DecisionsAdapted for Management with CART. ” Ameri-

can Journal of Agricultural Economics

76(1994):237-249.

Turvey, C.G. “Credit Scoring for AgriculturalLoans: A Review with Application. ” Agricul-

tural Finance Review 5 1(1991):43–54.‘Ihrvey, C.G. and R. Brown. “Credit Scoring for

Federal Lending Institutions: The Case of Can-ada’s Farm Credit Corporations.” Agricultural

Finance Review 50(1990):47–57.

Ziari, H.A., D,J. Leatham, and Calum G. ‘Iluvey.“Application of Mathematical ProgrammingTechniques in Credit Scoring of AgriculturalLoans.” Agricultural Finance Review 55(1995):

74–88.

Ztijewski, M.E. ‘<Methodological Issues Relatedto the Estimation of Financial Distress Predic-tion Models. ” Journal of Accounting Research

Supplement 22(1994):59–86.

Application of Recursive Partitioning to Agricultural Credit

Documents