481 - Report - Eric Spamer

Predict 422 – Eric Spamer

Direct Marketing Classification and Predictive Modeling Eric Spamer

Introduction

We have received approximately 6,000 full records from a charitable organization that

wishes to improve its response rate for its direct marketing campaign. We have all of the

relevant cost information to assist the client using predictive modeling to perform a profit

maximization. The client has provided 2,007 potential donors to us. The client sent all the

relevant attributes, and the organization wants to know who to send mail to.

In order to determine the best individuals to reach out to, we will conduct classification

modeling using the 6,000 records that note whether or not someone donated. This will enable

us to determine the likelihood of whether someone in the test set of 2,007 is willing to donate.

This will also enable us to calculate the net profit maximizing number of mailings total.

Assessing model in the way is going to be used is integral and unfortunately analysts do not

always frame the questionable this way.

In order for the charitable organization to know how much they should expect in

donations, we will also create a predictive regression model using the data from only the

donors. For both the classification modeling and predictive regression modeling, we tried

many different modeling techniques. We believe we have created models that will serve the

charitable organization very well.

Exploratory Data Analysis:

We begin this task with Exploratory Data Analysis (“EDA”). We want to learn as much

about the data as possible. Missing values and outliers are very impactful to a regression

analysis, so we want to think about these issues throughout the EDA process. There are several

updates (or “imputations”) we make based on our data exploration in order to ensure maximize

the usefulness of the data. We also create interaction groups based on decision trees in order to

find additional predictive data.

Prior to starting our EDA, we want the data to be split into training data, testing data,

and validation data. We create a split with 3,984 training records, 2,018 validation records, and

2,007 test records. The training, validation, and test data all begin with 21 variables.

Imputation (Fixing Missing Values)

The data is very clean, so we do not have to perform imputation on missing values.

Data Transformations

We analyze every variable in the dataset by creating histograms. When analyzing

histograms, we want to look at the shapes of the variables to see if transformations should be

applicable. We run histograms and all of the independent variables to analyze where a

transformation may be helpful to help improve the fit of the variables. We see approximately

eight cases where a natural log transformation could make the independent variables have a

more normal distribution. This is important, because some statistical methods can struggle

when predictors are highly skewed. Log transformations may improve the performance of the


variables, however we also want to try Box Cox transformations. The Box Cox will find the

optimal power of transformation for the variable. We do not want to use the log transformation

and Box Cox transformation of a variable at the same time – we would want to choose one or

the other. For the variables in this analysis, we see the results are very similar, so we choose the

log transformations. The plots below illustrate that relationship between the Box Cox

transformation and the log transformation is the same when compared to the original incm (we

provide three additional comparison plots within the Appendix):

Transformations of variables can be very beneficial to improving a model’s predictive

capability. Derived variables can often show more of the true relationship between variables.

We run decision trees to find meaningful interactions between the variables and we include

these. During several model selection runs, we also try polynomials for many of the variables,

since these transformations can provide an improved fit.

Interaction Variables

We create decision trees on the training data to help determine if there are interaction

variables that we should be including in the model. Please find the tree for DONR below.

When we run on damt instead of donr, we see a similar decision tree, so we are satisfied with

the original tree. Please see the Appendix for a comparison between the trees.

We see the first split is on whether there is a

child or not. The next split is whether the individual is

a homeowner or not. We will include a variable that

interacts owning a home and having a child. The next

split is on region2 with a child, so we will utilize an

interaction on that as well. We also see a split on hinc

of 2.5, which is the household income category

variable. We also create a variable that notes whether

hinc is above or below 2.5. We see 6,466 cases where

hinc is above 2.5.

If we are performing modeling for inference,

we want to stay away from interaction terms during

logistic regression. However we include these


interaction terms, because we will also be building a predictive model using linear regression

and prediction is our main goal, not inference. We do not need to include our interaction

variables for decision trees and random forests, since those algorithms will include interactions.

Data Standardization

We split into test/training set, compute parameters for standardization/decomposition

on training set (e.g. mean and variance of training set for standardization) and apply the same

parameters on test set. For standardization this could mean that the test set does not have zero

mean / unit variance, but that is fine. Looking at the test set to transform the training set is

usually considered bad practice.

Results ‐ Classification Modeling

We begin by performing classification modeling. The data contains variables that can

model whether or not someone will donate. For this modeling, weighted sampling has been

implemented, so more weight has been given to donors. We correct for that before determining

the number of mailings that will maximize net profits.

Logistic regression is often the most popular choice when there are two classes (K = 2).

We will run LDA first and then logistic regression. LDA is often better than logistic regression

under certain conditions. These include when n is small, when there are more than two classes,

and the classes are distinct and well separated and have Gaussian distributions. When p is very

high, Naïve Bayes may outperform the other techniques, however p is not very large here, so

we do not run Naïve Bayes.

After creating these models, we plug in the average donation amount ($14.50), the

mailing cost ($2.00), and the likelihood of response based on our classification model. It is not

wise to send mail to everyone since the average response rate is only 10%, which results in a net

loss of $0.55 per mailing. We will attempt to maximize net profits by creating models using the

training data, calculating the posterior probability of donation using the validation data, sorting

by the highest likelihood donors, and then reaching at a stopping point once the costs outweigh

the gains.

Classification ‐ LDA:

It is our goal to find the best possible predictive model. LDA is not designed to be used

with qualitative predictors, such as whether someone is a donor or not, however we overlook

that here in search of the best model.

We run several LDA models. The first includes the log transformations we performed

and the second supplements that with the three interaction variables and the high income

indicator variable we created (four total additional variables). The third model includes all of

the aforementioned fields plus 12 additional polynomials on the independent variables.

We see the third model – the most complex – performed the best. It finds that the

number of mailings that maximizes profits is 1,255, which is the lowest of the three models.

However it is the best at finding the correct true positives, as 978 of the 1,255 mailings were

actual donors. This precision, or positive predictive value, is 78%.


The plot of the third LDA model (right) shows how

profits change as more mailings are sent. Sending too

many mailings eventually starts decreasing the net profit:

Classification ‐ Logistic Regression:

The LDA results look like they will provide a

large amount of profit, however Logistic regression is a

very popular technique especially when there are two classes, such as there are in this case

(donor or not donor). We run nine different logistic regression models. This process includes

backwards stepwise which helps eliminate unimportant independent variables that are adding

more noise than signal to the predictive process.

The first model includes all fields with some logarithmic transformations. The second

model includes all fields plus the three interaction terms and the one indicator variable on

household income level. The third model includes a backwards selection process based on

Akaike information criterion: (AIC). The third model removes variables that are not important,

however it leaves in two variables that are still not statistically significant at a 5% level. The

fourth model removes these two variables. The fifth and sixth models include backward

stepwise with the additional of polynomials up to the power two for 12 additional variables.

The seventh and eighth models include backward stepwise with the additional of polynomials

up to the power three for 12 more variables. The sixth and eighth models take their

predecessor model and remove variables that are not statistically‐significant at the 5% level.

A small value of AIC is what we are searching for. The calculation penalizes models

with lots of parameters and penalizes models for poor fit as well. The full model has AIC:

2167.9 and the backwards selection leads to a 15 variable model with an AIC of 2153.23. As

noted above, two of these 15 variables have a p‐value of 0.08 (tgif and genf), so we run a fourth

model excluding these. This fourth model leads to a higher AIC of 2156. This is expected, since

the backwards selection process is attempting to minimize AIC.


Model #5 is backward stepwise with 12 more polynomials. Instead of adding a

polynomial simply for hinc, we add polynomials for 12 additional variables that have a

distribution other than 0 and 1. This leads to an AIC of 2002.1 with 21 variables. Model #6

removes four variables from Model #5 that are not statistically significant at a 5% level. These

variables have a p‐value above 0.05. They variables removed include: genf, npro^2, tgif^2, and

rgif to lead to a 17 variable model. This leads to an AIC of 2011.

Model #7 and Model #8 are similar to Models #5 and #6, however they involve taking 12

variables to the power of three before running the stepwise regression. This leads to 49

potential variables, of which 26 are selected to be included in the model. This results in an AIC

of 1,986.1. Model l#8 removes three variables from Model #7 that are not statistically significant:

genf, plow^3, and agif^3. This leads to an AIC slightly higher than Model #7: 1,987. The three

less variables may prove to be beneficial on a validation data set however.

Finally for the ninth model, we utilize the variables from Model #8 but we fill in

underlying variables for polynomials – meaning if the backwards selection said only the third

degree polynomial of a variable was significant, we add in the first and second degree

polynomial since that is traditionally more appropriate.

When we run these nine logistic models against the validation data, we see a better idea

of how these records would perform on new records. Summary of all nine logistic regression

models can be found in the appendix. Please find a summary of the best models below:

We see that the two best AIC prove to be the two best models on the validation data’s

maximum profit calculation. Models #7 and #8 have the highest profit and lowest AIC. The

precision or positive prediction value is true positives / predicted positives, which is 996 / 1335 =

74.6% for Model #7 and 75.9% for Model #8.

Two potential problems for fitting models are multicollinearity and autocorrelation of

errors. For multicollinearity, we perform a Variance Inflation Factors (VIF) check on the best

logistic regression models. For logistic model #7, we see a few variables with potential issues:

INT_chld_home, INT_chld_reg2, INT_chld_hinc and chld. We see similar issues for model #8.

Multicollinearity is more important for inference than prediction, so it is not a large issue here.

For checking the independence of errors, we perform the Durbin‐Watson test. This will help us

understand if we have autocorrelation in our errors. For our two best logistic regression

models, we see Durbin‐Watson values of 1.98, which shows no evidence of serial correlation.


Classification ‐ QDA:

Quadratic discriminant analysis (QDA) is similar to LDA. QDA and LDA both assume

that the observations from both of the classes (donor and not donor) have a Gaussian

distribution. Both also input estimates for the parameters into Bayes’ theorem to predict

observations. However QDA differs from LDA, because LDA assumes that both classes (or

each class if K > 2) share a covariance matrix but QDA does not assume this.

We run five models for QDA:

The first model includes all 21 fields with some logarithmic transformations

The second model includes all fields plus the three interaction terms and indicator

variable for high income

The third model includes variables from the model found using backwards selection

process during logistic regression (Model #3)

The fourth model includes variables from the model found using backwards

selection process with polynomials up to level #3 during logistic regression (Model

#7). This was the best performing model for logistic regression.

The fifth model includes variables from the model found using backwards selection

process with polynomials up to level #3 during logistic regression minus three non‐

statistically significant variables (Model #8). This was the second best performing

model for logistic regression.

The first two models were also run for LDA, so it will be interesting to compare the

results from QDA. We see that the first model suggests that 1,353 mailings should go out for a

profit of $11,199.50 while the same variables in LDA led to 1,411 mailings for a profit $406 more

with $11,605.50. The second model for QDA also performs $177.50 worse relative to the same

variables in LDA.

QDA does not perform well compared to logistic regression with this data either. We

see the first two models compared to the LDA results and they both perform poorly on the

QDA model as noted above. The final three models are compared to the logistic regression run

with the same variables, and the QDA performs significantly worse when trying to predict

maximum profit on the validation set: A comparison table is shown in the Appendix.

Too many variables proves to hurt the QDA model as shown by the poor performance

of models #4 and #5, even though those were the models that performed the best for logistic

regression. The best performing model was #3 with the lowest number of variables (15), which

has a precision or positive prediction value of 72.5% (true positives / predicted positives = 977 /

1347). The performance of the best models are shown below (all within the Appendix):


Classification ‐ K‐nearest neighbors:

Unlike LDA, Logistic Regression, and QDA, K‐Nearest Neighbors (KNN) is non‐

parametric. Therefore there are no assumptions regarding the decision boundary shape. KNN

will be the best model when the decision boundary is very non‐linear. The number of K adjusts

the flexibility of KNN.

We run various different groupings of k‐nearest neighbors. We utilize different

amounts of variables as inputs as well as different values of k to define different

“neighborhoods”. We begin by using all available variables, but we reduce this greatly to avoid

the curse of dimensionality that can easily impact k‐nearest neighbors.

The full model performs very poorly, but the second grouping of variables – reduced

from 24 variables to five variables – performs substantially better. The variables utilized are

based on the decision tree run earlier. This decision tree is very helpful for variable selection,

and in this case, the results are greatly improved over the model with far too many variables.

The third set variables removes genf to see if further reduced dimensionality can help.

We find that the five variable model outperforms the four variable model with several different

values of k. We keep increasing k by 10 until we see a reduction in performance. For the five

variable model we see the performance increasing at k=20 over k=10, and then again at k=30 and

once again at k=40, and then we see a reduction in performance at k=50. Therefore k=40 and the

five variable model is deemed

the best.

This results in a profit of

$10,836.50 based on 913 correct

predictions out of 1,201 total

predictions of donor, which is a

precision of 76.0%.

Classification ‐ GAM:

Now we move onto Generalized Additive Models (GAMs). These extend standard

linear models by allowing the variables to be represented by a non‐linear function. The ability

to add non‐linearity makes more accurate predictions possible. Despite the variables being


functions, the additivity of the model is maintained, so the effect of individual variables can still

be investigated. GAMs work on both quantitative and qualitative problems. We apply GAMs

on the qualitative variable of whether or not someone will donate, however we will also use

GAMs later on a quantitative dependent variable.

In order to determine variable selection for our GAM model, we first begin with a small

amount of quantitative variables. Decision trees can often be a good measure of variable

selection. As we saw with KNN, the decision tree provided a better model than the KNN run

that included all variables. Therefore we begin by reducing the variables down to the ones

found to be significant via the decision tree. We default all variables to four degrees of freedom,

but child is updated to two degrees of freedom, because it is limited in values.

In addition to the variables found via the decision tree, we add in the variables with the

largest variance (prior to standardizing). We do this in hope of finding variables that would be

adding a lot of variability and this would be captured with the splines. This means we also

include tgif, lgif, and npro.

Based on the summary of the natural spline model, we see chld, hinc, and tgif have

statistically significant values based on the p‐value. For the smoothing spline model, we see

chld, wrat, hinc, and tgif have statistical significance. When we remove the two non‐statistically

significant variables (lgif and npro) we see improvement in AIC for both the natural spline and

smoothing spline models. We then go through a few more iterations to improve on the model.

We can also use local regression fits as building blocks in a GAM, and through an

iterative process, we find meaningful variables to include. Finally, we add in the variables that

we found were significant via backwards selection during logistic regression. We do not use

smoothers or local regression with these variables. Variables where we previously utilized

polynomials are now being replaced by smoothing splines, with up 20 degrees of freedom.

The best model results are shown above. For the summary of results of all models,

please seen the appendix.


Classification – Random Forests and Bagging:

Bagging is an expansion of Decision Trees that constructs many trees utilizing bootstrap

sampling. Random Forests takes that a step further and includes a

limited number of predictors available to split each node. That value

is typically the square root of all variables available. We start with all

20 original variables, which is bagging since no random selection of

variables takes place. We also try random forests with four and five

variables available at random per node. For random forests, we do

not need to include interactive variables.

The best model on the validation data is one that utilizes five variables

at each node and does not include the three worst performing

variables based on the variable importance (shown to the right). This

results in 1,229 mailings and a net profit of $11,752. As shown in the

section below, this is outperformed by the GAM and Logistic Regression models in terms of

maximum net profit, however it does rank near the top with a precision of 80%.

Classification – Boosting:

Boosting is a way to improve the results of decision trees. Boosting can avoid

overfitting by learning slowly from the errors of previous trees created. This sequential

growth of trees can be changed by adjusting the shrinkage. Ultimately we find the optimal

boosted result has a depth of 5, shrinkage of 0.01 and worst performing variables removed.

We find the worst performing variables via the relative influence plot that is shown in the

appendix along with plots showcasing the partial dependence plots for the best performing

variables. These show the marginal effect after integrating out the other variables.

This results in a model that sends 1,197 mailings and has a profit of $11,932. It also

results in the highest precision of any model with 82.5%.

Classification – SVMs:

Support Vector Machines are another option for classification. We utilize a radial

kernel to try and predict the DONR indicator. We try various levels for the gamma and

cost parameters, and determine values of 0.1 for both leads to the best result. This results in

1,375 mailings for a net profit of $11,532.50.

Comparison of all Classification Models:

We wish to select the best classification model for determining whether an individual

was a donor or not (DONR indicator) based on the maximum profit. We utilize GAM Model

#10, which includes smoothing splines very high degrees of freedom as it leads to profits of

$11,960. We attempt to combine the first three best models into different ensemble models,

however no combination outperforms the GAM Model on its own. We apply this model to the

test data set, and we correct for the weighted sampling. The validation data response rate is 0.5,


but the typical response rate is 0.1, so the mailing rate needs to be adjusted before being applied

to the test data. When we send mail to everyone above the threshold given from the model

results, we see that we will mail to 302 records, the ones with the 302 highest posterior

probabilities. We will not mail to the remaining 1705 records. When we look at the best model

for all of the tests, we see the following results:

Results ‐ Prediction Regression Modeling

Now that we have completed the classification modeling, we want to move onto the

prediction model for the donation amount (DAMT) variable. For this purpose, we utilize data

of the records for donors only. We utilize several different candidate modeling techniques

including least squares regression, best subset with k‐fold cross validation, principal

components regression, partial least squares, ridge regression, and lasso. We will evaluate these

different results using the mean prediction error. Once we find the best model, we can create

predictions for the test data.

We now build a prediction model to predict expected gift amounts from donors. The

data used for this will only consist of people who donated, so our population size decreases.

The training set now contains 1995 records and the validation set now contains 999 records.

Least Squares Regression:

We begin the prediction modeling on a quantitative variable with least squares

regression. Our goal once again is to find the best possible predictive model. We run several

least squares regression models. The first includes all variables (with the log transformations

we performed), and the second supplements that with the three interaction variables and the


high income indicator variable we created (four total additional variables). The third model

includes all of the aforementioned fields plus 13 additional polynomials on the independent

variables. We then perform stepwise with backwards elimination to reduce the variables.

Finally, we remove a few additional variables from the backwards stepwise result – these

variables have p‐values above 0.05, so they would not be considered statistically significant in

some contexts.

We see the third model – the most complex with 37 variables – performed the best as it

has the lowest Mean Prediction Error with 1.408. This model also shares the lowest standard

error and is a

very close to

having the best

adjusted r‐square

value. We show

the results of the

best models on

the right and the results of all five least squares models in the Appendix.

When we run diagnostics on the best regression models, we see potential

multicollinearity issues based on the Variance Inflation Factors (VIF) check. For model #3, we

see many variables with potential multicollinearity issues: For model #4, many of these

variables are eliminated so the issues are alleviated. Multicollinearity is more important for

inference than prediction, so it is not a large issue here. For checking the autocorrelation of

errors, we perform the Durbin‐Watson test. For our two best regression models, we see a

Durbin‐Watson values of 2.01, which shows no evidence of serial correlation.

Best Subset with K‐Fold Cross Validation:

We now create models using best subset with K‐Fold Cross Validation. We start with

the biggest (and best) set of variables from the least squares regression ‐ Model #3 with the 37

variables. We run a test without cross‐validation to see the best model would be with subset

selection. We see that a 31 variable model would be the best. This has a mean error of 1.4053.

We then perform cross‐validation with k = 10 folds.

When we plot the mean cross validation errors by model

size as shown on the right. We see that a model size of 12

produces a mean error of 1.366. This is the first dip or

elbow in the plot, so it is noteworthy. This is the

minimum error up to that point. A model size of 14

produces a mean error of 1.360, and then a model size of

16 is slightly less than that. The overall minimum is a

model with size 25 (1.3241 mean error). We test the models of size 12, 14 and 25 against the

validation data set. We show the full model results comparison in the Appendix.


Principal Components Regression:

Now we apply Principal Components Regression (PCR) to the data. We want to ensure

that there are no missing values for PCR. Scale also plays an important role in PCR, and we

have already standardized the variables. We can run ten‐fold cross validation on the PCR to see

the root mean squared error for each value of M principal components.

Just like best subset with K‐Fold Cross Validation, we

start with the biggest (and best) set of variables from the least

squares regression performed above ‐ Model #3 with the 37

variables. We remove three of the interaction variables since

they cause issues, likely due to collinearity. We see that 8, 15,

and 25 components have an elbow in the plot, which is an

indicator of a good amount of components to use.

We see that 8 components explain 55% of the variance in the predictors and 49% of the

variance in the dependent variable. We see that 15 components explain 78% of the variance in

the predictors and 63% of the variance in the dependent variable. We see that 15 components

explain 96% of the variance in the predictors and 65% of the variance in the dependent variable.

We also want to see the impact of starting with a lower

amount of variables than the 34 we do above. We utilize the 25

variables found via the best subset with k‐fold cross validation

to run PCR as well. We see the resulting plot on the right:

We see that using 6 components has a significant elbow

in the plot, which is an indicator of a good amount of

components to use. We see that 6 components explain 52% of

the variance in the predictors and 31% of the variance in the dependent variable. We see that

using 13 components also has a substantial elbow in the plot. This level explains 63% of the

variance in the dependent variable. Using either 6 or 13 components would be an improvement

due to dimension reduction compared to the starting count of 25 variables.

With the PCR run on the training data, we can check the results on the validation data.

Since we input significant variables to start, we do not see an improvement in the results by

using PCR. If we would have included non‐statistically significant variables, then PCR would

have been helpful. As shown in the table below, as the number of components grows, the mean

prediction error decreases. Model #7 shows the exact same mean prediction error as shown in

the best subset results (since it contains the same starting fields and it uses all 25 fields, so no

dimension reduction is occurring). All results are shown in the Appendix.

Partial Least Squares:

Now we apply Partial Least Squares (PLS) to the data. Just like PCR, we run ten‐fold

cross validation to see the error for each value of M partial least squares directions. We utilize a


process to deciding the variables as we did for PCR. We see

that 2 components has an elbow in the plot, which is an

indicator of the amount of components to use.

We also want to see the impact on Partial Least Squares

of starting with a lower amount of variables than the 34 we do

above. We utilize the 25 variables found via the best subset

with k‐fold cross validation to run PCR as well. We see a

similar plot to the one shown on the right.

We ran two different groups of variables using PLS on the training data. When we

utilize the validation data, we do not see an improvement in the results by using PLS. As

shown in the table below, as the number of components grows, the mean prediction error

decreases. Just as we see with PCR, Model #7 shows the exact same mean prediction error as

shown in the best subset results (since it contains the same starting fields and it uses all 25

fields, so no dimension reduction is occurring). All results are shown in the Appendix.

For PCR and PLS, it is not worth adding the complexity of interpretation and

explanation if it does not lead to improved results. Therefore, it is better to stick with the best

linear regression model or best subset with k‐fold cross validation model up to this point.

Ridge Regression:

Now we will apply two shrinkage or regularization techniques: Ridge Regression and

Lasso. These methods shrink the coefficients towards zero and reduces variance in the process.

We want to find out if performing ridge regression will perform better than simply doing least

squares. We will utilize the best least squares set of variables that we found in order to make

this comparison most useful.

Ridge regression can scale variables automatically, and we have already scaled the

variables. We implement the lambda (tuning parameter) over a full range of values which

results in fitting the null model through to the full least squares fit. In the Appendix, we see the

comparison between the full least squares coefficient and three different values of lambda.

When we plot the ridge regression results, we see that as lambda

increases, more of the variables become very close to zero. We now show

the results of ridge regression using ten‐fold cross validation to choose the

optimum level of lambda. Through the cross validation, we see the optimal

lambda is 0.112. We utilize this minimum lambda. We also run a model

with the cross‐validation one standard error higher than this minimum,

which is 0.948. We try a few other values of lambda as well.

When we run the Ridge Regression model with a different set of with

the variables, we see that Ridge Regression does not improve the 25 variable model nor does it

improve the 37 variable model. All Ridge Regression results are shown in the Appendix.


Lasso:

The Lasso procedure is similar to Ridge Regression,

however Lasso can perform variable selection by reducing the

impact of variables to zero. This can lead to a more accurate

and/or more interpretable model than ridge regression.

If we utilize the lambda recommended by cross‐

validation, 0.0046, we see the full model of 37 variables becomes

31. If we use the lambda tuning parameter one standard

error higher than the minimum (0.09), then the 37 variables is

reduced further to 16. This 16 variable model has a higher error

on the validation set however. We also run a model starting

with the 25 variable model and see similar results. Just as with

Ridge Regression, we see that the Lasso does not improve the

models compared to the least squares. All Lasso results are shown in the Appendix.

A huge positive of Lasso modeling is the variable selection aspect. We see that the

tuning parameter value of 0.5 removes all but four variables from the model. This does not

perform well with the mean prediction error, yet it does showcase four of the most important

variables: whether someone lives in geographic region #4, the dollar amount of a recent gift, the

average dollar amount of gifts to date, and a negative variable: number of children.

Linear Regression GAM:

We now run a Linear Regression Generalized Additive Model (GAM) to see if we can

extend the linear model as GAMs are more flexible than linear regression. The relationship

between each predictor and the dependent variable uses a curve to model. This hurts the

interpretability of GAMs relative to linear regression. We utilize the 37 variable model that

was the best result from linear regression for our GAM testing. We apply natural splines and

smoothing splines and then we do iteration to determine the best number of degrees of freedom

for these variables. Ultimately we see results that improve upon the linear regression results.

Results ‐ Summary of Prediction Regression Modeling

For Principal Components Regression and Partial Least Squares, we do not see

improved results. When we will apply two shrinkage or regularization techniques, we see

these methods shrink the coefficients towards zero and reduce variance in the process, however

this does not outperform the best linear regression or best subset here.

We find that the GAM models do improve upon the performance of least squares and

best subset regression. When we compare the output of the best GAM Model, the best Least

Squares Model, and the best Subset Model, we see that less than one percent of the predictions

differ by more than one for DAMT, which indicates that all three models are producing similar

results. To summarize the best models, we use the following table:


Results ‐ Diagnostics of Prediction Regression Modeling

There are several potential problems when we fit a linear regression model. These

include Collinearity, Outliers, High‐leverage points, Correlation of error terms, Non‐constant

variance of error terms, and Non‐Linearity of the response‐predictor relationships. We take

several steps to avoid these pitfalls. One example is multicollinearity ‐ we perform a VIF check.

The least squares model of 37 variables has a significant amount of collinearity, particularly the

interaction variables we create. This is mostly corrected for in the 25 variable model

particularly after we remove the interaction variables.

Conclusion

We sought out to help a charitable organization use machine learning to improve its

direct marketing campaign. We believe our results can help make improve their donations and

reduce costs of the program. We built two models – a classification model and a regression

model. The classification model noted whether someone would be likely to donate. The best

modeling technique was Generalized Additive Models when we use maximum profit as the

model selection criteria. In particular a model using smoothing splines with high degrees of

freedom lead to the maximum profit, a value of $11,960. When applied to the test data, we see

that we should send mail to 302 records and no mail to the 1705 records where the model

outputs a lower probability. This is a 15% mailing rate. We know that typical response rates

are 10%, so we hope that we have isolated most of that 10% positive group within our 15%

mailing group.

With our classification model finalized, we built a prediction model to model gift

amounts using the data for donors only. We utilized many different modeling techniques, and

least squares regression and best subset with k‐fold cross validation outperformed almost every

option, until we implemented the highly flexible Generalized Additive Modeling techniques.

We did not see improved modeling from regularization techniques like Ridge

Regression and Lasso, however we did see some inference from the Lasso. When the tuning

parameter is high, the Lasso model reduces the variables to just four, which showcases some of

the more important factors in determining donation amount: whether someone lives in

geographic region #4, the dollar amount of a recent gift, and the average dollar amount of gifts

to date are positive and the number of children is a negative variable.


Appendix Box Cox versus Log Transformation

We see several cases where a log transformation could make the independent variables

have a more normal distribution. This could help improve predictive accuracy of the model.

We also test these cases using a Box Cox transformation and we see the results of the Box Cox

and Log Transformation are very similar. We see that for avhv, incm, rgif, and tdon below:


Decision Trees for Interaction Variables Comparison

We create decision trees on the training data to help determine if there are interaction

variables that we should be including in the model. We ran a tree for both donr and damt (the

two dependent variables used in this study), and we see a similar decision tree for both.

Decision Tree for DONR

Decision Tree for DAMT


Classification ‐ Logistic Regression Model Results:

When we run these nine logistic models against the validation data, we see a better idea

of how these records would perform on new records. Summary of all nine logistic regression

models can be found below:


Classification – QDA Model Results:

QDA does not perform well compared to logistic regression. A comparison of all results is

below:

The performance of all QDA models are shown below:


Classification – GAM Model Results:

The best model results are shown above in the report. The summary of results of all

models are shown below.

Classification – Boosting Plots:

We find the worst performing variables via the relative

influence plot that is shown below after removing the worst

performing variables on the first run.

Below we show plots revealing the partial dependence for

the best performing variables. These show the marginal effect after

integrating out the other variables.


Partial Dependence Plots:

Least Squares Regression – Model Results:

We show the results of the best models above and the results of all five least squares

models below.


Best Subset with K‐Fold Cross Validation – Model Results:

We did not have space to showcase these results in the report, so we show the

full model results comparison below:

Principal Components Regression – Model Results:

We did not have space to showcase these results in the report, so we show the full model

results comparison below:


Partial Least Squares – Model Results:

We did not have space to showcase these results in the report, so we show the full model

results comparison below:

Ridge Regression – Model Results:

We did not have space to showcase these results in the report, so we show all of the

Ridge Regression model results below:


Lasso – Model Results:


Lasso model results below:

Linear Regression GAM – Model Results:


GAM model results below:


Ridge Regression Tuning Parameter Comparison:

The tuning parameter lambda for ridge regression shrinks the coefficients towards zero

as it increases. Below we see the comparison between the full least squares coefficient and three

different values of lambda: low (close to zero at 0.01), medium (lambda of 11,498), and high

(lambda of over 800 million).

Number Variable Name

Coefficient ‐

Full Least

Squares

Low

Lambda

Medium

Lambda

High

Lambda

1 (Intercept) 14.56 14.60 14.49804 14.49825

2 reg1 (0.01) (0.01) (0.00002) (0.00000)

3 reg2 (0.06) (0.06) (0.00007) (0.00000)

4 reg3 0.33 0.32 0.00006 0.00000

5 reg4 0.64 0.64 0.00012 0.00000

6 home 0.23 0.23 0.00004 0.00000

7 chld (0.60) (0.26) (0.00011) (0.00000)

8 hinc 0.57 0.55 0.00009 0.00000

9 I(hinc^2) (0.11) (0.10) (0.00000) (0.00000)

10 genf (0.05) (0.05) (0.00000) (0.00000)

11 wrat (0.30) (0.29) (0.00001) (0.00000)

12 I(wrat^2) (0.43) (0.42) (0.00003) (0.00000)

13 avhv (0.05) (0.05) (0.00000) (0.00000)

14 I(avhv^2) (0.00) (0.00) 0.00001 0.00000

15 incm 0.33 0.32 0.00001 0.00000

16 I(incm^2) (0.06) (0.05) 0.00002 0.00000

17 inca 0.03 0.03 0.00000 0.00000

18 I(inca^2) 0.07 0.07 0.00002 0.00000

19 plow 0.15 0.14 0.00001 0.00000

20 I(plow^2) 0.12 0.12 0.00002 0.00000

21 npro 0.06 0.06 0.00002 0.00000

22 I(npro^2) (0.06) (0.06) (0.00000) (0.00000)

23 tgif 0.20 0.20 0.00003 0.00000

24 I(tgif^2) (0.01) (0.01) 0.00000 0.00000

25 lgif 0.35 0.34 0.00011 0.00000

26 I(lgif^2) (0.02) (0.02) 0.00000 0.00000


27 rgif 0.59 0.59 0.00019 0.00000

28 I(rgif^2) (0.04) (0.04) 0.00001 0.00000

29 tdon 0.06 0.06 0.00002 0.00000

30 I(tdon^2) 0.02 0.02 0.00001 0.00000

31 tlag 0.02 0.02 0.00001 0.00000

32 I(tlag^2) 0.01 0.01 0.00001 0.00000

33 agif 0.47 0.47 0.00018 0.00000

34 I(agif^2) (0.07) (0.06) 0.00001 0.00000

35 INT_chld_home 0.09 (0.22) (0.00011) (0.00000)

36 INT_chld_reg2 0.06 0.05 (0.00009) (0.00000)

37 hinc_high (0.05) (0.03) 0.00005 0.00000

38 INT_chld_hinc (0.20) (0.22) (0.00010) (0.00000)

481 - Report - Eric Spamer

Documents