Predict 422 – Eric Spamer Direct Marketing Classification and Predictive Modeling Eric Spamer Introduction We have received approximately 6,000 full records from a charitable organization that wishes to improve its response rate for its direct marketing campaign. We have all of the relevant cost information to assist the client using predictive modeling to perform a profit maximization. The client has provided 2,007 potential donors to us. The client sent all the relevant attributes, and the organization wants to know who to send mail to. In order to determine the best individuals to reach out to, we will conduct classification modeling using the 6,000 records that note whether or not someone donated. This will enable us to determine the likelihood of whether someone in the test set of 2,007 is willing to donate. This will also enable us to calculate the net profit maximizing number of mailings total. Assessing model in the way is going to be used is integral and unfortunately analysts do not always frame the questionable this way. In order for the charitable organization to know how much they should expect in donations, we will also create a predictive regression model using the data from only the donors. For both the classification modeling and predictive regression modeling, we tried many different modeling techniques. We believe we have created models that will serve the charitable organization very well. Exploratory Data Analysis: We begin this task with Exploratory Data Analysis (“EDA”). We want to learn as much about the data as possible. Missing values and outliers are very impactful to a regression analysis, so we want to think about these issues throughout the EDA process. There are several updates (or “imputations”) we make based on our data exploration in order to ensure maximize the usefulness of the data. We also create interaction groups based on decision trees in order to find additional predictive data. Prior to starting our EDA, we want the data to be split into training data, testing data, and validation data. We create a split with 3,984 training records, 2,018 validation records, and 2,007 test records. The training, validation, and test data all begin with 21 variables. Imputation (Fixing Missing Values) The data is very clean, so we do not have to perform imputation on missing values. Data Transformations We analyze every variable in the dataset by creating histograms. When analyzing histograms, we want to look at the shapes of the variables to see if transformations should be applicable. We run histograms and all of the independent variables to analyze where a transformation may be helpful to help improve the fit of the variables. We see approximately eight cases where a natural log transformation could make the independent variables have a more normal distribution. This is important, because some statistical methods can struggle when predictors are highly skewed. Log transformations may improve the performance of the
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Predict 422 – Eric Spamer
Direct Marketing Classification and Predictive Modeling Eric Spamer
Introduction
We have received approximately 6,000 full records from a charitable organization that
wishes to improve its response rate for its direct marketing campaign. We have all of the
relevant cost information to assist the client using predictive modeling to perform a profit
maximization. The client has provided 2,007 potential donors to us. The client sent all the
relevant attributes, and the organization wants to know who to send mail to.
In order to determine the best individuals to reach out to, we will conduct classification
modeling using the 6,000 records that note whether or not someone donated. This will enable
us to determine the likelihood of whether someone in the test set of 2,007 is willing to donate.
This will also enable us to calculate the net profit maximizing number of mailings total.
Assessing model in the way is going to be used is integral and unfortunately analysts do not
always frame the questionable this way.
In order for the charitable organization to know how much they should expect in
donations, we will also create a predictive regression model using the data from only the
donors. For both the classification modeling and predictive regression modeling, we tried
many different modeling techniques. We believe we have created models that will serve the
charitable organization very well.
Exploratory Data Analysis:
We begin this task with Exploratory Data Analysis (“EDA”). We want to learn as much
about the data as possible. Missing values and outliers are very impactful to a regression
analysis, so we want to think about these issues throughout the EDA process. There are several
updates (or “imputations”) we make based on our data exploration in order to ensure maximize
the usefulness of the data. We also create interaction groups based on decision trees in order to
find additional predictive data.
Prior to starting our EDA, we want the data to be split into training data, testing data,
and validation data. We create a split with 3,984 training records, 2,018 validation records, and
2,007 test records. The training, validation, and test data all begin with 21 variables.
Imputation (Fixing Missing Values)
The data is very clean, so we do not have to perform imputation on missing values.
Data Transformations
We analyze every variable in the dataset by creating histograms. When analyzing
histograms, we want to look at the shapes of the variables to see if transformations should be
applicable. We run histograms and all of the independent variables to analyze where a
transformation may be helpful to help improve the fit of the variables. We see approximately
eight cases where a natural log transformation could make the independent variables have a
more normal distribution. This is important, because some statistical methods can struggle
when predictors are highly skewed. Log transformations may improve the performance of the
Predict 422 – Eric Spamer
variables, however we also want to try Box Cox transformations. The Box Cox will find the
optimal power of transformation for the variable. We do not want to use the log transformation
and Box Cox transformation of a variable at the same time – we would want to choose one or
the other. For the variables in this analysis, we see the results are very similar, so we choose the
log transformations. The plots below illustrate that relationship between the Box Cox
transformation and the log transformation is the same when compared to the original incm (we
provide three additional comparison plots within the Appendix):
Transformations of variables can be very beneficial to improving a model’s predictive
capability. Derived variables can often show more of the true relationship between variables.
We run decision trees to find meaningful interactions between the variables and we include
these. During several model selection runs, we also try polynomials for many of the variables,
since these transformations can provide an improved fit.
Interaction Variables
We create decision trees on the training data to help determine if there are interaction
variables that we should be including in the model. Please find the tree for DONR below.
When we run on damt instead of donr, we see a similar decision tree, so we are satisfied with
the original tree. Please see the Appendix for a comparison between the trees.
We see the first split is on whether there is a
child or not. The next split is whether the individual is
a homeowner or not. We will include a variable that
interacts owning a home and having a child. The next
split is on region2 with a child, so we will utilize an
interaction on that as well. We also see a split on hinc
of 2.5, which is the household income category
variable. We also create a variable that notes whether
hinc is above or below 2.5. We see 6,466 cases where
hinc is above 2.5.
If we are performing modeling for inference,
we want to stay away from interaction terms during
logistic regression. However we include these
Predict 422 – Eric Spamer
interaction terms, because we will also be building a predictive model using linear regression
and prediction is our main goal, not inference. We do not need to include our interaction
variables for decision trees and random forests, since those algorithms will include interactions.
Data Standardization
We split into test/training set, compute parameters for standardization/decomposition
on training set (e.g. mean and variance of training set for standardization) and apply the same
parameters on test set. For standardization this could mean that the test set does not have zero
mean / unit variance, but that is fine. Looking at the test set to transform the training set is
usually considered bad practice.
Results ‐ Classification Modeling
We begin by performing classification modeling. The data contains variables that can
model whether or not someone will donate. For this modeling, weighted sampling has been
implemented, so more weight has been given to donors. We correct for that before determining
the number of mailings that will maximize net profits.
Logistic regression is often the most popular choice when there are two classes (K = 2).
We will run LDA first and then logistic regression. LDA is often better than logistic regression
under certain conditions. These include when n is small, when there are more than two classes,
and the classes are distinct and well separated and have Gaussian distributions. When p is very
high, Naïve Bayes may outperform the other techniques, however p is not very large here, so
we do not run Naïve Bayes.
After creating these models, we plug in the average donation amount ($14.50), the
mailing cost ($2.00), and the likelihood of response based on our classification model. It is not
wise to send mail to everyone since the average response rate is only 10%, which results in a net
loss of $0.55 per mailing. We will attempt to maximize net profits by creating models using the
training data, calculating the posterior probability of donation using the validation data, sorting
by the highest likelihood donors, and then reaching at a stopping point once the costs outweigh
the gains.
Classification ‐ LDA:
It is our goal to find the best possible predictive model. LDA is not designed to be used
with qualitative predictors, such as whether someone is a donor or not, however we overlook
that here in search of the best model.
We run several LDA models. The first includes the log transformations we performed
and the second supplements that with the three interaction variables and the high income
indicator variable we created (four total additional variables). The third model includes all of
the aforementioned fields plus 12 additional polynomials on the independent variables.
We see the third model – the most complex – performed the best. It finds that the
number of mailings that maximizes profits is 1,255, which is the lowest of the three models.
However it is the best at finding the correct true positives, as 978 of the 1,255 mailings were
actual donors. This precision, or positive predictive value, is 78%.
Predict 422 – Eric Spamer
The plot of the third LDA model (right) shows how
profits change as more mailings are sent. Sending too
many mailings eventually starts decreasing the net profit:
Classification ‐ Logistic Regression:
The LDA results look like they will provide a
large amount of profit, however Logistic regression is a
very popular technique especially when there are two classes, such as there are in this case
(donor or not donor). We run nine different logistic regression models. This process includes
backwards stepwise which helps eliminate unimportant independent variables that are adding
more noise than signal to the predictive process.
The first model includes all fields with some logarithmic transformations. The second
model includes all fields plus the three interaction terms and the one indicator variable on
household income level. The third model includes a backwards selection process based on
Akaike information criterion: (AIC). The third model removes variables that are not important,
however it leaves in two variables that are still not statistically significant at a 5% level. The
fourth model removes these two variables. The fifth and sixth models include backward
stepwise with the additional of polynomials up to the power two for 12 additional variables.
The seventh and eighth models include backward stepwise with the additional of polynomials
up to the power three for 12 more variables. The sixth and eighth models take their
predecessor model and remove variables that are not statistically‐significant at the 5% level.
A small value of AIC is what we are searching for. The calculation penalizes models
with lots of parameters and penalizes models for poor fit as well. The full model has AIC:
2167.9 and the backwards selection leads to a 15 variable model with an AIC of 2153.23. As
noted above, two of these 15 variables have a p‐value of 0.08 (tgif and genf), so we run a fourth
model excluding these. This fourth model leads to a higher AIC of 2156. This is expected, since
the backwards selection process is attempting to minimize AIC.
Predict 422 – Eric Spamer
Model #5 is backward stepwise with 12 more polynomials. Instead of adding a
polynomial simply for hinc, we add polynomials for 12 additional variables that have a
distribution other than 0 and 1. This leads to an AIC of 2002.1 with 21 variables. Model #6
removes four variables from Model #5 that are not statistically significant at a 5% level. These
variables have a p‐value above 0.05. They variables removed include: genf, npro^2, tgif^2, and
rgif to lead to a 17 variable model. This leads to an AIC of 2011.
Model #7 and Model #8 are similar to Models #5 and #6, however they involve taking 12
variables to the power of three before running the stepwise regression. This leads to 49
potential variables, of which 26 are selected to be included in the model. This results in an AIC
of 1,986.1. Model l#8 removes three variables from Model #7 that are not statistically significant:
genf, plow^3, and agif^3. This leads to an AIC slightly higher than Model #7: 1,987. The three
less variables may prove to be beneficial on a validation data set however.
Finally for the ninth model, we utilize the variables from Model #8 but we fill in
underlying variables for polynomials – meaning if the backwards selection said only the third
degree polynomial of a variable was significant, we add in the first and second degree
polynomial since that is traditionally more appropriate.
When we run these nine logistic models against the validation data, we see a better idea
of how these records would perform on new records. Summary of all nine logistic regression
models can be found in the appendix. Please find a summary of the best models below:
We see that the two best AIC prove to be the two best models on the validation data’s
maximum profit calculation. Models #7 and #8 have the highest profit and lowest AIC. The
precision or positive prediction value is true positives / predicted positives, which is 996 / 1335 =
74.6% for Model #7 and 75.9% for Model #8.
Two potential problems for fitting models are multicollinearity and autocorrelation of
errors. For multicollinearity, we perform a Variance Inflation Factors (VIF) check on the best
logistic regression models. For logistic model #7, we see a few variables with potential issues:
INT_chld_home, INT_chld_reg2, INT_chld_hinc and chld. We see similar issues for model #8.
Multicollinearity is more important for inference than prediction, so it is not a large issue here.
For checking the independence of errors, we perform the Durbin‐Watson test. This will help us
understand if we have autocorrelation in our errors. For our two best logistic regression
models, we see Durbin‐Watson values of 1.98, which shows no evidence of serial correlation.
Predict 422 – Eric Spamer
Classification ‐ QDA:
Quadratic discriminant analysis (QDA) is similar to LDA. QDA and LDA both assume
that the observations from both of the classes (donor and not donor) have a Gaussian
distribution. Both also input estimates for the parameters into Bayes’ theorem to predict
observations. However QDA differs from LDA, because LDA assumes that both classes (or
each class if K > 2) share a covariance matrix but QDA does not assume this.
We run five models for QDA:
The first model includes all 21 fields with some logarithmic transformations
The second model includes all fields plus the three interaction terms and indicator
variable for high income
The third model includes variables from the model found using backwards selection
process during logistic regression (Model #3)
The fourth model includes variables from the model found using backwards
selection process with polynomials up to level #3 during logistic regression (Model
#7). This was the best performing model for logistic regression.
The fifth model includes variables from the model found using backwards selection
process with polynomials up to level #3 during logistic regression minus three non‐
statistically significant variables (Model #8). This was the second best performing
model for logistic regression.
The first two models were also run for LDA, so it will be interesting to compare the
results from QDA. We see that the first model suggests that 1,353 mailings should go out for a
profit of $11,199.50 while the same variables in LDA led to 1,411 mailings for a profit $406 more
with $11,605.50. The second model for QDA also performs $177.50 worse relative to the same
variables in LDA.
QDA does not perform well compared to logistic regression with this data either. We
see the first two models compared to the LDA results and they both perform poorly on the
QDA model as noted above. The final three models are compared to the logistic regression run
with the same variables, and the QDA performs significantly worse when trying to predict
maximum profit on the validation set: A comparison table is shown in the Appendix.
Too many variables proves to hurt the QDA model as shown by the poor performance
of models #4 and #5, even though those were the models that performed the best for logistic
regression. The best performing model was #3 with the lowest number of variables (15), which
has a precision or positive prediction value of 72.5% (true positives / predicted positives = 977 /
1347). The performance of the best models are shown below (all within the Appendix):
Predict 422 – Eric Spamer
Classification ‐ K‐nearest neighbors:
Unlike LDA, Logistic Regression, and QDA, K‐Nearest Neighbors (KNN) is non‐
parametric. Therefore there are no assumptions regarding the decision boundary shape. KNN
will be the best model when the decision boundary is very non‐linear. The number of K adjusts
the flexibility of KNN.
We run various different groupings of k‐nearest neighbors. We utilize different
amounts of variables as inputs as well as different values of k to define different
“neighborhoods”. We begin by using all available variables, but we reduce this greatly to avoid
the curse of dimensionality that can easily impact k‐nearest neighbors.
The full model performs very poorly, but the second grouping of variables – reduced
from 24 variables to five variables – performs substantially better. The variables utilized are
based on the decision tree run earlier. This decision tree is very helpful for variable selection,
and in this case, the results are greatly improved over the model with far too many variables.
The third set variables removes genf to see if further reduced dimensionality can help.
We find that the five variable model outperforms the four variable model with several different
values of k. We keep increasing k by 10 until we see a reduction in performance. For the five
variable model we see the performance increasing at k=20 over k=10, and then again at k=30 and
once again at k=40, and then we see a reduction in performance at k=50. Therefore k=40 and the
five variable model is deemed
the best.
This results in a profit of
$10,836.50 based on 913 correct
predictions out of 1,201 total
predictions of donor, which is a
precision of 76.0%.
Classification ‐ GAM:
Now we move onto Generalized Additive Models (GAMs). These extend standard
linear models by allowing the variables to be represented by a non‐linear function. The ability
to add non‐linearity makes more accurate predictions possible. Despite the variables being
Predict 422 – Eric Spamer
functions, the additivity of the model is maintained, so the effect of individual variables can still
be investigated. GAMs work on both quantitative and qualitative problems. We apply GAMs
on the qualitative variable of whether or not someone will donate, however we will also use
GAMs later on a quantitative dependent variable.
In order to determine variable selection for our GAM model, we first begin with a small
amount of quantitative variables. Decision trees can often be a good measure of variable
selection. As we saw with KNN, the decision tree provided a better model than the KNN run
that included all variables. Therefore we begin by reducing the variables down to the ones
found to be significant via the decision tree. We default all variables to four degrees of freedom,
but child is updated to two degrees of freedom, because it is limited in values.
In addition to the variables found via the decision tree, we add in the variables with the
largest variance (prior to standardizing). We do this in hope of finding variables that would be
adding a lot of variability and this would be captured with the splines. This means we also
include tgif, lgif, and npro.
Based on the summary of the natural spline model, we see chld, hinc, and tgif have
statistically significant values based on the p‐value. For the smoothing spline model, we see
chld, wrat, hinc, and tgif have statistical significance. When we remove the two non‐statistically
significant variables (lgif and npro) we see improvement in AIC for both the natural spline and
smoothing spline models. We then go through a few more iterations to improve on the model.
We can also use local regression fits as building blocks in a GAM, and through an
iterative process, we find meaningful variables to include. Finally, we add in the variables that
we found were significant via backwards selection during logistic regression. We do not use
smoothers or local regression with these variables. Variables where we previously utilized
polynomials are now being replaced by smoothing splines, with up 20 degrees of freedom.
The best model results are shown above. For the summary of results of all models,
please seen the appendix.
Predict 422 – Eric Spamer
Classification – Random Forests and Bagging:
Bagging is an expansion of Decision Trees that constructs many trees utilizing bootstrap
sampling. Random Forests takes that a step further and includes a
limited number of predictors available to split each node. That value
is typically the square root of all variables available. We start with all
20 original variables, which is bagging since no random selection of
variables takes place. We also try random forests with four and five
variables available at random per node. For random forests, we do
not need to include interactive variables.
The best model on the validation data is one that utilizes five variables
at each node and does not include the three worst performing
variables based on the variable importance (shown to the right). This
results in 1,229 mailings and a net profit of $11,752. As shown in the
section below, this is outperformed by the GAM and Logistic Regression models in terms of
maximum net profit, however it does rank near the top with a precision of 80%.
Classification – Boosting:
Boosting is a way to improve the results of decision trees. Boosting can avoid
overfitting by learning slowly from the errors of previous trees created. This sequential
growth of trees can be changed by adjusting the shrinkage. Ultimately we find the optimal
boosted result has a depth of 5, shrinkage of 0.01 and worst performing variables removed.
We find the worst performing variables via the relative influence plot that is shown in the
appendix along with plots showcasing the partial dependence plots for the best performing
variables. These show the marginal effect after integrating out the other variables.
This results in a model that sends 1,197 mailings and has a profit of $11,932. It also
results in the highest precision of any model with 82.5%.
Classification – SVMs:
Support Vector Machines are another option for classification. We utilize a radial
kernel to try and predict the DONR indicator. We try various levels for the gamma and
cost parameters, and determine values of 0.1 for both leads to the best result. This results in
1,375 mailings for a net profit of $11,532.50.
Comparison of all Classification Models:
We wish to select the best classification model for determining whether an individual
was a donor or not (DONR indicator) based on the maximum profit. We utilize GAM Model
#10, which includes smoothing splines very high degrees of freedom as it leads to profits of
$11,960. We attempt to combine the first three best models into different ensemble models,
however no combination outperforms the GAM Model on its own. We apply this model to the
test data set, and we correct for the weighted sampling. The validation data response rate is 0.5,
Predict 422 – Eric Spamer
but the typical response rate is 0.1, so the mailing rate needs to be adjusted before being applied
to the test data. When we send mail to everyone above the threshold given from the model
results, we see that we will mail to 302 records, the ones with the 302 highest posterior
probabilities. We will not mail to the remaining 1705 records. When we look at the best model
for all of the tests, we see the following results:
Results ‐ Prediction Regression Modeling
Now that we have completed the classification modeling, we want to move onto the
prediction model for the donation amount (DAMT) variable. For this purpose, we utilize data
of the records for donors only. We utilize several different candidate modeling techniques
including least squares regression, best subset with k‐fold cross validation, principal
components regression, partial least squares, ridge regression, and lasso. We will evaluate these
different results using the mean prediction error. Once we find the best model, we can create
predictions for the test data.
We now build a prediction model to predict expected gift amounts from donors. The
data used for this will only consist of people who donated, so our population size decreases.
The training set now contains 1995 records and the validation set now contains 999 records.
Least Squares Regression:
We begin the prediction modeling on a quantitative variable with least squares
regression. Our goal once again is to find the best possible predictive model. We run several
least squares regression models. The first includes all variables (with the log transformations
we performed), and the second supplements that with the three interaction variables and the
Predict 422 – Eric Spamer
high income indicator variable we created (four total additional variables). The third model
includes all of the aforementioned fields plus 13 additional polynomials on the independent
variables. We then perform stepwise with backwards elimination to reduce the variables.
Finally, we remove a few additional variables from the backwards stepwise result – these
variables have p‐values above 0.05, so they would not be considered statistically significant in
some contexts.
We see the third model – the most complex with 37 variables – performed the best as it
has the lowest Mean Prediction Error with 1.408. This model also shares the lowest standard
error and is a
very close to
having the best
adjusted r‐square
value. We show
the results of the
best models on
the right and the results of all five least squares models in the Appendix.
When we run diagnostics on the best regression models, we see potential
multicollinearity issues based on the Variance Inflation Factors (VIF) check. For model #3, we
see many variables with potential multicollinearity issues: For model #4, many of these
variables are eliminated so the issues are alleviated. Multicollinearity is more important for
inference than prediction, so it is not a large issue here. For checking the autocorrelation of
errors, we perform the Durbin‐Watson test. For our two best regression models, we see a
Durbin‐Watson values of 2.01, which shows no evidence of serial correlation.
Best Subset with K‐Fold Cross Validation:
We now create models using best subset with K‐Fold Cross Validation. We start with
the biggest (and best) set of variables from the least squares regression ‐ Model #3 with the 37
variables. We run a test without cross‐validation to see the best model would be with subset
selection. We see that a 31 variable model would be the best. This has a mean error of 1.4053.
We then perform cross‐validation with k = 10 folds.
When we plot the mean cross validation errors by model
size as shown on the right. We see that a model size of 12
produces a mean error of 1.366. This is the first dip or
elbow in the plot, so it is noteworthy. This is the
minimum error up to that point. A model size of 14
produces a mean error of 1.360, and then a model size of
16 is slightly less than that. The overall minimum is a
model with size 25 (1.3241 mean error). We test the models of size 12, 14 and 25 against the
validation data set. We show the full model results comparison in the Appendix.
Predict 422 – Eric Spamer
Principal Components Regression:
Now we apply Principal Components Regression (PCR) to the data. We want to ensure
that there are no missing values for PCR. Scale also plays an important role in PCR, and we
have already standardized the variables. We can run ten‐fold cross validation on the PCR to see
the root mean squared error for each value of M principal components.
Just like best subset with K‐Fold Cross Validation, we
start with the biggest (and best) set of variables from the least
squares regression performed above ‐ Model #3 with the 37
variables. We remove three of the interaction variables since
they cause issues, likely due to collinearity. We see that 8, 15,
and 25 components have an elbow in the plot, which is an
indicator of a good amount of components to use.
We see that 8 components explain 55% of the variance in the predictors and 49% of the
variance in the dependent variable. We see that 15 components explain 78% of the variance in
the predictors and 63% of the variance in the dependent variable. We see that 15 components
explain 96% of the variance in the predictors and 65% of the variance in the dependent variable.
We also want to see the impact of starting with a lower
amount of variables than the 34 we do above. We utilize the 25
variables found via the best subset with k‐fold cross validation
to run PCR as well. We see the resulting plot on the right:
We see that using 6 components has a significant elbow
in the plot, which is an indicator of a good amount of
components to use. We see that 6 components explain 52% of
the variance in the predictors and 31% of the variance in the dependent variable. We see that
using 13 components also has a substantial elbow in the plot. This level explains 63% of the
variance in the dependent variable. Using either 6 or 13 components would be an improvement
due to dimension reduction compared to the starting count of 25 variables.
With the PCR run on the training data, we can check the results on the validation data.
Since we input significant variables to start, we do not see an improvement in the results by
using PCR. If we would have included non‐statistically significant variables, then PCR would
have been helpful. As shown in the table below, as the number of components grows, the mean
prediction error decreases. Model #7 shows the exact same mean prediction error as shown in
the best subset results (since it contains the same starting fields and it uses all 25 fields, so no
dimension reduction is occurring). All results are shown in the Appendix.
Partial Least Squares:
Now we apply Partial Least Squares (PLS) to the data. Just like PCR, we run ten‐fold
cross validation to see the error for each value of M partial least squares directions. We utilize a
Predict 422 – Eric Spamer
process to deciding the variables as we did for PCR. We see
that 2 components has an elbow in the plot, which is an
indicator of the amount of components to use.
We also want to see the impact on Partial Least Squares
of starting with a lower amount of variables than the 34 we do
above. We utilize the 25 variables found via the best subset
with k‐fold cross validation to run PCR as well. We see a
similar plot to the one shown on the right.
We ran two different groups of variables using PLS on the training data. When we
utilize the validation data, we do not see an improvement in the results by using PLS. As
shown in the table below, as the number of components grows, the mean prediction error
decreases. Just as we see with PCR, Model #7 shows the exact same mean prediction error as
shown in the best subset results (since it contains the same starting fields and it uses all 25
fields, so no dimension reduction is occurring). All results are shown in the Appendix.
For PCR and PLS, it is not worth adding the complexity of interpretation and
explanation if it does not lead to improved results. Therefore, it is better to stick with the best
linear regression model or best subset with k‐fold cross validation model up to this point.
Ridge Regression:
Now we will apply two shrinkage or regularization techniques: Ridge Regression and
Lasso. These methods shrink the coefficients towards zero and reduces variance in the process.
We want to find out if performing ridge regression will perform better than simply doing least
squares. We will utilize the best least squares set of variables that we found in order to make
this comparison most useful.
Ridge regression can scale variables automatically, and we have already scaled the
variables. We implement the lambda (tuning parameter) over a full range of values which
results in fitting the null model through to the full least squares fit. In the Appendix, we see the
comparison between the full least squares coefficient and three different values of lambda.
When we plot the ridge regression results, we see that as lambda
increases, more of the variables become very close to zero. We now show
the results of ridge regression using ten‐fold cross validation to choose the
optimum level of lambda. Through the cross validation, we see the optimal
lambda is 0.112. We utilize this minimum lambda. We also run a model
with the cross‐validation one standard error higher than this minimum,
which is 0.948. We try a few other values of lambda as well.
When we run the Ridge Regression model with a different set of with
the variables, we see that Ridge Regression does not improve the 25 variable model nor does it
improve the 37 variable model. All Ridge Regression results are shown in the Appendix.
Predict 422 – Eric Spamer
Lasso:
The Lasso procedure is similar to Ridge Regression,
however Lasso can perform variable selection by reducing the
impact of variables to zero. This can lead to a more accurate
and/or more interpretable model than ridge regression.
If we utilize the lambda recommended by cross‐
validation, 0.0046, we see the full model of 37 variables becomes
31. If we use the lambda tuning parameter one standard
error higher than the minimum (0.09), then the 37 variables is
reduced further to 16. This 16 variable model has a higher error
on the validation set however. We also run a model starting
with the 25 variable model and see similar results. Just as with
Ridge Regression, we see that the Lasso does not improve the
models compared to the least squares. All Lasso results are shown in the Appendix.
A huge positive of Lasso modeling is the variable selection aspect. We see that the
tuning parameter value of 0.5 removes all but four variables from the model. This does not
perform well with the mean prediction error, yet it does showcase four of the most important
variables: whether someone lives in geographic region #4, the dollar amount of a recent gift, the
average dollar amount of gifts to date, and a negative variable: number of children.
Linear Regression GAM:
We now run a Linear Regression Generalized Additive Model (GAM) to see if we can
extend the linear model as GAMs are more flexible than linear regression. The relationship
between each predictor and the dependent variable uses a curve to model. This hurts the
interpretability of GAMs relative to linear regression. We utilize the 37 variable model that
was the best result from linear regression for our GAM testing. We apply natural splines and
smoothing splines and then we do iteration to determine the best number of degrees of freedom
for these variables. Ultimately we see results that improve upon the linear regression results.
Results ‐ Summary of Prediction Regression Modeling
For Principal Components Regression and Partial Least Squares, we do not see
improved results. When we will apply two shrinkage or regularization techniques, we see
these methods shrink the coefficients towards zero and reduce variance in the process, however
this does not outperform the best linear regression or best subset here.
We find that the GAM models do improve upon the performance of least squares and
best subset regression. When we compare the output of the best GAM Model, the best Least
Squares Model, and the best Subset Model, we see that less than one percent of the predictions
differ by more than one for DAMT, which indicates that all three models are producing similar
results. To summarize the best models, we use the following table:
Predict 422 – Eric Spamer
Results ‐ Diagnostics of Prediction Regression Modeling
There are several potential problems when we fit a linear regression model. These
include Collinearity, Outliers, High‐leverage points, Correlation of error terms, Non‐constant
variance of error terms, and Non‐Linearity of the response‐predictor relationships. We take
several steps to avoid these pitfalls. One example is multicollinearity ‐ we perform a VIF check.
The least squares model of 37 variables has a significant amount of collinearity, particularly the
interaction variables we create. This is mostly corrected for in the 25 variable model
particularly after we remove the interaction variables.
Conclusion
We sought out to help a charitable organization use machine learning to improve its
direct marketing campaign. We believe our results can help make improve their donations and
reduce costs of the program. We built two models – a classification model and a regression
model. The classification model noted whether someone would be likely to donate. The best
modeling technique was Generalized Additive Models when we use maximum profit as the
model selection criteria. In particular a model using smoothing splines with high degrees of
freedom lead to the maximum profit, a value of $11,960. When applied to the test data, we see
that we should send mail to 302 records and no mail to the 1705 records where the model
outputs a lower probability. This is a 15% mailing rate. We know that typical response rates
are 10%, so we hope that we have isolated most of that 10% positive group within our 15%
mailing group.
With our classification model finalized, we built a prediction model to model gift
amounts using the data for donors only. We utilized many different modeling techniques, and
least squares regression and best subset with k‐fold cross validation outperformed almost every
option, until we implemented the highly flexible Generalized Additive Modeling techniques.
We did not see improved modeling from regularization techniques like Ridge
Regression and Lasso, however we did see some inference from the Lasso. When the tuning
parameter is high, the Lasso model reduces the variables to just four, which showcases some of
the more important factors in determining donation amount: whether someone lives in
geographic region #4, the dollar amount of a recent gift, and the average dollar amount of gifts
to date are positive and the number of children is a negative variable.
Predict 422 – Eric Spamer
Appendix Box Cox versus Log Transformation
We see several cases where a log transformation could make the independent variables
have a more normal distribution. This could help improve predictive accuracy of the model.
We also test these cases using a Box Cox transformation and we see the results of the Box Cox
and Log Transformation are very similar. We see that for avhv, incm, rgif, and tdon below:
Predict 422 – Eric Spamer
Decision Trees for Interaction Variables Comparison
We create decision trees on the training data to help determine if there are interaction
variables that we should be including in the model. We ran a tree for both donr and damt (the
two dependent variables used in this study), and we see a similar decision tree for both.
Decision Tree for DONR
Decision Tree for DAMT
Predict 422 – Eric Spamer
Classification ‐ Logistic Regression Model Results:
When we run these nine logistic models against the validation data, we see a better idea
of how these records would perform on new records. Summary of all nine logistic regression
models can be found below:
Predict 422 – Eric Spamer
Classification – QDA Model Results:
QDA does not perform well compared to logistic regression. A comparison of all results is
below:
The performance of all QDA models are shown below:
Predict 422 – Eric Spamer
Classification – GAM Model Results:
The best model results are shown above in the report. The summary of results of all
models are shown below.
Classification – Boosting Plots:
We find the worst performing variables via the relative
influence plot that is shown below after removing the worst
performing variables on the first run.
Below we show plots revealing the partial dependence for
the best performing variables. These show the marginal effect after
integrating out the other variables.
Predict 422 – Eric Spamer
Partial Dependence Plots:
Least Squares Regression – Model Results:
We show the results of the best models above and the results of all five least squares
models below.
Predict 422 – Eric Spamer
Best Subset with K‐Fold Cross Validation – Model Results:
We did not have space to showcase these results in the report, so we show the
full model results comparison below:
Principal Components Regression – Model Results:
We did not have space to showcase these results in the report, so we show the full model
results comparison below:
Predict 422 – Eric Spamer
Partial Least Squares – Model Results:
We did not have space to showcase these results in the report, so we show the full model
results comparison below:
Ridge Regression – Model Results:
We did not have space to showcase these results in the report, so we show all of the
Ridge Regression model results below:
Predict 422 – Eric Spamer
Lasso – Model Results:
We did not have space to showcase these results in the report, so we show all of the
Lasso model results below:
Linear Regression GAM – Model Results:
We did not have space to showcase these results in the report, so we show all of the
GAM model results below:
Predict 422 – Eric Spamer
Ridge Regression Tuning Parameter Comparison:
The tuning parameter lambda for ridge regression shrinks the coefficients towards zero
as it increases. Below we see the comparison between the full least squares coefficient and three
different values of lambda: low (close to zero at 0.01), medium (lambda of 11,498), and high