John Rabovich STA 9850 Project 2 “Analyzing home loan trends in the five boroughs across ethnicity and gender” I wanted to analyze home loan data in New York City. Specifically, I wanted to see if there were differences in the origination of home loans in the five boroughs by ethnicity. It seems that race plays a factor in many facets of our society, and I wanted to see if there were differences in home loan origination by ethnicity. The question that I am posing is “Are there differences in the average value of the loans originated due to different ethnicities?” The second question I want to ask is ”Are there are differences in the average value of loans originated between genders?” It seems that while our country has come a long way in terms of gender relationships and gender discrimination, it doesn’t appear that we are at a place of complete gender equality yet. To answer these questions, I went to the consumerfinance.gov website and from there went to the Consumer Financial Protection Bureau section. I filtered home loan origination in 2014 by the five boroughs, types of housing, whether the person who took out the loan occupied the home as their personal dwelling, whether the loan was for a home purchase or refinancing, as well as if the ethnicity was “Hispanic or latino” or “not Hispanic or Latino.” This gave me a large dataset of 33,407 records from which I could analyze number of loans given out, the amount of the loans, and I could break it down by race and gender. Here is the link to the site: http://www.consumerfinance.gov/hmda/explore#filters Finding out if there are differences in the amount of loans given out by race using bootstrap resampling techniques Before I used resampling techniques I wanted to get some descriptive statistics about my dataset as well as load the data, etc... Below is my initial code to load the data: proc import out=WORK.loan_data datafile="C:\Users\jr167238\Desktop\home_loan_data_2014.xlsx" dbms=excel replace ; SHEET="hmda_lar"; getnames=yes; run; proc print data=loan_data(obs=15); run; Data loan_data1; Set loan_data; Where applicant_race_name_1 NE .; run; Proc contents data = loan_data varnum; run;
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
John Rabovich STA 9850 Project 2 “Analyzing home loan trends in the five boroughs across ethnicity and gender”
I wanted to analyze home loan data in New York City. Specifically, I wanted to see if there were differences in the origination of home loans in the five boroughs by ethnicity. It seems that race plays a factor in many facets of our society, and I wanted to see if there were differences in home loan origination by ethnicity. The question that I am posing is “Are there differences in the average value of the loans originated due to different ethnicities?” The second question I want to ask is ”Are there are differences in the average value of loans originated between genders?” It seems that while our country has come a long way in terms of gender relationships and gender discrimination, it doesn’t appear that we are at a place of complete gender equality yet.
To answer these questions, I went to the consumerfinance.gov website and from there went to the Consumer Financial Protection Bureau section. I filtered home loan origination in 2014 by the five boroughs, types of housing, whether the person who took out the loan occupied the home as their personal dwelling, whether the loan was for a home purchase or refinancing, as well as if the ethnicity was “Hispanic or latino” or “not Hispanic or Latino.” This gave me a large dataset of 33,407 records from which I could analyze number of loans given out, the amount of the loans, and I could break it down by race and gender. Here is the link to the site: http://www.consumerfinance.gov/hmda/explore#filters
Finding out if there are differences in the amount of loans given out by
race using bootstrap resampling techniques
Before I used resampling techniques I wanted to get some descriptive statistics about my
dataset as well as load the data, etc... Below is my initial code to load the data:
Proc means data = loan_data n nmiss mean std min max range skew kurtosis
maxdec=2 ;
var _numeric_;
run;
proc freq data = loan_data;
tables applicant_race_name_1 * loan_amount_000s
;
Run;
The data below is a screenshot of the proc print just to doublecheck that the data loaded into
SAS matches the raw data in the Excel file. I cut it off after the first fifteen observations.
Below is the output from proc means. As you can see, the mean for loan amount(variable name
loan_amount_000s) is 466.14 or $466,140. This will be useful as a baseline amount from which to
analyze data later on.
When I ran proc freq for loan amount by race, the data is pretty staggering. 20.9% of the total
amount of the loans were given to Asian people, black people or African Americans had 12.4% of the total amount, and white people had 64.15%
To analyze the difference in average value of loans by race, I used the bootstrap method of
resampling using proc surveyselect and then I looked at the descriptive statistics of this resampled data
using proc univariate. I compared the resampled data against the original data by using proc univariate on the original data and the resampled data. Below is the code I used:
proc surveyselect data=loan_data out=outboot
seed=3045956
method=urs
samprate=0.2
outhits
rep=100;
run;
proc sort data=outboot;
by applicant_race_name_1;
run;
proc univariate data=outboot;
var loan_amount_000s;
by applicant_race_name_1;
output out=outall mean=sampmean;
run;
proc sort data=loan_data;
by applicant_race_name_1;
run;
proc univariate data=loan_data;
var loan_amount_000s;
by applicant_race_name_1;
output out=outall mean=sampmean;
run;
Using bootstrap resampling techniques to determine loan amount by race output
Native Americans(American Indian or Alaskan Native)
Comparing the Native American resampled data to the original data, we see that the means of the
amount of loans given out to Native Americans are slightly different to each other. The resampled mean
is 388.60 and the mean for the original data is 380.28. However compared to the means for the total
data of the value of loans given out of 466.14, we see that Native Americans average loan values were
significantly smaller than the mean loan value . We also see that according to Kolomogorov-Smirnov p-
values and the graphs for Normality that the data does not seem to be normally distributed.
Asian
Comparing the Asian resampled data to the original, we see that the means of the value of loans given
out to Asians are remarkably consistent. The resampled mean is 410.41 and the mean for the original
data is 409.24. However compared to the means for the total data of the value of loans given out of
466.14, we see that Asians were given smaller loan amounts than the average. We also see that
according to Kolomogorov-Smirnov p-values and the graphs for Normality that the data does not seem
to be normally distributed.
Black or African American
Comparing the African American resampled data to the original, we see that the means of loan amounts
given to African Ameicans are remarkably consistent. The resampled mean is 326.99 and the mean for
the original data is 327.24. However compared to the means for the total data for the loan values out of
466.14, we see that African Ameicans were given much smaller loan amounts than the average. We also
see that according to Kolomogorov-Smirnov p-values and the graphs for Normality that the data does not seem to be normally distributed.
Native Hawaiian or Pacific Islander
Comparing the Native Hawaiian or Pacific Islander resampled data to the original, we see that the means
of the amount of loans given out to Native Hawaiian or Pacific Islanders are a little different. The
resampled mean is 330.66 and the mean for the original data is 333.95. However compared to the
means for the total data of loan values of 466.14, we see that Native Hawaiian or Pacific Islander were
given much smaller loan amounts than the average. We also see that according to Kolomogorov-Smirnov p-values and the graphs for Normality that the data does not seem to be normally distributed.
White
Comparing the Whites resampled data to the original, we see that the means of the amount of loans
given out to Whites are remarkably consistent. The resampled mean and the mean for the original data
is 514.65. Compared to the means for the total data for the amount of loans given out of 466.14, we see
that Whites had much larger loan values than average. We also see that according to Kolomogorov-Smirnov p-values and the graphs for Normality that the data does not seem to be normally distributed.
Conclusions
According to the data, white people’s loans were much larger on average than every other ethnicity. Not
just that, but much more white people were given loans than any other ethnicity as well. Excluding
Native Americans due to their smaller sample size, it appears that average loan amounts given to other
races was in the $300,000s. The exception to this were people of Asian descent, who were given average
loan amounts of $410,000. While this is significantly higher than African Americans, Native Americans,
and Pacific Islanders on average, it still pales in comparison to the average loan amount of $514,000 given to white people. You can see the distribution by ethnicity graphically below:
Finding out if there are differences in the value of loans given out by
gender using cross-validation with resampling macro techniques
For the question as to whether there were differences in the amount of loans given out due to gender, I
wanted to get some initial baseline statistics using proc freq and proc means. Using proc freq, you can
see from the table below that 36.14% of the total amount of the loans originated were granted to
women while 63.86% of the total amount of the loans originated were granted to men. We also see that the average amount for each loan regardless of gender is $466,140.
I also used the procnpar1way procedure to see what the mean for the amount of the loans originated
were for males and females. The mean loan amount for males is $516,542 while the mean loan amount
for females is $377,075.
I also wanted to use the proc univariate procedure on the sample data to compare descriptive statistics before I used resampling techniques.
The means for the proc univariate procedure for loan amount by gender correspond to the means used
in the procnpar1way procedure. Specifically, the mean for loan amount for females is 377.07 for both
procedures while the mean for loan amount by males is 516.54
Now I want to use cross validation techniques to resample the data in SAS. Please see the code I used below: