UNEMPLOYMENT IN AMERICA Analysis of the Census Planning Database, 2010 Census and 2008- 2012 American Community Survey seph Reiter – Villanova University Regression Methods – MAT84
UNEMPLOYMENTINAMERICA
Analysis of the Census Planning Database, 2010 Census and 2008-2012 American Community Survey
Joseph Reiter – Villanova University Regression Methods – MAT8406
Regression Methods - MAT8406
Overview• Explanation of Dataset• Treatment and Selection of Variables• Model Assumptions• Model Comparison• Interpretation of Results
Joseph Reiter - Villanova University
Regression Methods - MAT8406
Dataset• Tract Level Planning Database with 2010 Census and
2008 – 2012 American Community Survey Data • Tracts ≈ 4,000 people , same general geographic area for
each census• 73,068 total observations• 559 variables
• Geography • Population• Households• Housing Units• Operational
• Response Variable = % Unemployment
Joseph Reiter - Villanova University
Regression Methods - MAT8406
Additional Variables• Added indicator variables for 9 regions of the country
Joseph Reiter - Villanova University
Regression Methods - MAT8406
Variable Reduction• Start with 559 + 8 (567 variables)• Only use percent variables (remove ½)• Drop margin of error variables (about ¼)• Remove repeats between census data and ACS data• Drop variables related to specific languages spoken (only
keep if English was spoken as primary language or English spoken very well)
• Drop operational variables (specifics about census returns)• Drop variables which are linear combinations of other
variables (ex. pct_Male is a linear combination of pct_Female)
• 58 independent variables remain
Joseph Reiter - Villanova University
Regression Methods - MAT8406
Variable Selection• LASSO: selected 33 variables• Forward Selection: 48 variables• Backwards Elimination: 47 variables• Any variable that was not selected by more than one method was
eliminated (48 left)• Remove non-significant and some multicollinearity issues (31 left)• Using all possible regressions, variables with large individual r-square
values were chosen ( > 0.1)• RESULT: 5 variable model
• % Persons Below Poverty Level• % Non-Hispanic Black• % On Public Assistance• % Not Graduate High School• % College Degrees
Joseph Reiter - Villanova University
Regression Methods - MAT8406
Evaluating Variance of Residuals No Transform Arcsin(sqrt) Y^(1/3)
Residuals(with zeros)
Residuals(no zeros)
Residuals(High influence
removed)
Joseph Reiter - Villanova University
Regression Methods - MAT8406
Evaluating Normality of Residuals No Transform Arcsin(sqrt) Y^(1/3)
QQplots(with zeros)
QQplots(no zeros)
QQplots(high
influenceremoved)
Joseph Reiter - Villanova University
Regression Methods - MAT8406
Model Comparison
No Transform {A} Arcsin(sqrt) {B} Y^(1/3) {C}
R2 0.4686 0.4524 0.4330Intercept 6.99 7.16 7.21
Prs Blw Pov Lev 0.141 0.109 0.0956NH Blk alone 0.0692 0.0515 0.0446
PUB ASST INC 0.325 0.243 0.212Not HS Grad 0.0159 0.0167 0.0174
College -0.0526 -0.0525 -0.0531
Joseph Reiter - Villanova University
Parameter Estimates (transformed to original percents)
No Transform {D} Arcsin(sqrt) {E} Y^(1/3) {F}R2 0.4810 0.4591 0.4374
Number Points Removed 281 159 148
Intercept 6.99 7.13 7.20Prs Blw Pov Lev 0.132 0.105 0.0954
NH Blk alone 0.0670 0.0512 0.0445PUB ASST INC 0.333 0.252 0.216Not HS Grad 0.0192 0.0176 0.0175
College -0.0513 -0.0519 -0.0529
With High Influence Points Removed
Regression Methods - MAT8406
Test PointsPrs Blw Pov Lev
NH Blk alone
PUB ASST INC
Not HS Grad College Model A Model B Model C Model D Model E Model
F Actual
16 13 3 15 28 9.89 9.17 8.82 9.82 9.16 8.83 -
9 0 0 1 54 5.44 5.42 5.39 5.42 5.40 5.39 -
0 16 0 13 25 6.99 6.88 6.83 7.03 6.88 6.82 -
11 6 6 21 14 10.50 9.93 9.64 10.52 9.95 9.66 -
29 46 0 18 81 10.30 8.78 8.05 10.08 8.72 8.05 -
35 10 4 34 16 13.62 12.92 12.51 13.43 12.87 12.51 -
5.23 3.33 5.54 3.75 38.97 7.77 7.25 7.02 7.82 7.28 7.03 8.85
28.05 78.19 4.73 17.99 18.21 17.23 16.41 15.83 16.91 16.35 15.85 17.65
5.86 2.57 2.07 7.33 35.76 6.90 6.67 6.57 6.93 6.67 6.56 9.65
31.82 86.12 0.00 33.37 0.61 17.94 17.66 17.45 17.56 17.52 17.42 12.19
13.57 5.00 1.89 14.95 9.16 9.62 9.23 9.03 9.56 9.20 9.03 7.84
Joseph Reiter - Villanova University
Regression Methods - MAT8406
Conclusions• Somewhat different models suggested by selection methods
and random forest• Model suggested is fairly robust to outliers and influential points• Geographic region appeared be less critical than expected, but
still significant (although not include in the fully reduced models)
• Correlational relationship, may not be a causal relationship, or causation may be in opposite direction. (ex. Percent below poverty level)
• Size of dataset presented issues with selecting variables, not necessarily best model
• More models should be developed and considered before concluding on the best model
Joseph Reiter - Villanova University