Unit 8: Categorical predictors, I: Dichotomies. "There are two kinds of people in the world: Those who believe there are two kinds of people in the world and those who don't." –Robert Benchley, American Humorist (1888-1946). The S-030 roadmap: Where’s this unit in the big picture?. Unit 1: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
What changes & what remains the same when we partial out LMiles?
The two outcomes are no longer
highly correlated
SeatBelt states no longer have more
fatalities (
We knew this from regression results
)Warmer states still
have more fatalities
More urban states now have fewer
occupant fatalities!
Warmer states still are more likely to have SeatBelt Laws (but the partial is now n.s.)
States with a greater %age of urban roads still have more non-occupant fatalities, but population density, by itself, no
longer seems to matter
Urbanicity is now uncorrelated with either SeatBelt laws
or temperature
The urbanicity variables are still highly correlated
• In general, the inter-correlations between predictors are smaller after we control for LMILES...• But also, some of these correlations have changed sign!
Hmmm... Hmmm...
Hmmm...
Really need to include LTEMP, don’t
we!
So we can probably add at least one
urbanicity variable (but need to check
about both)
But this still does NOT mean they are
necessarily collinear!We probably want to include
PctUrban, but we’re now unsure about LPopDen—need to see what
Results of fitting a series of multiple regression models predicting occupant and non-occupant fatalities examining the effects of the presence of a primary seat belt law (n=50 states)
Loge(number of fatalities), for car occupants and non-occupants, by presence of a primary seat belt law, overall and adjusted for vehicle miles, average temperature, percentage of urban roads and population density
Occupants Non-Occupants
Unadjusted Adjusted Unadjusted Adjusted
Law (n=14) 6.21 5.77 5.15 4.63
No Law (n=36)
5.64 5.87 4.28 4.53
Diff in means
t (for diff)p (for diff)
0.571.89
0.0643
-0.10-2.05
0.0465
0.862.61
0.0120
0.101.10
0.2772The difference between the means of the dichotomous predictor’s two categories
is equal to the dichotomous predictor’s slope coefficient in a particular model. For example, for occupant fatalities:
Adjusted means 5.77 – 5.87 = -0.10 in the controlled model
Unadjusted means 6.21 – 5.64 = 0.57 in the uncontrolled model
Towards a graphic display of the regression findings: Which predictors would we want to highlight in a graph?
Results of fitting a series of multiple regression models predicting occupant and non-occupant fatalities examining the effects of the presence of a primary seat belt law (n=50 states)
• Regression models can easily include dichotomous predictors– All assumptions are about Y at particular values of X (or X’s)—no
assumptions about the distribution of the predictors– The same toolkit we’ve developed for continuous predictors can be
used for dichotomous predictors (including hypothesis tests, correlations and plots)
• Controlled effects are often different from uncontrolled effects– One of the major reasons we use multiple regression is that we have
several predictors that affect the outcome for which we want to statistically control
– Not only can we control for a single covariate, we can control for many covariates simultaneously (in this example, we had 4 covariates in addition to our question variable)
• Results of complex analyses can be displayed more simply using tables and graphs– As your models become more complex, the need for simpler numerical
and graphical displays remains– Always important to think about how you will communicate your
results to colleagues and broader audiences– Adjusted means and prototypical trajectories are powerful tools
*------------------------------------------------------------------*Creating boxplots of DPFAT & NOFAT distributions for SEATBELTLAW=0 and SEATBELTLAW=1 *------------------------------------------------------------------*; proc boxplot data=one; title2 "Fatalities by Presence/Absence of SeatBelt Laws"; plot (PDFat NOFat)*SeatBeltLaw;
*-------------------------------------------------------------------*Display PDFAT & NOFAT univariate summary information in tables for SEATBELTLAW=0 & SEATBELTLAW=1*------------------------------------------------------------------*; proc means data=one; by SeatBeltLaw; var PDFat NOFat;
*-------------------------------------------------------------------*Comparing mean values of PDFAT & NOFAT for SEATBELTLAW=0 and SEATBELTLAW=1
*------------------------------------------------------------------*;proc ttest data=one; class SeatBeltLaw; var PDFat NOFat;
Appendix: Annotated PC-SAS Code for Using Dichotomous Predictors
Note that this is just an abstract from the full program
Note that this is just an abstract from the full programproc boxplot, when used for
dichotomous predictors, creates pairs of boxplots comparing the outcome variables values across the two categories in the dichotomous predictor. The plot statement specifies the outcome variables to be used and the dichotomous predictor. Its syntax is outcome*predictor (note the use of parenthesis because of the two outcome variables)
proc boxplot, when used for dichotomous predictors, creates pairs of boxplots comparing the outcome variables values across the two categories in the dichotomous predictor. The plot statement specifies the outcome variables to be used and the dichotomous predictor. Its syntax is outcome*predictor (note the use of parenthesis because of the two outcome variables)
proc means is a very useful tool to create table summaries of descriptive statistics, especially for categorical predictors. The by statement specifies the categorical predictor to be used in grouping the data. The var statement specifies the variables for which you require descriptive statistics.
proc means is a very useful tool to create table summaries of descriptive statistics, especially for categorical predictors. The by statement specifies the categorical predictor to be used in grouping the data. The var statement specifies the variables for which you require descriptive statistics.
proc ttest runs a two-sample t-test comparing the means of two groups. The class statement specifies the categorical predictor used to differentiate the two groups.
proc ttest runs a two-sample t-test comparing the means of two groups. The class statement specifies the categorical predictor used to differentiate the two groups.
Appendix: Annotated PC-SAS Code for Using Dichotomous Predictors
*-------------------------------------------------------------------*For pedagogic purposes only: What happens if we change the reference category? Creating new dichotomous predictor NOSEATBELTLAW*------------------------------------------------------------------*; data one; set one; NoSeatBeltLaw = 1 - SeatBeltLaw;
Use the data step in the middle of the program to add new variables to the same data. The set statement specifies to which dataset to add the variable. You can then run new PROCs on the same data, using the new variables.
Use the data step in the middle of the program to add new variables to the same data. The set statement specifies to which dataset to add the variable. You can then run new PROCs on the same data, using the new variables.
-------------------------------------------------------------------*Controlling for vehicle milesInspect bivariate scatterplots LDPFAT vs MILES, LDPFAT vs LMILES, LNOFAT vs MILES, LNOFAT vs LMILES
Inspect same plots showing SEATBELTLAW=0 and SEATBELTLAW=1 *-----------------------------------------------------------------*;
proc gplot data=one; title2 "Examining the effect of vehicle miles"; plot (LDPFat LNOFat)*(miles lmiles); plot (LDPFat LNOFat)*(miles lmiles)=SeatBeltLaw;
proc gplot can also be used to represent a three way plot with plotting symbols denoting the 3rd (here categorical) predictor. The plot statement syntax is outcome*predictor=categorical predictor. If you use a symbol statement in the program, SAS will use dots ● of different colors for each category of the predictor. Note you can have multiple plot statements in a single GPLOT.
proc gplot can also be used to represent a three way plot with plotting symbols denoting the 3rd (here categorical) predictor. The plot statement syntax is outcome*predictor=categorical predictor. If you use a symbol statement in the program, SAS will use dots ● of different colors for each category of the predictor. Note you can have multiple plot statements in a single GPLOT.
*-------------------------------------------------------------------*Estimating partial correlations controlling for LMILES
*------------------------------------------------------------------*; proc corr data=one; title2 "Partial correlation matrix controlling for Lmiles"; var LDPFat LNOFat SeatBeltLaw ltemp PctUrban lpopden; partial lmiles;
proc corr estimates bivariate correlations between variables you specify. By adding a partial statement to the syntax, it will estimate partial correlations, controlling for the variable named in the partial statement.
proc corr estimates bivariate correlations between variables you specify. By adding a partial statement to the syntax, it will estimate partial correlations, controlling for the variable named in the partial statement.