Print
Summary: Lesson 1: Introduction to Statistics
This summary containstopic summaries,syntax, andsample
programs.
Topic SummariesTo go to the movie where you learned a task or
concept, select a link.Basic Statistical ConceptsDescriptive
statisticsorganizes, describes, and summarizes data using numbers
and graphical techniques. Inferential statistics is concerned with
drawing conclusions about a population from the analysis of a
random sample drawn from that population. Inferential statistics is
also concerned with the precision and reliability of those
inferences.Apopulationis the complete set of observations or the
entire group of objects that you are researching. A sample is a
subset of the population. The sample should be representative of
the population. You can obtain a representative sample by
collecting a simple random sample.Parametersare numerical values
that summarize characteristics of a population. Parameter values
are typically unknown and are represented by Greek letters.
Statistics summarize characteristics of a sample. You use letters
from the English alphabet to represent sample statistics. You can
measure characteristics of your sample and provide numerical values
that summarize those characteristics. You use statistics to
estimate parameters.Variablesare characteristics or properties of
data that take on different values or amounts. A variable can be
independent or dependent. In some contexts, you select the value of
an independent variable, in order to determine its relationship to
the dependent variable. In other contexts, the independent
variables values are simply taken as given.Variables are also
classified according to their characteristics. They can
bequantitative or categorical. Data that consists of counts or
measurements is quantitative. Quantitative data can be further
distinguished by two types: discrete and continuous. Discrete data
takes on only a finite, or countable, number of values. Continuous
data has an infinite number of values and no breaks or
jumps.Categorical or attribute data consists of variables that
denote groupings or labels. There are two main types: nominal and
ordinal. A nominal categorical variable exhibits no ordering within
its groups or categories. With ordinal categorical variables, the
observed levels of the variable can be ordered in a meaningful way
that implies differences due to magnitude.A variables
classification is itsscale of measurement. There are two scales of
measurement for categorical variables: nominal and ordinal. There
are two scales of measurement for continuous variables: interval
and ratio. Data from an interval scale can be rank-ordered and has
a sensible spacing of observations such that differences between
measurements are meaningful. However, interval scales lack the
ability to calculate ratios between numbers on the scale because
there is no true zero point. Data on a ratio scale includes a true
zero point and can therefore accurately indicate the ratio of
difference between two spaces on the measurement scale.The
appropriatestatistical methodfor your data also depends on the
number of variables involved. Univariate analysis provides
techniques for analyzing and describing a single variable at a
time. Bivariate analysis describes and explains the relationship
between two variables and how they change, or covary, together.
Multivariate analysis examines two or more variables at the same
time, in order to understand the relationships among them.
Descriptive StatisticsAdata's distributiontells you what values
your data takes and how often it takes those values.You can
calculate descriptive statistics thatmeasure locationsin your data.
Statistics that locate the center of the data are measures of
central tendency. These include mean, median, and
mode.Percentilesare descriptive statistics that give you reference
points in your data. A percentile is the value of a variable below
which a certain percentage of observations fall. The most commonly
reported percentiles are quartiles, which break the data into
quarters.There are several descriptive statistics that measure
thevariabilityof your data: range, interquartile range (IQR),
variance, standard deviation, and coefficient of variation
(C.V.).To summarize and generate descriptive statistics, you use
theMEANS procedure. PROC MEANS calculates a standard set of
statistics, including the minimum, maximum, and mean data values,
as well as standard deviation andn. The PRINTALLTYPES option
displays statistics for all requested combinations of class
variables
Picturing Your DataAhistogramis a visual representation of the
frequency distribution of your data. The frequencies are
represented by bars.Thenormal distributionis a common theoretical
distribution in statistics. It is bell-shaped, with values
concentrated near the mean, and it is symmetric around the mean.
The standard deviation () determines how variable the distribution
is. Underlying the normal distribution is a mathematical function
named the probability density function.Tocheck the assumptionthat
your random sample has a normal distribution, you can plot a
histogram. You can also look at statistical summaries of your data.
The closer skewness and kurtosis are to 0, the closer your data is
shaped like the normal distribution.Skewnessmeasures the tendency
of your data to be more spread out on one side of the mean than on
the other. It measures the asymmetry of the distribution. The
direction of skewness is the direction to which the data is
trailing off. The closer the skewness is to 0, the more normal or
symmetric the data.Kurtosismeasures the tendency of data to be
concentrated toward the center or toward the tails of the
distribution. The closer kurtosis is to 0, the closer the tails of
the data resemble the tail thickness of the normal distribution.
Kurtosis can be difficult to assess visually.A negative kurtosis
statistic means that the data has lighter tails than in a normal
distribution and is less heavily concentrated about the mean. This
is a platykurtic distribution.A positive kurtosis statistic means
that the data has heavier tails and is more concentrated about the
mean than a normal distribution. This is a leptokurtic
distribution, which is often referred to as heavy-tailed and also
as an outlier-prone distribution.Anormal probability plotis another
way to visualize and assess the distribution of your data. The
vertical axis represents the actual data values. The horizontal
axis displays the expected percentiles from a standard normal
distribution. The normal reference line along the diagonal
indicates where the data would fall if it were perfectly
normal.Abox plotmakes it easy to see how spread out the data is and
if there are any outliers.You can usePROC UNIVARIATEto generate
descriptive statistics, histograms, and normal probability plots.In
the ID statement, you list the variable or variables that SAS
should label in the table of extreme observations and identify as
outliers in the graphs.You can add additional options to the
HISTOGRAM and PROBPLOT statements. The NORMAL option uses estimates
of the population mean and standard deviation to add a normal curve
overlay to the histogram and a diagonal reference line to the
normal probability plot.You can use the INSET statement to create a
box of summary statistics directly on the graphs.In addition to
thestatistical graphicsavailable to you with PROC UNIVARIATE, you
might want to use PROC SGSCATTER, PROC SGPLOT, PROC SGPANEL, and
PROC SGRENDER to produce a wide variety of additional plot
types.You can usePROC SGPLOTto generate dot plots, horizontal and
vertical bar charts, histograms, box plots, density curves, scatter
plots, series plots, band plots, needle plots, and vector plots.
The REG statement generates a fitted regression line or curve. You
use a REFLINE statement to create a horizontal or vertical
reference line on the plot.ODS Graphicsis an extension of the SAS
Output Delivery System. With ODS Graphics, statistical procedures
produce graphs as automatically as they produce tables, and graphs
are integrated with tables in the ODS output. You can find a list
of the graphs available for each SAS procedure in the SAS
documentation.
Confidence Intervals for the MeanApoint estimatoris a sample
statistic used to estimate a population parameter. A statistic that
measures the variability of your estimator is the standard
error.The standard error of the mean measures the variability of
your sample mean. Its an estimate of how much you can expect the
sample mean to vary from sample to sample.Thedistribution of sample
meansis the distribution of all possible sample means from the
population. The distribution of the mean is always less variable
than the data.Aninterval estimatoris another way to estimate a
population parameter. It incorporates the uncertainty that arises
from random variability.Confidence intervalsare a type of interval
estimator used to estimate the population mean, while taking into
account the variability of the sample mean.Thecentral limit
theoremstates that the distribution of sample means is
approximately normal, regardless of the population distribution's
shape, if the sample size is large enough.You can use theMEANS
procedureto generate a 95% confidence interval for the mean.You can
use the CLM option in the PROC MEANS statement to calculate the
confidence limits for the mean.You can add the ALPHA= option to the
PROC MEANS statement in order to construct confidence intervals
with a different confidence level.
Hypothesis TestingAhypothesis testuses sample data to evaluate a
question about a population. It provides a way to make inferences
about a population, based on sample data.There are foursteps in
conducting a hypothesis test. The first step is to identify the
population of interest and determine the null and alternative
hypotheses. The null hypothesis, H0, is what you assume to be true,
unless proven otherwise. It is usually a hypothesis of equality.
The alternative hypothesis, Haor H1, is typically what you suspect,
or are attempting to demonstrate. It is usually a hypothesis of
inequality.The second step in hypothesis testing is to select the
significance level. This is the amount of evidence needed to reject
the null hypothesis. A common significance level is 0.05 (1 chance
in 20).The third step is to collect the data. The fourth step is to
use a decision rule to evaluate the data. You decide whether or not
there is enough evidence to reject the null hypothesis.If you
reject the null hypothesis when it's actually true, you've made
aType I error. The probability of committing a Type I error is . is
the significance level of a test. If you fail to reject the null
hypothesis and it's actually false, you've made a Type II error.
The probability of committing a Type II error is . Type I and II
errors are inversely related.The power of a statistical test is
equal to 1 minus beta (1 ),The difference between the observed
statistic and the hypothesized value is theeffect size. Ap-value
measures the probability of observing a value as extreme as the one
observed or more extreme. Ap-value is not only affected by the
effect size, but also by the sample size.Thetstatisticmeasures how
far X-bar, the sample mean, is from the hypothesized mean, 0. If
thetstatistic is much higher or lower than 0 and has a small
correspondingp-value, this indicates that the sample mean is quite
different from the hypothesized mean, and you would reject the null
hypothesis.You can usePROC UNIVARIATEto perform a statistical
hypothesis test. You use the MU0= option to specify the value of
the hypothesized mean, 0. You can use the ALPHA= option to change
the significance level.
SyntaxTo go to the movie where you learned a statement or
option, select a link.
PROC MEANSDATA=SAS-data-set
;CLASSvariables;VARvariables;RUN;PROC UNIVARIATEDATA=SAS-data-set
;VARvariables;IDvariables;HISTOGRAMvariables;PROBPPLOTvariables;INSETkeywords;RUN;PROC
SGPLOT DATA=SAS-data-set;DOTcategory-variable
;HBARcategory-variable ;VBARcategory-variable
;HBOXresponse-variable ;VBOXresponse-variable
;HISTOGRAMresponse-variable ;SCATTERX=variable Y=variable
;NEEDLEX=variable Y=numeric-variable ;REGX=numeric-variable
Y=numeric-variable ;REFLINEvariable | value-1 ;RUN;ODS GRAPHICS
ON;statistical procedure codeODS GRAPHICS OFF;
Sample Programs
Using PROC MEANS to Generate Descriptive Statisticsproc means
data=statdata.testscores maxdec=2 fw=10 printalltypes n mean median
std var q1 q3; class Gender; var SATScore; title 'Selected
Descriptive Statistics for SAT Scores';run;title;
Using SAS to Picture Your Dataproc univariate
data=statdata.testscores; var SATScore; id idnumber; histogram
SATScore / normal(mu=est sigma=est); inset skewness kurtosis /
position=ne; probplot SATScore / normal(mu=est sigma=est); inset
skewness kurtosis; title 'Descriptive Statistics Using PROC
UNIVARIATE';run;title;
proc sgplot data=statdata.testscores; refline 1200 / axis=y
lineattrs=(color=blue); vbox SATScore / datalabel=IDNumber; format
IDNumber 8.; title "Box Plots of SAT Scores";run;title;
Calculating a 95% Confidence Intervalproc means
data=statdata.testscores maxdec=4 n mean stderr clm; var SATScore;
title '95% Confidence Interval for SAT';run; title;
Using PROC UNIVARIATE to Perform a Hypothesis Testods select
testsforlocation;proc univariate data=statdata.testscores mu0=1200;
var SATScore; title 'Testing Whether the Mean of SAT Scores =
1200';run;title;
Statistics I: Introduction to ANOVA, Regression, and Logistic
RegressionCopyright 2014 SAS Institute Inc., Cary, NC, USA. All
rights reserved.
Close
Print
Summary: Lesson 2: Analysis of Variance (ANOVA)
This summary containstopic summaries,syntax, andsample
programs.
Topic SummariesTo go to the movie where you learned a task or
concept, select a link.Two-Samplet-TestsThetwo-samplet-testis a
hypothesis test for answering questions about the means of two
populations. You can examine the differences between populations
for one or more continuous variables and assess whether the means
of the two populations are statistically different from each
other.Thenull hypothesis for the two-samplet-testis that the means
for the two groups are equal. The alternative hypothesis is the
logical opposite of the null and is typically what you suspect or
are trying to show. It is usually a hypothesis of inequality. The
alternative hypothesis for the two-samplet-test is that the means
for the two groups are not equal.The threeassumptionsfor the
two-samplet-test are independence, normality, and equal
variances.You use theF-testfor equality of variances to evaluate
the assumption of equal variances in the two populations. You
calculate theFstatistic, which is the ratio of the maximum sample
variance of the two groups to the minimum sample variance of the
two groups. If thep-value of theF-test is greater than your alpha,
you fail to reject the null hypothesis and can proceed as if
thevariances are equalbetween the groups. If thep-value of
theF-test is less than your alpha, you reject the null hypothesis
and can proceed as if thevariances are not equal.Withone-sided
tests, you look for a difference in one direction. For instance,
you can test to determine whether the mean of one population is
greater than or less than the mean of another population. An
advantage of one-sided tests is that they can increase the power of
a statistical test.To perform the two-samplet-test and the
one-sided test, you can usePROC TTEST. You add thePLOTS optionto
the PROC TTEST statement to control the plots that ODS produces.
You add theSIDES=U or SIDES=L optionto specify an upper or lower
one-sided test.
One-Way ANOVAYou can useANOVAto determine whether there are
significant differences between the means of two or more
populations. In this model, you have a continuous dependent,
orresponse, variable and a categorical independent, orpredictor,
variable. With ANOVA, thenull hypothesisis that all of the
population means are equal. Thealternative hypothesisis that not
all of the population means are equal. In other words, at least one
mean is different from the rest.One way to represent the
relationship between the response and predictor variables in ANOVA
is with a mathematicalANOVA model.ANOVA analyzes the variances of
the data to determine whether there is a difference between the
group means. You can determine whether thevariation of the meansis
large enough relative to the variation of observations within the
group. To do this, youcalculatethree types ofsums of squares:
between group variation (SSM), within group variation (SSE), and
total variation (SST). The SSM and SSE represent pieces of the
total variability. If the SSM is larger than the SSE, you reject
the null hypothesis that all of the group means are equal.Before
you perform the hypothesis test, you need to verify thethree ANOVA
assumptions: the observations are independent observations, the
error terms are normally distributed, and the error terms have
equal variances across groups.Theresidualsthat come from your data
are estimates of the error term in the model. You calculate the
residuals from ANOVA by taking each observation and subtracting its
group mean. Then you verify the two assumptions regarding normality
and equal variances of the errors.To verify the ANOVA assumptions
and perform the ANOVA test, you usePROC GLM. In theMODEL statement,
you specify the dependent and independent variables for the
analysis. TheMEANS statementcomputes unadjusted means of the
dependent variable for each value of the specified effect. You can
add theHOVTEST optionto the MEANS statement to perform Levene's
test for homogeneity of variances. If the resultingp-value of
Levene's test is greater than 0.05 (typically), then you fail to
reject the null hypothesis of equal variances.
ANOVA with Data from a Randomized Block DesignIn acontrolled
experiment, you can design the analysis prospectively and control
for other factors,nuisance factors, that affect the outcome you're
measuring. Nuisance factors can affect the outcome of your
experiment, but are not of interest in the experiment. In a
randomized block design, you can use a blocking variable to control
for the nuisance factors and reduce or eliminate their contribution
to the experimental error.One way to represent the relationship
between the response and predictor variables in ANOVA is with a
mathematicalANOVA model. You can also include ablocking variablein
the model.Along with the three original ANOVA assumptions of
independent observations, normally distributed errors, and equal
variances across treatments, you maketwo more assumptionswhen you
include a blocking factor in the model. You assume that the
treatments are randomly assigned within each block, and you assume
that the effects of the treatment factor are constant across levels
of the blocking factor.You usePROC GLMto perform ANOVA with a
blocking variable. Youlist the blocking variablein the CLASS
statement and in the MODEL statement.
ANOVA Post Hoc TestsApairwise comparisonexamines the difference
between two treatment means. If your ANOVA results suggest that you
reject the null hypothesis that the means are equal across groups,
you can conductmultiple pairwise comparisonsin a post hoc analysis
to learn which means differ.The chance that you make a Type I error
increases each time you conduct a statistical test. The
comparisonwise error rate, or CER, is the probability of a Type I
error on a single pairwise test. The experimentwise error rate,
orEER, is the probability of making at least one Type I error when
performing all of the pairwise comparisons. The EER increases as
the number of pairwise comparisons increases.You can use theTukey
methodto control the EER. This test compares all possible pairs of
means, so it can only be used when you make pairwise
comparisons.Dunnett's methodis a specialized multiple comparison
test that enables you to compare a single control group to all
other groups.You request all of the multiple comparison methods
withoptions in the LSMEANS statementin PROC GLM. You use the
PDIFF=ALL option to requestp-values for the differences between all
of the means. With this option, SAS produces adiffogram. You use
the ADJUST= option to specify the adjustment method for multiple
comparisons. When you specify theADJUST=Dunnett option, SAS
produces multiple comparisons using Dunnett's method and acontrol
plot.
Two-Way ANOVA with InteractionsWhen you have two categorical
predictor variables and a continuous response variable, you can
analyze your data usingtwo-way ANOVA. With two-way ANOVA, you can
examine the effects of the two predictor variables concurrently.
You can also determine whether they interact with respect to their
effect on the response variable. Aninteractionmeans that the
effects on one variable depend on the value of another variable. If
there is no interaction, you can interpret the test for the
individual factor effects to determine their significance. If an
interaction exists between any factors, the test for the individual
factor effects might be misleading due to the masking of these
effects by the interaction.You can include more than one predictor
variable and interactions in theANOVA model.You can graphically
explore the relationship between the response variable and the
effect of the interaction between the two predictor variables
usingPROC SGPLOT.You can usePROC GLMto determine whether the
effects of the predictor variables and the interaction between them
are statistically significant.
SyntaxTo go to the movie where you learned a statement or
option, select a link.
PROC TTEST
DATA=SAS-data-set;CLASSvariable;VARvariable(s);RUN;Selected Options
in PROC TTEST
StatementOption
PROC TTESTPLOTS(SHOWNULL)=INTERVALSIDES=USIDES=L
PROC GLM
DATA=SAS-data-set;CLASSvariable(s);MODELdependents=independents
;MEANSeffects ;LSMEANSeffects ;RUN;QUIT;Selected Options in PROC
GLM
StatementOption
PROC GLMPLOTS(ONLY)DIAGNOSTICS(UNPACK)
MEANSHOVTEST
LSMEANSPDIFF=ALLADJUST=
Sample Programs
Running PROC TTEST in SASproc ttest data=statdata.testscores
plots(shownull)=interval; class Gender; var SATScore; title
'Two-Sample t-Test Comparing Girls to Boys';run;title;
Performing a One-Sidedt-Testproc ttest data=statdata.testscores
plots(shownull)=interval h0=0 sides=U; class Gender; var SATScore;
title 'One-Sided t-Test Comparing Girls to Boys';run;title;
Examining Descriptive Statistics across Groupsproc means
data=statdata.mggarlic printalltypes maxdec=3; var BulbWt; class
Fertilizer; title 'Descriptive Statistics of Garlic
Weight';run;
proc sgplot data=statdata.mggarlic; vbox BulbWt /
category=Fertilizer datalabel=BedID; format BedID 5.; title 'Box
Plots of Garlic Weight';run;title;
Using the GLM Procedureproc glm data=statdata.mggarlic
plots(only)=diagnostics(unpack); class Fertilizer; model
BulbWt=Fertilizer; means Fertilizer / hovtest; title 'Testing for
Equality of Means with PROC GLM';run;quit;title;
Performing ANOVA with Blockingproc glm
data=statdata.mggarlic_block plots(only)=diagnostics(unpack); class
Fertilizer Sector; model BulbWt=Fertilizer Sector; title 'ANOVA for
Randomized Block Design';run;quit;title;
Performing a Post Hoc Pairwise Comparisonods select lsmeans diff
meanplot diffplot controlplot;
proc glm data=statdata.mggarlic_block; class Fertilizer Sector;
model BulbWt=Fertilizer Sector; lsmeans Fertilizer / pdiff=all
adjust=tukey; lsmeans Fertilizer / pdiff=controlu('4')
adjust=dunnett; lsmeans Fertilizer / pdiff=all adjust=t; title
'Garlic Data: Multiple Comparisons';run;quit;title;
Examining Your Data with PROC MEANSproc format; value dosef
1="Placebo" 2="100mg" 3="200mg" 4="500mg";run; proc means
data=statdata.drug mean var std printalltypes; class Disease
DrugDose; var BloodP; output out=means mean=BloodP_Mean; format
DrugDose dosef.; title 'Selected Descriptive Statistics for Drug
Data Set';run;title;
Examining Your Data with PROC SGPLOTproc sgplot data=means;
where _TYPE_=3; scatter x=DrugDose y=BloodP_Mean / group=Disease
markerattrs=(size=10); series x=DrugDose y=BloodP_Mean /
group=Disease lineattrs=(thickness=2); xaxis integer; format
DrugDose dosef.; title 'Plot of Stratified Means in Drug Data
Set';run;title;
Performing Two-Way ANOVA with Interactionsproc glm
data=statdata.drug; class DrugDose Disease; model Bloodp=DrugDose
Disease DrugDose*Disease; format DrugDose dosef.; title1 'Analyze
the Effects of DrugDose and Disease'; title2 'including
Interactions';run;quit;title;Performing a Post Hoc Pairwise
Comparisonproc format; value dosef 1="Placebo" 2="100mg" 3="200mg"
4="500mg";run;
ods select meanplot lsmeans slicedanova; proc glm
data=statdata.drug; class DrugDose Disease; model Bloodp=DrugDose
Disease DrugDose*Disease; lsmeans DrugDose*Disease / slice=Disease;
format DrugDose dosef.; title 'Analyze the Effects of DrugDose at
Each Level of Disease';run;quit;title;
Statistics I: Introduction to ANOVA, Regression, and Logistic
RegressionCopyright 2014 SAS Institute Inc., Cary, NC, USA. All
rights reserved.
Close
Print
Summary: Lesson 3: Regression
This summary containstopic summaries,syntax, andsample
programs.
Topic SummariesTo go to the movie where you learned a task or
concept, select a link.To analyze continuous variables, you can use
linear regression. To investigate your data before performing
linear regression, you can use techniques forexploratory data
analysis, including scatter plots and correlation analysis. In
exploratory data analysis, you're simply trying to explore the
relationships between variables and to screen for outliers.Scatter
plotsare an important tool for describing the relationship between
continuous variables.Plot your data!You can use scatter plots to
examine the relationship between two continuous variables, to
detect outliers, to identify trends in your data, to identify the
range of X and Y values, and to communicate the results of a data
analysis.You can also usecorrelation analysisto quantify the
relationship between two variables. Correlation statistics measure
the strength of the linear relationship between two continuous
variables. Two variables are correlated if there is
alinearassociation between them. A common correlation statistic
used for continuous variables is thePearson correlation
coefficient, which ranges from 1 to +1.The population parameter
that represents a correlation is . Thenull hypothesis for a test of
a correlation coefficientis that equals 0, and the alternative
hypothesis is that is not 0. Rejecting the null hypothesis means
only that you can be confident that the true population correlation
is not exactly 0. You need to avoidcommon mistakeswhen interpreting
the correlation between variables.To produce correlation statistics
and scatter plots for your data, you usePROC CORR. To rank-order
the absolute value of the correlations from highest to lowest, you
add theRANK optionto the PROC CORR statement. To produce scatter
plots, you add thePLOTS= optionin the PROC CORR statement. You can
also add context-specific options in parentheses following the main
option keyword, such as PLOTS or SCATTER.To examine the
correlations between the potential predictor variables, you produce
acorrelation matrix and scatter plot matrixby using the NOSIMPLE,
PLOTS=MATRIX, and HISTOGRAM options. To specify tooltips for
hovering over data points and seeing detailed information about the
observations, you use the IMAGEMAP=ON option in the ODS GRAPHICS
statement and an ID statement in the PROC CORR step.
In correlation analysis, you determine the strength of the
linear relationships between continuous response variables.
Insimple linear regression, you use thesimple linear regression
modelto determine the equation for the straight line that defines
the linear relationship between the response variable and the
predictor variable.To determine how much better the model that
takes the predictor variable into account is than a model that
ignores the predictor variable, you cancompare the simple linear
regression model to a baseline model. For your comparison, you
calculate the explained, unexplained, and total variability in the
simple linear regression model.Thenull hypothesis for linear
regressionis that the regression modeldoes notfit the data better
than the baseline model.The alternative hypothesis is that the
regression modeldoesfit the data better than the baseline model. In
other words, the slope of the regression line is not equal to 0, or
the parameter estimate of the predictor variable is not equal to
0.Before performing simple linear regression, you need to verify
thefour assumptions for linear regression: that the mean of the
response variable is linearly related to the value of the predictor
variable, that the error terms are normally distributed, that the
error terms have equal variances, and that the error terms are
independent at each value of the predictor variable.To fit
regression models to your data, you usePROC REG. TheMODEL
statementspecifies the response variable and the predictor
variable. To evaluate your model, you typically examine thep-value
for the overall model, the R-square value, and the parameter
estimates.To assess the level of precision around the mean
estimates of the response variable, you canproduce confidence
intervalsaround the means andconstruct prediction intervalsfor a
single observation. To display confidence and prediction intervals,
you can specify theCLM and CLI optionsin the MODEL
statement.Toproduce predicted values for small data setsusing PROC
REG, you create a new data set containing the values of the
independent variable for which you want to make predictions,
concatenate the new data set with the original data set, and fit a
simple linear regression model to the new data set.To produce
predicted values for large data sets,using PROC REG and PROC
SCOREis more efficient. You can use theNOPRINTandOUTEST=options in
a PROC REG statement to write the parameter estimates from PROC REG
to an output data set. Then you score the new observations using
PROC SCORE, with theSCORE=option specifying the data set containing
the parameter estimates, theOUT=option specifying the data set that
PROC SCORE creates, and theTYPE=option specifying what type of data
the SCORE= data set contains.
Inmultiple regression, you can model the relationship between
the response variable and more than one predictor variable. In a
model with two predictor variables, you can model the relationship
of the three variablesthree dimensionswith a two-dimensional
plane.Multiple linear regression hasadvantages and disadvantages.
Its biggest advantage is that it's more powerful than simple linear
regression, that is, you can determine whether a relationship
exists between the response variable and several predictor
variables at the same time. The disadvantages of multiple linear
regression are that you have to decide which model to use, and that
when you have more predictors, interpreting the model becomes more
complicated.You can use multiple regression intwo ways: for
analytical or explanatory analysis and for prediction. If you
specify many terms, themodel for multiple regressioncan become very
complex.The hypotheses for multiple regression are similar to those
for simple linear regression. Thenull hypothesisis that the
multiple regression modeldoes notfit the data better than the
baseline model. (All the slopes or parameter estimates are equal to
0.) The alternative hypothesis is that the regression modeldoesfit
the data better than the baseline model.For multiple linear
regression, the samefour assumptionsas for simple linear regression
apply: that the mean of the response variable is linearly related
to the value of the predictor variables, that the error terms are
normally distributed, that the error terms have equal variances,
and that the error terms are independent at each value of the
predictor variable.Tocompare multiple linear regression models, you
typically examine thep-value for the overall models, the adjusted
R-square values, and the parameter estimates. The adjusted R-square
value takes into account the number of terms in the model and
increases only if new terms significantly improve the model.
Yourfirst decisionin model selection is whether to use a manual
or automated approach.Automated model selection techniquesin SAS
fall into two general categories: the all-possible regressions
method and stepwise selection methods. For a large number of
potential predictor variables, the stepwise regression methods
might be a better option. The all-possible regressions method
produces more candidate models, which requires you to use your
expertise to select a model.In theall-possible regressions method,
SAS calculates all possible regression models. To describe your
model, you can add an optionallabelto the MODEL statement. You can
reduce the number of models in the output by specifying theBEST=
optionin the MODEL statement. To help evaluate the models you
produce, you can requestMallows' Cp statisticin thePLOTS= optionin
the PROC REG statement and in theSELECTION= optionin the MODEL
statement. Torequest statisticsfor each model, you can specify them
in the SELECTION= option. To select the best model for
prediction,you should useMallows' criterion for Cp. To select the
best model for parameter estimation, you should useHocking's
criterion for Cp.Stepwise selection methodsare another, less
computer-intensive way to find good candidate models without having
to generate all possible models. You can specify forward, backward,
and stepwise methods in theSELECTION= optionin the MODEL statement.
Each method selects variables based on theirp-values. To change the
defaultp-values that PROC REG uses to select variables, you can use
theSLENTRY= and SLSTAY= optionsin the MODEL statement. It's agood
ideato always run all three stepwise selection methods and look for
commonalities among the final models for all three methods.
SyntaxTo go to the movie where you learned a statement or
option, select a link.
PROC CORR DATA=SAS-data-set
;VARvariable(s);WITHvariable(s);RUN;Selected Options in PROC
CORR
StatementOption
PROC CORRRANKPLOTS=NOSIMPLE
Selected ODS Option
StatementOption
ODS GRAPHICSIMAGEMAP=ON
PROC REG DATA=SAS-data-set ;MODELdependent-regressor
;WITHvariable(s);IDvariable(s);RUN;SelectedOptionsinPROCREG
StatementOption
PROC REGNOPRINTOUTEST=
MODELCLMCLIP
PROC SCORE
DATA=SAS-data-setSCORE=SAS-data-setOUT=SAS-data-setTYPE=name;VARvariable(s);RUN;
Sample ProgramsProducing Correlation Statistics and Scatter
Plotsproc corr data=statdata.fitness rank
plots(only)=scatter(nvar=all ellipse=none); var RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse Performance; with
Oxygen_Consumption; title "Correlations and Scatter Plots with
Oxygen_Consumption";run;title;Examining Correlations between
Predictor Variablesods graphics on / imagemap=on;proc corr
data=statdata.fitness nosimple plots=matrix(nvar=all histogram);
var RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse
Performance; id name; title "Correlations with
Oxygen_Consumption";run;title;Performing Simple Linear
Regressionproc reg data=statdata.fitness; model
Oxygen_Consumption=RunTime; title 'Predicting Oxygen_Consumption
from RunTime';run;quit;title;Viewing and Printing Confidence
Intervals and Prediction Intervalsproc reg data=statdata.fitness;
model Oxygen_Consumption=RunTime / clm cli; id name runtime; title
'Predicting Oxygen_Consumption from
RunTime';run;quit;title;Producing Predicted Values of the Response
Variabledata need_predictions; input RunTime @@; datalines;9 10 11
12 13;run;data predoxy; set need_predictions
statdata.fitness;run;proc reg data=predoxy; model
Oxygen_Consumption=RunTime / p; id RunTime; title
'Oxygen_Consumption=RunTime with Predicted
Values';run;quit;title;Storing Parameter Estimates and Scoringproc
reg data=statdata.fitness noprint outest=estimates; model
Oxygen_Consumption=RunTime;run;quit;proc print data=estimates;
title "OUTEST= Data Set from PROC REG";run;title;proc score
data=need_predictions score=estimates out=scored type=parms; var
RunTime; run;proc print data=Scored; title "Scored New
Observations";run;title;Performing Multiple Linear Regressionproc
reg data=statdata.fitness; model Oxygen_Consumption=Performance
RunTime; title 'Multiple Linear Regression for Fitness
Data';run;quit;title;Using Automatic Model Selectionods graphics /
imagemap=on;proc reg data=statdata.fitness plots(only)=(cp);
ALL_REG: model Oxygen_Consumption= Performance RunTime Age Weight
Run_Pulse Rest_Pulse Maximum_Pulse / selection=cp rsquare adjrsq
best=20;title 'Best Models Using All-Regression
Option';run;quit;title;Estimating and Testing Coefficients for
Selected Modelsproc reg data=statdata.fitness; PREDICT: model
Oxygen_Consumption= RunTime Age Run_Pulse Maximum_Pulse; EXPLAIN:
model Oxygen_Consumption= RunTime Age Weight Run_Pulse
Maximum_Pulse; title 'Check "Best" Two Candidate
Models';run;quit;title;Performing Stepwise Regressionproc reg
data=statdata.fitness plots(only)=adjrsq; FORWARD: model
Oxygen_Consumption= Performance RunTime Age Weight Run_Pulse
Rest_Pulse Maximum_Pulse / selection=forward; BACKWARD: model
Oxygen_Consumption= Performance RunTime Age Weight Run_Pulse
Rest_Pulse Maximum_Pulse / selection=backward; STEPWISE: model
Oxygen_Consumption= Performance RunTime Age Weight Run_Pulse
Rest_Pulse Maximum_Pulse / selection=stepwise; title 'Best Models
Using Stepwise Selection';run;quit;title;
Statistics I: Introduction to ANOVA, Regression, and Logistic
RegressionCopyright 2014 SAS Institute Inc., Cary, NC, USA. All
rights reserved.
Close
Print
Summary: Lesson 4: Regression Diagnostics
This summary containstopic summaries,syntax, andsample
programs.
Topic SummariesTo go to the movie where you learned a task or
concept, select a link.Verifying thefirst assumptionof linear
regression, that the linear model fits the data adequately, is
critical. You should always plot your data before producing a
model.The remaining threeassumptionsof linear regression relate to
error terms, so you check these assumptions in terms oferrors, not
in terms of the values of the response variable. To verify these
assumptions, you can use several differentresidual plotsto check
your regression assumptions. You can plot the residuals versus the
predicted values, plot the residuals versus the values of the
independent variables, and produce a histogram or a normal
probability plot of the residuals. To verify that model assumptions
are valid, you can analyze the shape of the residual values to
ensure that they display a random scatter of the residual values
above and below the reference line at 0. If you see patterns or
trends in the residual values, the assumptions might not be valid
and the models might have problems. You can also use residual plots
todetect outliers.To create residual plots and other diagnostic
plots, you usePROC REG, which creates a number of default plots.
Specifying an identifier variable in theID statementshows you that
information when you hover your cursor over the data points in the
graph. You can also request specific plots with thePLOTS= optionin
the PROC REG statement.
You should also identify anyinfluential observationsthat
strongly affect the linear model's fit to the data. To identify
outliers and influential observations in your data, you can use
severaldiagnostic statisticsin PROC REG.To detect outliers, you can
useSTUDENT residuals. To detect influential observations, you can
useCooksDstatistics,RSTUDENT residuals, andDFFITS statistics.
CooksDstatistic is most useful for explanatory or analytic models,
and DFFITS is most useful for predictive models. If you detect an
influential observation, you can identify which parameter the
observation is influencing most by usingDFBETAS.To detect
influential observations in your model using PROC REG, you can
produce diagnostic statistics as well as diagnostic plots. To
control which plots are produced, you can use thePLOTS= optionin
the PROC REG statement. To request thediagnostic statisticsused in
creating the plots without producing the plots themselves, you can
use theR and INFLUENCE optionsin the MODEL statement. When you use
these options, PROC REG creates an ODS output object
calledOutputStatistics, which contains the residuals and
influential statistics from the R and INFLUENCE model options. To
add variables in the model to the OutputStatistics data object, you
specify them in theID statement. To save the statistics in an
output data set, you use theODS OUTPUT statement.For very large
data sets, viewing or printing all residuals and influence
statistics quickly becomes unwieldy. To reduce the amount of
output, you can use thecutoff valuesfor each of the diagnostic
criteria to detect influential observations. To do so, you can use
macro variables and the DATA step to create a program that you can
reuse.You canhandle influential observationsin several ways. You
can recheck for data entry errors, determine whether you have an
adequate model, and determine whether the observation is valid but
unusual. In your analysis, you should report the results of your
model with and without the influential observation.
Collinearity, also calledmulticollinearity, occurs in multiple
regression when two or more predictor variables are highly
correlated with each other. Collinearity doesn't violate the
assumptions of multiple regression, but itleads to instabilityin
the regression model.Todetect collinearity, you can check your PROC
REG output. To measure the magnitude of collinearity in a model,
you can use theVIF optionin the MODEL statement. If you detect
collinearity, you can determinehow to proceedandwhich model to
select.To review,effective modelingincludes performing preliminary
analyses, selecting candidate models, validating assumptions,
detecting influential observations and collinearity, revising your
model, and performing prediction testing.
SyntaxTo go to the movie where you learned a statement or
option, select a link.
LIBNAMElibref'SAS-library';ODS
OUTPUToutput-object-specification=data-set;PROC REG
DATA=SAS-data-set ;MODELdependent-regressor
;IDvariable(s);RUN;SelectedOptionsinPROCREG
StatementOption
PROC REGPLOTS=
MODELRINFLUENCEVIF
%LETvariable=value;DATASAS-data-set;SETSAS-data-set;variable=value;IFexpression;RUN;PROC
PRINT DATA=SAS-data-set;VARvariable(s);RUN;
Sample ProgramsProducing Default Diagnostic Plotsods graphics /
imagemap=on;
proc reg data=statdata.fitness; PREDICT: model
Oxygen_Consumption= RunTime Age Run_Pulse Maximum_Pulse; id Name;
title 'PREDICT Model - Plots of Diagnostic
Statistics';run;quit;
title;Requesting Specific Diagnostic Plotsods graphics /
imagemap=on;
proc reg data=statdata.fitness plots(only)=(QQ
RESIDUALBYPREDICTED RESIDUALS); PREDICT: model Oxygen_Consumption=
RunTime Age Run_Pulse Maximum_Pulse; id Name; title 'PREDICT Model
- Plots of Diagnostic Statistics';run;quit;
title;Using Diagnostic Plots to Identify Influential
Observationsods graphics / imagemap=on;
proc reg data=statdata.fitness plots(only)=
(RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL)
DFBETAS(LABEL)); PREDICT: model Oxygen_Consumption = RunTime Age
Run_Pulse Maximum_Pulse; id Name; title 'PREDICT Model - Plots of
Diagnostic Statistics';run;quit;
title;Writing Diagnostic Statistics to an Output Data Setods
output outputstatistics=Check4Outliers; proc reg
data=statdata.fitness; PREDICT: model Oxygen_Consumption= RunTime
Age Run_Pulse Maximum_pulse / r influence; id Name
Oxygen_Consumption RunTime Age Run_Pulse Maximum_pulse; title
'PREDICT Model - Plots of Diagnostic Statistics';run;quit;
title;Detecting Influential Observations Programmatically%let
dsname=check4outliers; /*data set name*/ %let numparms=5; /*# of
predictor variables + 1*/ %let numobs=31; /*# of observations*/
%let idvars=Name Oxygen_Consumption RunTime DFB_RunTime Age DFB_Age
Run_Pulse DFB_Run_Pulse Maximum_pulse DFB_Maximum_Pulse; /*relevant
variable(s)*/
data influential; set &dsname;
CutDFFits=2*(sqrt(&numparms/&numobs));
CutCooksD=4/&numobs; RStud_i=(abs(RStudent)>3);
DFits_i=(abs(DFFits)>CutDFFits); CookD_i=(CooksD>CutCooksD);
Summary_i=compress(RStud_i||DFits_i||CookD_i); if Summary_i ne
'000'; run;
proc print data=influential; var Summary_i &IDVars
PredictedValue RStudent DFFits CutDFFits CooksD CutCooksD; title
'Observations Exceeding Suggested Cutoffs';run;
title;Detecting Collinearityproc reg data=statdata.fitness;
PREDICT: model Oxygen_Consumption= RunTime Age Run_Pulse
Maximum_Pulse; FULL: model Oxygen_Consumption = Performance RunTime
Age Weight Run_Pulse Rest_Pulse Maximum_Pulse; title 'Collinearity:
Full Model';run;quit;
title;Calculating Collinearity Diagnosticsproc reg
data=statdata.fitness; FULL: model Oxygen_Consumption= Performance
RunTime Age Weight Run_Pulse Rest_Pulse Maximum_Pulse / vif; title
'Collinearity: Full Model with VIF';run;quit;
title;Dealing with Collinearityproc reg data=statdata.fitness;
NOPERF: model Oxygen_Consumption= RunTime Age Weight Run_Pulse
Rest_Pulse Maximum_Pulse / vif; title 'Dealing with
Collinearity';run;quit;
title;
Statistics I: Introduction to ANOVA, Regression, and Logistic
RegressionCopyright 2014 SAS Institute Inc., Cary, NC, USA. All
rights reserved.
Close
Print
Summary: Lesson 5: Categorical Data Analysis
This summary containstopic summaries,syntax, andsample
programs.
Topic SummariesTo go to the movie where you learned a task or
concept, select a link.Aone-way frequency tabledisplays frequency
statistics for a categorical variable.Anassociationexists between
two variables if the distribution of one variable changes when the
value of the other variable changes. If there's no association, the
distribution of the first variable is the same regardless of the
level of the other variable.To look for a possible association
between two or more categorical variables, you can create
acrosstabulation table. A crosstabulation table shows frequency
statistics for each combination of values (or levels) of two or
more variables.To create frequency and crosstabulation tables in
SAS, and request associated statistics and plots, you use theTABLES
statementin the FREQUENCY procedure. You can use the PLOTS= option
in the TABLES statement to request specific plots for frequency and
crosstabulation tables.When ordinal values areordered logically,
you can use more powerful statistical tests that can detect linear
(ordinal) associations instead of only general associations. To
logically order the values of a variable for calculations and
output, you can create a new variable or you can apply a temporary
format to an existing variable. The ORDER=FORMATTED option in the
PROC FREQ statement tells PROC FREQ to perform calculations and
display output by using the formatted values instead of the stored
values.
To perform a formal test of association between two categorical
variables, you use the chi-square test. ThePearson chi-square
testis the most commonly used of several chi-square tests. The
chi-square statistic indicates the difference between observed
frequencies and expected frequencies. Neither the chi-square
statistic nor itsp-value indicates the magnitude of an
association.Cramer's V statisticis one measure of the strength of
an association between two categorical variables. Cramer's V
statistic is derived from the Pearson chi-square statistic.To
measure the strength of the association between a binary predictor
variable and a binary outcome variable, you can use anodds ratio.
An odds ratio indicates how much more likely it is, with respect to
odds, that a certain event, or outcome, occurs in one group
relative to its occurrence in another group.To perform a Pearson
chi-square test of association and generate related measures of
association, you specify theCHISQ option and other optionsin the
TABLES statement in PROC FREQ.For ordinal associations,
theMantel-Haenszel chi-square testis a more powerful test than the
Pearson chi-square test. The Mantel-Haenszel chi-square statistic
and itsp-value indicate whether an association exists but not the
magnitude of the association.To measure the strength of the linear
association between two ordinal variables, you can use theSpearman
correlation statistic. The Spearman correlation is considered to be
a rank correlation because it provides a degree of linearity
between the ordinal variables.To perform a Mantel-Haenszel
chi-square test of association and generate related measures of
association, you specify theCHISQ option and other optionsin the
TABLES statement in PROC FREQ.
Logistic regressionis a type of statistical model that you can
use to predict a categorical response, or outcome, on the basis of
one or more continuous or categorical predictor variables. You
select one of three types of logistic regression binary, nominal,
or ordinal based on your response variable.Although linear and
logistic regression models have the same structure, you can't use
linear regression with a binary response variable.Binary logistic
regressionuses a predictor variable to estimate the probability of
a specific outcome. To directly model the relationship between a
continuous predictor and the probability of an event or outcome,
you must use a nonlinear function: the inverse logit function.To
model categorical data, you use theLOGISTIC procedure. The two
required statements are the PROC LOGISTIC statement and the MODEL
statement. Depending on the complexity of your analysis, you can
use additional statements in PROC LOGISTIC. If your model has one
or more categorical predictor variables, you must specify them in
the CLASS statement. The MODEL statement specifies the response
variable and can specify other information as well, such as the
response variable. In the MODEL statement, the EVENT= option
specifies the event category for a binary response model. To
specify the type of confidence intervals you want to use, you add
the CLODDS= option to the MODEL statement. PROC LOGISTIC computes
Wald confidence intervals by default. You can use the PLOTS= option
in the PROC LOGISTIC statement to request specific plots.Instead of
working directly with the categorical predictor variables in the
CLASS statement, PROC LOGISTIC firstparameterizeseach predictor
variable. The CLASS statement creates a set of one or more design
variables that represent the information in each specified
classification variable. PROC LOGISTIC uses the design variables,
and not the original variables, in model calculations. Two common
parameterization methods areeffect coding(the method that PROC
LOGISTIC uses by default) andreference cell coding. To specify a
parameterization method other than the default, you use the PARAM=
option in the CLASS statement. If you want to specify a reference
level other than the default for a classification variable, you use
the REF= variable option in the CLASS statement.Akaike's
information criterion (AIC) and the Schwarz criterion (SC)
aregoodness-of-fit measuresthat you can use to compare models.
-2Log L is a goodness-of-fit measure that is not commonly used to
compare models.Comparing pairsis another goodness-of-fit measure
that you can use to compare models.PROC LOGISTIC uses a
0.05significance leveland a 95% confidence interval by default. If
you want to specify a different significance level for the
confidence interval, you can use the ALPHA= option in the MODEL
statement.For acontinuous predictor variable, the odds ratio
measures the increase or decrease in odds associated with a
one-unit difference of the predictor variable by default.
Amultiple logistic regressionmodel characterizes the
relationship between a categorical response variable and multiple
predictor variables.One method of selecting a subset of predictor
variables for a multiple logistic regression model is thebackward
elimination method. To specify the variable selection method in
PROC LOGISTIC, you add theSELECTION= optionto the MODEL statement.
By default, for the backward elimination method, PROC LOGISTIC uses
a 0.05 significance level to determine which variables remain in
the model. If you want to change the significance level, you can
use the SLSTAY= (or SLS=) option in the MODEL statement.Multiple
logistic regression usesadjusted odds ratios, which measure the
effect of a single predictor variable on a response variable while
holding all the other predictor variables constant.In PROC
LOGISTIC, theUNITS statementenables you to obtain customized odds
ratio estimates for a specified unit of change in one or more
continuous predictor variables.In the CLASS statement, when you use
the REF= option with a variable that has either a temporary or a
permanent format assigned to it, you must specify theformatted
valueof the level instead of the stored value.When you fit a
multiple logistic regression model, the simplest approach is to
consider only the main effectsthe effect of each predictor
individuallyon the response. If you suspect that there
areinteractionsbetween predictor variables, you can fit a more
complex logistic regression model that includes interactions. When
you use thebackward elimination method with interactions in the
model, PROC LOGISTIC must preserve the model hierarchy when
eliminating main effects. You specify interactions in theMODEL
statement.By default, PROC LOGISTIC produces the odds ratio only
for variables that are not involved in an interaction.To tell PROC
LOGISTIC to produce the odds ratios for each value of a variable
that is involved in an interaction, you can use theODDSRATIO
statement. To specify whether PROC LOGISTIC computes the odds
ratios for a categorical variable against the reference level or
against all of its levels, you can use the DIFF= option. The AT
option specifies fixed levels of one or more interacting variables
(also called covariates). PROC LOGISTIC computes odds ratios at
each of the specified levels.To visualize the interaction between
two categorical variables, you can produce aninteraction plot.
SyntaxTo go to the movie where you learned a statement or
option, select a link.
PROC FREQ DATA=SAS-data-set'SAS-library';TABLES=table-request(s)
;additional statements;RUN;Selected Options in PROC FREQ
StatementOption
PROC FREQORDER=
TABLESCELLCHI2CHISQ
(PearsonandMantel-Haenszel)CLEXPECTEDMEASURESNOCOLNOPERCENTPLOTS=RELRISK
PROC LOGISTIC DATA=SAS-data-set;CLASSvariable ...
;MODELresponse=predictors;UNITSindependent1=list...
;ODDSRATIOvariable;RUN;Selected Options in PROC LOGISTIC
StatementOption
PROC LOGISTICPLOTS=
CLASSPARAM=REF= (general usageandusage with a formatted
variable)
MODELALPHA=CLODDS=EVENT=SELECTION=SLSTAY= | SLS=
ODDSRATIOATCL=DIFF=
Sample Programs
Examining the Distribution of Variablesproc freq
data=statdata.sales; tables Purchase Gender Income Gender*Purchase
Income*Purchase / plots=(freqplot); format Purchase purfmt.; title1
'Frequency Tables for Sales Data';run;
ods select histogram probplot;
proc univariate data=statdata.sales; var Age; histogram Age /
normal (mu=est sigma=est); probplot Age / normal (mu=est
sigma=est); title1 'Distribution of Age'; run;
title;
Ordering the Values of a Variable by Creating a New Variabledata
statdata.sales_inc; set statdata.sales; if Income='Low' then
IncLevel=1; else If Income='Medium' then IncLevel=2; else If
Income='High' then IncLevel=3;run;
proc freq data=statdata.sales_inc; tables IncLevel*Purchase /
plots=freq; format IncLevel incfmt. Purchase purfmt.; title1
'Create variable IncLevel to correct Income';run;
title;
Performing a Pearson Chi-Square Test of Associationproc freq
data=statdata.sales_inc; tables Gender*Purchase / chisq expected
cellchi2 nocol nopercent relrisk; format Purchase purfmt.; title1
'Association between Gender and Purchase';run;
title;
Performing a Mantel-Haenszel Chi-Square Testproc freq
data=statdata.sales_inc; tables IncLevel*Purchase / chisq measures
cl; format IncLevel incfmt. Purchase purfmt.; title1 'Ordinal
Association between IncLevel and Purchase?';run;
title;
Fitting a Binary Logistic Regression Modelproc logistic
data=statdata.sales_inc plots(only)=(effect); class Gender
(param=ref ref='Male'); model Purchase(event='1')=Gender; title1
'LOGISTIC MODEL (1):Purchase=Gender';run;
title;
Fitting a Multiple Logistic Regression Modelproc logistic
data=statdata.sales_inc plots(only)=(effect oddsratio); class
Gender (param=ref ref='Male') IncLevel (param=ref ref='1'); units
Age=10; model Purchase(event='1')=Gender Age IncLevel /
selection=backward clodds=pl; title1 'LOGISTIC MODEL
(2):Purchase=Gender Age IncLevel';run;
title;
Fitting a Multiple Logistic Regression Model with
Interactionsproc logistic data=statdata.sales_inc
plots(only)=(effect oddsratio); class Gender (param=ref ref='Male')
IncLevel (param=ref ref='1'); units Age=10; model
Purchase(event='1')=Gender | Age | IncLevel @2 / selection=backward
clodds=pl; title1 'LOGISTIC MODEL (3): Main Effects and 2-Way
Interactions'; title2 '/ sel=backward';run;
title;
Fitting a Multiple Logistic Regression Model with All Odds
Ratiosods select OddsRatiosPL ORPlot;
proc logistic data=statdata.sales_inc plots(only)=(oddsratio);
class Gender (param=ref ref='Male') IncLevel (param=ref ref='1');
units Age=10; model Purchase(event='1')=Gender | IncLevel Age;
oddsratio Age / cl=pl; oddsratio Gender / diff=ref at
(IncLevel=all) cl=pl; oddsratio IncLevel / diff=ref at (Gender=all)
cl=pl; title1 'LOGISTIC MODEL (3a): Significant Terms and All Odds
Ratios'; title2 '/ sel=backward';run;
title;
Statistics I: Introduction to ANOVA, Regression, and Logistic
RegressionCopyright 2014 SAS Institute Inc., Cary, NC, USA. All
rights reserved.
Close