-
Strategy for Complete Regression AnalysisAdditional issues in
regression analysisAssumption of independence of errorsInfluential
casesMulticollinearityAdjusted R
Strategy for Solving problems
Sample problems
Complete regression analysis
Social Work Statistics
-
Assumption of independence of errors - 1Multiple regression
assumes that the errors are independent and there is no serial
correlation. Errors are the residuals or differences between the
actual score for a case and the score estimated using the
regression equation. No serial correlation implies that the size of
the residual for one case has no impact on the size of the residual
for the next case.
The Durbin-Watson statistic is used to test for the presence of
serial correlation among the residuals. The value of the
Durbin-Watson statistic ranges from 0 to 4. As a general rule of
thumb, the residuals are not correlated if the Durbin-Watson
statistic is approximately 2, and an acceptable range is 1.50 -
2.50.
Social Work Statistics
-
Assumption of independence of errors - 2Serial correlation is
more of a concern in analyses that involve time series.
If it does occur in relationship analyses, its presence can be
usually be understood by changing the sequence of cases and running
the analysis again.
If the problem with serial correlation disappears, it may be
ignored.
If the problem with serial correlation remains, it can usually
be handled by using the difference between successive data points
as the dependent variable.
Social Work Statistics
-
Multicollinearity - 1Multicollinearity is a problem in
regression analysis that occurs when two independent variables are
highly correlated, e.g. r = 0.90, or higher.
The relationship between the independent variables and the
dependent variables is distorted by the very strong relationship
between the independent variables, leading to the likelihood that
our interpretation of relationships will be incorrect.
In the worst case, if the variables are perfectly correlated,
the regression cannot be computed.
SPSS guards against the failure to compute a regression solution
by arbitrarily omitting the collinear variable from the
analysis.
Social Work Statistics
-
Multicollinearity - 2Multicollinearity is detected by examining
the tolerance for each independent variable. Tolerance is the
amount of variability in one independent variable that is no
explained by the other independent variables.
Tolerance values less than 0.10 indicate collinearity.
If we discover collinearity in the regression output, we should
reject the interpretation of the relationships as false until the
issue is resolved.
Multicollinearity can be resolved by combining the highly
correlated variables through principal component analysis, or
omitting a variable from the analysis.
Social Work Statistics
-
Adjusted RThe coefficient of determination, R which measures the
strength of a relationship, can usually be increased simply by
adding more variables. In fact, if the number of variables equals
the number of cases, it is often possible to produce a perfect R of
1.00.
Adjusted R is a measure which attempts to reduce the inflation
in R by taking into account the number of independent variables and
the number of cases.
If there is a large discrepancy between R and Adjusted R,
extraneous variables should be removed from the analysis and R
recomputed.
Social Work Statistics
-
Influential casesThe degree to which outliers affect the
regression solution depends upon where the outlier is located
relative to the other cases in the analysis. Outliers whose
location have a large effect on the regression solution are called
influential cases.
Whether or not a case is influential is measured by Cooks
distance.
Cooks distance is an index measure; it is compared to a critical
value based on the formula: 4 / (n k 1) where n is the number of
cases and k is the number of independent variables.
If a case has a Cooks distance greater than the critical value,
it should be examined for exclusion.
Social Work Statistics
-
Overall strategy for solving problemsRun a baseline regression
using the method for including variables implied by the problem
statement to find the initial strength of the relationship,
baseline R.Test for useful transformations to improve normality,
linearity, and homoscedasticity.Substitute transformed variables
and check for outliers and influential cases.If R from regression
model using transformed variables and omitting outliers is at least
2% better than baseline R, select it for interpretation; otherwise
select baseline model.Validate and interpret the selected
regression model.
Social Work Statistics
-
Problem 1In the dataset GSS2000.sav, is the following statement
true, false, or an incorrect application of a statistic? Assume
that there is no problem with missing data. Use a level of
significance of 0.05 for the regression analysis. Use a level of
significance of 0.01 for evaluating assumptions. Use 0.0160 as the
criteria for identifying influential cases. Validate the results of
your regression analysis by splitting the sample in two, using
788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's
socioeconomic index" [sei] have a strong relationship to the
variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members
earning money. The variables sex and respondent's socioeconomic
index did not have a relationship to how many in family earned
money.
1. True 2. True with caution 3. False 4. Inappropriate
application of a statistic
Social Work Statistics
-
Dissecting problem 1 - 1In the dataset GSS2000.sav, is the
following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data. Use a
level of significance of 0.05 for the regression analysis. Use a
level of significance of 0.01 for evaluating assumptions. Use
0.0160 as the criteria for identifying influential cases. Validate
the results of your regression analysis by splitting the sample in
two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's
socioeconomic index" [sei] have a strong relationship to the
variable "how many in family earned money" [earnrs].
The problem may give us different levels of significance for the
analysis.
In this problem, we are told to use 0.05 as alpha for the
regression, but 0.01 for testing assumptions.When we test for
influential cases using Cooks distance, we need to compute a
critical value for comparison using the formula: 4 / (n k 1)where n
is the number of cases and k is the number of independent
variables. The correct value (0.0160) is provided in the
problem.The random number seed (788035) for the split sample
validation is provided.After evaluating assumptions, outliers, and
influential cases, we will decide whether we should use the model
with transformations and excluding outliers, or the model with the
original form of the variables and all cases.
Social Work Statistics
-
Dissecting problem 1 - 2In the dataset GSS2000.sav, is the
following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data. Use a
level of significance of 0.05 for the regression analysis. Use a
level of significance of 0.01 for evaluating assumptions. Use
0.0160 as the criteria for identifying influential cases. Validate
the results of your regression analysis by splitting the sample in
two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's
socioeconomic index" [sei] have a strong relationship to the
variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members
earning money. The variables sex and respondent's socioeconomic
index did not have a relationship to how many in family earned
money.
1. True 2. True with caution 3. False 4. Inappropriate
application of a statistic
When a problem states that there is a relationship between some
independent variables and a dependent variable, we do standard
multiple regression.The variables listed first in the problem
statement are the independent variables (IVs): "age" [age], "sex"
[sex], and "respondent's socioeconomic index" [sei].The variable
that is the target of the relationship is the dependent variable
(DV): "how many in family earned money [earnrs].
Social Work Statistics
-
Dissecting problem 1 - 3In the dataset GSS2000.sav, is the
following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data. Use a
level of significance of 0.05 for the regression analysis. Use a
level of significance of 0.01 for evaluating assumptions. Use
0.0160 as the criteria for identifying influential cases. Validate
the results of your regression analysis by splitting the sample in
two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's
socioeconomic index" [sei] have a strong relationship to the
variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members
earning money. The variables sex and respondent's socioeconomic
index did not have a relationship to how many in family earned
money.
1. True 2. True with caution 3. False 4. Inappropriate
application of a statisticIn order for a problem to be true, we
will have to find that there is a statistically significant
relationship between the set of IVs and the DV, and the strength of
the relationship stated in the problem must be correct. In
addition, the relationship or lack of relationship between the
individual IV's and the DV must be identified correctly, and must
be characterized correctly.
Social Work Statistics
-
LEVEL OF MEASUREMENTIn the dataset GSS2000.sav, is the following
statement true, false, or an incorrect application of a statistic?
Assume that there is no problem with missing data. Use a level of
significance of 0.05 for the regression analysis. Use a level of
significance of 0.01 for evaluating assumptions. Use 0.0160 as the
criteria for identifying influential cases. Validate the results of
your regression analysis by splitting the sample in two, using
788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's
socioeconomic index" [sei] have a strong relationship to the
variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members
earning money. The variables sex and respondent's socioeconomic
index did not have a relationship to how many in family earned
money.
1. True 2. True with caution 3. False 4. Inappropriate
application of a statisticMultiple regression requires that the
dependent variable be metric and the independent variables be
metric or dichotomous. "How many in family earned money" [earnrs]
is an interval level variable, which satisfies the level of
measurement requirement.
"Age" [age] and "respondent's socioeconomic index" [sei] are
interval level variables, which satisfies the level of measurement
requirements for multiple regression analysis.
"Sex" [sex] is a dichotomous or dummy-coded nominal variable
which may be included in multiple regression analysis.
Social Work Statistics
-
The baseline regressionWe begin out analysis by runring a
standard multiple regression analysis with earnrs as the dependent
variable and age, sex, and sei as the independent variables. Select
Enter as the Method for including variables to produce a standard
multiple regression.Click on the Statistics button to select
statistics we will need for the analysis.
Social Work Statistics
-
The baseline regressionRetain the default checkboxes for
Estimates and Model fit to obtain the baseline R, which will be
used to determine whether we should use the model with
transformations and excluding outliers, or the model with the
original form of the variables and all cases.Mark the Descriptives
checkbox to get the number of cases available for the analysis.Mark
the checkbox for the Durbin-Watson statistic, which will be used to
test the assumption of independence of errors.
Social Work Statistics
-
Initial sample sizeThe initial sample size before excluding
outliers and influential cases is 254. With 3 independent
variables, the ratio of cases to variables is 84.7 to 1, satisfying
both the minimum and preferred sample size for multiple
regression.
If the sample size did not initially satisfy the minimum
requirement, regression analysis is not appropriate.
Social Work Statistics
-
R before transformations or removing outliersPrior to any
transformations of variables to satisfy the assumptions of multiple
regression or removal of outliers, the proportion of variance in
the dependent variable explained by the independent variables (R)
was 18.7%.
The relationship is statistically significant, though we would
not stop if it were not significant because the lack of
significance may be a consequence of violation of assumptions or
the inclusion of outliers and influential cases.The R of 0.187 is
the benchmark that we will use to evaluate the utility of
transformations and the elimination of outliers/influential
cases.
Social Work Statistics
-
Assumption of independence of errors:the Durbin-Watson
statisticThe Durbin-Watson statistic is used to test for the
presence of serial correlation among the residuals, i.e., the
assumption of independence of errors, which requires that the
residuals or errors in prediction do not follow a pattern from case
to case.
The value of the Durbin-Watson statistic ranges from 0 to 4. As
a general rule of thumb, the residuals are not correlated if the
Durbin-Watson statistic is approximately 2, and an acceptable range
is 1.50 - 2.50.
The Durbin-Watson statistic for this problem is 1.849 which
falls within the acceptable range.
If the Durbin-Watson statistic was not in the acceptable range,
we would add a caution to the findings for a violation of
regression assumptions.
Social Work Statistics
-
Normality of dependent variable:how many in family earned
moneyAfter evaluating the dependent variable, we examine the
normality of each metric variable and linearity of its relationship
with the dependent variable.
To test the normality of number of earners in family, run the
script: NormalityAssumptionAndTransformations.SBSSecond, click on
the OK button to produce the output.First, move the independent
variable EARNRS to the list box of variables to test.
Social Work Statistics
-
Normality of dependent variable:how many in family earned
moneyThe dependent variable "how many in family earned money"
[earnrs] does not satisfy the criteria for a normal
distribution.
The skewness (0.742) fell between -1.0 and +1.0, but the
kurtosis (1.324) fell outside the range from -1.0 to +1.0.
Social Work Statistics
-
Normality of dependent variable:how many in family earned
moneyThe logarithmic transformation improves the normality of "how
many in family earned money" [earnrs]. In evaluating normality, the
skewness (-0.483) and kurtosis (-0.309) were both within the range
of acceptable values from -1.0 to +1.0. The square root
transformation also has values of skewness and kurtosis in the
acceptable range.
However, by our order of preference for which transformation to
use, the logarithm is preferred to the square root or inverse.
Social Work Statistics
-
Transformation for how many in family earned moneyThe
logarithmic transformation improves the normality of "how many in
family earned money" [earnrs].
We will substitute the logarithmic transformation of how many in
family earned money as the dependent variable in the regression
analysis.
Social Work Statistics
-
Adding a transformed variableSecond, mark the checkbox for the
transformation we want to add to the data set, and clear the other
checkboxes.First, move the variable that we want to transform to
the list box of variables to test.Before testing the assumptions
for the independent variables, we need to add the transformation of
the dependent variable to the data set.Fourth, click on the OK
button to produce the output.Third, clear the checkbox for Delete
transformed variables from the data. This will save the transformed
variable.
Social Work Statistics
-
The transformed variable in the data editorIf we scroll to the
extreme right in the data editor, we see that the transformed
variable has been added to the data set.Whenever we add transformed
variables to the data set, we should be sure to delete them before
starting another analysis.
Social Work Statistics
-
Normality/linearity of independent variable: ageAfter evaluating
the dependent variable, we examine the normality of each metric
variable and linearity of its relationship with the dependent
variable.
To test the normality of age, run the script:
NormalityAssumptionAndTransformations.SBSSecond, click on the OK
button to produce the output.First, move the independent variable
AGE to the list box of variables to test.
Social Work Statistics
-
Normality/linearity of independent variable: ageIn evaluating
normality, the skewness (0.595) and kurtosis (-0.351) were both
within the range of acceptable values from -1.0 to +1.0.
Social Work Statistics
-
Normality/linearity of independent variable: ageTo evaluate the
linearity of age and the log transformation of number of earners in
the family, run the script for the assumption of linearity:
LinearityAssumptionAndTransformations.SBSThird, click on the OK
button to produce the output.First, move the transformed dependent
variable LOGEARN to the text box for the dependent variable.Second,
move the independent variable, AGE, to the list box for independent
variables.
Social Work Statistics
- Normality/linearity of independent variable: ageThe evidence of
linearity in the relationship between the independent variable
"age" [age] and the dependent variable "log transformation of how
many in family earned money" [logearn] was the statistical
significance of the correlation coefficient (r = -0.493). The
probability for the correlation coefficient was
-
Normality/linearity of independent variable: respondent's
socioeconomic indexTo test the normality of respondent's
socioeconomic index, run the script:
NormalityAssumptionAndTransformations.SBSSecond, click on the OK
button to produce the output.First, move the independent variable
SEI to the list box of variables to test.
Social Work Statistics
-
Normality/linearity of independent variable: respondent's
socioeconomic indexThe independent variable "respondent's
socioeconomic index" [sei] satisfies the criteria for the
assumption of normality, but does not satisfy the assumption of
linearity with the dependent variable "log transformation of how
many in family earned money" [logearn].
In evaluating normality, the skewness (0.585) and kurtosis
(-0.862) were both within the range of acceptable values from -1.0
to +1.0.
Social Work Statistics
-
Normality/linearity of independent variable: respondent's
socioeconomic indexTo evaluate the linearity of the relationship
between respondent's socioeconomic index and the log transformation
of how many in family earned money, run the script for the
assumption of linearity:
LinearityAssumptionAndTransformations.SBSThird, click on the OK
button to produce the output.Second, move the independent variable,
SEI, to the list box for independent variables.First, move the
transformed dependent variable LOGEARN to the text box for the
dependent variable.
Social Work Statistics
-
Normality/linearity of independent variable: respondent's
socioeconomic indexThe probability for the correlation coefficient
was 0.385, greater than the level of significance of 0.01. We
cannot reject the null hypothesis that r = 0, and cannot conclude
that there is a linear relationship between the variables.
Since none of the transformations to improve linearity were
successful, it is an indication that the problem may be a weak
relationship, rather than a curvilinear relationship correctable by
using a transformation. A weak relationship is not a violation of
the assumption of linearity, and does not require a caution.
Social Work Statistics
-
Homoscedasticity of independent variable: SexTo evaluate the
homoscedasticity of the relationship between sex and the log
transformation of how many in family earned money, run the script
for the assumption of homogeneity of variance:
HomoscedasticityAssumptionAnd Transformations.SBSThird, click on
the OK button to produce the output.Second, move the independent
variable, SEX, to the list box for independent variables.First,
move the transformed dependent variable LOGEARN to the text box for
the dependent variable.
Social Work Statistics
-
Homoscedasticity of independent variable: SexBased on the Levene
Test, the variance in "log transformation of how many in family
earned money" [logearn] is homogeneous for the categories of "sex"
[sex].
The probability associated with the Levene Statistic (0.767) is
greater than the level of significance, so we fail to reject the
null hypothesis and conclude that the homoscedasticity assumption
is satisfied.
Social Work Statistics
-
The regression to identify outliers and influential casesTo run
the regression again, select the Regression | Linear command from
the Analyze menu.We use the regression procedure to identify
univariate outliers, multivariate outliers, and influential
cases.
We start with the same dialog we used for the baseline analysis
and substitute the transformed variables which we think will
improve the analysis.
Social Work Statistics
-
The regression to identify outliers and influential casesThird,
we want to save the calculated values of the outlier statistics to
the data set.
Click on the Save button to specify what we want to save.First,
we substitute the logarithmic transformation of earnrs, logearn,
into the list of independent variables.
Second, we keep the method of entry to Enter so that all
variables will be included in the detection of outliers.
NOTE: we should always use Enter when testing for outliers and
influential cases to make sure all variables are included in the
determination.
Social Work Statistics
-
Saving the measures of outliers/influential casesSecond, mark
the checkbox for Mahalanobis in the Distances panel. This will
compute Mahalanobis distances for the set of independent
variables.Fourth, click on the OK button to complete the
specifications.First, mark the checkbox for Studentized residuals
in the Residuals panel. Studentized residuals are z-scores computed
for a case based on the data for all other cases in the data
set.Third, mark the checkbox for Cooks in the Distances panel. This
will compute Cooks distances to identify influential cases.
Social Work Statistics
-
The variables for identifying outliers/influential casesThe
variable for identifying univariate outliers for the dependent
variable are in a column which SPSS has named sre_1. These are the
studentized residuals for the log transformed variables.The
variable for identifying multivariate outliers for the independent
variables are in a column which SPSS has named mah_1.The variable
containing Cooks distances for identifying influential cases has
been named coo_1 by SPSS.
Social Work Statistics
-
Computing the probability for Mahalanobis DTo compute the
probability of D, we will use an SPSS function in a Compute
command.First, select the Compute command from the Transform
menu.
Social Work Statistics
-
Formula for probability for Mahalanobis DThird, click on the OK
button to signal completion of the computer variable dialog.Second,
to complete the specifications for the CDF.CHISQ function, type the
name of the variable containing the D scores, mah_1, followed by a
comma, followed by the number of variables used in the
calculations, 3.
Since the CDF function (cumulative density function) computes
the cumulative probability from the left end of the distribution up
through a given value, we subtract it from 1 to obtain the
probability in the upper tail of the distribution.First, in the
target variable text box, type the name "p_mah_1" as an acronym for
the probability of the mah_1, the Mahalanobis D score.
Social Work Statistics
-
Univariate outliersA score on the dependent variable is
considered unusual if its studentized residual is bigger than
3.0.
Social Work Statistics
-
Multivariate outliersThe combination of scores for the
independent variables is an outlier if the probability of the
Mahalanobis D distance score is less than or equal to 0.001.
Social Work Statistics
-
Influential casesIn addition, a case may have a large influence
on the regression analysis, resulting in an analysis that is less
representative of the population represented by the sample. The
criteria for identifying influential case is a Cook's distance
score with a value of 0.0160 or greater.
The criteria for Cooks distance is:
4 / (n k 1) =4 / (254 3 1) = 0.0160
Social Work Statistics
-
Omitting the outliers and influential casesTo omit the outliers
and influential cases from the analysis, we select in the cases
that are not outliers and are not influential cases.First, select
the Select Cases command from the Transform menu.
Social Work Statistics
-
Specifying the condition to omit outliersFirst, mark the If
condition is satisfied option button to indicate that we will enter
a specific condition for including cases.Second, click on the If
button to specify the criteria for inclusion in the analysis.
Social Work Statistics
-
The formula for omitting outliersTo eliminate the outliers and
influential cases, we request the cases that are not outliers or
influential cases.
The formula specifies that we should include cases if the
studentized residual (regardless of sign) is less than 3, the
probability for Mahalanobis D is higher than the level of
significance of 0.001, and the Cooks distance value is less than
the critical value of 0.0160.After typing in the formula, click on
the Continue button to close the dialog box,
Social Work Statistics
-
Completing the request for the selectionTo complete the request,
we click on the OK button.
Social Work Statistics
-
An omitted outlier and influential caseSPSS identifies the
excluded cases by drawing a slash mark through the case number.
This omitted case has a large studentized residual, greater than
3.0, as well as a Cooks distance value that is greater than the
critical value, 0.0160.
Social Work Statistics
-
The outliers and influential casesCase 20000159 is an
influential case (Cook's distance=0.0320) as well as an outlier on
the dependent variable (studentized residual=3.13). Case 20000915
is an influential case (Cook's distance=0.0239). Case 20001016 is
an influential case (Cook's distance=0.0598) as well as an outlier
on the dependent variable (studentized residual=-3.12). Case
20001761 is an influential case (Cook's distance=0.0167). Case
20002587 is an influential case (Cook's distance=0.0264). Case
20002597 is an influential case (Cook's distance=0.0293). There are
6 cases that have a Cook's distance score that is large enough to
be considered influential cases.
Social Work Statistics
-
Running the regression omitting outliersWe run the regression
again, without the outliers which we selected out with the Select
If command.
Select the Regression | Linear command from the Analyze
menu.
Social Work Statistics
-
Opening the save options dialogWe specify the dependent and
independent variables, continuing to substitute any transformed
variables required by assumptions.On our last run, we instructed
SPSS to save studentized residuals, Mahalanobis distance, and Cooks
distance. To prevent these values from being calculated again,
click on the Save button.
Social Work Statistics
-
Clearing the request to save outlier dataFirst, clear the
checkbox for Studentized residuals.Third, click on the OK button to
complete the specifications.Second, clear the checkbox for
Mahalanobis distance.Third, clear the checkbox form Cooks
distance.
Social Work Statistics
-
Opening the statistics options dialogOnce we have removed
outliers, we need to check the sample size requirement for
regression.
Since we will need the descriptive statistics for this, click on
the Statistics button.
Social Work Statistics
-
Requesting descriptive statisticsFirst, mark the checkbox for
Descriptives.Third, click on the Continue button to complete the
specifications.Second, mark the checkbox for Collinearity
diagnostics to obtain the tolerance values for each independent
variable in order to assess multicollinearity.
Social Work Statistics
-
Requesting the outputHaving specified the output needed for the
analysis, we click on the OK button to obtain the regression
output.
Social Work Statistics
-
SELECTION OF MODEL FOR INTERPRETATIONPrior to any
transformations of variables to satisfy the assumptions of multiple
regression and the removal of outliers and influential cases, the
proportion of variance in the dependent variable explained by the
independent variables (R) was 18.7%.
After substituting transformed variables and removing outliers
and influential cases, the proportion of variance in the dependent
variable explained by the independent variables (R) was 38.4%.Since
the regression analysis using transformations and omitting outliers
and influential cases explained at least two percent more variance
than the regression analysis with all cases and no transformations,
the regression analysis with transformed variables omitting
outliers and influential cases was interpreted.
Social Work Statistics
-
SAMPLE SIZEThe minimum ratio of valid cases to independent
variables for multiple regression is 5 to 1. After removing 6
influential cases or outliers, there are 248 valid cases and 3
independent variables.
The ratio of cases to independent variables for this analysis is
82.67 to 1, which satisfies the minimum requirement. In addition,
the ratio of 82.67 to 1 satisfies the preferred ratio of 15 to
1.
Social Work Statistics
- OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLESThe probability of the F statistic (50.759) for the
overall regression relationship is
-
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLESThe Multiple R for the relationship between the set of
independent variables and the dependent variable is 0.620, which
would be characterized as strong using the rule of thumb than a
correlation less than or equal to 0.20 is characterized as very
weak; greater than 0.20 and less than or equal to 0.40 is weak;
greater than 0.40 and less than or equal to 0.60 is moderate;
greater than 0.60 and less than or equal to 0.80 is strong; and
greater than 0.80 is very strong.
Social Work Statistics
-
MULTICOLLINEARITYMulticollinearity occurs when one independent
variable is so strongly correlated with one or more other variables
that its relationship to the dependent variable is likely to be
misinterpreted. Its potential unique contribution to explaining the
dependent variable is minimized by its strong relationship to other
independent variables. Multicollinearity is indicated when the
tolerance value for an independent variable is less than 0.10.
The tolerance values for all of the independent variables are
larger than 0.10. Multicollinearity is not a problem in this
regression analysis.
Social Work Statistics
-
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 1
For the independent variable age, the probability of the t
statistic (-12.237) for the b coefficient is
-
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 2For the independent variable sex, the probability of
the t statistic (1.284) for the b coefficient is 0.200 which is
greater than the level of significance of 0.05.
We fail to reject the null hypothesis that the slope associated
with sex is equal to zero (b = 0) and conclude that there is not a
statistically significant relationship between sex and log
transformation of how many in family earned money.
Social Work Statistics
-
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 3For the independent variable respondent's socioeconomic
index, the probability of the t statistic (0.354) for the b
coefficient is 0.724 which is greater than the level of
significance of 0.05.
We fail to reject the null hypothesis that the slope associated
with respondent's socioeconomic index is equal to zero (b = 0) and
conclude that there is not a statistically significant relationship
between respondent's socioeconomic index and log transformation of
how many in family earned money.
Social Work Statistics
-
Validation analysis:set the random number seedTo set the random
number seed, select the Random Number Seed command from the
Transform menu.
Social Work Statistics
-
Set the random number seedFirst, click on the Set seed to option
button to activate the text box.Second, type in the random seed
stated in the problem.Third, click on the OK button to complete the
dialog box.
Note that SPSS does not provide you with any feedback about the
change.
Social Work Statistics
-
Validation analysis:compute the split variableTo enter the
formula for the variable that will split the sample in two parts,
click on the Compute command.
Social Work Statistics
-
The formula for the split variableFirst, type the name for the
new variable, split, into the Target Variable text box.Second, the
formula for the value of split is shown in the text box.
The uniform(1) function generates a random decimal number
between 0 and 1. The random number is compared to the value
0.50.
If the random number is less than or equal to 0.50, the value of
the formula will be 1, the SPSS numeric equivalent to true. If the
random number is larger than 0.50, the formula will return a 0, the
SPSS numeric equivalent to false.Third, click on the OK button to
complete the dialog box.
Social Work Statistics
-
The split variable in the data editorIn the data editor, the
split variable shows a random pattern of zeros and ones.
To select half of the sample for each validation analysis, we
will first select the cases where split = 0, then select the cases
where split = 1.
Social Work Statistics
-
Repeat the regression with first validation sampleTo repeat the
multiple regression analysis for the first validation sample,
select Linear Regression from the Dialog Recall tool button.
Social Work Statistics
-
Using "split" as the selection variableFirst, scroll down the
list of variables and highlight the variable split.Second, click on
the right arrow button to move the split variable to the Selection
Variable text box.
Social Work Statistics
-
Setting the value of split to select casesWhen the variable
named split is moved to the Selection Variable text box, SPSS adds
"=?" after the name to prompt up to enter a specific value for
split.Click on the Rule button to enter a value for split.
Social Work Statistics
-
Completing the value selectionFirst, type the value for the
first half of the sample, 0, into the Value text box.Second, click
on the Continue button to complete the value entry.
Social Work Statistics
-
Requesting output for the first validation sampleWhen the value
entry dialog box is closed, SPSS adds the value we entered after
the equal sign. This specification now tells SPSS to include in the
analysis only those cases that have a value of 0 for the split
variable.Click on the OK button to request the output.Since the
validation analysis requires us to compare the results of the
analysis using the two split sample, we will request the output for
the second sample before doing any comparison.
Social Work Statistics
-
Repeat the regression with second validation sampleTo repeat the
multiple regression analysis for the second validation sample,
select Linear Regression from the Dialog Recall tool button.
Social Work Statistics
-
Setting the value of split to select casesSince the split
variable is already in the Selection Variable text box, we only
need to change its value.
Click on the Rule button to enter a different value for
split.
Social Work Statistics
-
Completing the value selectionFirst, type the value for the
first half of the sample, 1, into the Value text box.Second, click
on the Continue button to complete the value entry.
Social Work Statistics
-
Requesting output for the second validation sampleWhen the value
entry dialog box is closed, SPSS adds the value we entered after
the equal sign. This specification now tells SPSS to include in the
analysis only those cases that have a value of 1 for the split
variable.Click on the OK button to request the output.
Social Work Statistics
- SPLIT-SAMPLE VALIDATION - 1In both of the split-sample
validation analyses, the relationship between the independent
variables and the dependent variable was statistically
significant.In the first validation, the probability for the F
statistic testing overall relationship was
-
SPLIT-SAMPLE VALIDATION - 2The total proportion of variance in
the relationship utilizing the full data set was 38.4% compared to
40.0% for the first split sample validation and 36.5% for the
second split sample validation.
In both of the split-sample validation analyses, the total
proportion of variance in the dependent variable explained by the
independent variables was within 5% of the variance explained in
the model using the full data set (38.4%).
Social Work Statistics
- RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 1The relationship between "age" [age] and "log
transformation of how many in family earned money" [logearn] was
statistically significant for the model using the full data set
(p
- RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 2In the second validation analysis, the probability for
the test of relationship between "age" [age] and "log
transformation of how many in family earned money" [logearn]
was
-
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 3The relationship between "respondent's socioeconomic
index" [sei] and "log transformation of how many in family earned
money" [logearn] was not statistically significant for the model
using the full data set (p=0.724). Similarly, the relationships in
both of the validation analyses were not statistically
significant.In the first validation analysis, the probability for
the test of relationship between "respondent's socioeconomic index"
[sei] and "log transformation of how many in family earned money"
[logearn] was 0.601, which was greater than the level of
significance of 0.05 and not statistically significant.
Social Work Statistics
-
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 4In the second validation analysis, the probability for
the test of relationship between "respondent's socioeconomic index"
[sei] and "log transformation of how many in family earned money"
[logearn] was 0.274, which was greater than the level of
significance of 0.05 and not statistically significant.
Social Work Statistics
-
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 5The relationship between "sex" [sex] and "log
transformation of how many in family earned money" [logearn] was
not statistically significant for the model using the full data set
(p=0.200). Similarly, the relationships in both of the validation
analyses were not statistically significant.In the first validation
analysis, the probability for the test of relationship between
"sex" [sex] and "log transformation of how many in family earned
money" [logearn] was 0.410, which was greater than the level of
significance of 0.05 and not statistically significant.
Social Work Statistics
-
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT
VARIABLE - 6In the second validation analysis, the probability for
the test of relationship between "sex" [sex] and "log
transformation of how many in family earned money" [logearn] was
0.366, which was greater than the level of significance of 0.05 and
not statistically significant.The split sample validation supports
the findings of the regression analysis using the full data
set.
The answer to the original question is true.
Social Work Statistics
-
Table of validation results: standard regressionIt may be
helpful to create a table for our validation results and fill in
its cells as we complete the analysis. The split sample validation
supports the findings of the regression analysis using the full
data set.
Social Work Statistics
-
Answering the problem question - 1In the dataset GSS2000.sav, is
the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data. Use
a level of significance of 0.05 for the regression analysis. Use a
level of significance of 0.01 for evaluating assumptions. Use
0.0160 as the criteria for identifying influential cases. Validate
the results of your regression analysis by splitting the sample in
two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's
socioeconomic index" [sei] have a strong relationship to the
variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members
earning money. The variables sex and respondent's socioeconomic
index did not have a relationship to how many in family earned
money.
1. True 2. True with caution 3. False 4. Inappropriate
application of a statisticWe have found that there is a
statistically significant relationship between the set of IVs and
the DV (p
-
Answering the problem question - 2In the dataset GSS2000.sav, is
the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data. Use
a level of significance of 0.05 for the regression analysis. Use a
level of significance of 0.01 for evaluating assumptions. Use
0.0160 as the criteria for identifying influential cases. Validate
the results of your regression analysis by splitting the sample in
two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's
socioeconomic index" [sei] have a strong relationship to the
variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members
earning money. The variables sex and respondent's socioeconomic
index did not have a relationship to how many in family earned
money.
1. True 2. True with caution 3. False 4. Inappropriate
application of a statisticThe b coefficient associated with age was
statistically significant (p< 0.001), so there was an individual
relationship to interpret.
The b coefficient (-0.007) was negative, indicating an inverse
relationship in which higher numeric values for age are associated
with lower numeric values for log transformation of how many in
family earned money.
Therefore, the negative value of b implies that survey
respondents who were older had fewer family members earning
money.
Social Work Statistics
-
Answering the problem question - 3In the dataset GSS2000.sav, is
the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data. Use
a level of significance of 0.05 for the regression analysis. Use a
level of significance of 0.01 for evaluating assumptions. Use
0.0160 as the criteria for identifying influential cases. Validate
the results of your regression analysis by splitting the sample in
two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's
socioeconomic index" [sei] have a strong relationship to the
variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members
earning money. The variables sex and respondent's socioeconomic
index did not have a relationship to how many in family earned
money.
1. True 2. True with caution 3. False 4. Inappropriate
application of a statisticFor the independent variable sex, the
probability of the t statistic (1.284) for the b coefficient is
0.200 which is greater than the level of significance of 0.05. Sex
did not have a relationship to the number of persons in the family
earning money.
For the independent variable respondent's socioeconomic index,
the probability of the t statistic (0.354) for the b coefficient is
0.724 which is greater than the level of significance of 0.05.
Socioeconomic status did not have a relationship to the number of
persons in the family earning money.
The answer to the question is true.
Social Work Statistics
-
Steps in regression analysis: running the baseline modelThe
following is a guide to the decision process for answering problems
about the complete regression analysis: Inappropriate application
of a statisticYesNoDependent variable metric?Independent variables
metric or dichotomous?Run baseline regression, using method for
including variables identified in the research question.Record R
for evaluation of transformations and removal of outliers and
influential cases.Record Durbin-Watson statistic for assumption of
independence of errors.Ratio of cases to independent variables at
least 5 to 1?Inappropriate application of a statistic
Social Work Statistics
-
Steps in regression analysis: evaluating assumptions - 1Is the
dependent variable normally distributed?
Try: 1. Logarithmic transformation2. Square root
transformation3. Inverse transformation
If unsuccessful, add caution for violation of regression
assumptionsMetric IVs normally distributed and linearly related to
DV?Try: 1. Logarithmic transformation2. Square root
transformation(3. Square transformation)4. Inverse
transformation
If unsuccessful, add caution for violation of regression
assumptions
Social Work Statistics
-
Steps in regression analysis: evaluating assumptions - 2DV is
homoscedastic for categories of dichotomous IVs?
Add caution for violation of regression assumptions
Residuals are independent,Durbin-Watson between 1.5 and 2.5?
Add caution for violation of regression assumptions
Social Work Statistics
-
Steps in regression analysis: evaluating outliersUnivariate
outliers (DV), multivariate outliers (IVs), or influential
cases?Request statistics for detecting outliers and influential
cases by running standard multiple regression using Enter method to
include all variables and substituting transformed variables. Ratio
of cases to independent variables at least 5 to 1?Yes
Remove outliers and influential cases from data set
Restore outliers and influential cases to data set, add caution
to findings
Social Work Statistics
-
Steps in regression analysis: picking regression model for
interpretationR for evaluated regression greater than R for
baseline regression by 2% or more?
Pick baseline regression for interpretation
Were transformed variables substituted, or outliers and
influential cases omitted?Evaluate impact of transformations and
removal of outliers by running regression again, using method for
including variables identified in the research question.
Pick regression with transformations and omitting outliers for
interpretation
Social Work Statistics
-
Steps in regression analysis: overall relationship is
interpretableProbability of ANOVA test of regression less
than/equal to level of significance?FalseTolerance for all IVs
greater than 0.10, indicating no multicollinearity? False
Social Work Statistics
-
Steps in regression analysis: validation - 1YesNoEnough valid
cases to split sample and keep 5 to 1 ratio of cases/variables?
Set the random seed and compute the split variableRe-run
regression with split = 0Re-run regression with split = 1
Set the first random seed and compute the split1 variableRe-run
regression with split1 = 1Set the second random seed and compute
the split2 variableRe-run regression with split2 = 1
Probability of ANOVA test
-
Steps in regression analysis: validation - 2Pattern of
significance for independent variables in both validations matches
pattern for full data set?FalseChange in R statistically
significant in both validation analyses? (Hierarchical only)FalseR
for both validations within 5% of R for analysis of full data
set?False
Social Work Statistics
-
Steps in regression analysis: answering the questionAssumptions
not violated? Outliers/influential cases excluded from
interpretation?TrueSatisfies ratio for preferred sample size: 15 to
1(stepwise: 50 to 1)True with cautionDV is interval level and IVs
are interval level or dichotomous?
True with cautionTrue with caution
Social Work Statistics
-
Interpreting the coefficients when the dependent variable is
transformed
Social Work Statistics
-
Interpreting b coefficients when the dependent variable is
transformed - 1
For the independent variable age, the probability of the t
statistic (-12.237) for the b coefficient is
-
Interpreting b coefficients when the dependent variable is
transformed - 2If we not to interpret a specific change in number
of earners for some amount of change in age, we will need to find
the answer and convert it from log units to decimal units. We can
use Microsoft Excel to calculate the answer.In the worksheet, I
have entered the b coefficient from the SPSS output (-0.007) in row
1.
In row 2, I have entered different ages, e.g. 20, 30, 40, and
50.
Social Work Statistics
-
Interpreting b coefficients when the dependent variable is
transformed - 3On row 3, we multiply the value for the independent
variable age by the b coefficient, which measures the contribution
to the dependent variable in log units.On row 4, we reverse the log
transform back to decimal units by raising the number 10 the value
on row 3.
The caret symbol is used by Excel for raise to a power, so
10^-0.14 is 10 raised to the -0.14 power, or 0.7244.
Social Work Statistics
-
Interpreting b coefficients when the dependent variable is
transformed - 4Based on our table, ia respondent were age 20
contributes 0.7244 to the number of earners in the family. If a
respondent were 30, rather than 20, the contribution to number of
earns would be 0.6166, a decrease of -0.1078. Thus, increasing age
has a negative effect on the number of earners.Note that as we go
up in increments of 10, the difference between increments is
decreasing. The logarithmic scale is not linear, requiring us to
compute the change for any specific interval of interest.
Social Work Statistics