Correlations & Linear Regressions Block 3
Correlations &Linear Regressions
Block 3
Question
• You can ask other questions besides ‘Aretwo conditions different’?
• What relationship or association existsbetween two or more variables?– Positively related: as x goes, so goes y.– Negatively related: whatever x does, y does
the opposite.– No relationship.
Example of linear correlation
Record Sales (thousands)
4003002001000
Adv
erts
ing
Bud
g et
(tho
usan
ds
of p
o un d
s)
3000
2000
1000
0
-1000
Covariance• An association is indexed by covariance.
– Are changes in one variable met with a similar oropposite change in another variable?
• Variance (s2) = SS/N-1• SS =
!
xi " x( )#2
– We squared the error scores when lookingfor variance within one variable.
– If interested in the association between twovariables, we multiply the error scorestogether.
Calculating Covariance
• If deviations from the mean go in the samedirections for both variables, you’ll get apositive number.
• If deviations from the mean go in oppositedirections (one negative, one positive)you’ll get a negative number.
!
cov(x,y) =xi " x( ) yi " y( )#N "1
Interpreting linear relations
• Correlation coefficient [r] = linearrelationship between two variables.
• r2 = proportion of common variation in thetwo variables (strength or magnitude of therelationship.
• Outliers? A single outlier can greatlyinfluence the strength of a correlation.
Effect of outliers
One approach to dealing with outliers is to see if they arenon-representative (i.e., at the far end of the normal distribution).If so, they should be removed.
Conducting the analysis
• Each variable gets separate column.• Create a scatter plot to get visual
impression of data.– Direction of relationship– Strength of relationship– Extreme values (outliers), which can greatly
influence correlation coefficient.
When doing T-tests or anovas,especiallyrepeatedmeasures, eachrow was data from1 person.
Each observationis its own datarow.
No collapsing ofdata
Types of correlations
• Bivariate correlation: between two variable– Pearson’s correlation coefficient for parametric
data (interval or ratio data)
• Partial correlation: relationship betweentwo variables while ‘controlling’ the effectof one or more additional variables.
Partial Correlations
Drawing conclusions• Correlations only inform us about a
relationship between two or morevariables.
• Not able to talk about directionality orcausality.– An increase in X does not CAUSE an increase
in Y or vise versa. Cause could be fromunmeasured third variable.
– We don’t know which variable is influencingand which is being influenced.
R2
• By squaring our test statistic, we can tellhow much of total variance in the data forvariable x is accounted for by therelationship with variable y.
• R2 = .2382 = .056 = 5.6% of variance. (94%of variability still unaccounted for!)
• For height x age: .7632 = 0.582 = 58%
Non-parametric correlations
• Spearman’s Rho– Ranks the data and then applies Pearson’s
equation to ranks.• Kendall’s Tau
– Preferred for small data sets with many tiedrankings.
• Biserial correlations:– When one variable is dichotomous
Regressions
• Correlations detect associations betweentwo variables.– Say nothing of causal relationships or
directionality– Can’t predict behavior on one variable given a
value behavior for another variable
• With Regression models we can predictvariable Y based on variable X.
Simple Linear Regressions
• A line is fit to the data (similar to thecorrelations line).– Best line is one that produces the smallest
sum of squares from regression line to datapoints.
• Evaluation based on improvement ofprediction relative to using the mean orsome other model.
Hypothetical Data
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25
outcome variable
Pre
dict
or v
aria
ble
Mean
Error from Mean
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25
outcome variable
Pre
dic
tor
vari
able
Mean
Error from regression line
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25
outcome variable
Pre
dict
or v
aria
ble
Mean
Regressionline
Regression Results
• The best regression line has the lowestsum of squared errors
• Evaluation of the regression model isachieved via– R2 = tells you % of variance accounted for by
the regression line (as with correlations)– F = Evaluates improvement of regression line
compared to the mean as a model of the data.
Simple Linear Regression inSPSS
• Data input in SPSS as for correlation• Only one predictor (IV) and one outcome
(DV) allowed.
• Coefficient table allows you to predict DVfor new values of IV.
R2 proportion of variance accountedfor by the regression, biased
Adjusted R2 = adjusts for bias
Is the model an improvementover the mean or over a priormodel?
β = Change inoutcome resulting inchange in predictor
Tests null hypothesisfor relationshipbetween IV an DV
Correlation between expected y and y
Predicting New Values
• Equation for line:– Y - output value– X = predictor value– β0 = intercept (constant in table. Value of Y
without predictors)– β1 = slope of line (value for predictor)– ε = residual (error)!
Y = "0 + "1Xi + #i
Multiple Regression
• Extends principles of simple linearregression to situation with multiplepredictor variables.
• We seek to find the linear combination ofpredictors that correlate maximally with theoutcome variable.
!
Y = "0 + "1Xi + "nXn + #iPredictor 1 Predictor 2
Multiple Regression, con’t
• R2 gives the % variance accounted for bythe model consisting of the multiplepredictors.
• T-test tell you independent contribution ofeach predictor in capturing data.
Descriptivesand otherstats fromhere
Evaluation of modelas a whole
Linearrelationship ofeach factor tothe dependentvariable
R2 was .33, now .44
Logistic Regression
• If you are instead interested in predicting classmembership, or seeing how well your variablespredict class membership, then you can do aLogistic Regression.
• You can use multiple factors as predictors, just like inmultiple regression.– You can also enter interactions between factors.– Categorical and continuous factors can be combined, but you
must tell SPSS which factors are continuous.
Summary• Correlations tell you about the relationship
between 2 variables
• Regressions allow you to predict an outcomevariable and to make causal inferences.
• Logistic regressions good for predicting groupmembership
• Next, I’ll tell you about some new developmentsin my world of stats including more sophisticatedregression techniques and a method to comparechanges to data over time.
Informative Websites
• http://faculty.vassar.edu/lowry/VassarStats.html
• http://www.uwsp.edu/psych/stat/
• http://www.richland.edu/james/lecture/m170/
• http://calcnet.mth.cmich.edu/org/spss/index.htmfor spss videos!
• Or, just google the test you are interested in.