Top Banner
62

Relationships

Jan 04, 2016

Download

Documents

leilani-snyder

Relationships. If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the variables ?. Association Between Variables :. Two variables measured on the - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Relationships
Page 2: Relationships

Relationships

• If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the variables ?

• Association Between Variables : Two variables measured on thesame individuals are associated if some values of one variabletend to occur more often with some values of the secondvariable than with other values of that variable.

• Response Variable : A response variable measures an outcomeof a study.

• Explanatory Variable : An explanatory variable explains or causes changes in the response variable.

Page 3: Relationships

2.1: Scatterplots

• A scatterplot shows the relationship between two variables.

• The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis.

• Always plot the explanatory variable on the horizontal axis, and the response variable as the vertical axis.

Example: If we are going to try to predict someone’s weight from theirheight, then the height is the explanatory variable, and the weight isthe response variable.

• The explanatory variable is often denoted by the variable x, and is sometimes called the independent variable.

• The response variable is often denoted by the variable y, and is sometimes called the dependent variable.

Page 4: Relationships

ScatterplotsExample: Do you think that a father’s height would affect a son’s height?

We are saying that given a father’s height, can we make any determinations about the son’s height ?The explanatory variable is : The father’s height

The response variable is : The son’s height

Data Set : Father’s Height Son’s Height

64 6568 6768 7070 7272 7574 7075 7375 7676 7777 76

Page 5: Relationships

Father’s Height Son’s Height

64 6568 6768 7070 7272 75

Father’s Height Son’s Height

74 7075 7375 7676 7777 76

64 68 72 76

64

68

72

76

Explanatory Variable (Father’s Height)

Response Variable (Son’s Height)

Page 6: Relationships

Father’s Height Son’s Height

64 6568 6768 7070 7272 75

Father’s Height Son’s Height

74 7075 7375 7676 7777 76

64 68 72 76

64

68

72

76

Father

Son

Page 7: Relationships

Examining A Scatterplot

• In any graph of data, look for the overall pattern and for striking deviations from that pattern.

• You can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship.

• An important kind of deviation is an outlier, an individual that falls outside the overall pattern of the relationship.

• Two variables are positively associated when above-average values of one tend to accompany above average values of the other and below average values also tend to occur together.

• Two variables are negatively associated when above-average values of one accompany below-average values of the other; and vice versa.

• Strength : How closely the points follow a clear form.

Page 8: Relationships

Direction

Type of associations between X and Y.

1. Two variables are positively associated if small values of X are associated with small values of the Y, and if large values of X are associated with large values of Y.

There is an upward trend from left to right.

Page 9: Relationships

Positive Association

Y . . . . . . . . . . . . X

Page 10: Relationships

Direction

Type of associations between X and Y.

2. Two variables are negatively associated if small values

of one variable are associated with large values of the other variable, and vice versa.

There is a downward trend from left to right.

Page 11: Relationships

Negative Association

Y . . . . . . . . . . . . . X

Page 12: Relationships

Form

Describe the type of trend between X and Y.

1. Linear - points fall close to a straight line.

Page 13: Relationships

Linear Association

Y . . . . . . . . . . . . X

Page 14: Relationships

Form

Describe the type of trend between X and Y.

1. Linear - points fall close to a straight line.

2. Quadratic - points follow a parabolic pattern.

Page 15: Relationships

Quadratic Association

Y . . . . . . . . . . . . . . . . . . . . X

Page 16: Relationships

Form

Describe the type of trend between X and Y.

1. Linear - points fall close to a straight line.

2. Quadratic - points follow a parabolic pattern.

3. Exponential - points follow a curved pattern.

Page 17: Relationships

Exponential Growth

Y . . .. . . . . . . . . .. . . . . .. . . . . . .. . . . .. . . . . . . .. . . . . .

X

Page 18: Relationships

Strength

Measures the amount of scatter around the general trend.

Linear-The closer the points fall to a straight line,the stronger the relationship between the two variables.

Page 19: Relationships

Strong Association

Y . . . . . . . . . . . . X

Page 20: Relationships

Moderate Association

Y . . . . . . . . . . . . . . . . . . . . . . . . . . X

Page 21: Relationships

Weak Association

Y . . . . . . . . . . . . . . . . . . . . . . . . . X

Page 22: Relationships

Examining A ScatterplotConsider the previous scatterplot :

64 68 72 76

64

68

72

76

Father

Son

Direction : Going up

Form : Linear

Association : Positive

Strength : Strong

Outliers : None

Page 23: Relationships

Example : The following is a scatterplot of data collected from statesabout students taking the SAT. The question is whether the percentageof students from a state that takes the test will influence the state’saverage scores.

For instance, in California, 45 % of high school graduates took the SATand the mean verbal score was 495.

Direction : Downward

Form : Curved

Association : Negative

Strength : Strong

Outliers : Maybe

Page 24: Relationships

§2.2: CorrelationRecall that a scatterplot displays the form, direction, and strength of therelationship between two quantitative variables.

Linear relationships are important because they are the easiest to model,and are fairly common.

We say a linear relationship is strong if the points lie close to a straightline, and weak if the points are scattered around the line.

Correlation (r) measures the direction and the strength of the linearrelationship between two quantitative variables.

The + / - sign denotes a positive or negative association.

The numeric value shows the strength. If the strength is strong, thenr will be close to 1 or -1. If the strength is weak, then r will be closeto 0.

Page 25: Relationships

Correlation Examples

Correlation = - 0.99Correlation

= 0.9

Correlation= - 0.7Correlation

= 0.5

Correlation= - 0.3

Correlation= 0

Page 26: Relationships

Which has the better correlation ?

Page 27: Relationships

CorrelationSo, how do we find the correlation ?

Suppose we have data on variables x and y for n individuals.

The means and standard deviations of the two variables are and for the x-values, and and for the y-values.

xys

xs

y

r = n - 1

1 xi

- x yi

- y

sy

sx

( ) ( )

Question : Will outliers effect the correlation ?

1

2 2 2 2

1 1

n

i ii

n n

i ii i

x y nx yr

x nx y ny

Page 28: Relationships

Example: Recall the scatterplot data for the heights of fathersand their sons.

Father’s Height Son’s Height

64 6568 6768 7070 7272 75

Father’s Height Son’s Height

74 7075 7375 7676 7777 76

We decided that the father’s heights was the explanatory variableand the son’s heights was the response variable.

The average of the x terms is 71.9 and the standard deviation is 4.25

The average of the y terms is 72.1 and the standard deviation is 4.07

Page 29: Relationships

r = n - 11 xi - x yi - y

sy

sx

( ) ( )x

x

- x

xi xi - x

i

s

64 -7.9 -1.8668 -3.9 -0.9268 -3.9 -0.9270 -1.9 -0.4572 0.1 0.0274 2.1 0.4975 3.1 0.7375 3.1 0.7376 4.1 0.9677 5.1 1.20

65 -7.1 -1.7567 -5.1 -1.2570 -2.1 -0.5272 -0.1 -0.0275 2.9 0.7170 -2.1 -0.5273 0.9 0.2276 3.9 0.9577 4.9 1.2076 3.9 0.95

y - y

yiyi - y

i

sy

Page 30: Relationships

x

x

- x

xi xi - x

i

s

64 -7.9 -1.8668 -3.9 -0.9268 -3.9 -0.9270 -1.9 -0.4572 0.1 0.0274 2.1 0.4975 3.1 0.7375 3.1 0.7376 4.1 0.9677 5.1 1.20

65 -7.1 -1.7567 -5.1 -1.2570 -2.1 -0.5272 -0.1 -0.0275 2.9 0.7170 -2.1 -0.5273 0.9 0.2276 3.9 0.9577 4.9 1.2076 3.9 0.95

y - y

yiyi - y

i

sy

r = 10 - 1

1[ (-1.86)(-1.75) + (-0.92)(-1.25) + ….. + (1.20)(0.95)]

= 9

1 [ (3.24) + (1.14) + ….. + (1.14)] = 0.87

Page 31: Relationships

Shortcut Calculations

10 10 102

1 1 1

10 102

1 1

2 2

719, 721, 51859

52133, 51975

51975 10*71.9*72.10.87

51859 10*71.9 52133 10*72.1

i i ii i i

i i ii i

x y x

y x y

r

Page 32: Relationships

Facts about CorrelationCorrelation makes no distinction between explanatory and responsevariables. The correlation between x and y is the same as the correlationbetween y and x.

Correlation requires that both variables be quantitative. We cannotcompute a correlation between a categorical variable and a quantitativevariable or between two quantitative variables.

r does not change when we do transformations. The correlation between height and weight is the same whether height was measured in feet or centimeters or weight was measured in kilograms or pounds. This happens because all the observations are standardized in theCalculation of correlation.

The correlation r itself has no unit of measurement, it is just a number.

Page 33: Relationships

Exercise

What’s wrong with these statements?

1. At AU there is no correlation between the ethnicity of students and their GPA.

2. The correlation between height and weight of stat202 students

(b) is 0.61 inches per pound

(a) is 2.61

(d) is 0.61 using inches and pounds, but converting inches to centimeters would make r > 0.61 (since an inch equals about 2.54 centimeters).

(c) is 0.61, so the corr. between weight and height is -0.61

Page 34: Relationships

§2.3: Least-Squares Regression

• Correlation measures the direction and strength of a straight-line (linear) relationship between two quantitative variables.

• We have tried to summarize the data by drawing a straight-line the through the data.

• A regression line summarizes the relationship between two variables.

• These can only be used in one setting : when one variable helps explain or predict the other variable.

Page 35: Relationships

Regression Line

• A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Regression, unlike correlation, requires that we have an explanatory variable and a response variable.

• If a scatterplot displays a linear pattern, we can describe the overall pattern by drawing a straight line through the points.

• This is called fitting a line to the data.

• This is a mathematical model which we can use to make predictions based on the given data.

Page 36: Relationships

Example: Recall the data we were using before where we comparedthe heights of fathers and sons.

Father’s Height Son’s Height

64 6568 6768 7070 7272 75

Father’s Height Son’s Height

74 7075 7375 7676 7777 76

The first thing we did was to plot the points.

Page 37: Relationships

64 68 72 76

64

68

72

76

Father

Son

Example: Recall the data we were using before where we comparedthe heights of fathers and sons.

The line which is closest to all the points is the regression line.

Page 38: Relationships

Least-Squares Regression Line• The least squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

Page 39: Relationships

Equation of the Least-SquaresRegression Line

• Imagine we have data on an explanatory variable x and a response y for n individuals.

• Assume the mean for the explanatory variable is and the standard deviation is s

x

x

• Assume the mean for the response variable is and the standard deviation is s

y

y

• Assume the correlation between x and y is r.

Page 40: Relationships

Equation of the Least-SquaresRegression Line

• The equation of the least-squares regression line of y on x is :

y = a + bx

slope intercept

The slope is b : b = r sx

sy( )

The intercept is a : a = y - b x

Page 41: Relationships

Interpretation of Regression Coefficients

The Y-Intercept a is the value of the response variable, y, when the explanatory variable, x, is zero.

The Slope, b is the change in the response variable, y, for a unit increase in the explanatory variable, x.

Page 42: Relationships

Example: What if we want to find the least-squares regression line where we will predict the son’s height from the father’s height ?

We need the means and the standard deviations :

Note: The father’s heights are the explanatory variable, and the son’s height is the response variable.

The average of the x terms is 71.9 and the standard deviation is 4.25

The average of the y terms is 72.1 and the standard deviation is 4.07

The correlation between the two variables is 0.87

Page 43: Relationships

We need to find the slope :

y = a + bx

b = r sx

sy( )

= 71.9x sx = 4.25 y = 72.1 s

y = 4.07 r = 0.87

The equation for the regression line is :

b = 0.874.07

4.25( ) = 0.8331529

Next, find the intercept : a = y - b x

a = 72.1 - (0.8331529)(71.9)

So, the equation for the regression line is :

= 12.196307

y = a + bx y = 12.2 + .83x

Page 44: Relationships

Making Predictions

We can use the regression line to make some predicts.

Example : Based on the previous data, we can predict the son’sheight from the father’s height.

y = 12.2 + .83x

Q: If the father’s height is 70 inches, what is our prediction for the son’s height?

A: = 12.2 + .83(70) =y 70.1

Note: These predictions are only good on relevant data!!

Page 45: Relationships

64 68 72 76

64

68

72

76

Father

Son

y = 12.2 + .83x

Page 46: Relationships

Notes On Regression

b = r sx

sy( )

• This equation says that a change of one standard deviation in x corresponds to a change of r standard deviations in y.

• The point is always on the regression line.yx ,( )

• The square of the correlation, , is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.

r2

Example: The straight line relationship between father’s heights andson’s height is = (0.87) = 0.7569 explains the variation in heights.r2 2

Page 47: Relationships

Residuals Analysis

• A regression line is a mathematical model for the overall pattern of a linear relationship between an explanatory variable, and a response variable.

• The regression line is chosen so that the vertical distances to the line from all the points is as small as possible.

• A residual is the difference between an observed value of the response variable, and the value predicted by the regression line.

Residual = observed y - predicted y

Residual = y - y

Page 48: Relationships

Example of ResidualsGo back to our favorite example :Father’s Height Son’s Height

64 6568 6768 7070 7272 75

Father’s Height Son’s Height

74 7075 7375 7676 7777 76

y = 0.8293x + 12.47

R2 = 0.7525

64

66

68

70

72

74

76

78

60 65 70 75 80

Series1

Linear (Series1)

Page 49: Relationships

Example of ResidualsGo back to our favorite example :Father’s Height Son’s Height

64 6568 6768 7070 7272 75

Father’s Height Son’s Height

74 7075 7375 7676 7777 76

y = 12.47 + .8293xWe found the regression line to be:

So, when the father’s height is 64 inches, we expect theson to be how tall?

y = 12.47 + (.8293)(64) = 65.5452

However, the actual height of the son is 65 inches, so the residualis : 65 - 65.5452 = -0.5452

This tells us the point is .5452 units below the regression line.

Page 50: Relationships

Son’sHeight

Predicted Height Residual

65677072757073767776

y = 12.47 + .8293x

65.545268.862468.862470.52172.179673.838274.667574.667575.496876.3261

Residual = y - y-0.5452-1.8624 1.1376 1.479 2.8204-3.8382-1.6675 1.3325 1.5032-0.3261

Average =

0.0037678

• Again, we could have drawn our line anywhere on the graph

• The least squares regression line has the property that the mean of the least-squares residuals is always zero!

Page 51: Relationships

Residual Plot• A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of the regression line.• The regression line shows the overall pattern of the data. So, the residual plot should *not* have a pattern.

Page 52: Relationships

Residual PlotExample : What does this residual plot show us :

This indicates the relationship between the explanatory variable and the response variable is curved, and not linear.

Regression should not be used in this example.

Page 53: Relationships

Residual PlotExample : What does this residual plot show us :

This shows that the variation of the response variable about the lineincreases as the explanatory variable increases.

The predictions for y will be better on the less variable part, thanthe more variable part.

Page 54: Relationships

Residual PlotHere is what our residual plot would look like :

64 66 68 70 72 74 76

0

-2

2

4

-4

Page 55: Relationships

Outliers and Influential ObservationsConsider our favorite example :

Father’s Height Son’s Height

64 6568 6768 7070 7272 75

Father’s Height Son’s Height

74 7075 7375 7676 7777 76

What happens if we add in an outlier ?

64 82

How does this change our results?

Page 56: Relationships

y = 0.2914x + 52.258

R2 = 0.0784

64

66

68

70

72

74

76

78

80

82

63 68 73 78

Series1

Linear (Series1)

Correlation & Scatterplot• The correlation drops from 0.87 to 0.28

Page 57: Relationships

Outliers and Influential ObservationsConsider the following scatterplot :

There is one outlier :

Page 58: Relationships

Outliers and Influential Observations• An outlier has a large residual from the regression line.• This could be called an outlier in the y direction.• If you look at the previous picture, there is an “outlier” in the x direction

This is called an influential observation.

Page 59: Relationships

Outliers and Influential Observations

• An influential observation is a score which is far from the other data points in the x direction.

• Note that it should still be close to the regression line, otherwise we would label it an outlier.

• An influential observation is a score that is extreme in the x direction with no other points around it.

• These values will pull the regression line towards itself.

Page 60: Relationships

Outliers and Influential Observations

Page 61: Relationships

Outliers and Influential Observations

• An influential observation is a score which is far from the other data points in the x direction.

• Note that it should still be close to the regression line, otherwise we would label it an outlier.

• An influential observation is a score that is extreme in the x direction with no other points around it.

• These values will pull the regression line towards itself.

So, an observation is influential if removing it would markedly changethe result of the calculation.

Page 62: Relationships

Influential Observations

Q: How can we check data for influential observations ?

A1 : residuals ?

A2 : Scatterplot ? Sort of.

A3 : Remove the point and see what happens ?