Jan 26, 2015
DATA ANALYSIS – TESTING FOR ASSOCIATIONRelationship :
A consistent and systematic link between two or more variables
While interpreting the relationship between variables following aspects are taken into account :
1. Whether two or more variables are related at all i.e To measure whether relationship is present vide concept of statistical significance
2. If the relationship is present it is important to know the direction which can be either Positive or Negative
3. Understanding strength of association
4. Type of relationship
Univariate Data Bivariate Data
involving a single variable involving two variables
does not deal with causes or relationships deals with causes or relationships
the major purpose of univariate analysis is to describe the major purpose of bivariate analysis is to explain
central tendency - mean, mode, median
dispersion - range, variance, max, min, quartiles, standard deviation.
frequency distributions
bar graph, histogram, pie chart, linegraph, box-and-whisker plot
analysis of two variables simultaneously
correlations
comparisons, relationships, causes,explanations
tables where one variable is contingent on the values of the other variable.
independent and dependent variables
Sample question: How many of the students in the freshman class are female?
Sample question: Is there a relationship between the number of females in Computer Programming and their scores in Mathematics?
Difference between Univariate and Bivariate
1) To measure whether relationship is present vide concept of statistical significance -
Whether relation exist between two or more variables
If we test for statistical significance and find that it exists then it is said that relationship is present
Stated another way , we say that knowledge about the behavior of one variable allows us to make a useful prediction about the behavior of another
For example :
If we found statistically significant relationship between the perceptions of the quality of Santa Fe Grill food and satisfaction , we would say a relationship is present and that perceptions of the quality of food will tell us what the perception of satisfaction are likely to be
2) If the relationship is present it is important to know the direction which can be either Positive or Negative
Presence of relationship precedes direction
The direction of relationship can either be positive or negative
For example :
Using Santa Fe Grill example we could say that a positive relationship exists if respondents who rate the quality of food high also are highly satisfied. Similarly , a negative relationship exists if respondents say the speed of service is slow (low rating ) but they are still satisfied (High rating)
3) Understanding strength of association
In general categorize the strength of association as
a. Non existentb. Weakc. Moderated. Strong
If a consistent and systematic relationship is not present then the strength of association is nonexistent
A weak association means there is low probability of variables having relationship
A strong association means there is high probability , a consistent and systematic relationship exists
4) Type of relationship
If we say two variables can be described as related, then we would pose this as question “What is the nature of relationship”? , How can the link between variables Y and X best be described ?
There are a number of different ways in which two variables (X & Y) can share a relationship
In the wake of finding answers to above questions following statistical methodologies will be applied
a.Covariation
a.Chi Square Test
a.Correlation Coefficient1. Pearson Correlation coefficient2. Coefficient of determination3. Spearman rank order correlation coefficient
b.Regression Analysis
COVARIATION :
It is defined as amount of change in one variable that is consistently related to the change in another variable of interest or degree of association between two items/variables
For example :
If we know DVD purchases are related to age ,then we want to know the extent to which younger persons purchase more DVDs and ultimately which types of DVDs
If two variables are foound to change together on a reliable or consistent basis then we can use that information to make predictions as well as decisions on advertising and marketing strategies
For example
Change in attitude towards Starbucks coffee advertising campaign as it varies between light, medium and heavy consumers of Starbucks coffee
SCATTER PLOTS AND CORRELATION A scatter plot (or scatter diagram) is used to
show the relationship between two variables
SCATTER PLOT EXAMPLES
y
x
y
x
y
y
x
x
Linear relationships
Curvilinear relationships
SCATTER PLOT EXAMPLES
y
x
y
x
y
y
x
x
Strong relationships
Weak relationships
(continued)
SCATTER PLOT EXAMPLES
y
x
y
x
No relationship
(continued)
Smoking
3020100-10
Lu
ng
Cap
acit
y
50
40
30
20
One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables
• We can see easily from the graph that as smoking goes up, lung capacity tends to go down.
• The two variables covary in opposite directions.
• We now examine two statistics, covariance and correlation, for quantifying how variables covary.
Smoking and Lung Capacity
Cigarettes (X) Lung Capacity (Y)
0 45
5 42
10 33
15 31
20 29
The formula for calculating covariance of sample data is as follows :
x = the independent variabley = the dependent variablen = number of data points in the sample = the mean of the independent variable x = the mean of the dependent variable y
To understand how covariance is used, consider the table below, which describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi).
Example : To understand how covariance is used, consider the table, which describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi)
Using the covariance formula, you can determine whether economic growth and S&P 500 returns have a positive or inverse relationship.
Before you compute the covariance, calculate the mean of x and y
A ) Now you can identify the variables for the covariance formula as follows
x = 2.1, 2.5, 4.0, and 3.6 (economic growth)y = 8, 12, 14, and 10 (S&P 500 returns) = 3.1 = 11B) Substitute these values into the covariance formula to determine the relationship between economic growth and S&P 500 returns.
Interpretation :
The covariance between the returns of the S&P 500 and economic growth is 1.53.
Since the covariance is positive, the variables are positively related—they move together in the same direction
Smoking
3020100-10
Lu
ng
Cap
acit
y
50
40
30
20
One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables
• We can see easily from the graph that as smoking goes up, lung capacity tends to go down.
• The two variables covary in opposite directions.
• We now examine two statistics, covariance and correlation, for quantifying how variables covary.
Smoking and Lung Capacity
Cigarettes (X) Lung Capacity (Y)
0 45
5 42
10 33
15 31
20 29
Correlation :
Correlation is another way to determine how two variables are related.
In addition to telling you whether variables are positively or inversely related, correlation also tells you the degree to which the variables tend to move together
Correlation standardizes the measure of interdependence between two variables and, consequently, tells you how closely the two variables move.
The correlation measurement, called a correlation coefficient, will always take on a value between 1 and – 1 called Pearson Correlation coefficient -
A) If the correlation coefficient is one
The variables have a perfect positive correlation.
This means that if one variable moves a given amount, the second moves proportionally in the same direction.
A positive correlation coefficient less than one indicates a less than perfect positive correlation, with the strength of the correlation growing as the number approaches one.
B) If correlation coefficient is zero
No relationship exists between the variables
If one variable moves, you can make no predictions about the movement of the other variable; they are uncorrelated.
C) If correlation coefficient is –1
The variables are perfectly negatively correlated (or inversely correlated) and move in opposition to each other
If one variable increases, the other variable decreases proportionally
A negative correlation coefficient greater than –1 indicates a less than perfect negative correlation, with the strength of the correlation growing as the number approaches –1
To calculate the correlation coefficient for two variables, you would use the correlation formula, shown below.
x,y) = correlation of the variables x and yCOV(x, y) = covariance of the variables x and ysx = sample standard deviation of the random variable x sy = sample standard deviation of the random variable y
To calculate correlation, you must know the covariance for the two variables and the standard deviations of each variable
From the earlier example, you know that the covariance of S&P 500 returns and economic growth was calculated to be 1.53
Now you need to determine the standard deviation of each of the variables
You would calculate the standard deviation of the S&P 500 returns and the economic growth
Using the information from above, you know that
COV(x,y) = 1.53sx = 0.90sy = 2.58
Now calculate the correlation coefficient by substituting the numbers above into the correlation formula, as shown below.
A correlation coefficient of .66 tells you two important things:
•Because the correlation coefficient is a positive number, returns on the S&P 500 and economic growth are postively related.
•Because .66 is relatively far from indicating no correlation, the strength of the correlation between returns on the S&P 500 and economic growth is strong
The coefficient of determination is the amount of variability in one measure that is explained by the other measure
The coefficient of determination is the square of the correlation coefficient (r2)
For example, if the correlation coefficient between two variables is r = 0.90, the coefficient of determination is (0.90)2 = 0.81
Square of coefficient of correlation (Pearson correlation coefficient) gives coefficient of determination given by r 2
This number ranges from .00 to 1.0 showing proportion variation explained or accounted for in one variable by another
Spearman Rank Order correlation coefficient :
A statistical measure of linear association between two variables where both have been measured using ordinal (rank order) scales
Example :
INTRODUCTION TO REGRESSION ANALYSIS
Regression analysis is used to: Predict the value of a dependent variable based
on the value of at least one independent variable Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain the dependent variable
SIMPLE LINEAR REGRESSION MODEL
Only one independent variable, x
Relationship between x and y is described by a linear function
Changes in y are assumed to be caused by changes in x
TYPES OF REGRESSION MODELS
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
εxββy 10 Linear component
POPULATION LINEAR REGRESSION
The population regression model:
Population y intercept
Population SlopeCoefficient
Random Error term, or residual
Dependent Variable
Independent Variable
Random Error component
LINEAR REGRESSION ASSUMPTIONS
Error values (ε) are statistically independent Error values are normally distributed for any
given value of x The probability distribution of the errors is
normal The probability distribution of the errors has
constant variance The underlying relationship between the x
variable and the y variable is linear
POPULATION LINEAR REGRESSION(continued)
Random Error for this x value
y
x
Observed Value of y for
xi
Predicted Value of y for
xi
εxββy 10
xi
Slope = β1
Intercept = β0
εi
xbby 10i
The sample regression line provides an estimate of the population regression line
ESTIMATED REGRESSION MODEL
Estimate of the regression intercept
Estimate of the regression slope
Estimated (or predicted) y value
Independent variable
The individual random error terms ei have a mean of zero
LEAST SQUARES CRITERION
b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals
210
22
x))b(b(y
)y(ye
THE LEAST SQUARES EQUATION
The formulas for b1 and b0 are:
algebraic equivalent:
n
xx
n
yxxy
b 22
1 )(
21 )(
))((
xx
yyxxb xbyb 10 and
INTERPRETATION OF THE SLOPE AND THE INTERCEPT
b0 is the estimated average value
of y when the value of x is zero
b1 is the estimated change in the
average value of y as a result of a one-unit change in x
FINDING THE LEAST SQUARES EQUATION
The coefficients b0 and b1 will usually be found using computer software, such as Excel or Minitab
Other regression measures will also be computed as part of computer-based regression analysis
SIMPLE LINEAR REGRESSION EXAMPLE
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected
Dependent variable (y) = house price in $1000s
Independent variable (x) = square feet
SAMPLE DATA FOR HOUSE PRICE MODEL
House Price in $1000s(y)
Square Feet (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
REGRESSION USING EXCEL Tools / Data Analysis / Regression
EXCEL OUTPUT
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA df SS MS F
Significance F
Regression 1 18934.934818934.934
811.084
8 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficien
ts Standard Error t StatP-
value Lower 95%Upper 95%
Intercept 98.24833 58.03348 1.692960.1289
2 -35.57720232.0738
6
Square Feet 0.10977 0.03297 3.329380.0103
9 0.03374 0.18580
The regression equation is: feet) (square 0.10977 98.24833 price house
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
Ho
use
Pri
ce (
$100
0s)
GRAPHICAL PRESENTATION
House price model: scatter plot and regression line
feet) (square 0.10977 98.24833 price house
Slope = 0.10977
Intercept = 98.248
INTERPRETATION OF THE INTERCEPT, B0
b0 is the estimated average value of Y when the
value of X is zero (if x = 0 is in the range of observed x values)
Here, no houses had 0 square feet, so b0 = 98.24833
just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
feet) (square 0.10977 98.24833 price house
INTERPRETATION OF THE SLOPE COEFFICIENT, B1
b1 measures the estimated
change in the average value of Y as a result of a one-unit change in X Here, b1 = .10977 tells us that the average value of a
house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size
feet) (square 0.10977 98.24833 price house
LEAST SQUARES REGRESSION PROPERTIES
The sum of the residuals from the least squares regression line is 0 ( )
The sum of the squared residuals is a minimum (minimized )
The simple regression line always passes through the mean of the y variable and the mean of the x variable
The least squares coefficients are unbiased estimates of β0 and β1
0)ˆ( yy
2)ˆ( yy
EXPLAINED AND UNEXPLAINED VARIATION
Total variation is made up of two parts:
SSR SSE SST Total sum
of Squares
Sum of Squares
Regression
Sum of Squares Error
2)yy(SST 2)yy(SSE 2)yy(SSR
where: = Average value of the dependent variabley = Observed values of the dependent
variable = Estimated value of y for the given x
value
y
y
EXPLAINED AND UNEXPLAINED VARIATION
SST = total sum of squares Measures the variation of the yi values around their
mean y
SSE = error sum of squares Variation attributable to factors other than the
relationship between x and y
SSR = regression sum of squares Explained variation attributable to the relationship
between x and y
(continued)
(continued)
Xi
y
x
yi
SST = (yi - y)2
SSE = (yi - yi
)2
SSR = (yi - y)2
_
_
_
EXPLAINED AND UNEXPLAINED VARIATION
y
y
y_y
THANKS……