Data analysis test for association BY Prof Sachin Udepurkar

DATA ANALYSIS – TESTING FOR ASSOCIATIONRelationship :

A consistent and systematic link between two or more variables

While interpreting the relationship between variables following aspects are taken into account :

1. Whether two or more variables are related at all i.e To measure whether relationship is present vide concept of statistical significance

2. If the relationship is present it is important to know the direction which can be either Positive or Negative

3. Understanding strength of association

4. Type of relationship

Univariate Data Bivariate Data

involving a single variable involving two variables

does not deal with causes or relationships deals with causes or relationships

the major purpose of univariate analysis is to describe the major purpose of bivariate analysis is to explain

central tendency - mean, mode, median

dispersion - range, variance, max, min, quartiles, standard deviation.

frequency distributions

bar graph, histogram, pie chart, linegraph, box-and-whisker plot

analysis of two variables simultaneously

correlations

comparisons, relationships, causes,explanations

tables where one variable is contingent on the values of the other variable.

independent and dependent variables

Sample question: How many of the students in the freshman class are female?

Sample question: Is there a relationship between the number of females in Computer Programming and their scores in Mathematics?

Difference between Univariate and Bivariate

1) To measure whether relationship is present vide concept of statistical significance -

Whether relation exist between two or more variables

If we test for statistical significance and find that it exists then it is said that relationship is present

Stated another way , we say that knowledge about the behavior of one variable allows us to make a useful prediction about the behavior of another

For example :

If we found statistically significant relationship between the perceptions of the quality of Santa Fe Grill food and satisfaction , we would say a relationship is present and that perceptions of the quality of food will tell us what the perception of satisfaction are likely to be

2) If the relationship is present it is important to know the direction which can be either Positive or Negative

Presence of relationship precedes direction

The direction of relationship can either be positive or negative

For example :

Using Santa Fe Grill example we could say that a positive relationship exists if respondents who rate the quality of food high also are highly satisfied. Similarly , a negative relationship exists if respondents say the speed of service is slow (low rating ) but they are still satisfied (High rating)

3) Understanding strength of association

In general categorize the strength of association as

a. Non existentb. Weakc. Moderated. Strong

If a consistent and systematic relationship is not present then the strength of association is nonexistent

A weak association means there is low probability of variables having relationship

A strong association means there is high probability , a consistent and systematic relationship exists

4) Type of relationship

If we say two variables can be described as related, then we would pose this as question “What is the nature of relationship”? , How can the link between variables Y and X best be described ?

There are a number of different ways in which two variables (X & Y) can share a relationship

In the wake of finding answers to above questions following statistical methodologies will be applied

a.Covariation

a.Chi Square Test

a.Correlation Coefficient1. Pearson Correlation coefficient2. Coefficient of determination3. Spearman rank order correlation coefficient

b.Regression Analysis

COVARIATION :

It is defined as amount of change in one variable that is consistently related to the change in another variable of interest or degree of association between two items/variables

For example :

If we know DVD purchases are related to age ,then we want to know the extent to which younger persons purchase more DVDs and ultimately which types of DVDs

If two variables are foound to change together on a reliable or consistent basis then we can use that information to make predictions as well as decisions on advertising and marketing strategies

For example

Change in attitude towards Starbucks coffee advertising campaign as it varies between light, medium and heavy consumers of Starbucks coffee

SCATTER PLOTS AND CORRELATION A scatter plot (or scatter diagram) is used to

show the relationship between two variables

SCATTER PLOT EXAMPLES

y

x

y

x

y

y

x

x

Linear relationships

Curvilinear relationships


y

x

y

x

y

y

x

x

Strong relationships

Weak relationships

(continued)


y

x

y

x

No relationship

(continued)

Smoking

3020100-10

Lu

ng

Cap

acit

y

50

40

30

20

One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables

• We can see easily from the graph that as smoking goes up, lung capacity tends to go down.

• The two variables covary in opposite directions.

• We now examine two statistics, covariance and correlation, for quantifying how variables covary.

Smoking and Lung Capacity

Cigarettes (X) Lung Capacity (Y)

0 45

5 42

10 33

15 31

20 29

The formula for calculating covariance of sample data is as follows :

x = the independent variabley = the dependent variablen = number of data points in the sample = the mean of the independent variable x = the mean of the dependent variable y

To understand how covariance is used, consider the table below, which describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi).

Example : To understand how covariance is used, consider the table, which describes the rate of economic growth (xi) and the rate of return on the S&P 500 (yi)

Using the covariance formula, you can determine whether economic growth and S&P 500 returns have a positive or inverse relationship.

Before you compute the covariance, calculate the mean of x and y

A ) Now you can identify the variables for the covariance formula as follows

x = 2.1, 2.5, 4.0, and 3.6 (economic growth)y = 8, 12, 14, and 10 (S&P 500 returns) = 3.1 = 11B) Substitute these values into the covariance formula to determine the relationship between economic growth and S&P 500 returns.

Interpretation :

The covariance between the returns of the S&P 500 and economic growth is 1.53.

Since the covariance is positive, the variables are positively related—they move together in the same direction

Smoking

3020100-10

Lu

ng

Cap

acit

y

50

40

30

20

One easy way to visually describe covariation between two variables is by using SCATERRED DIAGRAM which is graphic plot of the relative position of two variabkes using a horizontal and a vertical axis to represent the values of respective variables

• We can see easily from the graph that as smoking goes up, lung capacity tends to go down.

• The two variables covary in opposite directions.

• We now examine two statistics, covariance and correlation, for quantifying how variables covary.

Smoking and Lung Capacity

Cigarettes (X) Lung Capacity (Y)

0 45

5 42

10 33

15 31

20 29

Correlation :

Correlation is another way to determine how two variables are related.

In addition to telling you whether variables are positively or inversely related, correlation also tells you the degree to which the variables tend to move together

Correlation standardizes the measure of interdependence between two variables and, consequently, tells you how closely the two variables move.

The correlation measurement, called a correlation coefficient, will always take on a value between 1 and – 1 called Pearson Correlation coefficient -

A) If the correlation coefficient is one

The variables have a perfect positive correlation.

This means that if one variable moves a given amount, the second moves proportionally in the same direction.

A positive correlation coefficient less than one indicates a less than perfect positive correlation, with the strength of the correlation growing as the number approaches one.

B) If correlation coefficient is zero

No relationship exists between the variables

If one variable moves, you can make no predictions about the movement of the other variable; they are uncorrelated.

C) If correlation coefficient is –1

The variables are perfectly negatively correlated (or inversely correlated) and move in opposition to each other

If one variable increases, the other variable decreases proportionally

A negative correlation coefficient greater than –1 indicates a less than perfect negative correlation, with the strength of the correlation growing as the number approaches –1

To calculate the correlation coefficient for two variables, you would use the correlation formula, shown below.

x,y) = correlation of the variables x and yCOV(x, y) = covariance of the variables x and ysx = sample standard deviation of the random variable x sy = sample standard deviation of the random variable y

To calculate correlation, you must know the covariance for the two variables and the standard deviations of each variable

From the earlier example, you know that the covariance of S&P 500 returns and economic growth was calculated to be 1.53

Now you need to determine the standard deviation of each of the variables

You would calculate the standard deviation of the S&P 500 returns and the economic growth

Using the information from above, you know that

COV(x,y) = 1.53sx = 0.90sy = 2.58

Now calculate the correlation coefficient by substituting the numbers above into the correlation formula, as shown below.

A correlation coefficient of .66 tells you two important things:

•Because the correlation coefficient is a positive number, returns on the S&P 500 and economic growth are postively related.

•Because .66 is relatively far from indicating no correlation, the strength of the correlation between returns on the S&P 500 and economic growth is strong

The coefficient of determination is the amount of variability in one measure that is explained by the other measure

The coefficient of determination is the square of the correlation coefficient (r2)

For example, if the correlation coefficient between two variables is r = 0.90, the coefficient of determination is (0.90)2 = 0.81

Square of coefficient of correlation (Pearson correlation coefficient) gives coefficient of determination given by r 2

This number ranges from .00 to 1.0 showing proportion variation explained or accounted for in one variable by another

Spearman Rank Order correlation coefficient :

A statistical measure of linear association between two variables where both have been measured using ordinal (rank order) scales

Example :

INTRODUCTION TO REGRESSION ANALYSIS

Regression analysis is used to: Predict the value of a dependent variable based

on the value of at least one independent variable Explain the impact of changes in an independent

variable on the dependent variable

Dependent variable: the variable we wish to explain

Independent variable: the variable used to explain the dependent variable

SIMPLE LINEAR REGRESSION MODEL

Only one independent variable, x

Relationship between x and y is described by a linear function

Changes in y are assumed to be caused by changes in x

TYPES OF REGRESSION MODELS

Positive Linear Relationship

Negative Linear Relationship

Relationship NOT Linear

No Relationship

εxββy 10 Linear component

POPULATION LINEAR REGRESSION

The population regression model:

Population y intercept

Population SlopeCoefficient

Random Error term, or residual

Dependent Variable

Independent Variable

Random Error component

LINEAR REGRESSION ASSUMPTIONS

Error values (ε) are statistically independent Error values are normally distributed for any

given value of x The probability distribution of the errors is

normal The probability distribution of the errors has

constant variance The underlying relationship between the x

variable and the y variable is linear

POPULATION LINEAR REGRESSION(continued)

Random Error for this x value

y

x

Observed Value of y for

xi

Predicted Value of y for

xi

εxββy 10

xi

Slope = β1

Intercept = β0

εi

xbby 10i

The sample regression line provides an estimate of the population regression line

ESTIMATED REGRESSION MODEL

Estimate of the regression intercept

Estimate of the regression slope

Estimated (or predicted) y value

Independent variable

The individual random error terms ei have a mean of zero

LEAST SQUARES CRITERION

b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals

210

22

x))b(b(y

)y(ye

THE LEAST SQUARES EQUATION

The formulas for b1 and b0 are:

algebraic equivalent:

n

xx

n

yxxy

b 22

1 )(

21 )(

))((

xx

yyxxb xbyb 10 and

INTERPRETATION OF THE SLOPE AND THE INTERCEPT

b0 is the estimated average value

of y when the value of x is zero

b1 is the estimated change in the

average value of y as a result of a one-unit change in x

FINDING THE LEAST SQUARES EQUATION

The coefficients b0 and b1 will usually be found using computer software, such as Excel or Minitab

Other regression measures will also be computed as part of computer-based regression analysis

SIMPLE LINEAR REGRESSION EXAMPLE

A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)

A random sample of 10 houses is selected

Dependent variable (y) = house price in $1000s

Independent variable (x) = square feet

SAMPLE DATA FOR HOUSE PRICE MODEL

House Price in $1000s(y)

Square Feet (x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

REGRESSION USING EXCEL Tools / Data Analysis / Regression

EXCEL OUTPUT

Regression Statistics

Multiple R 0.76211

R Square 0.58082

Adjusted R Square 0.52842

Standard Error 41.33032

Observations 10

ANOVA df SS MS F

Significance F

Regression 1 18934.934818934.934

811.084

8 0.01039

Residual 8 13665.5652 1708.1957

Total 9 32600.5000

Coefficien

ts Standard Error t StatP-

value Lower 95%Upper 95%

Intercept 98.24833 58.03348 1.692960.1289

2 -35.57720232.0738

6

Square Feet 0.10977 0.03297 3.329380.0103

9 0.03374 0.18580

The regression equation is: feet) (square 0.10977 98.24833 price house

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

Square Feet

Ho

use

Pri

ce (

$100

0s)

GRAPHICAL PRESENTATION

House price model: scatter plot and regression line

feet) (square 0.10977 98.24833 price house

Slope = 0.10977

Intercept = 98.248

INTERPRETATION OF THE INTERCEPT, B0

b0 is the estimated average value of Y when the

value of X is zero (if x = 0 is in the range of observed x values)

Here, no houses had 0 square feet, so b0 = 98.24833

just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet


INTERPRETATION OF THE SLOPE COEFFICIENT, B1

b1 measures the estimated

change in the average value of Y as a result of a one-unit change in X Here, b1 = .10977 tells us that the average value of a

house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size


LEAST SQUARES REGRESSION PROPERTIES

The sum of the residuals from the least squares regression line is 0 ( )

The sum of the squared residuals is a minimum (minimized )

The simple regression line always passes through the mean of the y variable and the mean of the x variable

The least squares coefficients are unbiased estimates of β0 and β1

0)ˆ( yy

2)ˆ( yy

EXPLAINED AND UNEXPLAINED VARIATION

Total variation is made up of two parts:

SSR SSE SST Total sum

of Squares

Sum of Squares

Regression

Sum of Squares Error

2)yy(SST 2)yy(SSE 2)yy(SSR

where: = Average value of the dependent variabley = Observed values of the dependent

variable = Estimated value of y for the given x

value

y

y


SST = total sum of squares Measures the variation of the yi values around their

mean y

SSE = error sum of squares Variation attributable to factors other than the

relationship between x and y

SSR = regression sum of squares Explained variation attributable to the relationship

between x and y

(continued)

(continued)

Xi

y

x

yi

SST = (yi - y)2

SSE = (yi - yi

)2

SSR = (yi - y)2

_

_

_


y

y

y_y

THANKS……

Data analysis test for association BY Prof Sachin Udepurkar

Technology

association relationship

direction of relationship

positive relationship

negative relationship

systematic relationship

type of relationship

significant relationship

strong association