Reading Material #9 (Correlation Analysis) --------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------- Gabino P. Petilos, Ph.D. FIC, EDRE 231, 2 nd Sem 11-12 CORRELATION ANALYSIS INTRODUCTION It has been said that research is conducted in order to find relationship between or among variables. When factors or variables are relate d in some systematic patte rn, so that a change in the value of one is associated with a concurrent change in the value of the other, we say that they are corre lated. Thus, we know that ability level is correlat ed with academic performance based on our common observation that students belonging to high ability level tends to show better academic performance while those belonging to low ability level tend to show poor academic performance. In statistics, we not only establish the existence of certain correlations but also measure the direction a nd the degre e of correlatio n. Ideally, we want to know the correlation between two variables X and Y in a given population (Figure 1). The correlation between these variables is denoted by the symbol , called population correlation coefficient. Since it is not alway s feasible to study th e entire populati on, we attempt to describe the correlation between X and Y by drawing a random sample from the population. We denote the estimate of the parameter by the sample correlation coefficient r. If a sample is used to estimate the amount of correlation between two variables, significance testing is called for to find out if the variables in the actual population are indeed significantly re lated. For this reason, correlation analysis employs both descriptive statistics as well as inferential statistics.Correlation analysis is concerned with the linear relationship between two variables. It aims to determine the direction (whether positive or negative) as well as the strength (whether weak, moderate, or strong) of linear association between two variables. When two variables vary in the same direction, we say that the variables are positively correlated. For example, it has been shown that IQ and academic performance are positively correlated. This means that a person who has high IQ would tend to have a good academic performance Population =? X Y Sample r=? Figure 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
---------------------------------------------------------------------------------------------------------------------------Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
CORRELATION ANALYSIS
INTRODUCTION
It has been said that research is conducted in order to find relationship between oramong variables. When factors or variables are related in some systematic pattern, so that a
change in the value of one is associated with a concurrent change in the value of the other,
we say that they are correlated. Thus, we know that ability level is correlated with academic
performance based on our common observation that students belonging to high ability level
tends to show better academic performance while those belonging to low ability level tend
to show poor academic performance.
In statistics, we not only establish the existence of certain correlations but also
measure the direction and the degree of correlation. Ideally, we want to know the
correlation between two variables X and Y in a given population (Figure 1). The correlation
between these variables is denoted by the symbol , called population correlation
coefficient. Since it is not always feasible to study the entire population, we attempt to
describe the correlation between X and Y by drawing a random sample from the population.
We denote the estimate of the parameter by the sample correlation coefficient r.
If a sample is used to estimate the amount of correlation between two variables,
significance testing is called for to find out if the variables in the actual population are indeed
significantly related. For this reason, correlation analysis employs both descriptive statistics
as well as inferential statistics.
Correlation analysis is concerned with the linear relationship between two variables.
It aims to determine the direction (whether positive or negative) as well as the strength
(whether weak, moderate, or strong) of linear association between two variables. When
two variables vary in the same direction, we say that the variables are positively correlated.
For example, it has been shown that IQ and academic performance are positively correlated.
This means that a person who has high IQ would tend to have a good academic performance
in school and in turn a person's good academic performance is usually associated with his
high IQ. Other examples of variables which are positively correlated are:
Grade in Mathematics and Grade in Physics;
Work performance and Level of morale;
Number of hours spent in studying and Grades in mathematics.
On the other hand, when two variables vary in the opposite direction, the variables
are said to be negatively correlated. Examples of variables which exhibit negatively
correlation are:
Academic achievement and Hours per week of watching TV
Time spent in typing practice and Number of typing errors
Absenteeism and Job satisfaction
Variables that are not linearly correlated have zero correlation. For instance, height
of students and their ability level have a zero correlation. In this example, it does not make
sense to associate a particular value of height to a particular ability level. As anotherexample, there is zero correlation between size of shoes and level of income of bank
managers!
The direction and strength of linear correlation between variables may be described
using a statistical device called “scatter plot” or “scatter diagram. Examples of scatter plots
are given in Figure 1. Here, the scatter plots from (a) to (c) illustrate a positive correlation
between the two variables in varying strengths while (d) to (f) illustrate a negative
correlation also in varying strengths. The scatter plots in (a) and (d) illustrate a perfect
correlation between the two variables while those of (g) and (h) illustrate a zero correlation.
A strong correlation between two variables can occur when not all points fall on the
line of relationship but they are close to it. If the distances of the points are far from the
line, the correlation is said to be weak (or low) [See graphs (c) and (f)].
When the points do not tend o follow the path of a straight line, the correlation is
said to be zero. This is illustrated by the scatter plots in (g) and (h). Note that zero
correlation between two variables does not necessarily mean that the variables are not
related. In (g) for instance, there is zero linear correlation between the variables yet they are
related in a quadratic sense.
THE CORRELATION COEFFICIENT
As mentioned earlier, the scatter diagram is a visual device which is useful in
characterizing the direction and strength of linear correlation between two variables. The
direction of relationship is perhaps easy to discern in a scatter diagram. However,
interpretation of the strength of linear correlation using a scatter diagram is not easy since it
is open to various interpretations when viewed by different persons.
The correlation coefficient is another tool by which the direction and strength of
linear correlation between two variables may be described. As a measure of correlation, the
correlation coefficient ranges in value from -1.0 to +1.0. Thus if (rho) represents the
population correlation coefficient, then
-1.0 +1.0
If = 1.0, the variables are said to be perfectly correlated in a positive sense. If the
value is -1.0, the variables are perfectly correlated in a negative sense. A value of = 0indicates a zero linear correlation between the two variables. Figure 2 illustrates the
descriptive interpretation of the correlation coefficient.
correlation coefficient of 0.72 between political affiliation and religion in social science
research may be interpreted as high. The same value, however, may be interpreted as low
when used as a measure of reliability or validity of standardized tests. Also, It is often
tempting to say that an r value of 0.80 is twice as strong as an r value of 0.40. Such an
interpretation is incorrect since the correlation scale is not ratio or interval but rather an
ordinal one.
Another consideration in interpreting a correlation coefficient is when the value is 0.
In general, a value of r = 0 does not mean that the variables are not related. As shown in
Figure 1 (g), a value of 0 merely implies that there is no linear association between the two
variables. Moreover, values of r that are different from 0 cannot be construed that one
variable causes the other which means that if two variables are correlated, it does not imply
that one of them causes the other.
One meaningful interpretation of r involves the concept of coefficient of
determination which is denoted by2
r . This value gives us a measure of the amount of
variation in one variable which can be attributed to the variation of the other variable and
vice versa. Thus, if r = 0.91, r 2
= 0.8281 or 82.81% which means that 82.81% of the variation
in one variable is accounted for by the variation of the other variable and versa. The
coefficient of determination is a very a important and useful concept in regression analysis.
Testing the Significance of r
If the value of the correlation coefficient is obtained from a sample data, the
researcher would often want to know whether the variables are in fact related in the actual
population from which the sample was drawn. The hypothesis of interest is about whether
the population correlation coefficient is zero or not. Thus the following null hypothesismust be tested using the obtained sample correlation coefficient.
Null Hypothesis : Ho: = 0 (There is no correlation between X and Y)
Alternative Hypothesis : Ha : 0 (For a Non-directional Test)
: Ha : < 0 or > 0 (For a Directional Test)
To test whether the obtained Pearson’s r is significantly different from zero, a t-test
could be used if N < 30 or z-test is N 30. The test statistics are given below:
Z which is significant at =.05 level of significance (two-
tailed.
The Pearson's product moment correlation is the most popular measure of
correlation. However, as was pointed out earlier, this measure is appropriate only whenboth variables are measured in at least the interval scale. When the assumptions on the use
of r are not met, it is not advisable to use the Pearson’s r. Instead, we estimate the
population correlation using other measures of correlation. The succeeding discussion
considers other measures of correlation when the scale of measurement is not interval and
one of the assumptions (normality and linearity) is violated.
OTHER MEASURES OF CORRELATION
The Spearman's Rank Order Correlation (r s)
The Spearman's rho (r s)a
is a measure of correlation based on the difference between
ranks of the values of two variables X and Y. It is used when both variables are measured in
at least the ordinal scale. The Spearman’s rho is the nonparametric counterpart of the
Pearson’s r. Unlike the Pearson's r , this measure does not make assumption about normality
of distribution of the paired data.
The formula for computing the Spearman's r s is given by
)1)(1(
61
2
NNN
d r s
(Equation 3)
where d is the difference between the ranks of paired values of X and Y , and N is the total
number of cases.
When ranking the data, “1” is usually treated as the lowest rank corresponding to the
lowest score value of the variable, followed by “2” for the next higher score, etc. Thus,
higher ranks correspond to higher scores while lower ranks correspond to lower scores. You
have to adapt this rule of ranking numbers because this is the convention used in analyzing
ordinal data using nonparametric statistics. (Note: The same value of 2d in Equation 3 is
obtained if we assign rank 1 to the highest score instead of rank 1 to the lowest score. Checkthis!)
Another important rule that you should remember is the assignment of ranks for tied
scores. The rule is very simple: think of the scores as if they were distinct, get their ranks,
and assign the average of their ranks as the ranks of the tied scores. Let us illustrate these
rules by considering an example.
a We don’t use the symbol for rho since this is our symbol for the population correlation coefficient.
Note: When a statistical test IS NOT SIGNIFICANT, we accept the null hypothesis. Accepting the nullhypothesis, however, does not mean that it (the null hypothesis) is true because we only considered
one sample out of the so many possible samples from the population.
In our discussion of the Chi-square test, we mentioned that the test can be used to
establish independence of variables. When the null hypothesis is rejected using the Chi-
Square test, we conclude that the variables ARE NOT independent, which means that the
variables are correlated. The Chi-square value, however, does not give information as to the
strength of correlation between the variables.
We can estimate the strength of correlation by using the computed value of the Chi-
square statistic (hence the term Chi-Square based). These measures, are described as crude
measures because they are not as accurate or reliable as in the case of Pearson-based
measures (r , r s, and r pb).
Also, the technique employed here is different from the Pearson based measures in
the sense that one tries to establish first whether the variables are significantly correlated or
not using the Chi-square statistic. The strength of correlation is computed only when the Chi-square test is significant. We outline below the computation of some Chi-square based
measures of correlation.
A. Contingency Coefficient:N
2
2
Oχ
χ C where
2χ is the computed Chi-square value and
N is the grand total in the contingency table.
This measure is used when the contingency table is a square table with at least
three categories for each variable.
Example 4. The Chi-square value based on the contingency table below is 27.160 which is
significant at α = 0.05. Estimate the strength of correlation between interest in
sports and social class using the contingency coefficient.
Interest in SportsSocial Class
TOTALWorking Middle Upper
High 12 45 7 64
Moderate 24 40 21 85
Low 21 14 23 58
Total 57 99 51 207
Solution: The contingency table is a 3x3 square table. Hence the contingency coefficient is
an appropriate measure of correlation. Using the values 160.27χ 2 and N = 207,
we have
c Siegel S. & Castellan, J. (1988) Nonparametric Statistics. New York: McGraw-Hill Book Company (2
, where 2χ is the computed Chi-square value and N is the
grand total in the contingency table.
The Phi-coefficient is used only for 22 contingency tables.
Example 6. A survey of 300 undergraduate and 100 graduate students from a large
university was conducted to determine their opinions on autonomous status of
colleges. The following contingency table was generated from the survey.
OpinionLevel of Education
TotalUndergrad Graduate
Favor 100 70 170
Not Favor 200 30 230
Total 300 100 400
Find out if there is a significant correlation between opinion and level of education at α = 0.05. If significant, estimate the strength of correlation.
Solution: We are given a 22 table, thus we can compute the2
χ value using the formula
))()()((
)(2
2
d cbad bca
bcad N
with d.f. = 1.
Note that the expected frequencies are all greater than 5 (check this). Thus,
262.41)230)(170)(100)(300(
)]70)(200()30)(100[(400
))()()((
)(
χ
222
d cbad bca
bcad N
.
The critical value of 2χ at α = 0.05 and d.f. = 1 is 3.84. Since the computed
2χ is greater than the critical value, the null hypothesis is rejected which means
that opinions of the students regarding the autonomoous status of college is
dependent on the level of education.
Using the Phi-coefficient, the estimated strength of correlation is