Biostatistics, statistical software VI. Relationship between two
continuous variables, correlation, linear regression,
transformations. Relationship between two discrete variables,
contingency tables, test for independence. Krisztina Boda PhD
Department of Medical Informatics, University of Szeged Slide 2
Krisztina Boda INTERREG 2 Relationship between two continuous
variables correlation, linear regression, transformations. Slide 3
Krisztina Boda INTERREG 3 Imagine that 6 students are given a
battery of tests by a vocational guidance counsellor with the
results shown in the following table: Variables measured on the
same individuals are often related to each other. Slide 4 Krisztina
Boda INTERREG 4 Let us draw a graph called scattergram to
investigate relationships. Scatterplots show the relationship
between two quantitative variables measured on the same cases. In a
scatterplot, we look for the direction, form, and strength of the
relationship between the variables. The simplest relationship is
linear in form and reasonably strong. Scatterplots also reveal
deviations from the overall pattern. Slide 5 Krisztina Boda
INTERREG 5 Creating a scatterplot When one variable in a
scatterplot explains or predicts the other, place it on the x-axis.
Place the variable that responds to the predictor on the y-axis. If
neither variable explains or responds to the other, it does not
matter which axes you assign them to. Slide 6 Krisztina Boda
INTERREG 6 Possible relationships positive correlation negative
correlation no correlation Slide 7 Krisztina Boda INTERREG 7
Describing linear relationship with number: the coefficient of
correlation Correlation is a numerical measure of the strength of a
linear association. The formula for coefficient of correlation
treats x and y identically. There is no distinction between
explanatory and response variable. Let us denote the two samples by
x 1,x 2,x n and y 1,y 2,y n, the coefficient of correlation can be
computed according to the following formula Slide 8 Krisztina Boda
INTERREG 8 Properties of r Correlations are between -1 and +1; the
value of r is always between -1 and 1, either extreme indicates a
perfect linear association. 1 r 1. a) If r is near +1 or -1 we say
that we have high correlation. b) If r=1, we say that there is
perfect positive correlation. If r= -1, then we say that there is a
perfect negative correlation. c) A correlation of zero indicates
the absence of linear association. When there is no tendency for
the points to lie in a straight line, we say that there is no
correlation (r=0) or we have low correlation (r is near 0 ). Slide
9 Krisztina Boda INTERREG 9 Effect of outliers Even a single
outlier can change the correlation substantially. Outliers can
create an apparently strong correlation where none would be found
otherwise, or hide a strong correlation by making it appear to be
weak. r=-0.21 r=0.74 r=0.998r=-0.26 Slide 10 Krisztina Boda
INTERREG 10 Two variables may be closely related and still have a
small correlation if the form of the relationship is not linear.
r=2.8 E-15 r=0.157 Slide 11 Krisztina Boda INTERREG 11 Correlation
and causation a correlation between two variables does not show
that one causes the other. Causation is a subtle concept best
demonstrated statistically by designed experiments. Slide 12
Krisztina Boda INTERREG 12 Correlation by eye
http://onlinestatbook.com/stat_sim/reg_by_eye/index.html). This
applet lets you estimate the regression line and to guess the value
of Pearson's correlation. Five possible values of Pearson's
correlation are listed. One of them is the correlation for the data
displayed in the scatterplot. Guess which one it is. To see the
correct value, click on the "Show r" button. Slide 13 Krisztina
Boda INTERREG 13 When is a correlation high? What is considered to
be high correlation varies with the field of application. The
statistician must decide when a sample value of r is far enough
from zero, that is, when it is sufficiently far from zero to
reflect the correlation in the population. Slide 14 Krisztina Boda
INTERREG 14 Testing the significance of the coefficient of
correlation The statistician must decide when a sample value of r
is far enough from zero to be significant, that is, when it is
sufficiently far from zero to reflect the correlation in the
population. H 0 : =0 (greek rho=0, correlation coefficient in
population = 0) H a : 0 (correlation coefficient in population 0)
This test can be carried out by expressing the t statistic in terms
of r. The following t-statistic has n-2 degrees of freedom Decision
using statistical table: If |t|>t ,n-2, the difference is
significant at level, we reject H 0 and state that the population
correlation coefficient is different from 0. If |t| Krisztina Boda
INTERREG 15 Example 1. The correlation coefficient between math
skill and language skill was found r=0.9989. Is significantly
different from 0? H 0 : the correlation coefficient in population =
0, =0. H a : the correlation coefficient in population is different
from 0. Let's compute the test statistic: Degrees of freedom:
df=6-2=4 The critical value in the table is t 0.05,4 = 2.776.
Because 42.6 > 2.776, we reject H 0 and claim that there is a
significant linear correlation between the two variables at 5 %
level. Slide 16 Krisztina Boda INTERREG 16 Example 1, cont. p
Krisztina Boda INTERREG 17 Example 2. The correlation coefficient
between math skill and retailing skill was found r= -0.9993. Is
significantly different from 0? H 0 : the correlation coefficient
in population = 0, =0. H a : the correlation coefficient in
population is different from 0. Let's compute the test statistic:
Degrees of freedom: df=6-2=4 The critical value in the table is t
0.05,4 = 2.776. Because |-53.42|=53.42 > 2.776, we reject H 0
and claim that there is a significant linear correlation between
the two variables at 5 % level. Slide 18 Krisztina Boda INTERREG 18
Example 2., cont. Slide 19 Krisztina Boda INTERREG 19 Example 3.
The correlation coefficient between math skill and theater skill
was found r= -0.2157. Is significantly different from 0? H 0 : the
correlation coefficient in population = 0, =0. H a : the
correlation coefficient in population is different from 0. Let's
compute the test statistic: Degrees of freedom: df=6-2=4 The
critical value in the table is t 0.05,4 = 2.776. Because
|-0.4418|=0.4418 < 2.776, we do not reject H 0 and claim that
there is no a significant linear correlation between the two
variables at 5 % level. Slide 20 Krisztina Boda INTERREG 20 Example
3., cont. Slide 21 Krisztina Boda INTERREG 21 Prediction based on
linear correlation: the linear regression When the form of the
relationship in a scatterplot is linear, we usually want to
describe that linear form more precisely with numbers. We can
rarely hope to find data values lined up perfectly, so we fit lines
to scatterplots with a method that compromises among the data
values. This method is called the method of least squares. The key
to finding, understanding, and using least squares lines is an
understanding of their failures to fit the data; the residuals A
straight line that best fits the data: y=bx + a is called
regression line Geometrical meaning of a and b. b: is called
regression coefficient, slope of the best-fitting line or
regression line; a: y-intercept of the regression line. Slide 22
Krisztina Boda INTERREG 22 Equation of regression line for the data
of Example 1. y=1.016x+15.5 the slope of the line is 1.016
Prediction based on the equation: what is the predicted score for
language for a student having 400 points in math? y predicted
=1.016 400-15.5=421.9 Slide 23 Krisztina Boda INTERREG 23 How to
get the formula for the line which is used to get the best point
estimates Slide 24 Krisztina Boda INTERREG 24 Computation of the
correlation coefficient from the regression coefficient. There is a
relationship between the correlation and the regression
coefficient: where s x, s y are the standard deviations of the
samples. From this relationship it can be seen that the sign of r
and b is the same: if there exist a negative correlation between
variables, the slope of the regression line is also negative. It
can be shown that the same t-test can be used to test the
significance of r and the significance of b. Slide 25 Krisztina
Boda INTERREG 25 Coefficient of determination The square of the
correlation coefficient multiplied by 100 is called the coefficient
of determination. It shows the percentages of the total variation
explained by the linear regression. Example. The correlation
between math aptitude and language aptitude was found r =0,9989.
The coefficient of determination, r 2 = 0.917. So 91.7% of the
total variation of Y is caused by its linear relationship with X.
Slide 26 Krisztina Boda INTERREG 26 Regression using
transformations Sometimes, useful models are not linear in
parameters. Examining the scatterplot of the data shows a
functional, but not linear relationship between data. Slide 27
Krisztina Boda INTERREG 27 Example A fast food chain opened in
1974. Each year from 1974 to 1988 the number of steakhouses in
operation is recorded. The scatterplot of the original data
suggests an exponential relationship between x (year) and y (number
of Steakhouses) (first plot) Taking the logarithm of y, we get
linear relationship (plot at the bottom) Slide 28 Krisztina Boda
INTERREG 28 Performing the linear regression procedure to x and log
(y) we get the equation log y = 2.327 + 0.2569 x that is y = e
2.327 + 0.2569 x =e 2.327 e 0.2569x = 1.293e 0.2569x is the
equation of the best fitting curve to the original data. Slide 29
Krisztina Boda INTERREG 29 log y = 2.327 + 0.2569 xy = 1.293e
0.2569x Slide 30 Krisztina Boda INTERREG 30 Types of
transformations Some non-linear models can be transformed into a
linear model by taking the logarithms on either or both sides.
Either 10 base logarithm (denoted log) or natural (base e)
logarithm (denoted ln) can be used. If a>0 and b>0, applying
a logarithmic transformation to the model Slide 31 Krisztina Boda
INTERREG 31 Exponential relationship ->take log y Model: y=a*10
bx Take the logarithm of both sides: lg y =lga+bx so lg y is linear
in x Slide 32 Krisztina Boda INTERREG 32 Logarithm relationship
->take log x Model: y=a+lgx so y is linear in lg x Slide 33
Krisztina Boda INTERREG 33 Power relationship ->take log x and
log y Model: y=ax b Take the logarithm of both sides: lg y =lga+b
lgx so lgy is linear in lg x Slide 34 Krisztina Boda INTERREG 34
Reciprocal relationship ->take reciprocal of x Model: y=a +b/x
y=a +b*1/x so y is linear in 1/x Slide 35 Krisztina Boda INTERREG
35 Example from the literature Slide 36 Krisztina Boda INTERREG 36
Slide 37 Krisztina Boda INTERREG 37 Relationship between two
discrete variables, contingency tables, test for independence Slide
38 Krisztina Boda INTERREG 38 Comparison of categorical variables
(percentages): 2 tests (chi-square) Example: rates of diabetes in
three groups: 31%, 27% and 25%*. Frequencies can be arranged into
contingency tables. H 0 : the occurrence of diabetes is independent
of groups (the rates are the same in the population)
DIABTreatment1Treatment 2Treatment3Total yes 31272583 no 697375217
Total 100 300 Slide 39 Krisztina Boda INTERREG 39 2 tests,
assumptions If H 0 is true, the expected frequencies can be
computed (E i =row total*column total/total) 2 statistics: 2 =(O i
-E i ) 2 /E i DF (degrees of freedom: (number of rows-1)*(number of
columns-1) Decision based on table: 2 > 2 table, , df
Assumption: cells with expected frequencies