BIVARIATE DATA - Laboratory of Software Science1 Introduction to Bivariate Data 2 Pearson product-moment correlation 3 Variance Sum Law II 4 Computing r by R [email protected]

ioc.pdf

BIVARIATE DATA

David M. Lane. et al. Introduction to Statistics : pp. 172�194

[email protected] ICY0006: Lecture 3 1 / 24

ioc.pdf

Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection ofinformation.

It provides simple summaries about the observations that have been made. Suchsummaries may be either quantitative, i.e. summary statistics, or visual, i.e.simple-to-understand graphs.

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, includingI central tendency (mean, median, and mode)I dispersion (range and quantiles of the data-set, measures of

spread such as the variance and standard deviation)I shape of the distribution (skewness and kurtosis

Bivariate analysis: more than one variable are involved and describing the relationship

between pairs of variables. In this case, descriptive statistics include:I Cross-tabulations and contingency tablesI Graphical representation via scatterplotsI Quantitative measures of dependenceI Descriptions of conditional distributions


ioc.pdf










ioc.pdf










ioc.pdf

Contents

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R


ioc.pdf

Next section




4 Computing r by R


ioc.pdf

Bivariate Data � more than one variable

Often, more than one variable is collected on each individual.

In health studies the variables such as age, sex, height, weight, blood pressure, and totalcholesterol are often measured on each individual.

Economic studies may be interested in, among other things, personal income and years ofeducation.

Bivariate data consists of data on two variables

Usually we are interested in the relationship between the variables


ioc.pdf

Bivariate Data � more than one variable

Often, more than one variable is collected on each individual.

In health studies the variables such as age, sex, height, weight, blood pressure, and totalcholesterol are often measured on each individual.

Economic studies may be interested in, among other things, personal income and years ofeducation.

Bivariate data consists of data on two variables

Usually we are interested in the relationship between the variables


ioc.pdf

Example: �Do the people tend to marry other

people of about the same age?�

Our experience tells us �yes,� but how good is the correspondence?

One way to address the question is to look at pairs of ages for a sample of marriedcouples (an excerpt from a dataset consisting of 282 pairs of spousal ages):

We see that, yes, husbands and wives tend to be of about the same age, with men havinga tendency to be slightly older than their wives.


ioc.pdf

Example: Histograms and means of spousal ages

Each distribution is fairly skewed with a long right tail.

Not all husbands are older than their wives; this fact is lost when we separate the variables.

Also the pairing within couple is lost by separating the variables.

For example, based on the means alone, we cannot say what percentage of couples hasyounger husbands than wives.

Another example of information not available from the separate descriptions what is theaverage age of husbands with 45-year-old wives?

Finally, we do not know the relationship between the husband's age and the wife's age.


ioc.pdf

Example: Histograms and means of spousal ages

Each distribution is fairly skewed with a long right tail.

Not all husbands are older than their wives; this fact is lost when we separate the variables.

Also the pairing within couple is lost by separating the variables.

For example, based on the means alone, we cannot say what percentage of couples hasyounger husbands than wives.

Another example of information not available from the separate descriptions what is theaverage age of husbands with 45-year-old wives?

Finally, we do not know the relationship between the husband's age and the wife's age.


ioc.pdf

Visualization of Bivariate Data

A scatter plot displays the bivariate data in a graphical form that maintains the pairing.

Scatter plots that show linear relationships between variables can di�er in several waysincluding the slope of the line about which they cluster and how tightly the points clusterabout the line.

This is a scatter plot of the paired ages (all 282 pairs):


ioc.pdf

Scatter plot

Two observations:

1 there is a strong relationship between the husband's age and the wife's age: the older thehusband, the older the wife.

I When one variable (Y ) increases with the second variable (X), we say that X and Y have a positiveassociation.

I When Y decreases as X increases, we say that they have a negative association.

2 The points cluster along a straight line. When this occurs, the relationship is called alinear relationship.

I There is a perfect linear relationship between two variables if a scatterplot of the points falls on a straightline.

I The relationship is linear even if the points diverge from the line as long as the divergence is random ratherthan being systematic.


ioc.pdf

Linear and non-linear relationships

Perfect negative relationship

Non-linear relationship


ioc.pdf

Weak relationship and no relationship

Weak positive relationship No relationship


ioc.pdf

Next section




4 Computing r by R


ioc.pdf

Pearson's correlation coe�cient

Pearson's correlation coe�cient is a statistical measure of the strength of a linearrelationship between paired data.

The symbol for Pearson's correlation is ρ when it is measured in the population and rwhen it is measured in a sample. (Further on, we are dealing with samples and will use r).

De�nition

Let X = {x1, . . . ,xN} and Y = {y1, . . . ,yN} are two datasets (two samples) with means MX andMY and standard deviations σX and σY respectively, then the sample Pearson correlationcoe�cient (or simply correlation coe�cient) is de�ned by the formula

r =∑(X −MX )(Y −MY )

σX σY.

Considering the formula of standard deviation, we obtain the formula for computing:

r =∑(X −MX )(Y −MY )√

∑(X −MX )2 ∑(Y −MY )2=

∑XY −NMXMY√(∑X 2−NM2

X

)(∑Y 2−NM2

Y

)


ioc.pdf

Computing Pearson's r

Example

r =∑XY −NMXMY√(

∑X 2−NM2X

)(∑Y 2−NM2

Y

)

r =210−5 ·4 ·9√

(96−5 ·42)(465−5 ·92)=

210−180√16 ·60

=30√960

= 0.9682458


ioc.pdf

Correlation coe�cients


ioc.pdf

Properties of Pearson's r

The Pearson correlation coe�cient is symmetric: r = cor(X ,Y ) = cor(Y ,X ).

r is restricted as −16 r 6 1.

The Pearson correlation coe�cient is invariant to separate changes in location and scale inthe two variables. That is, we may transform X to a+bX and transform Y to c+dY ,where a,b,c, and d are constants with b,d 6= 0, without changing the correlationcoe�cient.

Positive values denote positive linear correlation.

Negative values denote negative linear correlation.

A value of 0 denotes no linear correlation.

The closer the value is to 1 or �1, the stronger the linear correlation.


ioc.pdf

Properties of Pearson's r

The Pearson correlation coe�cient is symmetric: r = cor(X ,Y ) = cor(Y ,X ).

r is restricted as −16 r 6 1.

The Pearson correlation coe�cient is invariant to separate changes in location and scale inthe two variables. That is, we may transform X to a+bX and transform Y to c+dY ,where a,b,c, and d are constants with b,d 6= 0, without changing the correlationcoe�cient.

Positive values denote positive linear correlation.

Negative values denote negative linear correlation.

A value of 0 denotes no linear correlation.

The closer the value is to 1 or �1, the stronger the linear correlation.


ioc.pdf

Assumptions

There are �ve assumptions that are made with respect to Pearson's correlation:

1 The variables must be either interval or ratio measurements.

2 The variables must be approximately normally distributed (we will discuss this later)

3 There is a linear relationship between the two variables.

4 Outliers are either kept to a minimum or are removed entirely. (Use scatter plot todetermine outliers)

5 There is homoscedasticity of the data (All random variables in the sequence or vectorhave the same �nite variance. Homoscedasticity basically means that the variances alongthe line of best �t remain similar as you move along the line. Use scatter plot todetermine Homo- or heteroscedasticity).


ioc.pdf

Removing of outliers


ioc.pdf

Homo- and heteroscedasticity


ioc.pdf

Caution!

1 The existence of a strong correlation does not imply acausal link between the variables.!

2 We need to perform a signi�cance test to decidewhether based upon on a given sample there is any orno evidence to suggest that linear correlation is presentin the population. (We will discuss signi�cance testslater in our course.)


ioc.pdf

Caution!




ioc.pdf

Caution!



Recall:

Pearson correlation is a measure of the strength of a relationship between two variablesBut any relationship should be assessed for its signi�cance as well as its strength.

If your data does not meet the above assumptions then use the Spearman's rankcorrelation (ρ) or Kendall rank correlation (τ).


ioc.pdf

Next section




4 Computing r by R


ioc.pdf

Variance Sum Law II

Variance Sum Law I

If X and Y are independent (uncorrelated) variables, then

σ2X±Y = σ

2X +σ

2Y

Variance Sum Law II

When X and Y are correlated variables, the following is valid:

σ2X±Y = σ

2X +σ

2Y ±2ρσXσY

where ρ is the correlation between X and Y in the population.


ioc.pdf

Next section




4 Computing r by R


ioc.pdf

Correlations in R

R can perform correlation with the cor() function.

Built-in to the base distribution of the program are three routines; for Pearson, Kendaland Spearman Rank correlations.

Simpli�ed formats of the function call are1 cor(x,y) � the default correlation returns the Pearson correlation coe�cient;2 cor(dataset) � if you use a datset instead of separate variables you will return a

matrix of all the pairwize correlation coe�cients;3 cor(x, y, method = "spearman") � if you specify "spearman" you will get the

Spearman correlation coe�cient;4 cor(x, y, use="complete.obs") � The parameter use speci�es the handling of

missing data. Options are all.obs (assumes no missing data � missing data willproduce an error), complete.obs (listwise deletion), and pairwise.complete.obs(pairwise deletion).


BIVARIATE DATA - Laboratory of Software Science1 Introduction to Bivariate Data 2 Pearson product-moment correlation 3 Variance Sum Law II 4 Computing r by R [email protected]

Documents