Top Banner
SK R ENGINEERING Modelling correlations with Python and SciPy Eric Marsden <[email protected]>
19

Modelling correlations using Python

Jul 30, 2015

Download

Engineering

Eric Marsden
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Modelling correlations using Python

SKRENGINEERING

Modelling correlations with Python and SciPy

Eric Marsden <[email protected]>

Page 2: Modelling correlations using Python

Measuring linear correlation

▷ Linear correlation coefficient: a measure of the strength and directionof a linear association between two random variables• also called the Pearson product-moment correlation coefficient

▷ 𝜌X ,Y =cov(X ,Y )

u�Xu�Y=

𝔼[(X−u�X )(Y−u�Y )]

u�Xu�Y

• 𝔼 is the expectation operator

• cov means covariance

• u�X is the expected value of random variable X

• u�X is the standard deviation of X

▷ Python: scipy.stats.pearsonr(X, Y)

▷ Excel and Google Docs spreadsheet: CORREL

2 / 17

Page 3: Modelling correlations using Python

Measuring linear correlation

The linear correlation coefficient ρ quantifies the strengths and directions ofmovements in two random variables:

▷ sign of ρ determines the relative directions that the variables move in

▷ value determines strength of the relative movements (ranging from -1 to+1)

▷ ρ = 0.5: one instrument moves in the same direction by half the amountthat the other variable moves

▷ ρ = 0: variables are uncorrelated• does not imply that they are independent!

3 / 17

Page 4: Modelling correlations using Python

Examples of correlations

Source: Wikipedia

c o r r e l a t i o n≠ d e p e n d e

n c y

4 / 17

Page 5: Modelling correlations using Python

Examples of correlations

Source: Wikipedia

c o r r e l a t i o n≠ d e p e n d e

n c y

4 / 17

Page 6: Modelling correlations using Python

Examples of correlations

Source: Wikipedia

c o r r e l a t i o n≠ d e p e n d e

n c y

4 / 17

Page 7: Modelling correlations using Python

Online visualization: interpreting correlations

Try it out online: http://rpsychologist.com/d3/correlation/

5 / 17

Page 8: Modelling correlations using Python

Not all relations are linear!

▷ Example: Yerkes–Dodson law• empirical relationship between level of

arousal/stress and level of performance

▷ Performance initially increases withstress/arousal

▷ Beyond a certain level of stress, performancedecreases

Source: http://en.wikipedia.org/wiki/YerkesDodson_law

6 / 17

Page 9: Modelling correlations using Python

Measuring correlation with NumPy

In [3]: import numpy

import matplotlib.pyplot as plt

import scipy.stats

In [4]: X = numpy.random.normal(10, 1, 100)

Y = X + numpy.random.normal(0, 0.3, 100)

plt.scatter(X, Y)

Out[4]: <matplotlib.collections.PathCollection at 0x7f7443e3c438>

In [5]: scipy.stats.pearsonr(X, Y)

Out[5]: (0.9560266103379802, 5.2241043747083435e-54)

E x e r c i s e : sh o w t h a t w

h e n t h e e rr o r

i n Y d e c r e a s e s , t h e c o rr e l a t i o n

c o e f f i c i e n ti n c r e a s e s

E x e r c i s e : pr o d u c e d a

t a a n d a pl o t

w i t h a n e ga t i v e c o r r e

l a t i o n

c o e f f i c i e n t

7 / 17

Page 10: Modelling correlations using Python

Anscombe’s quartet

4

8

12 I II

0 10 20

4

8

12 III

0 10 20

IV

Four datasets proposed by Francis Anscombe to illustrateimportance of graphing data rather than relying blindlyon summary statistics

e a c h d a t a se t h a s t h e

s a m e

c o r r e l a t i o nc o e f f i c i e n t

!

8 / 17

Page 11: Modelling correlations using Python

Plotting relationships between variables with matplotlib

▷ Scatterplot: use function plt.scatter

▷ Continuous plot or X-Y: function plt.plot

1 > import matplotlib.pyplot as plt2 > import numpy3 > x = numpy.linspace(1, 15, 100)4 > plt.plot(x, numpy.sin(x))5 > plt.show()

0 2 4 6 8 10 12 14 16−1.0

−0.5

0.0

0.5

1.0

9 / 17

Page 12: Modelling correlations using Python

Aside: polio caused by ice cream!

▷ Polio: an infectious disease causing paralysis, which primarilyaffects young children

▷ Largely eliminated today but was once a worldwide concern

▷ Late 1940s: public health experts in USA noted that theincidence of polio increased with the consumption of ice cream

▷ Some suspected that ice cream caused polio… sales plummeted

▷ Polio incidence increases in hot summer weather

▷ Correlation is not causation: there may be a hidden, underlyingvariable• but it sure is a hint! [Edward Tufte]

More info: Freakonomics, Steven Levitt and Stephen J. Dubner

10 / 17

Page 13: Modelling correlations using Python

Aside: fire fighters and fire damage

▷ Statistical fact: the larger the number of fire-fighters attendingthe scene, the worse the damage!

▷ More fire fighters are sent to larger fires

▷ Larger fires lead to more damage

▷ Lurking (underlying) variable = fire size

▷ An instance of “Simpson’s paradox”

11 / 17

Page 14: Modelling correlations using Python

Aside: low birth weight babies of tobacco smoking mothers

▷ Statistical fact: low birth-weight children born to smoking mothershave a lower infant mortality rate than the low birth weight childrenof non-smokers

▷ In a given population, low birth weight babies have a significantlyhigher mortality rate than others

▷ Babies of smoking mothers are more likely to be of low birth weightthan babies of non-smoking mothers

▷ Babies underweight because of smoking still have a lower mortalityrate than children who have other, more severe, medical reasons whythey are born underweight

▷ Lurking variable between smoking, birth weight and infant mortality

Source: Wilcox, A. (2001). On the importance — and the unimportance — of birthweight, International Journal of Epidemiology.30:1233–1241

12 / 17

Page 15: Modelling correlations using Python

Aside: cheese causes death by bedsheet strangulation

Note: real data!

Source: http://www.tylervigen.com/, with many more surprising correlations

13 / 17

Page 16: Modelling correlations using Python

Aside: correlation is not causation

Source: http://xkcd.com/552/ (CC BY-NC licence)

14 / 17

Page 17: Modelling correlations using Python

Directionality of effect problem

aggressive behaviour watching violent films

aggressive behaviour watching violent films

Do aggressive children prefer violent TV programmes, or do violentprogrammes promote violent behaviour?

15 / 17

Page 18: Modelling correlations using Python

Analyzing data: wind speed

▷ Import wind speed data for Toulouse airport

▷ Find the mean of the distribution

▷ Plot a histogram of the data

▷ Does the data seem to follow a normal distribution?• use a Q-Q plot to check

▷ Check whether a Weibull distribution fits better

▷ Predict the highest wind speed expected in a 10-year interval

0 10 20 30 40 50 600.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08 TLS wind speed in 2013

−3 −2 −1 0 1 2 3Quantiles

−10

0

10

20

30

40

50

Ord

ered

Val

ues

R2 =0:9645

TLS wind speed qqnorm-plot

0 5 10 15 20 25 30 35Quantiles

0

10

20

30

40

50

Ord

ered

Val

ues

R2 =0:9850

TLS wind speed qqweibull plot

Data downloaded from http://www.wunderground.com/history/airport/LFBO/

16 / 17

Page 19: Modelling correlations using Python

For more information

▷ SciPy lecture notes: https://scipy-lectures.github.io/

▷ Analysis of the “pay for performance” (correlation between a CEO’s payand their job performance, as measured by the stock market) principle,http://freakonometrics.hypotheses.org/15999

This presentation is distributed under the terms ofthe Creative Commons Attribution – Share Alikelicence.

SKRENGINEERING

For more free course materials on risk engineering, visithttp://risk-engineering.org/

17 / 17