Measuring linear correlation
▷ Linear correlation coefficient: a measure of the strength and directionof a linear association between two random variables• also called the Pearson product-moment correlation coefficient
▷ 𝜌X ,Y =cov(X ,Y )
u�Xu�Y=
𝔼[(X−u�X )(Y−u�Y )]
u�Xu�Y
• 𝔼 is the expectation operator
• cov means covariance
• u�X is the expected value of random variable X
• u�X is the standard deviation of X
▷ Python: scipy.stats.pearsonr(X, Y)
▷ Excel and Google Docs spreadsheet: CORREL
2 / 17
Measuring linear correlation
The linear correlation coefficient ρ quantifies the strengths and directions ofmovements in two random variables:
▷ sign of ρ determines the relative directions that the variables move in
▷ value determines strength of the relative movements (ranging from -1 to+1)
▷ ρ = 0.5: one instrument moves in the same direction by half the amountthat the other variable moves
▷ ρ = 0: variables are uncorrelated• does not imply that they are independent!
3 / 17
Online visualization: interpreting correlations
Try it out online: http://rpsychologist.com/d3/correlation/
5 / 17
Not all relations are linear!
▷ Example: Yerkes–Dodson law• empirical relationship between level of
arousal/stress and level of performance
▷ Performance initially increases withstress/arousal
▷ Beyond a certain level of stress, performancedecreases
Source: http://en.wikipedia.org/wiki/YerkesDodson_law
6 / 17
Measuring correlation with NumPy
In [3]: import numpy
import matplotlib.pyplot as plt
import scipy.stats
In [4]: X = numpy.random.normal(10, 1, 100)
Y = X + numpy.random.normal(0, 0.3, 100)
plt.scatter(X, Y)
Out[4]: <matplotlib.collections.PathCollection at 0x7f7443e3c438>
In [5]: scipy.stats.pearsonr(X, Y)
Out[5]: (0.9560266103379802, 5.2241043747083435e-54)
E x e r c i s e : sh o w t h a t w
h e n t h e e rr o r
i n Y d e c r e a s e s , t h e c o rr e l a t i o n
c o e f f i c i e n ti n c r e a s e s
E x e r c i s e : pr o d u c e d a
t a a n d a pl o t
w i t h a n e ga t i v e c o r r e
l a t i o n
c o e f f i c i e n t
7 / 17
Anscombe’s quartet
4
8
12 I II
0 10 20
4
8
12 III
0 10 20
IV
Four datasets proposed by Francis Anscombe to illustrateimportance of graphing data rather than relying blindlyon summary statistics
e a c h d a t a se t h a s t h e
s a m e
c o r r e l a t i o nc o e f f i c i e n t
!
8 / 17
Plotting relationships between variables with matplotlib
▷ Scatterplot: use function plt.scatter
▷ Continuous plot or X-Y: function plt.plot
1 > import matplotlib.pyplot as plt2 > import numpy3 > x = numpy.linspace(1, 15, 100)4 > plt.plot(x, numpy.sin(x))5 > plt.show()
0 2 4 6 8 10 12 14 16−1.0
−0.5
0.0
0.5
1.0
9 / 17
Aside: polio caused by ice cream!
▷ Polio: an infectious disease causing paralysis, which primarilyaffects young children
▷ Largely eliminated today but was once a worldwide concern
▷ Late 1940s: public health experts in USA noted that theincidence of polio increased with the consumption of ice cream
▷ Some suspected that ice cream caused polio… sales plummeted
▷ Polio incidence increases in hot summer weather
▷ Correlation is not causation: there may be a hidden, underlyingvariable• but it sure is a hint! [Edward Tufte]
More info: Freakonomics, Steven Levitt and Stephen J. Dubner
10 / 17
Aside: fire fighters and fire damage
▷ Statistical fact: the larger the number of fire-fighters attendingthe scene, the worse the damage!
▷ More fire fighters are sent to larger fires
▷ Larger fires lead to more damage
▷ Lurking (underlying) variable = fire size
▷ An instance of “Simpson’s paradox”
11 / 17
Aside: low birth weight babies of tobacco smoking mothers
▷ Statistical fact: low birth-weight children born to smoking mothershave a lower infant mortality rate than the low birth weight childrenof non-smokers
▷ In a given population, low birth weight babies have a significantlyhigher mortality rate than others
▷ Babies of smoking mothers are more likely to be of low birth weightthan babies of non-smoking mothers
▷ Babies underweight because of smoking still have a lower mortalityrate than children who have other, more severe, medical reasons whythey are born underweight
▷ Lurking variable between smoking, birth weight and infant mortality
Source: Wilcox, A. (2001). On the importance — and the unimportance — of birthweight, International Journal of Epidemiology.30:1233–1241
12 / 17
Aside: cheese causes death by bedsheet strangulation
Note: real data!
Source: http://www.tylervigen.com/, with many more surprising correlations
13 / 17
Aside: correlation is not causation
Source: http://xkcd.com/552/ (CC BY-NC licence)
14 / 17
Directionality of effect problem
aggressive behaviour watching violent films
aggressive behaviour watching violent films
Do aggressive children prefer violent TV programmes, or do violentprogrammes promote violent behaviour?
15 / 17
Analyzing data: wind speed
▷ Import wind speed data for Toulouse airport
▷ Find the mean of the distribution
▷ Plot a histogram of the data
▷ Does the data seem to follow a normal distribution?• use a Q-Q plot to check
▷ Check whether a Weibull distribution fits better
▷ Predict the highest wind speed expected in a 10-year interval
0 10 20 30 40 50 600.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08 TLS wind speed in 2013
−3 −2 −1 0 1 2 3Quantiles
−10
0
10
20
30
40
50
Ord
ered
Val
ues
R2 =0:9645
TLS wind speed qqnorm-plot
0 5 10 15 20 25 30 35Quantiles
0
10
20
30
40
50
Ord
ered
Val
ues
R2 =0:9850
TLS wind speed qqweibull plot
Data downloaded from http://www.wunderground.com/history/airport/LFBO/
16 / 17
For more information
▷ SciPy lecture notes: https://scipy-lectures.github.io/
▷ Analysis of the “pay for performance” (correlation between a CEO’s payand their job performance, as measured by the stock market) principle,http://freakonometrics.hypotheses.org/15999
This presentation is distributed under the terms ofthe Creative Commons Attribution – Share Alikelicence.
SKRENGINEERING
For more free course materials on risk engineering, visithttp://risk-engineering.org/
17 / 17