Top Banner
Statistics with R Chapter 1: Introduction to statistics Tabea Rebafka October 2018 Master AIMS 2018–19 Tabea Rebafka Statistics with R Introduction to statistics 1 / 39
39

Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Jul 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Statistics with R

Chapter 1: Introduction to statistics

Tabea Rebafka

October 2018

Master AIMS 2018–19

Tabea Rebafka Statistics with R Introduction to statistics 1 / 39

Page 2: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Outline

1 What is statistics?

2 Example: Coin tossing

3 Refresher on probability theory

4 Statistical modelling

Tabea Rebafka Statistics with R Introduction to statistics 2 / 39

Page 3: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

What is statistics? I

What is the aim of statistics?Analysis and interpretation of data (or observations, measurements)

understand an observed phenomenon by statistical inference (i.e.modelling, estimation and testing)recover unobserved features (prediction)

Tabea Rebafka Statistics with R Introduction to statistics 3 / 39

Page 4: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

What is statistics? II

Statistical approachUse a probabilistic model to explain the nature of the data (inopposition to data analysis)Let x1, . . . , xn be the data. A statistician assumes that (x1, . . . , xn) isthe realization of a random variable X with distribution P.The distribution P is unknown (in opposition to probability theory).

Tabea Rebafka Statistics with R Introduction to statistics 4 / 39

Page 5: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

What is statistics? III

Tabea Rebafka Statistics with R Introduction to statistics 5 / 39

Page 6: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Example: Coin tossing I

DataObservations: the outcome of n tosses of the same coinHead is encoded by 1, tail by 0.Data: x1, . . . , xn with xi ∈ 0, 1. The number n is called the samplesize.

Probabilistic modelConsider xi as independent realizations of a Bernoulli distributionMore precisely, let Xi be i.i.d. (independent and identicallydistributed) random variables with Bernoulli distribution B(p) withparameter p ∈ (0, 1), i.e.

P(Xi = 1) = p = 1− P(Xi = 0)

Bernoulli parameter p is unknown.

Tabea Rebafka Statistics with R Introduction to statistics 6 / 39

Page 7: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Example: Coin tossing II

Fit the modelEstimate the Bernoulli parameter p from the data x1, . . . , xn.Simple idea: we know that for Xi ∼ B(p) i.i.d., we have

E[X1] = p and Xn =1n

n∑i=1

XiP−→ p (n→∞).

Use the sample mean Xn as an estimate of p:

pn = xn =1n

n∑i=1

xi .

Tabea Rebafka Statistics with R Introduction to statistics 7 / 39

Page 8: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Example: Coin tossing III

Properties of the estimator pn of p

pn = XnP−→ p as n→∞, i.e. when the sample size n is large, pn

tends to be close to p (consistency).E[pn] = p, i.e. in average pn takes the target value p (unbiased)Mean squared error (MSE)

E[(pn − p)2] =p(1− p)

n−→ 0 (n→∞).

Limit distribution and rate of convergence

√n(pn − p)

d−→ N (0, p(1− p)) (n→∞).

Tabea Rebafka Statistics with R Introduction to statistics 8 / 39

Page 9: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Example: Coin tossing IV

Quantify uncertainty of the estimateInstead of a point estimator pn compute an interval I that depends on thedata (i.e. I = I(x1, . . . , xn)) and that contains the target p with givenprobability γ (confidence interval):

P(p ∈ I) ≥ γ.

The length of the interval I indicates the uncertainty about our estimationof p.

Tabea Rebafka Statistics with R Introduction to statistics 9 / 39

Page 10: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Example: Coin tossing V

Confidence interval for p in the Bernoulli modelAn asymptotic confidence interval is given by

In =

[pn + qNγ1

√pn(1− pn)

n, pn + qNγ2

√pn(1− pn)

n

]

with γ1 = (1− γ)/2 and γ2 = (1 + γ)/2 and where qNα denotes theα-quantile of the standard normal distribution N (0, 1) defined by

P(Z ≤ qNα ) = α for Z ∼ N (0, 1).

Interval length:

2qNγ2

√pn(1− pn)

n

by using qNγ1= −qNγ2

.

Tabea Rebafka Statistics with R Introduction to statistics 10 / 39

Page 11: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Example: Coin tossing VI

Statistical testingAnswer questions as: Is the coin a fair coin?

Mathematically speaking: Is p = 1/2 or p 6= 1/2?Estimate p and evaluate the uncertainty of the estimateIf the estimate is too far away from 1/2, then decide that p 6= 1/2.Otherwise conserve the hypothesis that p = 1/2.

Tabea Rebafka Statistics with R Introduction to statistics 11 / 39

Page 12: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Refresher on probability theory

DefinitionA sample space is any finite or infinite set Ω (it is thought as the setof all possible outcomes of a random experiment).Any subset A ⊂ Ω is called an event, including Ω and the empty set ∅.

Example: DiceSample space of rolling a dice: Ω = 1, . . . , 6.Some events:

A = 2 ,B = 2, 4, 6 = the result is even ,C = ∅,D = Ω.

Tabea Rebafka Statistics with R Introduction to statistics 12 / 39

Page 13: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Probability measures I

Basically, a probability measure assigns a probability to every event.

DefinitionLet Ω be a sample space, a probability measure P on Ω is an application

P : Events → [0, 1]

such thatP(∅) = 0, P(Ω) = 1.(Countable additivity) For every sequence of disjoint eventsA1,A2, . . .

P

⋃n≥1

An

=∑n≥1

P(An).

A pair (Ω,P) is called a probability space.

Tabea Rebafka Statistics with R Introduction to statistics 13 / 39

Page 14: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Probability measures IIExamples

The uniform measure on a finite set Ω is defined by

µ(A) =card(A)

card(Ω).

The Dirac measure (or Dirac mass) at some point a, denoted by δa,puts all the mass on a:

δa(A) =

1 if a ∈ A,

0 otherwise.

The Lebesgue measure on R is the measure λ that assigns the length toeach interval [a, b]:

λ([a, b]

)= b − a.

It is not a probability measure as its values are not restricted to [0, 1].

Tabea Rebafka Statistics with R Introduction to statistics 14 / 39

Page 15: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Probability measures III

PropositionLet (Ω,P) be a probability space.(i) If A ⊂ B , then P(A) ≤ P(B).(ii) For any event A, P(Ac) = 1− P(A).(iii) For any events A,B ,

P(A ∪ B) = P(A) + P(B)− P(A ∩ B),

in particular P(A ∪ B) ≤ P(A) + P(B).

Tabea Rebafka Statistics with R Introduction to statistics 15 / 39

Page 16: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Probability measures IV

Proposition(iv) (Union bound) More generally, let (An)n≥1 be any sequence of sets

(not necessarily disjoint),

P

⋃n≥1

An

≤∑n≥1

P(An).

(v) (Law of total probability) Let A be an event and B1,B2, . . . be asequence of disjoint sets such that ∪n≥1Bn = Ω,

P(A) =∑n≥1

P(A ∩ Bn).

Tabea Rebafka Statistics with R Introduction to statistics 16 / 39

Page 17: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Random variables I

From now on, we work on a fixed probability space (Ω,P) where P is aprobability measure.

Elements of Ω are often denoted by ω.

DefinitionAny function X : Ω→ R is a random variable.

Tabea Rebafka Statistics with R Introduction to statistics 17 / 39

Page 18: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Random variables II

Example: Indicator functionFor a given event A, the indicator function of A is denoted by 1A anddefined as

1A(ω) =

1 if ω ∈ A,

0 otherwise.

Indicator functions are similar to Dirac measures as 1A(ω) = δω(A).

Tabea Rebafka Statistics with R Introduction to statistics 18 / 39

Page 19: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Random variables III

DefinitionThe distribution or law of X , denoted by PX , is the probabilitymeasure on R such that for any event A

PX (A) = P (ω such that X (ω) ∈ A) = P(X ∈ A).

We write X ∼ PX .The cumulative distribution function (or just distribution function)of X is the function FX : R 7→ [0, 1] defined by

FX (t) = P(X ≤ t) for every t.

TheoremX and Y have the same law ⇐⇒ FX (t) = FY (t) for every t.

Tabea Rebafka Statistics with R Introduction to statistics 19 / 39

Page 20: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Random variables IV

Properties of the distribution function(i) FX is non-decreasing.(ii) FX is right-continuous.(iii) lim

t→−∞FX (t) = 0, lim

t→+∞FX (t) = 1.

TheoremAny function F with properties (i), (ii) and (iii) above, is the distributionfunction of some random variable.

Tabea Rebafka Statistics with R Introduction to statistics 20 / 39

Page 21: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Discrete distribution I

DefinitionWe say that X has a discrete distribution if X takes its values in afinite or countable set x1, x2, . . . .Discrete distributions are entirely described by their probability massfunction p(x) = P(X = x) for x ∈ x1, x2, . . . .

Tabea Rebafka Statistics with R Introduction to statistics 21 / 39

Page 22: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Discrete distribution II

Examples of discrete distributionsBernoulli distribution B(p) with parameter p ∈ [0, 1] with values in0, 1:

P(X = 1) = p, P(X = 0) = 1− p.

Model of the success or failure of an experiment.Binomial distribution B(n, p) with parameters n ≥ 1 and p ∈ [0, 1]:

P(X = k) =

(n

k

)pk(1− p)n−k for k = 0, 1, . . . , n.

Model of the number of successes in n Bernoulli trials.

Tabea Rebafka Statistics with R Introduction to statistics 22 / 39

Page 23: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Discrete distribution III

Examples of discrete distributionsGeometric distribution with parameter p ∈ [0, 1]:

P(X = k) = (1− p)k−1p for k = 1, 2, . . .

Model of the number of Bernoulli trials until the first success.Poisson distribution with parameter λ > 0:

P(X = k) = e−λλk

k!for k = 0, 1, 2, . . .

Discrete uniform distribution on a finite set of values x1, . . . , xm:

P(X = xk) =1m

for k = 1, . . . ,m.

Tabea Rebafka Statistics with R Introduction to statistics 23 / 39

Page 24: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Discrete distribution IVBernoulli distribution with parameter p = 0.4

0.0 0.2 0.4 0.6 0.8 1.00.

00.

4

x

Pro

babi

litie

s

−2 −1 0 1 2 3 4 5

0.0

0.4

0.8

x

CD

F

Tabea Rebafka Statistics with R Introduction to statistics 24 / 39

Page 25: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Discrete distribution VBinomial distribution with parameters n = 8 and p = 0.4

0 2 4 6 8

0.00

0.15

0.30

x

Pro

babi

litie

s

−2 0 2 4 6 8 10

0.0

0.4

0.8

x

CD

F

Tabea Rebafka Statistics with R Introduction to statistics 25 / 39

Page 26: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Discrete distribution VI

The cumulative distribution function of any discrete distribution is astep function.

The jumps indicate the values taken by the random variable and

the height of the jump indicates the associated probability.

Tabea Rebafka Statistics with R Introduction to statistics 26 / 39

Page 27: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Continuous distribution I

DefinitionWe say that X has continuous distribution if X takes its values in R (orin an interval of R) and if there is a non-negative function f such that forany event A

P(X ∈ A) =

∫Af (x)dx .

The function f is called the density of X .

Any density function f is non-negative and∫R f (x)dx = 1.

The density entirely describes the distribution of the random variable.

Tabea Rebafka Statistics with R Introduction to statistics 27 / 39

Page 28: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Continuous distribution II

Examples continuous distributionsUniform distribution U[a, b] on [a, b]:

f (x) =1

b − a1[a,b](x).

Exponential distribution E(λ) with parameter λ > 0:

f (x) = λ exp(−λx)1x≥0.

Normal distribution or gaussian distribution N (µ, σ2) withparameters µ ∈ R, σ2 > 0:

f (x) =1

σ√2π

exp(−(x − µ)2

2σ2

).

Tabea Rebafka Statistics with R Introduction to statistics 28 / 39

Page 29: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Continuous distribution IIIExponential distribution E(1) with parameter λ = 1

−2 −1 0 1 2 3 4 50.

00.

40.

8

x

Den

sity

−2 −1 0 1 2 3 4 5

0.0

0.4

0.8

x

CD

F

Tabea Rebafka Statistics with R Introduction to statistics 29 / 39

Page 30: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Continuous distribution IVNormal distribution N (2, 1) with parameters µ = 2 and σ2 = 1

−2 0 2 4 60.

00.

20.

4

x

Den

sity

−2 0 2 4 6

0.0

0.4

0.8

x

CD

F

Tabea Rebafka Statistics with R Introduction to statistics 30 / 39

Page 31: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Continuous distribution VUniform distribution U[−1, 3] on [−1, 3]

−2 0 2 40.

00.

20.

4

x

Den

sity

−2 0 2 4

0.0

0.4

0.8

x

CD

F

Tabea Rebafka Statistics with R Introduction to statistics 31 / 39

Page 32: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Continuous distribution VI

The cumulative distribution function of any continuous distribution iscontinuous.

We have Fx(t) =∫ t−∞ f (x)dx for all t and

f (t) = F ′(t) for almost all t.

Tabea Rebafka Statistics with R Introduction to statistics 32 / 39

Page 33: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Continuous distribution VII

There exist random variables which are neither discrete norcontinuous!For instance X = min 1,Y where Y ∼ E(1) (censored distribution).

Tabea Rebafka Statistics with R Introduction to statistics 33 / 39

Page 34: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Statistical modelling I

In statistics, data x = (x1, . . . , xn) are considered as a realization of arandom vector X = (X1, . . . ,Xn) with distribution P.The distribution P is unknown.

Statistical modelWe introduce a family P of (known) probability distributions andsuppose that P belongs to this family P, i.e.

P ∈ P.

P is called a statistical model and it is indeed a set of candidatedistributions for P.

Tabea Rebafka Statistics with R Introduction to statistics 34 / 39

Page 35: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Statistical modelling II

A statistical model P is determined by usingI our prior knowledge on the observed phenomenon andI tools from descriptive statistics.

Any model is false. A model is only an approximation of reality.

A model is always a trade-off between a precise description of acomplex reality and mathematical convenience.

Tabea Rebafka Statistics with R Introduction to statistics 35 / 39

Page 36: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Statistical modelling III

Model parameterIn general, we write P = Pθ, θ ∈ Θ where θ is the modelparameter and Θ the parameter set.Denote θ0 ∈ Θ the “true value” of the parameter such that P = Pθ0 .The problem of estimating P becomes the problem of estimating theparameter θ0 from the data.

IdentifiabilityThe model P is said to be identifiable if and only if

∀θ, θ′ ∈ Θ,Pθ = Pθ′ =⇒ θ = θ′.

Tabea Rebafka Statistics with R Introduction to statistics 36 / 39

Page 37: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Statistical modelling IV

Example: Coin tossingThe data x = (x1, . . . , xn) are considered as a realization the randomvector X = (X1, . . . ,Xn) with Xi ∼ B(p) i.i.d. and unknownparameter p ∈ (0, 1).In other words, we suppose that the distribution P of X belongs to thefamily

P = B(p)⊗n, p ∈ (0, 1).

Here, p is the model parameter.

Tabea Rebafka Statistics with R Introduction to statistics 37 / 39

Page 38: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Parameter estimation I

How to estimate θ0 from the data x = (x1, . . . , xn)?

DefinitionAny function S = S(x) defined on the data x is called a statistic.Examples: S1(x) = 0,∀x; S2(x) = xn.A statistic is called an estimator of θ0 if the statistic is supposed toapproach θ0.

Tabea Rebafka Statistics with R Introduction to statistics 38 / 39

Page 39: Statistics with R Chapter 1: Introduction to statistics · Statistics with R Chapter 1: Introduction to statistics TabeaRebafka October 2018 MasterAIMS2018–19 Tabea Rebafka Statistics

Parameter estimation II

There are different estimation approaches depending on the size of theparameter set Θ.

If Θ ⊂ Rd (i.e. if θ is a d-vector) for some d <∞, the model is calledparametric.If no parametrization of P exists such that Θ is of finite dimension,the model is called non parametric.

Examples of non parametric models:

P = the set of all probability measures

P = the set of all absolutely continuous probability measures

P = the set of all absolutely continuous probability measures withcontinuous density

Tabea Rebafka Statistics with R Introduction to statistics 39 / 39