Top Banner
Introduction to Statistical Inference Dr. Fatima Sanchez-Cabo [email protected] http://www.genome.tugraz.at Institute for Genomics and Bioinformatics, Graz University of Technology, Austria Introduction to Probability Theory. March 9, 2006 – p. 1/29
29

Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Introduction to Statistical Inference

Dr. Fatima [email protected]

http://www.genome.tugraz.at

Institute for Genomics and Bioinformatics,

Graz University of Technology,

Austria

Introduction to Probability Theory. March 9, 2006 – p. 1/29

Page 2: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Summary last session

• Motivation• Algebra of sets• Definition of probability space:

• Sample space• Sigma algebra• Probability axioms

• Conditional Probability and independence of events• Random variables

• Continuos vs discrete• pdf and mass function• Distribution function

Introduction to Probability Theory. March 9, 2006 – p. 2/29

Page 3: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Part IV:

From Probability to Statistics

Introduction to Probability Theory. March 9, 2006 – p. 3/29

Page 4: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Statistical Inference

• The target of statistical inference is to provide some informationabout the probability distribution P defined over the probabilityspace (Ω,F).

• Differently from the previous examples where an exhaustiveobservation was possible, this is often difficult.

• Hence, statistical inference focusses in the analysis andinterpretation of the realizations of the random variable in order todraw conclusions about the probability law under study.

• The conclusions can be relative to:• Estimation of a unique value for a parameter or parameters

essential in the probability distribution (i.e., p for a Binomial r.v)• Estimation of a confidence interval for this parameter/s.• Accept or reject a certain hypothesis about the probability

distribution of interest.

Introduction to Probability Theory. March 9, 2006 – p. 4/29

Page 5: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Statistical Inference

• In general, there is some knowledge about the probabilitydistribution explaining a certain random process

• The inference process is often involved with deciding (looking atthe available data) which is the distribution that best explains theavailable data among a set of them: Fθ : θǫΘ. That’s known asparametric statistics .

Example 4.1: We are betting with a friend to "head" or "tail" whentossing a coin. We want to know if the coin is unbiased. The most logicapproach will be to toss the coin "enough" times to get an approximatevalue for the probability of head.

Introduction to Probability Theory. March 9, 2006 – p. 5/29

Page 6: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Choosing a random sample

• The first step in this process is to choose a sample representativefrom the whole population (what is "enough" in the previousexample?)

• We should also take care of the cost/time.• Random sampling is one way to obtain samples from a

population: all individuals have the same probability to be chosen.This type of sampling is particularly useful because theobservations are then independent from each other; henceF (x1, . . . , xn) =

i F (xi)

• Sampling with replacement (some individuals might berepeated in the sample)

• Sampling without replacement (each individual can be onlyonce in the sample)

Introduction to Probability Theory. March 9, 2006 – p. 6/29

Page 7: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Combinatorics

• We can define the probability of an event A as:

P (A) =number favorables casesnumber of possible cases

• The number of possible samples of r elements chosen withoutreplacement among n possible elements is C(n, r) = n!

(n−r)!r!

• The number of possible ways to create a sample of size n usingr < n elements, if repetition is allowed, is rn

Examples:

1. An urn contains 5 red, 3 green, 2 blue and 4 white balls. A sample of size 8 isselected at random without replacement. Calculate the probability that the samplecontains 2 red, 2 green, 1 blue, and 3 white balls.

2. Consider a class with 30 students. Calculate the probability that all 30 birthdaysare different.

Introduction to Probability Theory. March 9, 2006 – p. 7/29

Page 8: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Example 4.2

We are betting with a friend to "head" or "tail" when tossing a coin. Wewant to know what is the probability of getting "head" so that we decidewhat to bet on. Hence, we have the random variable

X =

0 Head1 Tail

with P (X = 0) = θ, P (X = 1) = 1 − θ. Given a random sample(X1, X2, X3), determine the mass function (probability for eachpossible triplet). Does this sample has a known mass distribution?

Introduction to Probability Theory. March 9, 2006 – p. 8/29

Page 9: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Descriptive statistics

• Given a very large or very complex sample it might not bepossible to determine the distribution easily.

• It is often useful to make use of some tools that help tounderstand the target distribution and the main characteristics ofthe sample data:

Probability StatisticBase Population Sample

Central tendency ExpectationMean

Median

Dispersion VarianceSample variance

IQRMode

Introduction to Probability Theory. March 9, 2006 – p. 9/29

Page 10: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Descriptive statistics

Given a sample (x1, . . . , xn) we can calculate the following expressionsto get a rough idea of the properties of the sample.

1. Parametric• Mean: x = 1

n

i xi

• Sample variance: s2 = 1n

∑ni=1(xi − x)2

• Sample Quasi-variance S2 = 1n−1

∑ni=1(xi − x)2

2. Non-parametric (order statistics)• Median:

x[ n+12 ] if n is odd

x[ n

2]+x[ n

2]+1

2 if n is even

• Interquantiles range: IQR=q3-q1

• Mode: most repeated value

Introduction to Probability Theory. March 9, 2006 – p. 10/29

Page 11: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Descriptive statistics

Histogram of x

x

Fre

quen

cy

2 4 6 8

05

1015

2025

30

(a) Histogram

24

68

(b) Boxplot

Introduction to Probability Theory. March 9, 2006 – p. 11/29

Page 12: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Descriptive statistics

Given x1, . . . , xn a sample of size n, calculate the pth percentile:

1. R = int( p100 · (N + 1));

FR = p100 · (N + 1) − int( p

100 · (N + 1))

2. Find x[R], x[R]+1

3. p = x[R] + (x[R]+1 − x[R]) ∗ FR

Introduction to Probability Theory. March 9, 2006 – p. 12/29

Page 13: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Estimators and statistics

• Definition: A statistic is a function T : (Ω,F) → (Rk, Bk), i.e.T (x1, . . . , xn), over which a probability function can be defined.

• k is the dimension of the statistic.• An estimator is a particular case of a statistic that approximates

one of the unknown parameters of the distribution.• Examples:

• µ = x = 1n

∑ni=1 xi

• θ in example 4.2.• An estimator is a random variable itself: Its moments can be

calculated• An estimator is said to be unbiased if E[θ] = θ

Introduction to Probability Theory. March 9, 2006 – p. 13/29

Page 14: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Example 4.3

• Given (X1, . . . , Xn) a random sample such as E[Xi] = µ andV [Xi] = σ2, show that for X = 1

n

i Xi then E[X] = µ andV [X] = σ2/n.

• Find an unbiased estimator for σ2?

Note: The variance of an estimator is the sample error (SE).

Introduction to Probability Theory. March 9, 2006 – p. 14/29

Page 15: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Distribution of the statistics

• Since the statistics are function of random variables they aretheirselves random variables for which a probability distributioncan be defined.

• Particularly important are the distributions of statistics based onnormally distributed random variables.• We have proved that X ≡ N(µ, σ2/n).

• For the Central Limit Theorem X−µσ/

√n≡ N(0, 1).

• If σ is unknown and we substitute it by its natural estimator S2,then

t =X − µ

S/√

n

That is the Student-t distribution .

Introduction to Probability Theory. March 9, 2006 – p. 15/29

Page 16: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Student’s t

Introduction to Probability Theory. March 9, 2006 – p. 16/29

Page 17: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

χ2 and F-Snedecor

Other important distributions of statistics from normally distributedrandom variables are:

• χ2n =

i X2i when Xi ≡ N(0, 1)

• Fn,m =χ2

n/n

χ2m

/m

0 10 20 30 40

0.00

0.02

0.04

0.06

0.08

0.10

Chi−square (d.f.=10)

x

Den

sity

(c) χ2 density

0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

F(5,10)

x

Den

sity

(d) F-Snedecor

Introduction to Probability Theory. March 9, 2006 – p. 17/29

Page 18: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Part V:

Hypothesis testing

Introduction to Probability Theory. March 9, 2006 – p. 18/29

Page 19: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Motivation

• There are some situations when we want to test if a certainhypothesis about the parameter of interest is true

• Is the probability of "head" the same than the probability of "tail"when flipping the coin of the previous experiment?

QUESTION ????

Formulate an

hypothesis

EXPERIMENT

Results DON’T support the hypothesis

Results support the hypothesis

Figure 1: The Scientific Method

Introduction to Probability Theory. March 9, 2006 – p. 19/29

Page 20: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Main concepts

• A statistical null hypothesis H0 is the one that want to be tested.• Statistical tools provide the appropriated framework to guarantee

that its rejection is due to real mechanisms and not to chance. Itsrejection will provide the called "statistical significance".

• We never talk about "accepting" H0: with the available data wecan say if we reject or not the H0.

Introduction to Probability Theory. March 9, 2006 – p. 20/29

Page 21: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Error types

H0 is really...

Decision True False

Accept H0

OK!A true hypothesis has

been accepted

Error! A false hypothesis has

been accepted. This is a Type II error.

The probability of this is β

Reject H0

Error! a true hypothesis

has been rejected.

This is a Type I error.

The probability of this

occurring is α

OK!

A false hypothesis has been rejected

Introduction to Probability Theory. March 9, 2006 – p. 21/29

Page 22: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Error types

• The ideal situation would be to minimize the probability of botherrors (α,β), where:

α = P (Reject H0/H0 is true) = P (ETI)

β = P (Accept H0/H0 is false) = P (ETII)

• However, they are related and cannot be minimized at the sametime.

• Definition: The p-value corresponds to the probability ofrejecting the null hypothesis if it is actually true. If the p-value issmaller than the allowed α-level H0 is rejected (always with theavailable data!)

Introduction to Probability Theory. March 9, 2006 – p. 22/29

Page 23: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Protocol for statistically testing a null hypothesis

1. Question that we want to address (aufpassen: Statistical tests canonly reject the null hypothesis, H0)

2. Experimental design: How many repetitions we need How are wegoing to select a sample that represents the population? Whichare the sources of variation? How do we remove the systematicerrors?

3. Data collection

4. Removal systematic errors

5. Which statistic summarizes best my sample. Does it have aknown distribution? (statistic: numerical variable that summarizesmy sample data in a meaningful way)

Introduction to Probability Theory. March 9, 2006 – p. 23/29

Page 24: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Example (5)

Introduction to Probability Theory. March 9, 2006 – p. 24/29

Page 25: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Simple hypothesis testing

1. We want to test if a particular sample comes from a distribution with a particularmean value, let’s say 3. It is quite logical to try to estimate how close the estimatorof the population mean is from the proposed value. So we will calculate:

(x − 3)

2. To minimize the intrinsic error of the sample mean, we will divide this difference bythe standard error (variance of the sample mean). We have then the statistic:

t =x − 3

s/√

n

3. Regarding to the definition of t-student, if H0 is true (H0 : µ = 3) then, t ≡ tn−1

where n is the number of samples. This makes sense, because in the t − student

distribution, the probability of getting t = 0 is very high and this is exactly whatwould happen under H0, because:

x ≃ 3 → x − 3 ≃ 0 → t = 0

Introduction to Probability Theory. March 9, 2006 – p. 25/29

Page 26: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

Simple hypothesis testing (cont.)

• α was defined as:

1 − α = P (|t| ≤ t∗) → α = P (|t| > t∗)

Where t∗ = tn−1; α

2and α is as well defined as:

α = P (Reject H0/H0 true) = P (ETI)

• In this particular example, if we are under the H0 the value calculated for thestatistic |t| should be close to 0. Otherwise, we reject H0 (this would mean thatx 6= 3). But, how big must be the difference (x-3)? For that, we fixed the value α

such as we just reject H0 being true for 5 out of 100 samples that we take. And forthis particular, this α determines a unique value, called critical value:

t∗ = tn−1; α

2

Introduction to Probability Theory. March 9, 2006 – p. 26/29

Page 27: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

• If |t| > t∗, we reject the H0, because the probability of t = 0 (µ =3)is very small (<0.05). Just in 5 out of 100 samples (what we canconsider by chance) the value of the statistic will be very far awayfrom 0 although the H0 is true. This condition is equivalent to:

p = P (tn−1 > |t|) < α = 0.05

• If |t| < t∗ the result obtained is the one we would expect for adistribution following a t-distribution with n − 1 degrees offreedom, as expected if the H0 is true. We don’t have thenenough evidence to reject the H0. This is equivalent to say that:

p = P (tn−1 > |t|) > α = 0.05

Introduction to Probability Theory. March 9, 2006 – p. 27/29

Page 28: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

−6 −4 −t* 0 t*=2.22 4 6 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4t distribution

Den

sity

α/2=0.025

α=0.05

Introduction to Probability Theory. March 9, 2006 – p. 28/29

Page 29: Introduction to Statistical Inference - Bioinformatics Grazgenome.tugraz.at/MedicalInformatics2/Statistical... · 2015-01-26 · Statistical Inference • The target of statistical

References[1] Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1996) Biological

sequence analysis, Cambridge University Press.

[2] Durret, R. (1996) Probability: Theory and examples, DuxburyPress, Second edition.

[3] Rohatgi, V.K. and Ehsanes Saleh, A.K.Md. (1988) An introductionto probability and statistics, Wiley, Second Edition.

[4] Tuckwell, H.C. (1988) Elementary applications of probability theory,

[5] Gonick, L. and Smith, W. (2000) The cartoon guide to statistics.Chapman and Hal

[Engineering statistics] http://www.itl.nist.gov/div898/handbook/

Introduction to Probability Theory. March 9, 2006 – p. 29/29