Division of Biostatistics and Epidemiology Department of ......Basic concepts in probability Events and random variables Probability and probability distributions Mean, variance and

Introduction to Probability

and Statistics

Xi Kathy Zhou, PhDDivision of Biostatistics and Epidemiology

Department of Public Health

http://www.med.cornell.edu/public.health/biostat.htm

Feb. 2010

Introduction to Probability

and Statistics

Xi Kathy Zhou, PhDDivision of Biostatistics and Epidemiology

Department of Public Health

http://www.med.cornell.edu/public.health/biostat.htm

Feb. 2010

Overview

Statistics:

The mathematics of the collection, orThe mathematics of the collection, or

data, especially the analysis of popusampling – definition in American Heritage Dictionary

Why statistics:

Through studying the characteristics of a small collection of observations

proper inference for the entire population could be derived

Probability theory consists of a set oinference

, organization and interpretation of numerical , organization and interpretation of numerical

pulation characteristics by inference from definition in American Heritage Dictionary

tudying the characteristics of a small collection of observations

proper inference for the entire population could be derived

t of mathematical tools useful for statistical

Outline

Basic concepts in probability

Events and random variables Events and random variables

Probability and probability distributions

Mean, variance and moments

Joint, marginal and conditional probabilities

Dependence and independence

Basic concepts in statistics

DataData

Descriptive statistics

Statistical Inference – Estimation

Statistical Inference – Hypothesis testing

Basic concepts in probability

Probability and probability distributions

Joint, marginal and conditional probabilities

Dependence and independence

Estimation

Hypothesis testing

Probability – a measure of uncertainty

Example:

Random experiment Possible OutcomeToss a coin H, TRoll a 6-sided die 3, 5, 1,2,3

How to describe these experiments and quantify their outcomes How to evaluate the uncertaint

or even more complicated outcomes, or outcomes from more complicated random experiments?complicated random experiments?

Probability theory provides tooloutcomes from random experiments.

a measure of uncertainty

Random experiment Possible OutcomeToss a coin H, T

sided die 3, 5, 1,2,3

How to describe these experiments and quantify their outcomesinties associated with these outcomes

or even more complicated outcomes, or outcomes from more complicated random experiments?complicated random experiments?

ools for describing and quantifying outcomes from random experiments.

Events

Definitions:

Random experiment: an experiment which can result in different an experiment which can result in different Random experiment: an experiment which can result in different an experiment which can result in different outcomes, and for which the outcome is unknown in advance.outcomes, and for which the outcome is unknown in advance.

Sample space Ω : a set of all p

Event: a subset of the sample space

Random experiment sample space Events

Toss a coin H, T H, T

Roll a 6-sided die: 1,2,3,4,5,6 3, 5, 1,2,3Roll a 6-sided die: 1,2,3,4,5,6 3, 5, 1,2,3

an experiment which can result in different an experiment which can result in different an experiment which can result in different an experiment which can result in different outcomes, and for which the outcome is unknown in advance.outcomes, and for which the outcome is unknown in advance.

ll possible outcomes of an experiment

: a subset of the sample space Ω

Random experiment sample space Events

Toss a coin H, T H, T

sided die: 1,2,3,4,5,6 3, 5, 1,2,3sided die: 1,2,3,4,5,6 3, 5, 1,2,3

Probability measure

F F

∈∈

∈∪∈

Athen A If 2.

and BAthen B A, If 1.

c , ,

,

Sigma field FFFF: a set that satisfies the following,

then , and , If 3.

1)( 2.

any for 0)( 1.

ØBABA

P

AAP

=∈

=Ω

∈≥

I

F

F

FF

∈

∈∈

Ø 3.

Athen A If 2. c , ,

Probability measure P on (Ω, F ): a function P:

following properties (Axioms of the Probability):

then , and , If 3. ØBABA =∈ IF

The 6-sided die example, Sigma field: Ø, 1, …, 6, 1,2, …, 1,2,3,4,5,6

Sigma field: Ø, 1,2,3, 4,5,6, 1,2,3,4,5,6

F∈∩ BA

: a set that satisfies the following,

)()()( then BPAPBAP +=∪

): a function P: F→ [0,1] satisfying the

following properties (Axioms of the Probability):

)()()( then BPAPBAP +=∪

Sigma field: Ø, 1, …, 6, 1,2, …, 1,2,3,4,5,6

Sigma field: Ø, 1,2,3, 4,5,6, 1,2,3,4,5,6

Probability measure - some properties

Comparing the uncertainty of events:

If , and , then ( ) ( )A B A B P A P B⊆ Ω ⊆ ≤

Assessing the uncertainty associated with other events


AAAAPAP −Ω=Ω∈−= and for )(1)(

∩−+=∪

++=∪∪∪

BAPBPAPBAP

APAPAAP(A kk

for )()()()(

)(...)()... 121

Example (rolling a 6-sided die):

If we know P(1), …, P(6), we should know the uncertainty of more complicated events such as P(1,2,4)

some properties

If , and , then ( ) ( )A B A B P A P B⊆ Ω ⊆ ≤ A∩B

Ω

A-B B-A

Assessing the uncertainty associated with other events


Ω∈

Ω∈

BAany

AA k

, for

,...,disjoint pairwisefor ) 1

Illustration of rule 6.

A∩BA-B B-A

A B

If we know P(1), …, P(6), we should know the uncertainty of more complicated events such as P(1,2,4)

Probabilities of events -

Experiment: randomly generating

Event E: the generated sequence is “ATG” Event E: the generated sequence is “ATG”

P(E)= 1/43 = 1/64

Experiment: randomly generating a DNA sequence with 100 bases among which 20 bases are A’s

• Event E: the sequence has 20A’s in a row

• What is the sample space, what is the probability of event E?• What is the sample space, what is the probability of event E?

Answer:

20

100/81

- Examples

generating a DNA sequence of length 3

sequence is “ATG”sequence is “ATG”

Experiment: randomly generating a DNA sequence with 100 bases among which 20 bases are A’s

Event E: the sequence has 20A’s in a row

What is the sample space, what is the probability of event E?What is the sample space, what is the probability of event E?

Conditional probability : Quantifies the probabilevent. The conditional probability of A given B

B)|P(A , ]1,0[ :B)|P(. →Ω

Relationship of two events

Independence: Two events A,B є Ω with P(A)>0, P(B)>0 are called (stochastically) if one of the following equivalent conditions holds:

B)|P(A , ]1,0[ :B)|P(. →Ω

• P(A∩B) = P(A)·P(B)

• P(A|B) = P(A)

• P(B|A) = P(B)

EXAMPLE: Throwing a 6-sided fair dice, we consider the following events:A: outcome is an even number, B: outcome <=4, C: outcome is a D: outcome is an odd numberQuestion:

1. If we know that event B happened in the experiment, what is the probability that event A also happened? ( Think about

2. Are A and B independent? How about B and C? How about A and C?3. Are A and D independent? How about D and B? How about D and C?

bility of an event given the happening of another conditional probability of A given B is defined as

A , B)P(A

Ω∈∩

=

Relationship of two events

with P(A)>0, P(B)>0 are called (stochastically) independentif one of the following equivalent conditions holds:

A , P(B)

B)P(A Ω∈

∩=

B) = P(A)·P(B)

sided fair dice, we consider the following events:A: outcome is an even number, B: outcome <=4, C: outcome is a prime number

1. If we know that event B happened in the experiment, what is the probability that event A also happened? ( Think about P(A)= ?, P(B)=?, P(A|B)=? )

Are A and B independent? How about B and C? How about A and C?Are A and D independent? How about D and B? How about D and C?

Random variableRandom variable

each xfor x)X( :

:Xfunction A :Vriable Random

∈∈≤Ω∈

→Ω

Fωω each xfor x)X( : ∈∈≤Ω∈ Fωω

A more common description of the results of a random experiment.A more common description of the results of a random experiment.

Takes on value from a set of mutually exclusive and collectively

exhaustive states and represent it with a number.

Usually denoted by capital letters, e.g. X, Y, Z, etc.

Realizations of random variables are usually denoted in lower case, Realizations of random variables are usually denoted in lower case,

e.g., x,y,z, etc.

Can be discrete or continuous

R.

hat property t with theR

∈

→

R.∈

A more common description of the results of a random experiment.A more common description of the results of a random experiment.

Takes on value from a set of mutually exclusive and collectively

exhaustive states and represent it with a number.

Usually denoted by capital letters, e.g. X, Y, Z, etc.

Realizations of random variables are usually denoted in lower case, Realizations of random variables are usually denoted in lower case,

Probability distributionsProbability distributions

Uncertainties associated with possible values of a random variable, are Uncertainties associated with possible values of a random variable, are

characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution.

Definition:Definition:

Probability Distribution of a random variable X is the function F: Probability Distribution of a random variable X is the function F:

R R -->[0,1] given by F(x) = P(X >[0,1] given by F(x) = P(X

Takes values between 0 and 1Takes values between 0 and 1

Follows the axioms of the probability Follows the axioms of the probability

Uncertainty associated with two (or moUncertainty associated with two (or mo Uncertainty associated with two (or moUncertainty associated with two (or mo

their joint distribution function F(x, y)= P(X their joint distribution function F(x, y)= P(X

Probability distributionsProbability distributions

Uncertainties associated with possible values of a random variable, are Uncertainties associated with possible values of a random variable, are

characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution.

Probability Distribution of a random variable X is the function F: Probability Distribution of a random variable X is the function F:

>[0,1] given by F(x) = P(X ≤ x)>[0,1] given by F(x) = P(X ≤ x)

ore) random variables is characterized by ore) random variables is characterized by ore) random variables is characterized by ore) random variables is characterized by

their joint distribution function F(x, y)= P(X ≤x, Y ≤ y)their joint distribution function F(x, y)= P(X ≤x, Y ≤ y)

Discrete Random Variable

Probability mass function

A discrete random variable X has a countable number of possible values xA discrete random variable X has a countable number of possible values x

xxkk, … , …

Probability mass function of X is Probability mass function of X is

P(X=xP(X=xii)=p)=pii, I = 1, 2, …, k, …, I = 1, 2, …, k, …

where pwhere pii is the probability mass function that satisfies is the probability mass function that satisfies

00≤≤ppii≤1≤1

pp11+p+p22+… ++… +ppkk+ … =1+ … =1

Range of this random variable: x1, x2, …,

Discrete Random Variable –

Probability mass function

A discrete random variable X has a countable number of possible values xA discrete random variable X has a countable number of possible values x11, x, x22, …, , …,

is the probability mass function that satisfies is the probability mass function that satisfies

, …, xk,…

Discrete Random Variable Discrete Random Variable

Cumulative distribution function (cdf)Cumulative distribution function (cdf)

The cumulative distribution function of aThe cumulative distribution function of a

≤= XPxF ()(

Properties of the Properties of the cdfcdf::

0 0 ≤ ≤ F(x) F(x) ≤ 1≤ 1

If xIf x ≤ y then F(x) ≤ F(y), i.e. always increasing≤ y then F(x) ≤ F(y), i.e. always increasing

Discrete case: step function, continuous from the right,Discrete case: step function, continuous from the right,

≤= XPxF ()(


jump discontinues at xjump discontinues at x11, x, x22, …, , …, xxkk,… with heights p,… with heights p

Discrete Random Variable Discrete Random Variable ––

Cumulative distribution function (cdf)Cumulative distribution function (cdf)

f a discrete random variable X is defined as f a discrete random variable X is defined as

∑=≤ px)

F(y), i.e. always increasing F(y), i.e. always increasing


∑≤

=≤xxi

i

i

px:

)


,… with heights p,… with heights p11, p, p22, …, , …, ppkk, …, …

Discrete random variables

Example of common distributions

Discrete uniform distribution

Random variable: outcome from rolling a fair 6

Geometric distribution

Random variable: the number of Bernoulli trials taken till the first Random variable: the number of Bernoulli trials taken till the first successsuccesssuccesssuccess

Discrete random variables –

Example of common distributions

Random variable: outcome from rolling a fair 6-sided die

Random variable: the number of Bernoulli trials taken till the first Random variable: the number of Bernoulli trials taken till the first

Discrete random variable Discrete random variable

Discrete uniform distributionDiscrete uniform distribution

A discrete random variable X is called uniformly distributed on the range xA discrete random variable X is called uniformly distributed on the range x

if for all if for all ii = 1, …, k:= 1, …, k:if for all if for all ii = 1, …, k:= 1, …, k:

Example: Roll a fair dieExample: Roll a fair die

Probability mass function Probability mass function

Discrete random variable Discrete random variable ––

Discrete uniform distributionDiscrete uniform distribution

A discrete random variable X is called uniformly distributed on the range xA discrete random variable X is called uniformly distributed on the range x11, x, x22, …, , …, xxkk

Discrete Random Variable Discrete Random Variable

Geometric distribution (1)Geometric distribution (1)

Random experiment (Repeat a Bernoulli expRandom experiment (Repeat a Bernoulli exp

a coin till the first appearance of a Ha coin till the first appearance of a H

Event: H, TH, TTH, …Event: H, TH, TTH, …

Probability for a success in a single trial: P(H) = Probability for a success in a single trial: P(H) =

Random variable X: “Number of trials until the first success” 1, 2, …Random variable X: “Number of trials until the first success” 1, 2, …

X has a geometric distribution with parameter X has a geometric distribution with parameter

The probability distribution function has the form The probability distribution function has the form The probability distribution function has the form The probability distribution function has the form

The cumulative distribution function has the form The cumulative distribution function has the form

Discrete Random Variable Discrete Random Variable ––


experiment until the first success), e.g. tossing experiment until the first success), e.g. tossing

Probability for a success in a single trial: P(H) = Probability for a success in a single trial: P(H) = ππ

Random variable X: “Number of trials until the first success” 1, 2, …Random variable X: “Number of trials until the first success” 1, 2, …

X has a geometric distribution with parameter X has a geometric distribution with parameter ππ

The probability distribution function has the form The probability distribution function has the form The probability distribution function has the form The probability distribution function has the form

The cumulative distribution function has the form The cumulative distribution function has the form





Discrete Random variable Discrete Random variable


Discrete Random variable Discrete Random variable ––



The value you The value you expectexpect to get in a random experiment is the mean, i.e., the average to get in a random experiment is the mean, i.e., the average

value of a random experiment value of a random experiment

Example: If you toss a coin 10 times, you exExample: If you toss a coin 10 times, you ex

this value because the probability of gettthis value because the probability of gett

you should get 5. you should get 5.

Definition: The mean of a discrete random variable with values xDefinition: The mean of a discrete random variable with values x

probability distribution pprobability distribution p11, p, p22, …, , …, ppkk, … is, … isprobability distribution pprobability distribution p11, p, p22, …, , …, ppkk, … is, … is

Note that E(x) characterizes the expected outcome from a random experimentNote that E(x) characterizes the expected outcome from a random experiment

Discrete random variable Discrete random variable -- MeanMean

to get in a random experiment is the mean, i.e., the average to get in a random experiment is the mean, i.e., the average

expect to get 5 heads and 5 tails. You expect expect to get 5 heads and 5 tails. You expect

etting "heads" is 0.5 and if you toss 10 times etting "heads" is 0.5 and if you toss 10 times

Definition: The mean of a discrete random variable with values xDefinition: The mean of a discrete random variable with values x11, x, x22, …, , …, xxkk, …, … and and

, … is, … is, … is, … is

Note that E(x) characterizes the expected outcome from a random experimentNote that E(x) characterizes the expected outcome from a random experiment


Mean (Example)Mean (Example)

Binary random variable X:Binary random variable X:

Assume P(X=1) = Assume P(X=1) = ππ and P(X=0)=1and P(X=0)=1-- π, π, thenthenAssume P(X=1) = Assume P(X=1) = ππ and P(X=0)=1and P(X=0)=1-- π, π, thenthen

E(x) = 0*P(X=0) + 1*P(X=1) = E(x) = 0*P(X=0) + 1*P(X=1) =

Toss a fair coin: Toss a fair coin:

X gain/loss of one dollar , X(H)=1, X(T)=X gain/loss of one dollar , X(H)=1, X(T)=

If P(X=1) = P(X=0), E(X) = ?If P(X=1) = P(X=0), E(X) = ?

Roll a fair 6Roll a fair 6--sided die: sided die:

Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?

Twice X =sum of values, E(X)=?Twice X =sum of values, E(X)=?


thenthenthenthen

E(x) = 0*P(X=0) + 1*P(X=1) = E(x) = 0*P(X=0) + 1*P(X=1) = ππ

X gain/loss of one dollar , X(H)=1, X(T)=X gain/loss of one dollar , X(H)=1, X(T)=--11

Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?


Variance and Standard deviationVariance and Standard deviation

The variance of a discrete random variable is The variance of a discrete random variable is

The standard deviation is The standard deviation is

––

Variance and Standard deviationVariance and Standard deviation

The variance of a discrete random variable is The variance of a discrete random variable is


Variance (Examples)Variance (Examples)

Binary random variable: Binary random variable: VarVar(X)= (X)= π(1π(1

ProofProofProofProof

Roll a fair die once: X is the value at the landingRoll a fair die once: X is the value at the landing

VarVar(X)=?(X)=?

Roll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the values

VarVar(X) = ?(X) = ?

––

π(1π(1−−π)π)

Roll a fair die once: X is the value at the landingRoll a fair die once: X is the value at the landing

Roll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the values

Discrete random variables Discrete random variables

IndependenceIndependence

Definition: 2 discrete random variables XDefinition: 2 discrete random variables X

and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e.

More general: More general: nn discrete random variables Xdiscrete random variables X

arbitrary values xarbitrary values x11, x, x22, …, , …, xxnn in their respective range the following term is truein their respective range the following term is true

Discrete random variables Discrete random variables ––

X are called independent if the events X=x X are called independent if the events X=x

and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e.

discrete random variables Xdiscrete random variables X11, X, X22, …, X, …, X33 are called independent, if forare called independent, if for

in their respective range the following term is truein their respective range the following term is true

Discrete Random variable Discrete Random variable

Independence (Example)Independence (Example)Random experiment: Roll two diceRandom experiment: Roll two dice

For allFor all 1 1 ≤ ≤ ii, j ≤ 6, j ≤ 6

P(X=I, Y=j) = 1/36 =1/6 X1/6 = P(X=P(X=I, Y=j) = 1/36 =1/6 X1/6 = P(X=ii)P(Y=j))P(Y=j)

Random experiment: Roll a dieRandom experiment: Roll a die

Y = 1 if the value is a “prime number” Y = 1 if the value is a “prime number”

Z = 1 if the value is “smaller than four” Z = 1 if the value is “smaller than four”

Are these two events independent? NoAre these two events independent? NoAre these two events independent? NoAre these two events independent? No

Because Y=1 and Z=1 means “2 or 3”, Because Y=1 and Z=1 means “2 or 3”,

so P(Y=1, Z=1) = 2/6 ≠ 1/2 · 1/2 = P(Y)·P(Z).so P(Y=1, Z=1) = 2/6 ≠ 1/2 · 1/2 = P(Y)·P(Z).

or equivalently: Is P(Y|Z) = P(Y) ?or equivalently: Is P(Y|Z) = P(Y) ?

Discrete Random variable Discrete Random variable ––

Independence (Example)Independence (Example)

)P(Y=j))P(Y=j)

Y = 1 if the value is a “prime number” Y = 1 if the value is a “prime number”

Z = 1 if the value is “smaller than four” Z = 1 if the value is “smaller than four”

Are these two events independent? NoAre these two events independent? NoAre these two events independent? NoAre these two events independent? No

Because Y=1 and Z=1 means “2 or 3”, Because Y=1 and Z=1 means “2 or 3”,

1/2 · 1/2 = P(Y)·P(Z). 1/2 · 1/2 = P(Y)·P(Z).

Example: DNA sequence analysis (1)Example: DNA sequence analysis (1)

•• 70 positions of the hemoglobin alpha gene in humans and rats:70 positions of the hemoglobin alpha gene in humans and rats:

HUM ...ACGTCAAGGCCGCCTGGGGCAAGGTTGGCGCGCACGGCGAGTATGGTGCGGAG

RAT ...ATGTAAGCCCCGGCTCTGCCCAGGTCAAGGCTCACGGCAAGAAGGTTGCTGAT

•• Empirical observation: 45 match positions and 25 mismatch positionsEmpirical observation: 45 match positions and 25 mismatch positions

•• Are both sequences related ? Are both sequences related ?

•• Compare the result with a random experCompare the result with a random exper

(null model), i.e. both sequences are dissimilar enough that such a model describes the (null model), i.e. both sequences are dissimilar enough that such a model describes the

observation good enough. observation good enough.


mutated region

1011010001111110010101111000011011111101101010111011011111110011010101


70 positions of the hemoglobin alpha gene in humans and rats:70 positions of the hemoglobin alpha gene in humans and rats:

AGGTTGGCGCGCACGGCGAGTATGGTGCGGAGGCCCTGGAGAATGTTCC...

AGGTCAAGGCTCACGGCAAGAAGGTTGCTGATGCCCTGGCCAAAGCTGC...

Empirical observation: 45 match positions and 25 mismatch positionsEmpirical observation: 45 match positions and 25 mismatch positions

periment with independently generated sequences periment with independently generated sequences

(null model), i.e. both sequences are dissimilar enough that such a model describes the (null model), i.e. both sequences are dissimilar enough that such a model describes the


conserved region

1011010001111110010101111000011011111101101010111011011111110011010101



•• Define a random variable Define a random variable ZZii with with

•• How likely is it to observe How likely is it to observe zz (=45) or more match positions?(=45) or more match positions?



mutated region

1011010001111110010101111000011011111101101010111011011111110011010101

=else ,0

same theobserve weif ,1Zi

•• How likely is it to observe How likely is it to observe zz (=45) or more match positions?(=45) or more match positions?

•• This depends on the evolutionary process but we can try to model This depends on the evolutionary process but we can try to model

this process with probability models. this process with probability models.



(=45) or more match positions?(=45) or more match positions?



conserved region

1011010001111110010101111000011011111101101010111011011111110011010101

sequencesboth in nucleotide same

(=45) or more match positions?(=45) or more match positions?

This depends on the evolutionary process but we can try to model This depends on the evolutionary process but we can try to model

this process with probability models. this process with probability models.

Example: Model 1Example: Model 1

•• We assume that both sequences are We assume that both sequences are

((iidiid) sequences and the two sequences are independent from each other.) sequences and the two sequences are independent from each other.

•• Moreover, we assume that all nucleMoreover, we assume that all nucle

i.e. i.e. ppAA = = ppCC = = ppGG = = ppTT = 0.25= 0.25

•• Then P(Then P(ZZii = 1) = 0.25 *0.25*4=0.25 and Z = “number of matches” with= 1) = 0.25 *0.25*4=0.25 and Z = “number of matches” with

Z=ZZ=Z11+…+Z+…+Znn is B(is B(nn, , ππ) with ) with ππ=0.25 .=0.25 .

•• Now it is possible to computeNow it is possible to compute

70 70 ∑

i.e. i.e. 45 or more matches are very unlikely under the assumption of 45 or more matches are very unlikely under the assumption of

unrelated sequences.unrelated sequences.

70

45

70( 45) 0.25 0.75 4.78 10

z

P Zz=

≥ = = ×

∑

re independently identically distributed re independently identically distributed

) sequences and the two sequences are independent from each other.) sequences and the two sequences are independent from each other.

cleotides have the same probability to occur,cleotides have the same probability to occur,

= 1) = 0.25 *0.25*4=0.25 and Z = “number of matches” with= 1) = 0.25 *0.25*4=0.25 and Z = “number of matches” with

=0.25 .=0.25 .

45 or more matches are very unlikely under the assumption of 45 or more matches are very unlikely under the assumption of

70 12( 45) 0.25 0.75 4.78 10z z− − ≥ = = ×


•• Now we assume that the sequence of base pairs is Now we assume that the sequence of base pairs is

•• The base pairs (xThe base pairs (xii, , yyii), ), ii=1,…,=1,…,nn are are ii ii

random experiments but follow an random experiments but follow an

•• We still assume that the observations of nucleotides in sequence 1 are We still assume that the observations of nucleotides in sequence 1 are

uniformly distributed with (uniformly distributed with (ppAA , , ppCC

•• But the observations in sequence 2 depend on the observations in But the observations in sequence 2 depend on the observations in

sequence 1:sequence 1:

pp =P(A in sequence 2 | A in sequence 1) = =P(A in sequence 2 | A in sequence 1) = ppA|AA|A=P(A in sequence 2 | A in sequence 1) = =P(A in sequence 2 | A in sequence 1) =

ppA|CA|C= = ppA|GA|G= = ppA|TA|T ==ppC|AC|A= = ppC|GC|G= = ppC|TC|T==pp


Now we assume that the sequence of base pairs is Now we assume that the sequence of base pairs is iidiid. .

re no longer independent realisations of twore no longer independent realisations of two

random experiments but follow an random experiments but follow an evolutionary process.evolutionary process.

We still assume that the observations of nucleotides in sequence 1 are We still assume that the observations of nucleotides in sequence 1 are

, , ppGG , , ppTT ) = (0.25, 0.25, 0.25, 0.25).) = (0.25, 0.25, 0.25, 0.25).

But the observations in sequence 2 depend on the observations in But the observations in sequence 2 depend on the observations in

=P(A in sequence 2 | A in sequence 1) = =P(A in sequence 2 | A in sequence 1) = pp = = pp = = pp == 0.640.64=P(A in sequence 2 | A in sequence 1) = =P(A in sequence 2 | A in sequence 1) = ppC|CC|C= = ppG|GG|G= = ppT|TT|T == 0.640.64

ppG|AG|A= = ppG|CG|C= = ppG|TG|T ==ppT|AT|A= = ppT|CT|C= = ppT|GT|G=0.12=0.12


•• The total probability theorem gives us the distribution of sequence 2:The total probability theorem gives us the distribution of sequence 2:

qqAA = = qqCC = = qqGG = = qqTT = 0.25 = 0.25

•• Both sequences have the same (margBoth sequences have the same (marg•• Both sequences have the same (margBoth sequences have the same (marg

e.g.e.g.

ppAAAA= P(“an A in both sequences”) = = P(“an A in both sequences”) =

•• Under this model assumption we get for the random variable Under this model assumption we get for the random variable

distribution distribution P(P(ZZii = 1) = 4*0.16 = 0.64= 1) = 4*0.16 = 0.64

and Z = “number of matches” is binomial distributed B(and Z = “number of matches” is binomial distributed B(

•• We We get get

•• Now Now the observation is much more likethe observation is much more like

44(1)45( ≤−=≥ ZPZP

•• Now Now the observation is much more likethe observation is much more like

E(Z)= E(Z)= nnππ = 70*0.64 = = 70*0.64 = 44.844.8

•• The observed number of matches The observed number of matches are much more likely under the are much more likely under the

evolutionary evolutionary model!model!

The total probability theorem gives us the distribution of sequence 2:The total probability theorem gives us the distribution of sequence 2:

rginal) distribution but they are not independentrginal) distribution but they are not independentrginal) distribution but they are not independentrginal) distribution but they are not independent

= P(“an A in both sequences”) = = P(“an A in both sequences”) = ppA|AA|A ppAA = 0.64*0.25 = 0.16= 0.64*0.25 = 0.16

Under this model assumption we get for the random variable Under this model assumption we get for the random variable ZZii the probability the probability

and Z = “number of matches” is binomial distributed B(and Z = “number of matches” is binomial distributed B(nn, , ππ) with ) with ππ=0.64=0.64..

ikely, in fact it is above the expected value of Z:ikely, in fact it is above the expected value of Z:

54.0)44 ≈

ikely, in fact it is above the expected value of Z:ikely, in fact it is above the expected value of Z:

are much more likely under the are much more likely under the

Continuous Random VariablesContinuous Random Variables

Continuous random variable

Probability distribution

)( , : xFIRIRF =→

Definition. If X:Ω→IR is a random variable, the function

is called the distribution function of X.

If X is a continuous random variable with density f, the distribution function F can be expressed as

()( ∫∞−

=x

xfxF

This formula is the continuous analogue of the discrete case, in which the

distribution function was defined as )( =xFdistribution function was defined as )( =xF

Continuous random variable –

Probability distribution

)( xXP ≤

IR is a random variable, the function

If X is a continuous random variable with density f, the distribution function F can

. )dxx

This formula is the continuous analogue of the discrete case, in which the

. )(∑ jxf . )(∑≤xx

j

j

xf

Continuous random variables

mean and varianceThe statistics “mean” and “variance”, which were already defined for discrete random variables can be defined in an analogous way for continuous random variables:

X discrete, Xj є x1,x2,…,

∑ ==jx

jj xXPxXE )()(Mean

p.d.f. P(X=xj)c.d.f. P(X≤xj)

Variance ())(()( 2

x

j XPXExXVarj

=−=∑

Continuous random variables –

The statistics “mean” and “variance”, which were already defined for discrete random variables can be defined in an analogous way for continuous random

∫∞

∞−

= dxxxfXE )()(

X continuous with density f

density function f(x)distribution function F(x)

∞−

∫∞

∞−

−= dxxfXExXVar )())(()( 2)jx


Example 1

Uniform distribution. A continuous random variable X is called distributed (in the interval [a,b]) if it has a density function of the form

1

for some real values a<b. This is denoted by X ~ U(a,b).

−=

0

1

)( abxf

f

1 1

f

0 0

a b


. A continuous random variable X is called uniform or uniformly (in the interval [a,b]) if it has a density function of the form

X ~ U(a,b).

∈

otherwise

b][a,for x

FF

a b

Normal /Gaussian distribution. A continuous random variable X is called normally distributed (with mean µ and standard deviation N(µ,σ2), if it has a density function of the form.


Example 2

N(µ,σ ), if it has a density function of the form.

= exp

2

1)(

πσxf

There is no closed form for the distribution function F of such a variable. the distribution function has to be computed numerically.

This distribution is symmetric (around x=µshaped like a bell.

Standard µ=0,

Standard

normal

distribution

µ=0, σ=1

µ=0,

σ=0.8 µ=1,5

,

σ=0.8

µ=1,σ=2

f

Density and distribution function of some normally distributed random variables X ~ N(

. A continuous random variable X is called and standard deviation σ>0), i.e. X ~

if it has a density function of the form.

− µ

Continuous random variables -

if it has a density function of the form.

−− 2

21 )(

σ

µx

There is no closed form for the distribution function F of such a variable. the distribution function has to be computed numerically.

µ), uni-modal (with mode at x=µ) and

µ=0 ,

µ=0, σ=1

µ=0,

σ=2

µ=0 ,

σ=0.8F

Density and distribution function of some normally distributed random variables X ~ N(µ,σ2)

Two more continuous distributions

The χ2-distribution. If X1,…,Xn are independent random variables that are N(0,1)distributed, then the random variable

2 XXZ +=

is said to be Chi-squared distributed with n degrees of freedom

Student t-distribution (t-distribution).then the random variable

is said to have a t-distribution with n degrees of freedom

2

1 XXZ +=

nZ

XT

/=

is said to have a t-distribution with n degrees of freedom

This list of continuous random variables is by no means complete. For a survey, consult the statistics literature given in the reference list to this lecture series.

Two more continuous distributions

are independent random variables that are N(0,1)-

22 ... XX ++

squared distributed with n degrees of freedom, for short Z ~ χ2(n).

distribution). If X~N(0,1) and Z~ χ2(n) are independent,

distribution with n degrees of freedom, for short T ~ t(n).

22

2 ... nXX ++

distribution with n degrees of freedom, for short T ~ t(n).

This list of continuous random variables is by no means complete. For a survey, consult the statistics literature given in the reference list to this

Definition. Let Ω be a probability space with probability measure P. Let X:Y:Ω→IR be continuous random variables. X and Y are called


Independence

for all x,y є IR.

)() ,( xXPyYxXP ≤=≤≤

Corollary. If the continuous random variables are independent,

for all real values of a1<a2,b1<b2.

() ,( 2121 PbYbaXaP =≤≤≤≤

be a probability space with probability measure P. Let X:Ω→IR and IR be continuous random variables. X and Y are called independent if


)()()() yFxFyYP YX=≤

. If the continuous random variables are independent,

)()( 2121 bYbPaXa ≤≤⋅≤≤

Let X and Y be two random variables on the samef: IR x IR→ IR such that

≤≤≤ ,( 121 YbaXaP

Continuous random variables Joint and marginal probability distributions

≤≤≤ ,( 121 YbaXaP

For all real values of a1<a2,b1<b2, then X and Y are said to have a continuous joint (multivariate) distribution, and f is called their joint density. We will be considered only with this case here.

The marginal distribution of X is given by

22 ∞ aa

where is the

density of the marginal distribution of X.

( ),()(2

1

2

1

21 ∫∫ ∫ ==≤≤∞

∞−

a

a

X

a

a

fdydxyxfaXaP

∫∞

∞−

= dyyxfxf X ),()(

same probability space Ω. If there exists a function

∫ ∫=≤2 2

),()2

a b

dydxyxfbY

Continuous random variables –Joint and marginal probability distributions

∫ ∫=≤

1 1

),()2

a b

dydxyxfbY

, then X and Y

, and f is called their joint density. We will be considered only with this

, )( dxx

The conditional distribution of X, given Y=b is given by

∫ ===≤≤2

|()|( 21

a

X bYxfbYaXaP

Continuous Random Varible

Conditional probability distributions

where is the

density of the conditional distribution of X, given Y=b.

∫ ===≤≤

1

|()|( 21

a

X bYxfbYaXaP

∫∞

∞−

== dtbtfbxfbYxf X ),(),( )|(

We mention an equivalent condition for independence:

The random variables X and Y are independent ifThe random variables X and Y are independent if

1. f(x,y)=fX(x)fY(y) for all x,y є IR

2. fX(x|Y=b)=fX(x) for all x,b є IR.

3. fY(y|X=a)=fY(y) for all a,y є IR.

is given by

) dx

Continuous Random Varible –

Conditional probability distributions

where is the

density of the conditional distribution of X, given Y=b.

) dx

dt

We mention an equivalent condition for independence:

The random variables X and Y are independent ifThe random variables X and Y are independent if

Basic Concepts in StatisticsBasic Concepts in Statistics

Data, sampling and statistical inference

Probability theory: reasoning from f->Y“if the experiment is like…, then f will be …, and (ymust be…”must be…”

Statistics: Reasoning from Y to f“Since (y1, …, yn), turned out to be …, it seems that f is likely to be …, or the

parameter is likely to be around …”

Sampling: Ways to select the subjects for which the characteristics/ properties of interest will be assessed

Examples: simple random sampling (SRS), stratified, clustered

Data: Characteristics/properties of a random sample from a population.For example: age, weight, expression level of a certain gene, …

Statistical inference: Learning from datai.e. assuming these data are n draws from distribution f

about the population parameter.

Data, sampling and statistical inference

“if the experiment is like…, then f will be …, and (y1, …, yn) will be like…, or E(Y)

), turned out to be …, it seems that f is likely to be …, or the

Ways to select the subjects for which the characteristics/ properties of

Examples: simple random sampling (SRS), stratified, clustered

Characteristics/properties of a random sample from a population.For example: age, weight, expression level of a certain gene, …

: Learning from datai.e. assuming these data are n draws from distribution fθ, what can we learn

There are different types of data:

Types of Data

• numerical data (discrete, continuous)

• categorical data (ordered, non-

ordered)

• mixtures of both

If the properties consist of multiple If the properties consist of multiple

features (like age, sex, smoking status,

PGEM levels pre/post treatment here),

the data is called multivariate, otherwise

it is called univariate.

SubjectID Age Sex Smoke PackyrCat PGEM.pre PGEM.post

N29 66 M N 0 7.44 1.77

data (discrete, continuous)N05 70 M N 0 16.75 10.82

N04 47 M N 0 20.38 3.51

N22 68 M N 0 15.43 4.47

N21 65 F N 0 2.99 3.14

…

F29 48 F F 3 12.08 3.44

F07 64 M F 2 16.64 7.74

F13 26 M F 1 8.2 6.33

F02 76 M F 2 7.95 5.36

features (like age, sex, smoking status,

PGEM levels pre/post treatment here),

, otherwise

F02 76 M F 2 7.95 5.36

F30 63 F F 3 3.63 0.55

…

C14 50 M C 2 18.23 6.18

C29 42 M C 1 20.16 10.22

C07 45 F C 1 10.43 5.67

C25 53 F C 1 10.6 5.73

C23 35 M C 3 10.62 8.41

Typical steps in statistical analysis of data

Describing the data (descriptive statistics)

Propose reasonable probabilistic model

Making inference about parameters in the model

Check the model fitting/assumption

Report results

Typical steps in statistical analysis of data

Describing the data (descriptive statistics)

Propose reasonable probabilistic model

Making inference about parameters in the model

Check the model fitting/assumption

Describing univariate categorical data

Frequency table: Simply list the count and the relative frequency for each data

category

Example:

Variable

Sex

M

F

Smoking StatusSmoking Status

N

F

C

Describing univariate categorical data

Simply list the count and the relative frequency for each data

Count Rel. Freq.

53 0.56

42 0.44

31 0.33

30 0.31

34 0.36

Histogram

Cut the possible range of the data to various bins of certain size/s:

Describing ordered univariate data

Count the number/ frequency of the data in each bin

Display the categorical count/frequency as

to the length of the intervals

Empirical distribution of the data

Fre

qu

en

cy

10

15

Fre

qu

en

cy 1

52

025

Age

Fre

qu

en

cy

20 30 40 50 60 70 80

05

PGEM.Pre

Fre

qu

en

cy

0 10 20

05

10

Cut the possible range of the data to various bins of certain size/s:

Describing ordered univariate data

Count the number/ frequency of the data in each bin

as a bar plot, with the width of the bars proportional

Fre

qu

en

cy

15

20

PGEM.Pre

30 40 50 60

PGEM.Post

Fre

qu

en

cy

0 10 20 30 40

05

10

Descriptive statistics 1

The second and by far the most important way is to summarize the data by appropriate statistics. A statistics is a rule that assigns a number to a dataset. This number is meant to tell us something about the underlying dataset.tell us something about the underlying dataset.

Examples:

Sample arithmetic mean. Given x1, …, xn, calculate the arithmetic mean as

The arithmetic mean is one of the many statistics that aim to describe where the “centre” of the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data

∑=jn

x 1

the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data points, namely

= ∑

=

xargmin n

j

x

The second and by far the most important way is to summarize the data by appropriate is a rule that assigns a number to a dataset. This number is meant to

, calculate the arithmetic mean as

The arithmetic mean is one of the many statistics that aim to describe where the “centre” of the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data

∑=

n

jx1

x

the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data

−∑

=

2

1

)(n

jxx

Median. Let x1, …, xn, be given in ascending order. The median x

=

+ if 2/)1( nx n


+=

+

+

)(

if

12/2/

2/)1(

xx

nxx

nn

n

med

The median is a value such that the number of data points smaller than number of data points greater than xmed. Like the arithmetic mean, the median is also a location measure for the “centre” of the data.

, be given in ascending order. The median xmed is defined as

odd is

even is if 2/)

odd is

n

The median is a value such that the number of data points smaller than xmed equals the Like the arithmetic mean, the median is also a

Mean

Median

Mode (this distribution is unimodal!)

Sample variance, sample standard deviationdataset x=(x1,…,xn ) is defined as

2 2 2 21 1 ( ) , or more commonly ( )

n n

v s x x v s x x= = − = = −∑ ∑


(the average squared distance from all data points to ). The standard deviation the positive square root of the variance, s2=v

are measures for the dispersion of the data.

2 2 2 2

1 1

1 1 ( ) , or more commonly ( )j j

j j

v s x x v s x xn n= =

= = − = = −∑ ∑

Re

lative

fre

qu

en

cy

small variance

Re

lative

fre

qu

en

cy

Sample variance, sample standard deviation. The variance v=Var(x1,…,xn)=Var(x) of a

2 2 2 21 1 ( ) , or more commonly ( )

n n

v s x x v s x x= = − = = −∑ ∑(the average squared distance from all data points to ). The standard deviation s=s(x) is

=v. The variance and the standard deviation of the data.

2 2 2 2

1 1

1 1 ( ) , or more commonly ( )

1j j

j j

v s x x v s x xn n= =

= = − = = −−

∑ ∑

Re

lative

fre

qu

en

cy

x

vs. large variance

Re

lative

fre

qu

en

cy

Symmetry. A frequency distribution is called symmetric if it has an axis of symmetry.

Skewness. A frequency distribution is called skewed to the right if the right tail of the distribution falls off slower than the left tail. Analogously: skewed to the left.


distribution falls off slower than the left tail. Analogously: skewed to the left.

The statistic for sample skewness used the sample third central moment and sample variance.

left skew symmetric

Posture rules:

Left skew:

Symmetric:

modexxx med <<

modexxx med ≈≈

. A frequency distribution is called symmetric if it has an axis of symmetry.

. A frequency distribution is called skewed to the right if the right tail of the distribution falls off slower than the left tail. Analogously: skewed to the left.distribution falls off slower than the left tail. Analogously: skewed to the left.

The statistic for sample skewness used the sample third central moment and sample

Mean

Median

Mode

symmetric skewed to the right

Right skew: modexxx med >>

Quantiles. Let q є (0,1). A q-quantile of a frequency distribution is a value

that the fraction of data lying left to xq is at least

at least 1-q. If the data is ordered , then( xx ≤


at least 1-q. If the data is ordered , then( 21 xx ≤

∈=

+ ][

is if

1, xx

qnxx

qnqn

qn

q

X0.05 X0.25X0.5

Special quantiles are the quartiles, x0.25,x0.5,x0.75

classes), and the quintiles x0.2,x0.4,x0.6,x0.8. They are frequently used to give a summary of the data distribution.

of a frequency distribution is a value xq such

is at least q, and the fraction lying right to xq is

If the data is ordered , then) ... x≤≤If the data is ordered , then) ... nx≤≤

integeran is if

integeran not is

qn

X0.75 X0.95

0.75 (which split up the data into four . They are frequently used to give a summary

Density plots. If the number of data points ia histogram (of the relative frequencies) by a density curve (red line):

Detailed description of a univariate data

x0 x1

A density function is a non-negative real-valued integrable function f such that

(this condition says that the area enclosed by the graph of Interpretation: The area of a segment enclosed by the y=x1 (the grey shaded area in the figure) equals the fraction of data points with values between x0 and x1.

∫∞

∞−

)( dxxf

ts is large, it is often convenient to approximate a histogram (of the relative frequencies) by a density curve (red line):

Detailed description of a univariate data

valued integrable function f such that

(this condition says that the area enclosed by the graph of f and the x-axis is 1).Interpretation: The area of a segment enclosed by the x-axis, the graph of f and y=x0 and

(the grey shaded area in the figure) equals the fraction of data points with values

= 1

Normal distributions = Gaussian distributionsfunctions are the Gaussian distributions, defined as

−1 µx

Continuous univariate data

An important distribution

−−=

21 )(exp

2

1)(

σ

µ

πσ

xxf

This distribution is symmetric (around x=µ), unimodal (with mode at like a bell. The mean of gaussian distributed data is

The 68-95-99.7 rule. If a dataset has a gaussian distribution with mean µ and variance σ2, then

68% of the data lie within the interval [ 95% of the data lie within the interval [ 99.7% of the data lie within the interval [

Normal distributions = Gaussian distributions. A very important family of density functions are the Gaussian distributions, defined as

Continuous univariate data –

2

), unimodal (with mode at x=µ) and shaped The mean of gaussian distributed data is µ, its variance is σ2.

With parameters µ and σ>0.

If a dataset has a gaussian distribution with mean

of the data lie within the interval [ µ-σ, µ+σ ]of the data lie within the interval [ µ-2σ, µ+2σ ]

of the data lie within the interval [ µ-3σ, µ+3σ ]

Summary

Frequency tables, Bar plots, Pie charts, Histograms, Density plots are possible ways to display distribution of statistical data.

Mean, Median and Quantiles are summary statistics for location of the numerical data

The variance is a measure of dispersion for numerical data

Higher order statistics such as sample skewness and kurtosis are also used to describe additional characteristics of the distribution

Frequency tables, Bar plots, Pie charts, Histograms, Density plots are possible ways to display distribution of statistical data.

Mean, Median and Quantiles are summary statistics for location of

The variance is a measure of dispersion for numerical data

Higher order statistics such as sample skewness and kurtosis are also used to describe additional characteristics of the distribution

Multivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statistics

Multidimensional dataMultidimensional data

In many applications a set of properties/features is measured for each study In many applications a set of properties/features is measured for each study

subject, such data is considered multidimensionalsubject, such data is considered multidimensional

It is often of interest to evaluate the relatIt is often of interest to evaluate the relatIt is often of interest to evaluate the relatIt is often of interest to evaluate the relat

to quantify the strength of the relationship to quantify the strength of the relationship

Examples:Examples:

The PGEM data we saw earlier. For each study subject, multiple aspects were The PGEM data we saw earlier. For each study subject, multiple aspects were

measured, such as age, gender, PGEM value, etc. measured, such as age, gender, PGEM value, etc.

Microarray gene expression data are multidimensional Microarray gene expression data are multidimensional

Describe the relationship between two discrete features: contingency tableDescribe the relationship between two discrete features: contingency table

Describe the relationship between two continuous features: correlationDescribe the relationship between two continuous features: correlation

In many applications a set of properties/features is measured for each study In many applications a set of properties/features is measured for each study

subject, such data is considered multidimensionalsubject, such data is considered multidimensional

lationship between/ among these features and lationship between/ among these features and lationship between/ among these features and lationship between/ among these features and

to quantify the strength of the relationship to quantify the strength of the relationship

The PGEM data we saw earlier. For each study subject, multiple aspects were The PGEM data we saw earlier. For each study subject, multiple aspects were

measured, such as age, gender, PGEM value, etc. measured, such as age, gender, PGEM value, etc.

Microarray gene expression data are multidimensional Microarray gene expression data are multidimensional

Describe the relationship between two discrete features: contingency tableDescribe the relationship between two discrete features: contingency table

Describe the relationship between two continuous features: correlationDescribe the relationship between two continuous features: correlation

General description: Contingency table General description: Contingency table

Absolute frequnciesAbsolute frequncies

The contingency table can be used to desThe contingency table can be used to des

terms of absolute frequencies X=aterms of absolute frequencies X=a11, …, , …, aaterms of absolute frequencies X=aterms of absolute frequencies X=a11, …, , …, aa

A (k x m) contingency table of absolute frequencies has the form:A (k x m) contingency table of absolute frequencies has the form:

Gender

F

M

total

Marginal frequency

General description: Contingency table General description: Contingency table ––

escribe the joint distribution of X and Y in escribe the joint distribution of X and Y in

aakk, Y=b, Y=b11, …, , …, bbmmaakk, Y=b, Y=b11, …, , …, bbmm

A (k x m) contingency table of absolute frequencies has the form:A (k x m) contingency table of absolute frequencies has the form:

Smoking Status Total

C F N

16 12 14 42

18 18 17 53

34 30 31 95

Marginal frequency

Contingency table Contingency table ––

Relative frequenciesRelative frequencies

The contingency table can also be used toThe contingency table can also be used to

terms of relative frequencies terms of relative frequencies terms of relative frequencies terms of relative frequencies

Gender C

F 0.17

M 0.19M 0.19

total 0.36

Marginal relative frequency

to describe the joint distribution of X and Y in to describe the joint distribution of X and Y in

Smoking Status Total

C F N

0.17 0.13 0.15 0.44

0.19 0.19 0.18 0.560.19 0.19 0.18 0.56

0.36 0.32 0.33 1.00

Marginal relative frequency


Conditional frequenciesConditional frequencies

Contingency table can help us to examine the dependency between two discrete Contingency table can help us to examine the dependency between two discrete

variables variables variables variables

Based on the relationship between the relative frequency ( joint distribution of two Based on the relationship between the relative frequency ( joint distribution of two

variables) and the conditional relative frequencies (conditional distribution of the variables) and the conditional relative frequencies (conditional distribution of the

variables)variables)

Therefore: Look at conditional frequencies, i.e. the distribution of aTherefore: Look at conditional frequencies, i.e. the distribution of a

feature for a fixed value of the second featurefeature for a fixed value of the second feature

Smoking Status

Sex C F N total

F 0.38 0.29 0.33 1

M 0.34 0.34 0.32 1

Relative freq. conditional on gender

Smoking Status

Conditional frequenciesConditional frequencies

Contingency table can help us to examine the dependency between two discrete Contingency table can help us to examine the dependency between two discrete

Based on the relationship between the relative frequency ( joint distribution of two Based on the relationship between the relative frequency ( joint distribution of two

variables) and the conditional relative frequencies (conditional distribution of the variables) and the conditional relative frequencies (conditional distribution of the

Therefore: Look at conditional frequencies, i.e. the distribution of aTherefore: Look at conditional frequencies, i.e. the distribution of a

feature for a fixed value of the second featurefeature for a fixed value of the second feature

Smoking Status

Sex C F N

F 0.47 0.40 0.45

M 0.53 0.60 0.55

total 1.00 1.00 1.00

Relative freq. conditional on smoking status


Conditional frequency distribution (1)Conditional frequency distribution (1)

Conditional frequency distribution of Y under the condition X=aConditional frequency distribution of Y under the condition X=a

given by:given by:given by:given by:

Conditional frequency distribution of X under the condition Y=bConditional frequency distribution of X under the condition Y=b

is given by:is given by:


Conditional frequency distribution of Y under the condition X=aConditional frequency distribution of Y under the condition X=aii, also written Y|X=a, also written Y|X=aii , is , is

Conditional frequency distribution of X under the condition Y=bConditional frequency distribution of X under the condition Y=bj j , also written X|Y=b, also written X|Y=bjj , ,



Because ofBecause of

we also havewe also have

The conditional distributions are computed by The conditional distributions are computed by

marginal frequencies. marginal frequencies.


y dividing the joint frequencies by the appropriate y dividing the joint frequencies by the appropriate

ContingencyContingency--tabletable ––

χχ22 coefficientscoefficients

Starting point: How should the joint frequencies look like, if we “empirically” assume Starting point: How should the joint frequencies look like, if we “empirically” assume

independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions)

Starting point: How should the joint frequencies look like, if we “empirically” assume Starting point: How should the joint frequencies look like, if we “empirically” assume

independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions)


Empirical independenceEmpirical independenceIdea: X and Y are “empirically” independeIdea: X and Y are “empirically” independe

are equal in each subare equal in each sub--population X=apopulation X=aii , i.e. independent of a, i.e. independent of a

Empirical independenceEmpirical independencedent if and only if the conditional frequencies dent if and only if the conditional frequencies

, i.e. independent of a, i.e. independent of aii


Assessing empirical independenceAssessing empirical independence

Idea: Compare for each cell (Idea: Compare for each cell (i,ji,j) the theoretical frequency with the observed ) the theoretical frequency with the observed

frequency under the assumption of independence frequency under the assumption of independence frequency under the assumption of independence frequency under the assumption of independence

→ → χχ22 coefficient, good approximation in large sample)coefficient, good approximation in large sample)

Assessing empirical independenceAssessing empirical independence

) the theoretical frequency with the observed ) the theoretical frequency with the observed

frequency under the assumption of independence frequency under the assumption of independence frequency under the assumption of independence frequency under the assumption of independence

coefficient, good approximation in large sample)coefficient, good approximation in large sample)

X and Y are empirically independentX and Y are empirically independent


Properties of the Properties of the χχ22 coefficientscoefficients


large large <==> strong dependence/association<==> strong dependence/association

small small <==> weak dependence/association<==> weak dependence/association

Disadvantage: depends on the dimension of the tableDisadvantage: depends on the dimension of the table


coefficientscoefficients


<==> strong dependence/association<==> strong dependence/association

<==> weak dependence/association<==> weak dependence/association

Disadvantage: depends on the dimension of the tableDisadvantage: depends on the dimension of the table

Graphical representation oGraphical representation o

Graphical representation of the values (Graphical representation of the values (xxii,y,yii

and Y.and Y.and Y.and Y.

The simplest representation of (xThe simplest representation of (x11,y,y11),…,(),…,(xxnn

12

3

log

(PG

EM

.po

st)

-1 0 1 2

-10

log(PGEM.pre)

log

(PG

EM

.po

st)

of two continuous features of two continuous features

ii), ), ii=1,…,n from two continuous features X =1,…,n from two continuous features X

nn,y,ynn) in a coordinate system is a scatter plot ) in a coordinate system is a scatter plot

2 3 4

log(PGEM.pre)

Pearson’s correlation coefficient (1)Pearson’s correlation coefficient (1)

Pearson correlation coefficient: commonly used to describe the strength of linear association Pearson correlation coefficient: commonly used to describe the strength of linear association

between two continuous features. For the data (between two continuous features. For the data (

The range of r is [The range of r is [--1,1]1,1]

r r > 0 positive correlation, positive linear relationship, i.e. values are > 0 positive correlation, positive linear relationship, i.e. values are

around a straight line with positive slope around a straight line with positive slope

r r < 0 negative correlation, negative linear relationship, i.e. values are < 0 negative correlation, negative linear relationship, i.e. values are

around a straight line with negative slope around a straight line with negative slope

r r = 0 no correlation, uncorrelated= 0 no correlation, uncorrelated


Pearson correlation coefficient: commonly used to describe the strength of linear association Pearson correlation coefficient: commonly used to describe the strength of linear association

between two continuous features. For the data (between two continuous features. For the data (xxii,y,yii), ), ii=1,…,n is defined as=1,…,n is defined as

> 0 positive correlation, positive linear relationship, i.e. values are > 0 positive correlation, positive linear relationship, i.e. values are

around a straight line with positive slope around a straight line with positive slope

< 0 negative correlation, negative linear relationship, i.e. values are < 0 negative correlation, negative linear relationship, i.e. values are

around a straight line with negative slope around a straight line with negative slope


The correlation coefficient The correlation coefficient rr measures the strength of a linear relationship measures the strength of a linear relationship


measures the strength of a linear relationship measures the strength of a linear relationship


Rule of thumb:Rule of thumb:

“weak correlation”“weak correlation” “weak correlation”“weak correlation”

“medium correlation”“medium correlation”

“strong correlation”“strong correlation”

Linear transformation:Linear transformation:

correlation coefficient between andcorrelation coefficient between and

correlation coefficient between andcorrelation coefficient between and correlation coefficient between andcorrelation coefficient between and


“medium correlation”“medium correlation”

oror

oror

Equivalent forms of Equivalent forms of rr

Multiplying out yields:Multiplying out yields:

Remember the formula for variances!Remember the formula for variances!

In terms of standard deviations and covarianceIn terms of standard deviations and covariance

with covariance with covariance

and standard deviations and standard deviations

In terms of standard deviations and covarianceIn terms of standard deviations and covariance

Statistic Inference

Estimation

Finding approximations of the model parameters Finding approximations of the model parameters estimation

Finding the uncertainty associated with the model parameter – interval estimation (finding the confidence intervals)

Estimates are used to characterize a population characteristic

Hypothesis testing

Examine the validity of our hypotheses regarding a population characteristic by using observed data.

Finding approximations of the model parameters – point Finding approximations of the model parameters – point

Finding the uncertainty associated with the model parameter interval estimation (finding the confidence intervals)

Estimates are used to characterize a population

Examine the validity of our hypotheses regarding a population characteristic by using observed data.

Estimation

Point estimation, i.e., finding ˆ ˆ ˆ( | ), ( | ), ( | )C F Ny x y x y xθ θ θ

, . , , ,i pgem pre C i sm C F i sm F N i sm N i iy x x x Nθ θ θ ε ε σ= = == + + +

Point estimation, i.e., finding

Interval estimation, i.e, finding the uncertainties associated with the point estimate

Desired properties of the estimator:

unbiasedness (bias is measured as the expected difference between the estmator and the population parameter)

efficiency (could be described by the inverse of variance of the estimator)

ˆ ˆ ˆ( | ), ( | ), ( | )C F Ny x y x y xθ θ θ

ˆ ˆ ˆvar( (C F Nθ θ θ

small mean square error (MSE)

other: consistency, etc.

Common methods to find estimators:

Method of moments

Maximum likelihood estimation

(E


2

, . , , , , ~ (0, )i pgem pre C i sm C F i sm F N i sm N i iy x x x Nθ θ θ ε ε σ= = == + + +

the uncertainties associated with the point estimate

unbiasedness (bias is measured as the expected difference between the estmator and the population parameter)

efficiency (could be described by the inverse of variance of the estimator)


ˆ ˆ ˆ( | )), var( ( | )), var( ( | )), C F Ny x y x y xθ θ θ

2)ˆ( θθ −

Estimation Methods

Method of moments:

Match the first E(X), second (E(X2

Solve the equation system

If sample E(Xk)=g(θ), then ˆ g=θ

Maximum Likelihood Estimation (MLE)

Assuming the data come from a parametric family indexed by a population parameter θ, i.e. X1,…, Xn ~ i.i.d. f(x|

|,...,( 1 nXXf

The probability of observing the data is the likelihood function of the parameter θ under the assumed probabilistic model, i.e.

|,...,( 1 nXXf

,...,( 1xfLikelihood =

2)),…, order moments to the parameters

)( kx-1g

Assuming the data come from a parametric family indexed by a population ~ i.i.d. f(x| θ), the joint density of the data is

)|()| θθ iXfΠ=

The probability of observing the data is the likelihood function of the under the assumed probabilistic model, i.e.

)|()| θθ iXfΠ=

)|()| θθ in xfx Π=

Example: Binomial data

Data: 6,3,5,6,8 number of successes in 5 repeated experiments of tossing a coin 10 times

Is this a fair coin?

What is going to come up for the 11

Assuming a probabilistic model: X

Estimating π

MoM: Because E(X)= 10π, estimate of MoM: Because E(X)= 10π, estimate of (0.6+0.3+0.5+0.6+0.8)/5

MLE: L(π|data) = P(x1=6,…, x5

find the value that maximize the likelihood function

Example: Binomial data

Data: 6,3,5,6,8 number of successes in 5 repeated experiments of

What is going to come up for the 11th toss?

Assuming a probabilistic model: X ~Binom (π,10)

, estimate of π = sample mean/10 = , estimate of π = sample mean/10 =

5= 8|π)=P(x1=6 |π)...P(x5=8|π), then find the value that maximize the likelihood function

Example: Normal data

x1,x2,....,xn ~ iid N(µ,σ2)

Joint pdf for the whole random sample

x( f N


σ,µ |(σ,µ | ,...,, ( 1

2

21 n xfxxxf =)

)σ,µ |( ,...,,|σ,µ ( 121 n xfxxxl =)

Likelihood function is basically the pdf for the fixed sample

nµ

i∑=

x σ

Maximum likelihood estimates of the model parameters are numbers that maximize the joint pdf for the fixed sample which is called the Likelihood function


2

2

2σ

µ)(x

2

σ2π

1σ,µ |

−

= e)


)σ,µ |()...σ,µ |()σ22

2

2

nxfxf

)σ,µ |()...σ,µ| () 2 nxfxf

Likelihood function is basically the pdf for the fixed sample

2

i2ˆ( µ)

σn

x −=∑

Maximum likelihood estimates of the model parameters µ and σ2

are numbers that maximize the joint pdf for the fixed sample which is called the Likelihood function

Hypothesis Testing

Making inference about the value of the population parameter based on the data

Form hypotheses about the population parameter Form hypotheses about the population parameter null hypothesis: H0

alternative hypothesis: H1

Construct and calculate the test statistic

utilizes the population parameter estimates

A desired property is that it conassociated with the parameter estimates

Define a rejection region ( or the significance level) for the test Define a rejection region ( or the significance level) for the test statistic based on H0

Reject the H0 if the test statistic falls the test statistic is deems highprobabilistic model defined by H

Fail to reject H0, the data is not highly unlikely to happen with H

Making inference about the value of the population parameter

Form hypotheses about the population parameter Form hypotheses about the population parameter

Construct and calculate the test statistic

utilizes the population parameter estimates

ontains information about the uncertainty associated with the parameter estimates

Define a rejection region ( or the significance level) for the test Define a rejection region ( or the significance level) for the test

if the test statistic falls inside the rejection region (or ghly unlikely to be generated from the

probabilistic model defined by H0)

, the data is not highly unlikely to happen with H0

Question of interest: if the baseline PGEM levels are different between two conditions ( for example, current smokers vs. never smokers)

Hypothesis Testing Example

two conditions ( for example, current smokers vs. never smokers)

Hypothesis to be tested:

H0: θC = θN

H1: θC ≠ θN

Choose an appropriate statistics D that is able to discriminate between the two hypotheses

Choose a rejection region / significance level

The selection of the statistics defines the test.

Question of interest: if the baseline PGEM levels are different between two conditions ( for example, current smokers vs. never smokers)

Hypothesis Testing Example

two conditions ( for example, current smokers vs. never smokers)

D that is able to discriminate between

rejection region / significance level based on H0

The selection of the statistics defines the test.

Fold change

A commonly used test statistics by biologists

Dividing the average PGEM level in smokers by the average of PGEM levels in never

Hypothesis Testing Example: Fold Change

This statistic does not take into account of the variation of the study sample

If the distribution is skewed, using median instead of the mean as the parameter estimate may be more robust.

,1 ,2 ,

,1 ,2 ,

...ˆ

ˆ

CC C C n

C C

N N N nN

y y y

nD

y y yθ

θ

+ + +

= =+ + +

Dividing the average PGEM level in smokers by the average of PGEM levels in never smokers

Rejection region for D:

Often based on experience about the biological system

For example: one may define the rejection region as D<0.5 or D>2, i.e.

if |log2D| > 1, reject H0 in favour of H1

if |log2D| ≤ 1, reject H0

A commonly used test statistics by biologists


Hypothesis Testing Example: Fold Change

This statistic does not take into account of the variation of the study sample

If the distribution is skewed, using median instead of the mean as the parameter

,1 ,2 ,...NN N N n

N

y y y

n

+ + +


Often based on experience about the biological system

For example: one may define the rejection region as D<0.5 or D>2, i.e.

Two sample student t-test:Another commonly used test statistic

Assumption:

Hypothesis Testing Example: t

iid iidAssumption:

2 2

ˆ ˆ

ˆ ˆ

C N

C N

C N

T

n n

θ θ

σ σ

−=

+

The T statistic is random variable with a little bit complicated distribution

iid iid2 2

,1 , ,1 ,,... ~ ( , ), ,... ~ ( , )C NC C n C C N N n N Ny y N y y Nθ σ θ σ

Calculating the T statistic

Under the normality assumption, T has an approx. a

freedom, where d is the closest integer to

2 2 2 2 2 2 21 1( / / ) / ( / ) ( / )

1 1C C N N C C N N

C N

S n S n S n S nn n

+ +

− −

Hypothesis Testing Example: t-test

iid iid

2 2ˆ ˆC N

C Nn n

σ σ

statistic is random variable with a little bit complicated distribution

iid iid2 2

,1 , ,1 ,,... ~ ( , ), ,... ~ ( , )C NC C n C C N N n N Ny y N y y Nθ σ θ σ

has an approx. a t-distribution with d degrees of

2 2 2 2 2 2 21 1( / / ) / ( / ) ( / )

1 1C C N N C C N N

C N

S n S n S n S nn n

+ +

− −

Deciding on the rejection region

Significance level:


•Usually represented by α є (0,1)

•A probabilistic interpretation of the rejection region, i.e. the probability that a test

statistic falls into the rejection region J under H

0( | H )P D J∈ =

•Commonly used α in a simple hypothesis testing is 0.05

• Means the rejection region would be in the region of the distribution that the test statistic will only have very small probability to be in.

•We can derive the rejection region based on the significance level.

Hypothesis Testing Example: t-test (2)

A probabilistic interpretation of the rejection region, i.e. the probability that a test

statistic falls into the rejection region J under H0

0( | H ) α∈ =

in a simple hypothesis testing is 0.05

Means the rejection region would be in the region of the distribution that the test statistic will only have very small probability to be in.

We can derive the rejection region based on the significance level.

For the test statistic has a t distribution 8 degrees of freedom, and significance level

The rejection region can be defined as the region, where, under H

5% of the time for T to be above t(0.975; 8)=2.306 or below


5% of the time for T to be above t(0.975; 8)=2.306 or below

Thus a typical decision rule in this case would be to reject

t(0.975;8) = 2.306.

0.3

0.4

Density of the t-statistic for k=8. Symmetric

confidence interval for α

x

density o

f t

(k=

8)

-6 -4 -2 0

0.0

0.1

0.2 2.5%

95%

For the test statistic has a t distribution 8 degrees of freedom, and significance level α = 5%

The rejection region can be defined as the region, where, under H0, we would expect only

to be above t(0.975; 8)=2.306 or below t(0.025; 8) = -2.306.

Example: t-test (3)

to be above t(0.975; 8)=2.306 or below t(0.025; 8) = -2.306.

Thus a typical decision rule in this case would be to reject H0 in favour of H1 if |T| >

statistic for k=8. Symmetric

α = 5%.

2 4 6

2.5%

95%

Hypothesis Testing Example:

Instead of comparing the observed twhich define the rejection region, we can compare the pthe test statistic to the significance level: the test statistic to the significance level:

P-valueThe probability of observing values of T that are at least as extreme as the observed t under H

p = P(|T|>|t||H0)

Given a significance level α, we reject the H

Hypothesis Testing Example: t-test (4)

Instead of comparing the observed t-statistic to the critical values which define the rejection region, we can compare the p-value of the test statistic to the significance level: the test statistic to the significance level:

The probability of observing values of T that are at least as extreme as the observed t under H0

, we reject the H0 if p< α.

Two types of tests

Parametric tests: A parametric distribution is assumed for the measured random variables. E.g. the t-test assumes that the variables are normally distributed. (If this were

not the case, this would lead to wrong pnot the case, this would lead to wrong p prior to computing a test statistic, data is transformed in order to produce

random variables that are easier to handle (e.g. to produce approximately normally distributed data).

Non-parametric tests: No parametric distribution function is assumed for the measured random variable when the distribution of the measured variables is not known or when there is no appropriate test that can deal with the distribution of the

measured variables merely rely on the relative order of the values of on some very mild constraints merely rely on the relative order of the values of on some very mild constraints

concerning the shape of the probability distributions of the measured variables (e.g. unimodality, symmetry).

We mention one parametric and one non

A parametric distribution is assumed for the measured random

test assumes that the variables are normally distributed. (If this were not the case, this would lead to wrong p-values or wrong confidence intervals.)not the case, this would lead to wrong p-values or wrong confidence intervals.)prior to computing a test statistic, data is transformed in order to produce random variables that are easier to handle (e.g. to produce approximately

No parametric distribution function is assumed for the

when the distribution of the measured variables is not known or when there is no appropriate test that can deal with the distribution of the

merely rely on the relative order of the values of on some very mild constraints merely rely on the relative order of the values of on some very mild constraints concerning the shape of the probability distributions of the measured variables

We mention one parametric and one non-parametric test which are commonly used.

Given two samples x=(x1,…,xn) and y=(yfrom the random variables X and Y resp.

Null hypothesis: The two variables X and Y have the same distribution.

Wilcoxon rank sum test


Test statistic:

Rank order all N=n +m values from both samples combined ( n is the size of the larger sample and m is the size of the smaller sample).

Sum the ranks of the smaller sample and call this value w.

Look up the level of significance (p-value) in a table using w, m and n.

Exact p-value can be calculated based on all permutations of ranks over both samples. (when n and m are large, approximations based on central limit samples. (when n and m are large, approximations based on central limit theorem can be used).

For large numbers it is almost as sensitive as the two Sample Student ttest.

For small numbers with unknown distributions this test is even more sensitive than the Student t-test.

) and y=(y1,…,ym) drawn independently from the random variables X and Y resp.


Wilcoxon rank sum test


Rank order all N=n +m values from both samples combined ( n is the size of the larger sample and m is the size of the smaller sample).

Sum the ranks of the smaller sample and call this value w.

value) in a table using w, m and n.

value can be calculated based on all permutations of ranks over both samples. (when n and m are large, approximations based on central limit samples. (when n and m are large, approximations based on central limit

For large numbers it is almost as sensitive as the two Sample Student t-

For small numbers with unknown distributions this test is even more

Hypothesis Testing – Error types

If we reject the null hypothesis when it is actually true, we have made what is called a type I error or a false positivedifferentially expressed)differentially expressed)

If we accept the null hypothesis whetype II error or a false negative. (Example: Failed to identify a truly differentially expressed gene)

True negatives

H0 true

H0 not rejected True negatives

Type I error(false positives)

H0 not rejected

(P>=αααα)

H0 rejected

(P<αααα)

Error types

If we reject the null hypothesis when it is actually true, we have made what is false positive. (Example: Falsely declare a gene as

hen it is actually false, then we have made a . (Example: Failed to identify a truly

Type II error

H0 not true

Type II error(false negatives)

True positives(false positives)

In hypothesis testing, the probability of a Type I error

high as significance level of the test.

Hypothesis Testing – Error Types (Cont.)

It is harder to control the probability of a Type II error because we usually do not

have a statistics for testing the alternative hypothesis.

The smaller the true existing difference is the larger the probability of a Type II

error.

Given a statistical testing procedure, it is impossible to keep both error types

arbitrarily large by selecting a special significance level. There is a trade

type I and type II error, as depicted below.

Probability of a type I error

Pro

babili

ty o

f a

type I

I err

or

Trade-off between error

types, plotted for different

significance levels

the probability of a Type I error is controlled to be at most as

Error Types (Cont.)

It is harder to control the probability of a Type II error because we usually do not

have a statistics for testing the alternative hypothesis.

The smaller the true existing difference is the larger the probability of a Type II

Given a statistical testing procedure, it is impossible to keep both error types

arbitrarily large by selecting a special significance level. There is a trade-off between

type I and type II error, as depicted below.

Probability of a type I error

off between error

types, plotted for different

significance levels

Summary

Null hypothesis, test statistics

Significance level, rejection region, p

Type I and type II errors

5-Step testing procedure

Parametric tests: t -test, ANOVA

Non-parametric test: Wilcoxon rank sum test, Kruskal

Significance level, rejection region, p-value

parametric test: Wilcoxon rank sum test, Kruskal-Wallis

Multiple hypothesis testing

Golub et al. (1999) were interested in identifying genes that are

differentially expressed in patients with two type of leukemias:differentially expressed in patients with two type of leukemias:

- acute lymphoblastic leukemia (ALL, class 0) and

- acute myeloid leukemia (AML, class 1).

Gene expression levels were measured using Affymetrix chips

containing g = 6817 human genes.

n = 38 samples

= 27 ALL cases + 11 AML cases.

Multiple hypothesis testing

Golub et al. (1999) were interested in identifying genes that are

differentially expressed in patients with two type of leukemias:differentially expressed in patients with two type of leukemias:

acute lymphoblastic leukemia (ALL, class 0) and

Gene expression levels were measured using Affymetrix chips

The preprocessed data include expression of 3051 genes for each of the 38 subjects.

A two sample t-test statistic was computed for each of the 3051 genes.

Multiple hypothesis testing (Cont. 2)

A two sample t-test statistic was computed for each of the 3051 genes.

P-values were obtained for each gene based on the t statisticH i s t o g r a m o f t e s t s t a t

Fre

qu

ency

20

03

00

40

05

00

Fre

qu

ency

t e s t s t a t

- 5 0 5 1 0

010

020

0

Which of these genes can we consider as differentially expressed?

The preprocessed data include expression of 3051 genes for each of the 38

test statistic was computed for each of the 3051 genes.


test statistic was computed for each of the 3051 genes.

values were obtained for each gene based on the t statisticH i s t o g r a m o f p - v a l u e s

400

60

08

00

10

00

120

0

2 * ( 1 - p n o r m ( a b s ( t e s t s t a t ) ) )

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

02

00

400

Which of these genes can we consider as differentially expressed?


Called significant

Null true F

Can we use the 0.05 significance level to identify differentially expressed genes?

P-value: probability of finding a difference equal or greater than the observed one just by chance under the null hypothesis.

In multiple comparisons (or repeated experiments), pas a measure of false positive rate (F/

Null true F

Alternative true T

Total S

Commonly used 0.05 significance level can result in too many false positive findings.

Exercise: Assuming among the 3000 genes, 20% are truly differentially expressed, can you give a conservative estimate of the rate of false positives among those called significant if we use the 0.05 significance level? How about if the proportion of truly differentially expressed genes is 10%?


Called not significant

Total

m - F m

Can we use the 0.05 significance level to identify differentially expressed genes?

probability of finding a difference equal or greater than the observed one just by chance under the null hypothesis. In multiple comparisons (or repeated experiments), p-value can be viewed as a measure of false positive rate (F/ m0)

m0 - F m0

m1- T m1

m - S m

0

Commonly used 0.05 significance level can result in too many false positive

Assuming among the 3000 genes, 20% are truly differentially expressed, can you give a conservative estimate of the rate of false positives among those called significant if we use the 0.05 significance level? How about if the proportion of truly


Family-wise error rate (FWER)

probability of having at least one false positives in multiple comparisons.

Many versions of controlling procedure. Many versions of controlling procedure. (1988), Hommel (1988)

Can be too conservative for genomic studies.

α

1 5 10

Table: FWER (expected number of false positives) for different

number of comparisons (N) at different

1 5 10

0.010.010.010.01 0.01

(0.01)

0.05

(0.05)

0.10

(0.1)

0.050.050.050.05 0.05

(0.05)

0.23

(0.25)

0.40

(0.5)


probability of having at least one false positives in multiple comparisons.

Many versions of controlling procedure. Bonferroni, Holm (1979), Hochberg Many versions of controlling procedure. Bonferroni, Holm (1979), Hochberg

Can be too conservative for genomic studies.

N

10 50 100 1000

Table: FWER (expected number of false positives) for different

number of comparisons (N) at different αααα level

10 50 100 1000

0.10

(0.1)

0.39

(0.5)

0.63

(1)

1.00

(10)

0.40

(0.5)

0.92

(2.5)

0.99

(5)

1.00

(50)


False discovery rate (FDR / pFDR): Proportion of hits that are false (F/S).

Several versions of controlling procedure. and Benjamini & Yekutieli (2001))

A significance measure based on pFDR: q(2003)) q-value: minimum false discovery rate that can be attained when

calling a feature significant

Require to estimate the proportion of true null (mestimation was provided basednull genes are uniformly distributed. null genes are uniformly distributed.


Proportion of hits that are false (F/S).

Several versions of controlling procedure. (Benjamini & Hochberg (1995),

A significance measure based on pFDR: q-value (Storey & Tibshirani

value: minimum false discovery rate that can be attained when

Require to estimate the proportion of true null (m0/m). An empirical ed on the fact that the p values for the

null genes are uniformly distributed. null genes are uniformly distributed.

Summary

This only provides some flavor of probability, statistics and their usage.

To learn more: taking a full course!

Introduction to biostatistics for clinical investigators

Statistical methods for observational studies

This only provides some flavor of probability, statistics and their

taking a full course!

Introduction to biostatistics for clinical investigators

Statistical methods for observational studies

References and some useful info

Statistical methods in bioinformatics course slides developed by

Gieger and Dr. Achim TreschGieger and Dr. Achim Tresch http://www.scaibit.de/index.php?id=92Gieger and Dr. Achim TreschGieger and Dr. Achim Tresch http://www.scaibit.de/index.php?id=92

Statistical Methods in Bioinformatics by Warren

Introduction to Statistical Thought by Michael http://www.math.umass.edu/~lavine/Book/book.html

The Elements of Statistical Learning by Trevor Hastie, Robert Jerome Friedman

Statistical Methods in Molecular BiologyMadhu Mazumdar and Heather L. Van EppsMadhu Mazumdar and Heather L. Van Epps

Statistical software package and program language R, http://www.r

References and some useful info

Statistical methods in bioinformatics course slides developed by Dr. Christian Dr. Christian

http://www.scaibit.de/index.php?id=92http://www.scaibit.de/index.php?id=92

by Warren Ewens and Gregory Grant

by Michael Lavine, http://www.math.umass.edu/~lavine/Book/book.html

by Trevor Hastie, Robert Tibshirani and

Statistical Methods in Molecular Biology Edited by Heejung Bang, Xi Kathy Zhou, and Heather L. Van Eppsand Heather L. Van Epps

Statistical software package and program language R, http://www.r-project.org

Division of Biostatistics and Epidemiology Department of ......Basic concepts in probability Events and random variables Probability and probability distributions Mean, variance and

Documents