Introduction and Statisti Xi Kathy Division of Biostati Department http://www.med.cornell.e Fe n to Probability ics y Zhou, PhD istics and Epidemiology t of Public Health edu/public.health/biostat.htm eb. 2010
Introduction to Probability
and Statistics
Xi Kathy Zhou, PhDDivision of Biostatistics and Epidemiology
Department of Public Health
http://www.med.cornell.edu/public.health/biostat.htm
Feb. 2010
Introduction to Probability
and Statistics
Xi Kathy Zhou, PhDDivision of Biostatistics and Epidemiology
Department of Public Health
http://www.med.cornell.edu/public.health/biostat.htm
Feb. 2010
Overview
Statistics:
The mathematics of the collection, orThe mathematics of the collection, or
data, especially the analysis of popusampling – definition in American Heritage Dictionary
Why statistics:
Through studying the characteristics of a small collection of observations
proper inference for the entire population could be derived
Probability theory consists of a set oinference
, organization and interpretation of numerical , organization and interpretation of numerical
pulation characteristics by inference from definition in American Heritage Dictionary
tudying the characteristics of a small collection of observations
proper inference for the entire population could be derived
t of mathematical tools useful for statistical
Outline
Basic concepts in probability
Events and random variables Events and random variables
Probability and probability distributions
Mean, variance and moments
Joint, marginal and conditional probabilities
Dependence and independence
Basic concepts in statistics
DataData
Descriptive statistics
Statistical Inference – Estimation
Statistical Inference – Hypothesis testing
Basic concepts in probability
Probability and probability distributions
Joint, marginal and conditional probabilities
Dependence and independence
Estimation
Hypothesis testing
Probability – a measure of uncertainty
Example:
Random experiment Possible OutcomeToss a coin H, TRoll a 6-sided die 3, 5, 1,2,3
How to describe these experiments and quantify their outcomes How to evaluate the uncertaint
or even more complicated outcomes, or outcomes from more complicated random experiments?complicated random experiments?
Probability theory provides tooloutcomes from random experiments.
a measure of uncertainty
Random experiment Possible OutcomeToss a coin H, T
sided die 3, 5, 1,2,3
How to describe these experiments and quantify their outcomesinties associated with these outcomes
or even more complicated outcomes, or outcomes from more complicated random experiments?complicated random experiments?
ools for describing and quantifying outcomes from random experiments.
Events
Definitions:
Random experiment: an experiment which can result in different an experiment which can result in different Random experiment: an experiment which can result in different an experiment which can result in different outcomes, and for which the outcome is unknown in advance.outcomes, and for which the outcome is unknown in advance.
Sample space Ω : a set of all p
Event: a subset of the sample space
Random experiment sample space Events
Toss a coin H, T H, T
Roll a 6-sided die: 1,2,3,4,5,6 3, 5, 1,2,3Roll a 6-sided die: 1,2,3,4,5,6 3, 5, 1,2,3
an experiment which can result in different an experiment which can result in different an experiment which can result in different an experiment which can result in different outcomes, and for which the outcome is unknown in advance.outcomes, and for which the outcome is unknown in advance.
ll possible outcomes of an experiment
: a subset of the sample space Ω
Random experiment sample space Events
Toss a coin H, T H, T
sided die: 1,2,3,4,5,6 3, 5, 1,2,3sided die: 1,2,3,4,5,6 3, 5, 1,2,3
Probability measure
F F
∈∈
∈∪∈
Athen A If 2.
and BAthen B A, If 1.
c , ,
,
Sigma field FFFF: a set that satisfies the following,
then , and , If 3.
1)( 2.
any for 0)( 1.
ØBABA
P
AAP
=∈
=Ω
∈≥
I
F
F
FF
∈
∈∈
Ø 3.
Athen A If 2. c , ,
Probability measure P on (Ω, F ): a function P:
following properties (Axioms of the Probability):
then , and , If 3. ØBABA =∈ IF
The 6-sided die example, Sigma field: Ø, 1, …, 6, 1,2, …, 1,2,3,4,5,6
Sigma field: Ø, 1,2,3, 4,5,6, 1,2,3,4,5,6
F∈∩ BA
: a set that satisfies the following,
)()()( then BPAPBAP +=∪
): a function P: F→ [0,1] satisfying the
following properties (Axioms of the Probability):
)()()( then BPAPBAP +=∪
Sigma field: Ø, 1, …, 6, 1,2, …, 1,2,3,4,5,6
Sigma field: Ø, 1,2,3, 4,5,6, 1,2,3,4,5,6
Probability measure - some properties
Comparing the uncertainty of events:
If , and , then ( ) ( )A B A B P A P B⊆ Ω ⊆ ≤
Assessing the uncertainty associated with other events
If , and , then ( ) ( )A B A B P A P B⊆ Ω ⊆ ≤
AAAAPAP −Ω=Ω∈−= and for )(1)(
∩−+=∪
++=∪∪∪
BAPBPAPBAP
APAPAAP(A kk
for )()()()(
)(...)()... 121
Example (rolling a 6-sided die):
If we know P(1), …, P(6), we should know the uncertainty of more complicated events such as P(1,2,4)
some properties
If , and , then ( ) ( )A B A B P A P B⊆ Ω ⊆ ≤ A∩B
Ω
A-B B-A
Assessing the uncertainty associated with other events
If , and , then ( ) ( )A B A B P A P B⊆ Ω ⊆ ≤
Ω∈
Ω∈
BAany
AA k
, for
,...,disjoint pairwisefor ) 1
Illustration of rule 6.
A∩BA-B B-A
A B
If we know P(1), …, P(6), we should know the uncertainty of more complicated events such as P(1,2,4)
Probabilities of events -
Experiment: randomly generating
Event E: the generated sequence is “ATG” Event E: the generated sequence is “ATG”
P(E)= 1/43 = 1/64
Experiment: randomly generating a DNA sequence with 100 bases among which 20 bases are A’s
• Event E: the sequence has 20A’s in a row
• What is the sample space, what is the probability of event E?• What is the sample space, what is the probability of event E?
Answer:
20
100/81
- Examples
generating a DNA sequence of length 3
sequence is “ATG”sequence is “ATG”
Experiment: randomly generating a DNA sequence with 100 bases among which 20 bases are A’s
Event E: the sequence has 20A’s in a row
What is the sample space, what is the probability of event E?What is the sample space, what is the probability of event E?
Conditional probability : Quantifies the probabilevent. The conditional probability of A given B
B)|P(A , ]1,0[ :B)|P(. →Ω
Relationship of two events
Independence: Two events A,B є Ω with P(A)>0, P(B)>0 are called (stochastically) if one of the following equivalent conditions holds:
B)|P(A , ]1,0[ :B)|P(. →Ω
• P(A∩B) = P(A)·P(B)
• P(A|B) = P(A)
• P(B|A) = P(B)
EXAMPLE: Throwing a 6-sided fair dice, we consider the following events:A: outcome is an even number, B: outcome <=4, C: outcome is a D: outcome is an odd numberQuestion:
1. If we know that event B happened in the experiment, what is the probability that event A also happened? ( Think about
2. Are A and B independent? How about B and C? How about A and C?3. Are A and D independent? How about D and B? How about D and C?
bility of an event given the happening of another conditional probability of A given B is defined as
A , B)P(A
Ω∈∩
=
Relationship of two events
with P(A)>0, P(B)>0 are called (stochastically) independentif one of the following equivalent conditions holds:
A , P(B)
B)P(A Ω∈
∩=
B) = P(A)·P(B)
sided fair dice, we consider the following events:A: outcome is an even number, B: outcome <=4, C: outcome is a prime number
1. If we know that event B happened in the experiment, what is the probability that event A also happened? ( Think about P(A)= ?, P(B)=?, P(A|B)=? )
Are A and B independent? How about B and C? How about A and C?Are A and D independent? How about D and B? How about D and C?
Random variableRandom variable
each xfor x)X( :
:Xfunction A :Vriable Random
∈∈≤Ω∈
→Ω
Fωω each xfor x)X( : ∈∈≤Ω∈ Fωω
A more common description of the results of a random experiment.A more common description of the results of a random experiment.
Takes on value from a set of mutually exclusive and collectively
exhaustive states and represent it with a number.
Usually denoted by capital letters, e.g. X, Y, Z, etc.
Realizations of random variables are usually denoted in lower case, Realizations of random variables are usually denoted in lower case,
e.g., x,y,z, etc.
Can be discrete or continuous
R.
hat property t with theR
∈
→
R.∈
A more common description of the results of a random experiment.A more common description of the results of a random experiment.
Takes on value from a set of mutually exclusive and collectively
exhaustive states and represent it with a number.
Usually denoted by capital letters, e.g. X, Y, Z, etc.
Realizations of random variables are usually denoted in lower case, Realizations of random variables are usually denoted in lower case,
Probability distributionsProbability distributions
Uncertainties associated with possible values of a random variable, are Uncertainties associated with possible values of a random variable, are
characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution.
Definition:Definition:
Probability Distribution of a random variable X is the function F: Probability Distribution of a random variable X is the function F:
R R -->[0,1] given by F(x) = P(X >[0,1] given by F(x) = P(X
Takes values between 0 and 1Takes values between 0 and 1
Follows the axioms of the probability Follows the axioms of the probability
Uncertainty associated with two (or moUncertainty associated with two (or mo Uncertainty associated with two (or moUncertainty associated with two (or mo
their joint distribution function F(x, y)= P(X their joint distribution function F(x, y)= P(X
Probability distributionsProbability distributions
Uncertainties associated with possible values of a random variable, are Uncertainties associated with possible values of a random variable, are
characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution. characterized by the random variable’s probability distribution.
Probability Distribution of a random variable X is the function F: Probability Distribution of a random variable X is the function F:
>[0,1] given by F(x) = P(X ≤ x)>[0,1] given by F(x) = P(X ≤ x)
ore) random variables is characterized by ore) random variables is characterized by ore) random variables is characterized by ore) random variables is characterized by
their joint distribution function F(x, y)= P(X ≤x, Y ≤ y)their joint distribution function F(x, y)= P(X ≤x, Y ≤ y)
Discrete Random Variable
Probability mass function
A discrete random variable X has a countable number of possible values xA discrete random variable X has a countable number of possible values x
xxkk, … , …
Probability mass function of X is Probability mass function of X is
P(X=xP(X=xii)=p)=pii, I = 1, 2, …, k, …, I = 1, 2, …, k, …
where pwhere pii is the probability mass function that satisfies is the probability mass function that satisfies
00≤≤ppii≤1≤1
pp11+p+p22+… ++… +ppkk+ … =1+ … =1
Range of this random variable: x1, x2, …,
Discrete Random Variable –
Probability mass function
A discrete random variable X has a countable number of possible values xA discrete random variable X has a countable number of possible values x11, x, x22, …, , …,
is the probability mass function that satisfies is the probability mass function that satisfies
, …, xk,…
Discrete Random Variable Discrete Random Variable
Cumulative distribution function (cdf)Cumulative distribution function (cdf)
The cumulative distribution function of aThe cumulative distribution function of a
≤= XPxF ()(
Properties of the Properties of the cdfcdf::
0 0 ≤ ≤ F(x) F(x) ≤ 1≤ 1
If xIf x ≤ y then F(x) ≤ F(y), i.e. always increasing≤ y then F(x) ≤ F(y), i.e. always increasing
Discrete case: step function, continuous from the right,Discrete case: step function, continuous from the right,
≤= XPxF ()(
Discrete case: step function, continuous from the right,Discrete case: step function, continuous from the right,
jump discontinues at xjump discontinues at x11, x, x22, …, , …, xxkk,… with heights p,… with heights p
Discrete Random Variable Discrete Random Variable ––
Cumulative distribution function (cdf)Cumulative distribution function (cdf)
f a discrete random variable X is defined as f a discrete random variable X is defined as
∑=≤ px)
F(y), i.e. always increasing F(y), i.e. always increasing
Discrete case: step function, continuous from the right,Discrete case: step function, continuous from the right,
∑≤
=≤xxi
i
i
px:
)
Discrete case: step function, continuous from the right,Discrete case: step function, continuous from the right,
,… with heights p,… with heights p11, p, p22, …, , …, ppkk, …, …
Discrete random variables
Example of common distributions
Discrete uniform distribution
Random variable: outcome from rolling a fair 6
Geometric distribution
Random variable: the number of Bernoulli trials taken till the first Random variable: the number of Bernoulli trials taken till the first successsuccesssuccesssuccess
Discrete random variables –
Example of common distributions
Random variable: outcome from rolling a fair 6-sided die
Random variable: the number of Bernoulli trials taken till the first Random variable: the number of Bernoulli trials taken till the first
Discrete random variable Discrete random variable
Discrete uniform distributionDiscrete uniform distribution
A discrete random variable X is called uniformly distributed on the range xA discrete random variable X is called uniformly distributed on the range x
if for all if for all ii = 1, …, k:= 1, …, k:if for all if for all ii = 1, …, k:= 1, …, k:
Example: Roll a fair dieExample: Roll a fair die
Probability mass function Probability mass function
Discrete random variable Discrete random variable ––
Discrete uniform distributionDiscrete uniform distribution
A discrete random variable X is called uniformly distributed on the range xA discrete random variable X is called uniformly distributed on the range x11, x, x22, …, , …, xxkk
Discrete Random Variable Discrete Random Variable
Geometric distribution (1)Geometric distribution (1)
Random experiment (Repeat a Bernoulli expRandom experiment (Repeat a Bernoulli exp
a coin till the first appearance of a Ha coin till the first appearance of a H
Event: H, TH, TTH, …Event: H, TH, TTH, …
Probability for a success in a single trial: P(H) = Probability for a success in a single trial: P(H) =
Random variable X: “Number of trials until the first success” 1, 2, …Random variable X: “Number of trials until the first success” 1, 2, …
X has a geometric distribution with parameter X has a geometric distribution with parameter
The probability distribution function has the form The probability distribution function has the form The probability distribution function has the form The probability distribution function has the form
The cumulative distribution function has the form The cumulative distribution function has the form
Discrete Random Variable Discrete Random Variable ––
Geometric distribution (1)Geometric distribution (1)
experiment until the first success), e.g. tossing experiment until the first success), e.g. tossing
Probability for a success in a single trial: P(H) = Probability for a success in a single trial: P(H) = ππ
Random variable X: “Number of trials until the first success” 1, 2, …Random variable X: “Number of trials until the first success” 1, 2, …
X has a geometric distribution with parameter X has a geometric distribution with parameter ππ
The probability distribution function has the form The probability distribution function has the form The probability distribution function has the form The probability distribution function has the form
The cumulative distribution function has the form The cumulative distribution function has the form
Discrete random variable Discrete random variable
Geometric distribution (2)Geometric distribution (2)
Discrete random variable Discrete random variable ––
Geometric distribution (2)Geometric distribution (2)
Discrete Random variable Discrete Random variable
Geometric distribution (3)Geometric distribution (3)
Discrete Random variable Discrete Random variable ––
Geometric distribution (3)Geometric distribution (3)
Discrete random variable Discrete random variable
The value you The value you expectexpect to get in a random experiment is the mean, i.e., the average to get in a random experiment is the mean, i.e., the average
value of a random experiment value of a random experiment
Example: If you toss a coin 10 times, you exExample: If you toss a coin 10 times, you ex
this value because the probability of gettthis value because the probability of gett
you should get 5. you should get 5.
Definition: The mean of a discrete random variable with values xDefinition: The mean of a discrete random variable with values x
probability distribution pprobability distribution p11, p, p22, …, , …, ppkk, … is, … isprobability distribution pprobability distribution p11, p, p22, …, , …, ppkk, … is, … is
Note that E(x) characterizes the expected outcome from a random experimentNote that E(x) characterizes the expected outcome from a random experiment
Discrete random variable Discrete random variable -- MeanMean
to get in a random experiment is the mean, i.e., the average to get in a random experiment is the mean, i.e., the average
expect to get 5 heads and 5 tails. You expect expect to get 5 heads and 5 tails. You expect
etting "heads" is 0.5 and if you toss 10 times etting "heads" is 0.5 and if you toss 10 times
Definition: The mean of a discrete random variable with values xDefinition: The mean of a discrete random variable with values x11, x, x22, …, , …, xxkk, …, … and and
, … is, … is, … is, … is
Note that E(x) characterizes the expected outcome from a random experimentNote that E(x) characterizes the expected outcome from a random experiment
Discrete random variable Discrete random variable
Mean (Example)Mean (Example)
Binary random variable X:Binary random variable X:
Assume P(X=1) = Assume P(X=1) = ππ and P(X=0)=1and P(X=0)=1-- π, π, thenthenAssume P(X=1) = Assume P(X=1) = ππ and P(X=0)=1and P(X=0)=1-- π, π, thenthen
E(x) = 0*P(X=0) + 1*P(X=1) = E(x) = 0*P(X=0) + 1*P(X=1) =
Toss a fair coin: Toss a fair coin:
X gain/loss of one dollar , X(H)=1, X(T)=X gain/loss of one dollar , X(H)=1, X(T)=
If P(X=1) = P(X=0), E(X) = ?If P(X=1) = P(X=0), E(X) = ?
Roll a fair 6Roll a fair 6--sided die: sided die:
Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?
Twice X =sum of values, E(X)=?Twice X =sum of values, E(X)=?
Discrete random variable Discrete random variable ––
thenthenthenthen
E(x) = 0*P(X=0) + 1*P(X=1) = E(x) = 0*P(X=0) + 1*P(X=1) = ππ
X gain/loss of one dollar , X(H)=1, X(T)=X gain/loss of one dollar , X(H)=1, X(T)=--11
Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?Once X =value of the landing, E(X)=?
Discrete random variable Discrete random variable ––
Variance and Standard deviationVariance and Standard deviation
The variance of a discrete random variable is The variance of a discrete random variable is
The standard deviation is The standard deviation is
––
Variance and Standard deviationVariance and Standard deviation
The variance of a discrete random variable is The variance of a discrete random variable is
Discrete random variable Discrete random variable ––
Variance (Examples)Variance (Examples)
Binary random variable: Binary random variable: VarVar(X)= (X)= π(1π(1
ProofProofProofProof
Roll a fair die once: X is the value at the landingRoll a fair die once: X is the value at the landing
VarVar(X)=?(X)=?
Roll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the values
VarVar(X) = ?(X) = ?
––
π(1π(1−−π)π)
Roll a fair die once: X is the value at the landingRoll a fair die once: X is the value at the landing
Roll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the valuesRoll a fair die twice: X is the sum of the values
Discrete random variables Discrete random variables
IndependenceIndependence
Definition: 2 discrete random variables XDefinition: 2 discrete random variables X
and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e.
More general: More general: nn discrete random variables Xdiscrete random variables X
arbitrary values xarbitrary values x11, x, x22, …, , …, xxnn in their respective range the following term is truein their respective range the following term is true
Discrete random variables Discrete random variables ––
X are called independent if the events X=x X are called independent if the events X=x
and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e. and Y=y are independent for all x and y, i.e.
discrete random variables Xdiscrete random variables X11, X, X22, …, X, …, X33 are called independent, if forare called independent, if for
in their respective range the following term is truein their respective range the following term is true
Discrete Random variable Discrete Random variable
Independence (Example)Independence (Example)Random experiment: Roll two diceRandom experiment: Roll two dice
For allFor all 1 1 ≤ ≤ ii, j ≤ 6, j ≤ 6
P(X=I, Y=j) = 1/36 =1/6 X1/6 = P(X=P(X=I, Y=j) = 1/36 =1/6 X1/6 = P(X=ii)P(Y=j))P(Y=j)
Random experiment: Roll a dieRandom experiment: Roll a die
Y = 1 if the value is a “prime number” Y = 1 if the value is a “prime number”
Z = 1 if the value is “smaller than four” Z = 1 if the value is “smaller than four”
Are these two events independent? NoAre these two events independent? NoAre these two events independent? NoAre these two events independent? No
Because Y=1 and Z=1 means “2 or 3”, Because Y=1 and Z=1 means “2 or 3”,
so P(Y=1, Z=1) = 2/6 ≠ 1/2 · 1/2 = P(Y)·P(Z).so P(Y=1, Z=1) = 2/6 ≠ 1/2 · 1/2 = P(Y)·P(Z).
or equivalently: Is P(Y|Z) = P(Y) ?or equivalently: Is P(Y|Z) = P(Y) ?
Discrete Random variable Discrete Random variable ––
Independence (Example)Independence (Example)
)P(Y=j))P(Y=j)
Y = 1 if the value is a “prime number” Y = 1 if the value is a “prime number”
Z = 1 if the value is “smaller than four” Z = 1 if the value is “smaller than four”
Are these two events independent? NoAre these two events independent? NoAre these two events independent? NoAre these two events independent? No
Because Y=1 and Z=1 means “2 or 3”, Because Y=1 and Z=1 means “2 or 3”,
1/2 · 1/2 = P(Y)·P(Z). 1/2 · 1/2 = P(Y)·P(Z).
Example: DNA sequence analysis (1)Example: DNA sequence analysis (1)
•• 70 positions of the hemoglobin alpha gene in humans and rats:70 positions of the hemoglobin alpha gene in humans and rats:
HUM ...ACGTCAAGGCCGCCTGGGGCAAGGTTGGCGCGCACGGCGAGTATGGTGCGGAG
RAT ...ATGTAAGCCCCGGCTCTGCCCAGGTCAAGGCTCACGGCAAGAAGGTTGCTGAT
•• Empirical observation: 45 match positions and 25 mismatch positionsEmpirical observation: 45 match positions and 25 mismatch positions
•• Are both sequences related ? Are both sequences related ?
•• Compare the result with a random experCompare the result with a random exper
(null model), i.e. both sequences are dissimilar enough that such a model describes the (null model), i.e. both sequences are dissimilar enough that such a model describes the
observation good enough. observation good enough.
RAT ...ATGTAAGCCCCGGCTCTGCCCAGGTCAAGGCTCACGGCAAGAAGGTTGCTGAT
mutated region
1011010001111110010101111000011011111101101010111011011111110011010101
Example: DNA sequence analysis (1)Example: DNA sequence analysis (1)
70 positions of the hemoglobin alpha gene in humans and rats:70 positions of the hemoglobin alpha gene in humans and rats:
AGGTTGGCGCGCACGGCGAGTATGGTGCGGAGGCCCTGGAGAATGTTCC...
AGGTCAAGGCTCACGGCAAGAAGGTTGCTGATGCCCTGGCCAAAGCTGC...
Empirical observation: 45 match positions and 25 mismatch positionsEmpirical observation: 45 match positions and 25 mismatch positions
periment with independently generated sequences periment with independently generated sequences
(null model), i.e. both sequences are dissimilar enough that such a model describes the (null model), i.e. both sequences are dissimilar enough that such a model describes the
AGGTCAAGGCTCACGGCAAGAAGGTTGCTGATGCCCTGGCCAAAGCTGC...
conserved region
1011010001111110010101111000011011111101101010111011011111110011010101
Example: DNA sequence analysis (2)Example: DNA sequence analysis (2)
HUM ...ACGTCAAGGCCGCCTGGGGCAAGGTTGGCGCGCACGGCGAGTATGGTGCGGAG
•• Define a random variable Define a random variable ZZii with with
•• How likely is it to observe How likely is it to observe zz (=45) or more match positions?(=45) or more match positions?
HUM ...ACGTCAAGGCCGCCTGGGGCAAGGTTGGCGCGCACGGCGAGTATGGTGCGGAG
RAT ...ATGTAAGCCCCGGCTCTGCCCAGGTCAAGGCTCACGGCAAGAAGGTTGCTGAT
mutated region
1011010001111110010101111000011011111101101010111011011111110011010101
=else ,0
same theobserve weif ,1Zi
•• How likely is it to observe How likely is it to observe zz (=45) or more match positions?(=45) or more match positions?
•• This depends on the evolutionary process but we can try to model This depends on the evolutionary process but we can try to model
this process with probability models. this process with probability models.
Example: DNA sequence analysis (2)Example: DNA sequence analysis (2)
AGGTTGGCGCGCACGGCGAGTATGGTGCGGAGGCCCTGGAGAATGTTCC...
(=45) or more match positions?(=45) or more match positions?
AGGTTGGCGCGCACGGCGAGTATGGTGCGGAGGCCCTGGAGAATGTTCC...
AGGTCAAGGCTCACGGCAAGAAGGTTGCTGATGCCCTGGCCAAAGCTGC...
conserved region
1011010001111110010101111000011011111101101010111011011111110011010101
sequencesboth in nucleotide same
(=45) or more match positions?(=45) or more match positions?
This depends on the evolutionary process but we can try to model This depends on the evolutionary process but we can try to model
this process with probability models. this process with probability models.
Example: Model 1Example: Model 1
•• We assume that both sequences are We assume that both sequences are
((iidiid) sequences and the two sequences are independent from each other.) sequences and the two sequences are independent from each other.
•• Moreover, we assume that all nucleMoreover, we assume that all nucle
i.e. i.e. ppAA = = ppCC = = ppGG = = ppTT = 0.25= 0.25
•• Then P(Then P(ZZii = 1) = 0.25 *0.25*4=0.25 and Z = “number of matches” with= 1) = 0.25 *0.25*4=0.25 and Z = “number of matches” with
Z=ZZ=Z11+…+Z+…+Znn is B(is B(nn, , ππ) with ) with ππ=0.25 .=0.25 .
•• Now it is possible to computeNow it is possible to compute
70 70 ∑
i.e. i.e. 45 or more matches are very unlikely under the assumption of 45 or more matches are very unlikely under the assumption of
unrelated sequences.unrelated sequences.
70
45
70( 45) 0.25 0.75 4.78 10
z
P Zz=
≥ = = ×
∑
re independently identically distributed re independently identically distributed
) sequences and the two sequences are independent from each other.) sequences and the two sequences are independent from each other.
cleotides have the same probability to occur,cleotides have the same probability to occur,
= 1) = 0.25 *0.25*4=0.25 and Z = “number of matches” with= 1) = 0.25 *0.25*4=0.25 and Z = “number of matches” with
=0.25 .=0.25 .
45 or more matches are very unlikely under the assumption of 45 or more matches are very unlikely under the assumption of
70 12( 45) 0.25 0.75 4.78 10z z− − ≥ = = ×
Example: Model 2Example: Model 2
•• Now we assume that the sequence of base pairs is Now we assume that the sequence of base pairs is
•• The base pairs (xThe base pairs (xii, , yyii), ), ii=1,…,=1,…,nn are are ii ii
random experiments but follow an random experiments but follow an
•• We still assume that the observations of nucleotides in sequence 1 are We still assume that the observations of nucleotides in sequence 1 are
uniformly distributed with (uniformly distributed with (ppAA , , ppCC
•• But the observations in sequence 2 depend on the observations in But the observations in sequence 2 depend on the observations in
sequence 1:sequence 1:
pp =P(A in sequence 2 | A in sequence 1) = =P(A in sequence 2 | A in sequence 1) = ppA|AA|A=P(A in sequence 2 | A in sequence 1) = =P(A in sequence 2 | A in sequence 1) =
ppA|CA|C= = ppA|GA|G= = ppA|TA|T ==ppC|AC|A= = ppC|GC|G= = ppC|TC|T==pp
Example: Model 2Example: Model 2
Now we assume that the sequence of base pairs is Now we assume that the sequence of base pairs is iidiid. .
re no longer independent realisations of twore no longer independent realisations of two
random experiments but follow an random experiments but follow an evolutionary process.evolutionary process.
We still assume that the observations of nucleotides in sequence 1 are We still assume that the observations of nucleotides in sequence 1 are
, , ppGG , , ppTT ) = (0.25, 0.25, 0.25, 0.25).) = (0.25, 0.25, 0.25, 0.25).
But the observations in sequence 2 depend on the observations in But the observations in sequence 2 depend on the observations in
=P(A in sequence 2 | A in sequence 1) = =P(A in sequence 2 | A in sequence 1) = pp = = pp = = pp == 0.640.64=P(A in sequence 2 | A in sequence 1) = =P(A in sequence 2 | A in sequence 1) = ppC|CC|C= = ppG|GG|G= = ppT|TT|T == 0.640.64
ppG|AG|A= = ppG|CG|C= = ppG|TG|T ==ppT|AT|A= = ppT|CT|C= = ppT|GT|G=0.12=0.12
Example: Model 2Example: Model 2
•• The total probability theorem gives us the distribution of sequence 2:The total probability theorem gives us the distribution of sequence 2:
qqAA = = qqCC = = qqGG = = qqTT = 0.25 = 0.25
•• Both sequences have the same (margBoth sequences have the same (marg•• Both sequences have the same (margBoth sequences have the same (marg
e.g.e.g.
ppAAAA= P(“an A in both sequences”) = = P(“an A in both sequences”) =
•• Under this model assumption we get for the random variable Under this model assumption we get for the random variable
distribution distribution P(P(ZZii = 1) = 4*0.16 = 0.64= 1) = 4*0.16 = 0.64
and Z = “number of matches” is binomial distributed B(and Z = “number of matches” is binomial distributed B(
•• We We get get
•• Now Now the observation is much more likethe observation is much more like
44(1)45( ≤−=≥ ZPZP
•• Now Now the observation is much more likethe observation is much more like
E(Z)= E(Z)= nnππ = 70*0.64 = = 70*0.64 = 44.844.8
•• The observed number of matches The observed number of matches are much more likely under the are much more likely under the
evolutionary evolutionary model!model!
The total probability theorem gives us the distribution of sequence 2:The total probability theorem gives us the distribution of sequence 2:
rginal) distribution but they are not independentrginal) distribution but they are not independentrginal) distribution but they are not independentrginal) distribution but they are not independent
= P(“an A in both sequences”) = = P(“an A in both sequences”) = ppA|AA|A ppAA = 0.64*0.25 = 0.16= 0.64*0.25 = 0.16
Under this model assumption we get for the random variable Under this model assumption we get for the random variable ZZii the probability the probability
and Z = “number of matches” is binomial distributed B(and Z = “number of matches” is binomial distributed B(nn, , ππ) with ) with ππ=0.64=0.64..
ikely, in fact it is above the expected value of Z:ikely, in fact it is above the expected value of Z:
54.0)44 ≈
ikely, in fact it is above the expected value of Z:ikely, in fact it is above the expected value of Z:
are much more likely under the are much more likely under the
Continuous Random VariablesContinuous Random Variables
Continuous random variable
Probability distribution
)( , : xFIRIRF =→
Definition. If X:Ω→IR is a random variable, the function
is called the distribution function of X.
If X is a continuous random variable with density f, the distribution function F can be expressed as
()( ∫∞−
=x
xfxF
This formula is the continuous analogue of the discrete case, in which the
distribution function was defined as )( =xFdistribution function was defined as )( =xF
Continuous random variable –
Probability distribution
)( xXP ≤
IR is a random variable, the function
If X is a continuous random variable with density f, the distribution function F can
. )dxx
This formula is the continuous analogue of the discrete case, in which the
. )(∑ jxf . )(∑≤xx
j
j
xf
Continuous random variables
mean and varianceThe statistics “mean” and “variance”, which were already defined for discrete random variables can be defined in an analogous way for continuous random variables:
X discrete, Xj є x1,x2,…,
∑ ==jx
jj xXPxXE )()(Mean
p.d.f. P(X=xj)c.d.f. P(X≤xj)
Variance ())(()( 2
x
j XPXExXVarj
=−=∑
Continuous random variables –
The statistics “mean” and “variance”, which were already defined for discrete random variables can be defined in an analogous way for continuous random
∫∞
∞−
= dxxxfXE )()(
X continuous with density f
density function f(x)distribution function F(x)
∞−
∫∞
∞−
−= dxxfXExXVar )())(()( 2)jx
Continuous random variables
Example 1
Uniform distribution. A continuous random variable X is called distributed (in the interval [a,b]) if it has a density function of the form
1
for some real values a<b. This is denoted by X ~ U(a,b).
−=
0
1
)( abxf
f
1 1
f
0 0
a b
Continuous random variables –
. A continuous random variable X is called uniform or uniformly (in the interval [a,b]) if it has a density function of the form
X ~ U(a,b).
∈
otherwise
b][a,for x
FF
a b
Normal /Gaussian distribution. A continuous random variable X is called normally distributed (with mean µ and standard deviation N(µ,σ2), if it has a density function of the form.
Continuous random variables
Example 2
N(µ,σ ), if it has a density function of the form.
= exp
2
1)(
πσxf
There is no closed form for the distribution function F of such a variable. the distribution function has to be computed numerically.
This distribution is symmetric (around x=µshaped like a bell.
Standard µ=0,
Standard
normal
distribution
µ=0, σ=1
µ=0,
σ=0.8 µ=1,5
,
σ=0.8
µ=1,σ=2
f
Density and distribution function of some normally distributed random variables X ~ N(
. A continuous random variable X is called and standard deviation σ>0), i.e. X ~
if it has a density function of the form.
− µ
Continuous random variables -
if it has a density function of the form.
−− 2
21 )(
σ
µx
There is no closed form for the distribution function F of such a variable. the distribution function has to be computed numerically.
µ), uni-modal (with mode at x=µ) and
µ=0 ,
µ=0, σ=1
µ=0,
σ=2
µ=0 ,
σ=0.8F
Density and distribution function of some normally distributed random variables X ~ N(µ,σ2)
Two more continuous distributions
The χ2-distribution. If X1,…,Xn are independent random variables that are N(0,1)distributed, then the random variable
2 XXZ +=
is said to be Chi-squared distributed with n degrees of freedom
Student t-distribution (t-distribution).then the random variable
is said to have a t-distribution with n degrees of freedom
2
1 XXZ +=
nZ
XT
/=
is said to have a t-distribution with n degrees of freedom
This list of continuous random variables is by no means complete. For a survey, consult the statistics literature given in the reference list to this lecture series.
Two more continuous distributions
are independent random variables that are N(0,1)-
22 ... XX ++
squared distributed with n degrees of freedom, for short Z ~ χ2(n).
distribution). If X~N(0,1) and Z~ χ2(n) are independent,
distribution with n degrees of freedom, for short T ~ t(n).
22
2 ... nXX ++
distribution with n degrees of freedom, for short T ~ t(n).
This list of continuous random variables is by no means complete. For a survey, consult the statistics literature given in the reference list to this
Definition. Let Ω be a probability space with probability measure P. Let X:Y:Ω→IR be continuous random variables. X and Y are called
Continuous random variables
Independence
for all x,y є IR.
)() ,( xXPyYxXP ≤=≤≤
Corollary. If the continuous random variables are independent,
for all real values of a1<a2,b1<b2.
() ,( 2121 PbYbaXaP =≤≤≤≤
be a probability space with probability measure P. Let X:Ω→IR and IR be continuous random variables. X and Y are called independent if
Continuous random variables –
)()()() yFxFyYP YX=≤
. If the continuous random variables are independent,
)()( 2121 bYbPaXa ≤≤⋅≤≤
Let X and Y be two random variables on the samef: IR x IR→ IR such that
≤≤≤ ,( 121 YbaXaP
Continuous random variables Joint and marginal probability distributions
≤≤≤ ,( 121 YbaXaP
For all real values of a1<a2,b1<b2, then X and Y are said to have a continuous joint (multivariate) distribution, and f is called their joint density. We will be considered only with this case here.
The marginal distribution of X is given by
22 ∞ aa
where is the
density of the marginal distribution of X.
( ),()(2
1
2
1
21 ∫∫ ∫ ==≤≤∞
∞−
a
a
X
a
a
fdydxyxfaXaP
∫∞
∞−
= dyyxfxf X ),()(
same probability space Ω. If there exists a function
∫ ∫=≤2 2
),()2
a b
dydxyxfbY
Continuous random variables –Joint and marginal probability distributions
∫ ∫=≤
1 1
),()2
a b
dydxyxfbY
, then X and Y
, and f is called their joint density. We will be considered only with this
, )( dxx
The conditional distribution of X, given Y=b is given by
∫ ===≤≤2
|()|( 21
a
X bYxfbYaXaP
Continuous Random Varible
Conditional probability distributions
where is the
density of the conditional distribution of X, given Y=b.
∫ ===≤≤
1
|()|( 21
a
X bYxfbYaXaP
∫∞
∞−
== dtbtfbxfbYxf X ),(),( )|(
We mention an equivalent condition for independence:
The random variables X and Y are independent ifThe random variables X and Y are independent if
1. f(x,y)=fX(x)fY(y) for all x,y є IR
2. fX(x|Y=b)=fX(x) for all x,b є IR.
3. fY(y|X=a)=fY(y) for all a,y є IR.
is given by
) dx
Continuous Random Varible –
Conditional probability distributions
where is the
density of the conditional distribution of X, given Y=b.
) dx
dt
We mention an equivalent condition for independence:
The random variables X and Y are independent ifThe random variables X and Y are independent if
Basic Concepts in StatisticsBasic Concepts in Statistics
Data, sampling and statistical inference
Probability theory: reasoning from f->Y“if the experiment is like…, then f will be …, and (ymust be…”must be…”
Statistics: Reasoning from Y to f“Since (y1, …, yn), turned out to be …, it seems that f is likely to be …, or the
parameter is likely to be around …”
Sampling: Ways to select the subjects for which the characteristics/ properties of interest will be assessed
Examples: simple random sampling (SRS), stratified, clustered
Data: Characteristics/properties of a random sample from a population.For example: age, weight, expression level of a certain gene, …
Statistical inference: Learning from datai.e. assuming these data are n draws from distribution f
about the population parameter.
Data, sampling and statistical inference
“if the experiment is like…, then f will be …, and (y1, …, yn) will be like…, or E(Y)
), turned out to be …, it seems that f is likely to be …, or the
Ways to select the subjects for which the characteristics/ properties of
Examples: simple random sampling (SRS), stratified, clustered
Characteristics/properties of a random sample from a population.For example: age, weight, expression level of a certain gene, …
: Learning from datai.e. assuming these data are n draws from distribution fθ, what can we learn
There are different types of data:
Types of Data
• numerical data (discrete, continuous)
• categorical data (ordered, non-
ordered)
• mixtures of both
If the properties consist of multiple If the properties consist of multiple
features (like age, sex, smoking status,
PGEM levels pre/post treatment here),
the data is called multivariate, otherwise
it is called univariate.
SubjectID Age Sex Smoke PackyrCat PGEM.pre PGEM.post
N29 66 M N 0 7.44 1.77
data (discrete, continuous)N05 70 M N 0 16.75 10.82
N04 47 M N 0 20.38 3.51
N22 68 M N 0 15.43 4.47
N21 65 F N 0 2.99 3.14
…
F29 48 F F 3 12.08 3.44
F07 64 M F 2 16.64 7.74
F13 26 M F 1 8.2 6.33
F02 76 M F 2 7.95 5.36
features (like age, sex, smoking status,
PGEM levels pre/post treatment here),
, otherwise
F02 76 M F 2 7.95 5.36
F30 63 F F 3 3.63 0.55
…
C14 50 M C 2 18.23 6.18
C29 42 M C 1 20.16 10.22
C07 45 F C 1 10.43 5.67
C25 53 F C 1 10.6 5.73
C23 35 M C 3 10.62 8.41
Typical steps in statistical analysis of data
Describing the data (descriptive statistics)
Propose reasonable probabilistic model
Making inference about parameters in the model
Check the model fitting/assumption
Report results
Typical steps in statistical analysis of data
Describing the data (descriptive statistics)
Propose reasonable probabilistic model
Making inference about parameters in the model
Check the model fitting/assumption
Describing univariate categorical data
Frequency table: Simply list the count and the relative frequency for each data
category
Example:
Variable
Sex
M
F
Smoking StatusSmoking Status
N
F
C
Describing univariate categorical data
Simply list the count and the relative frequency for each data
Count Rel. Freq.
53 0.56
42 0.44
31 0.33
30 0.31
34 0.36
Histogram
Cut the possible range of the data to various bins of certain size/s:
Describing ordered univariate data
Count the number/ frequency of the data in each bin
Display the categorical count/frequency as
to the length of the intervals
Empirical distribution of the data
Fre
qu
en
cy
10
15
Fre
qu
en
cy 1
52
025
Age
Fre
qu
en
cy
20 30 40 50 60 70 80
05
PGEM.Pre
Fre
qu
en
cy
0 10 20
05
10
Cut the possible range of the data to various bins of certain size/s:
Describing ordered univariate data
Count the number/ frequency of the data in each bin
as a bar plot, with the width of the bars proportional
Fre
qu
en
cy
15
20
PGEM.Pre
30 40 50 60
PGEM.Post
Fre
qu
en
cy
0 10 20 30 40
05
10
Descriptive statistics 1
The second and by far the most important way is to summarize the data by appropriate statistics. A statistics is a rule that assigns a number to a dataset. This number is meant to tell us something about the underlying dataset.tell us something about the underlying dataset.
Examples:
Sample arithmetic mean. Given x1, …, xn, calculate the arithmetic mean as
The arithmetic mean is one of the many statistics that aim to describe where the “centre” of the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data
∑=jn
x 1
the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data points, namely
= ∑
=
xargmin n
j
x
The second and by far the most important way is to summarize the data by appropriate is a rule that assigns a number to a dataset. This number is meant to
, calculate the arithmetic mean as
The arithmetic mean is one of the many statistics that aim to describe where the “centre” of the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data
∑=
n
jx1
x
the data is. The arithmetic mean minimizes the sum of the quadratic distances to the data
−∑
=
2
1
)(n
jxx
Median. Let x1, …, xn, be given in ascending order. The median x
=
+ if 2/)1( nx n
Descriptive statistics 2
+=
+
+
)(
if
12/2/
2/)1(
xx
nxx
nn
n
med
The median is a value such that the number of data points smaller than number of data points greater than xmed. Like the arithmetic mean, the median is also a location measure for the “centre” of the data.
, be given in ascending order. The median xmed is defined as
odd is
even is if 2/)
odd is
n
The median is a value such that the number of data points smaller than xmed equals the Like the arithmetic mean, the median is also a
Mean
Median
Mode (this distribution is unimodal!)
Sample variance, sample standard deviationdataset x=(x1,…,xn ) is defined as
2 2 2 21 1 ( ) , or more commonly ( )
n n
v s x x v s x x= = − = = −∑ ∑
Descriptive statistics 3
(the average squared distance from all data points to ). The standard deviation the positive square root of the variance, s2=v
are measures for the dispersion of the data.
2 2 2 2
1 1
1 1 ( ) , or more commonly ( )j j
j j
v s x x v s x xn n= =
= = − = = −∑ ∑
Re
lative
fre
qu
en
cy
small variance
Re
lative
fre
qu
en
cy
Sample variance, sample standard deviation. The variance v=Var(x1,…,xn)=Var(x) of a
2 2 2 21 1 ( ) , or more commonly ( )
n n
v s x x v s x x= = − = = −∑ ∑(the average squared distance from all data points to ). The standard deviation s=s(x) is
=v. The variance and the standard deviation of the data.
2 2 2 2
1 1
1 1 ( ) , or more commonly ( )
1j j
j j
v s x x v s x xn n= =
= = − = = −−
∑ ∑
Re
lative
fre
qu
en
cy
x
vs. large variance
Re
lative
fre
qu
en
cy
Symmetry. A frequency distribution is called symmetric if it has an axis of symmetry.
Skewness. A frequency distribution is called skewed to the right if the right tail of the distribution falls off slower than the left tail. Analogously: skewed to the left.
Descriptive statistics 4
distribution falls off slower than the left tail. Analogously: skewed to the left.
The statistic for sample skewness used the sample third central moment and sample variance.
left skew symmetric
Posture rules:
Left skew:
Symmetric:
modexxx med <<
modexxx med ≈≈
. A frequency distribution is called symmetric if it has an axis of symmetry.
. A frequency distribution is called skewed to the right if the right tail of the distribution falls off slower than the left tail. Analogously: skewed to the left.distribution falls off slower than the left tail. Analogously: skewed to the left.
The statistic for sample skewness used the sample third central moment and sample
Mean
Median
Mode
symmetric skewed to the right
Right skew: modexxx med >>
Quantiles. Let q є (0,1). A q-quantile of a frequency distribution is a value
that the fraction of data lying left to xq is at least
at least 1-q. If the data is ordered , then( xx ≤
Descriptive statistics 5
at least 1-q. If the data is ordered , then( 21 xx ≤
∈=
+ ][
is if
1, xx
qnxx
qnqn
qn
q
X0.05 X0.25X0.5
Special quantiles are the quartiles, x0.25,x0.5,x0.75
classes), and the quintiles x0.2,x0.4,x0.6,x0.8. They are frequently used to give a summary of the data distribution.
of a frequency distribution is a value xq such
is at least q, and the fraction lying right to xq is
If the data is ordered , then) ... x≤≤If the data is ordered , then) ... nx≤≤
integeran is if
integeran not is
qn
X0.75 X0.95
0.75 (which split up the data into four . They are frequently used to give a summary
Density plots. If the number of data points ia histogram (of the relative frequencies) by a density curve (red line):
Detailed description of a univariate data
x0 x1
A density function is a non-negative real-valued integrable function f such that
(this condition says that the area enclosed by the graph of Interpretation: The area of a segment enclosed by the y=x1 (the grey shaded area in the figure) equals the fraction of data points with values between x0 and x1.
∫∞
∞−
)( dxxf
ts is large, it is often convenient to approximate a histogram (of the relative frequencies) by a density curve (red line):
Detailed description of a univariate data
valued integrable function f such that
(this condition says that the area enclosed by the graph of f and the x-axis is 1).Interpretation: The area of a segment enclosed by the x-axis, the graph of f and y=x0 and
(the grey shaded area in the figure) equals the fraction of data points with values
= 1
Normal distributions = Gaussian distributionsfunctions are the Gaussian distributions, defined as
−1 µx
Continuous univariate data
An important distribution
−−=
21 )(exp
2
1)(
σ
µ
πσ
xxf
This distribution is symmetric (around x=µ), unimodal (with mode at like a bell. The mean of gaussian distributed data is
The 68-95-99.7 rule. If a dataset has a gaussian distribution with mean µ and variance σ2, then
68% of the data lie within the interval [ 95% of the data lie within the interval [ 99.7% of the data lie within the interval [
Normal distributions = Gaussian distributions. A very important family of density functions are the Gaussian distributions, defined as
Continuous univariate data –
2
), unimodal (with mode at x=µ) and shaped The mean of gaussian distributed data is µ, its variance is σ2.
With parameters µ and σ>0.
If a dataset has a gaussian distribution with mean
of the data lie within the interval [ µ-σ, µ+σ ]of the data lie within the interval [ µ-2σ, µ+2σ ]
of the data lie within the interval [ µ-3σ, µ+3σ ]
Summary
Frequency tables, Bar plots, Pie charts, Histograms, Density plots are possible ways to display distribution of statistical data.
Mean, Median and Quantiles are summary statistics for location of the numerical data
The variance is a measure of dispersion for numerical data
Higher order statistics such as sample skewness and kurtosis are also used to describe additional characteristics of the distribution
Frequency tables, Bar plots, Pie charts, Histograms, Density plots are possible ways to display distribution of statistical data.
Mean, Median and Quantiles are summary statistics for location of
The variance is a measure of dispersion for numerical data
Higher order statistics such as sample skewness and kurtosis are also used to describe additional characteristics of the distribution
Multivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statisticsMultivariate descriptive statistics
Multidimensional dataMultidimensional data
In many applications a set of properties/features is measured for each study In many applications a set of properties/features is measured for each study
subject, such data is considered multidimensionalsubject, such data is considered multidimensional
It is often of interest to evaluate the relatIt is often of interest to evaluate the relatIt is often of interest to evaluate the relatIt is often of interest to evaluate the relat
to quantify the strength of the relationship to quantify the strength of the relationship
Examples:Examples:
The PGEM data we saw earlier. For each study subject, multiple aspects were The PGEM data we saw earlier. For each study subject, multiple aspects were
measured, such as age, gender, PGEM value, etc. measured, such as age, gender, PGEM value, etc.
Microarray gene expression data are multidimensional Microarray gene expression data are multidimensional
Describe the relationship between two discrete features: contingency tableDescribe the relationship between two discrete features: contingency table
Describe the relationship between two continuous features: correlationDescribe the relationship between two continuous features: correlation
In many applications a set of properties/features is measured for each study In many applications a set of properties/features is measured for each study
subject, such data is considered multidimensionalsubject, such data is considered multidimensional
lationship between/ among these features and lationship between/ among these features and lationship between/ among these features and lationship between/ among these features and
to quantify the strength of the relationship to quantify the strength of the relationship
The PGEM data we saw earlier. For each study subject, multiple aspects were The PGEM data we saw earlier. For each study subject, multiple aspects were
measured, such as age, gender, PGEM value, etc. measured, such as age, gender, PGEM value, etc.
Microarray gene expression data are multidimensional Microarray gene expression data are multidimensional
Describe the relationship between two discrete features: contingency tableDescribe the relationship between two discrete features: contingency table
Describe the relationship between two continuous features: correlationDescribe the relationship between two continuous features: correlation
General description: Contingency table General description: Contingency table
Absolute frequnciesAbsolute frequncies
The contingency table can be used to desThe contingency table can be used to des
terms of absolute frequencies X=aterms of absolute frequencies X=a11, …, , …, aaterms of absolute frequencies X=aterms of absolute frequencies X=a11, …, , …, aa
A (k x m) contingency table of absolute frequencies has the form:A (k x m) contingency table of absolute frequencies has the form:
Gender
F
M
total
Marginal frequency
General description: Contingency table General description: Contingency table ––
escribe the joint distribution of X and Y in escribe the joint distribution of X and Y in
aakk, Y=b, Y=b11, …, , …, bbmmaakk, Y=b, Y=b11, …, , …, bbmm
A (k x m) contingency table of absolute frequencies has the form:A (k x m) contingency table of absolute frequencies has the form:
Smoking Status Total
C F N
16 12 14 42
18 18 17 53
34 30 31 95
Marginal frequency
Contingency table Contingency table ––
Relative frequenciesRelative frequencies
The contingency table can also be used toThe contingency table can also be used to
terms of relative frequencies terms of relative frequencies terms of relative frequencies terms of relative frequencies
Gender C
F 0.17
M 0.19M 0.19
total 0.36
Marginal relative frequency
to describe the joint distribution of X and Y in to describe the joint distribution of X and Y in
Smoking Status Total
C F N
0.17 0.13 0.15 0.44
0.19 0.19 0.18 0.560.19 0.19 0.18 0.56
0.36 0.32 0.33 1.00
Marginal relative frequency
Contingency table Contingency table ––
Conditional frequenciesConditional frequencies
Contingency table can help us to examine the dependency between two discrete Contingency table can help us to examine the dependency between two discrete
variables variables variables variables
Based on the relationship between the relative frequency ( joint distribution of two Based on the relationship between the relative frequency ( joint distribution of two
variables) and the conditional relative frequencies (conditional distribution of the variables) and the conditional relative frequencies (conditional distribution of the
variables)variables)
Therefore: Look at conditional frequencies, i.e. the distribution of aTherefore: Look at conditional frequencies, i.e. the distribution of a
feature for a fixed value of the second featurefeature for a fixed value of the second feature
Smoking Status
Sex C F N total
F 0.38 0.29 0.33 1
M 0.34 0.34 0.32 1
Relative freq. conditional on gender
Smoking Status
Conditional frequenciesConditional frequencies
Contingency table can help us to examine the dependency between two discrete Contingency table can help us to examine the dependency between two discrete
Based on the relationship between the relative frequency ( joint distribution of two Based on the relationship between the relative frequency ( joint distribution of two
variables) and the conditional relative frequencies (conditional distribution of the variables) and the conditional relative frequencies (conditional distribution of the
Therefore: Look at conditional frequencies, i.e. the distribution of aTherefore: Look at conditional frequencies, i.e. the distribution of a
feature for a fixed value of the second featurefeature for a fixed value of the second feature
Smoking Status
Sex C F N
F 0.47 0.40 0.45
M 0.53 0.60 0.55
total 1.00 1.00 1.00
Relative freq. conditional on smoking status
Contingency table Contingency table ––
Conditional frequency distribution (1)Conditional frequency distribution (1)
Conditional frequency distribution of Y under the condition X=aConditional frequency distribution of Y under the condition X=a
given by:given by:given by:given by:
Conditional frequency distribution of X under the condition Y=bConditional frequency distribution of X under the condition Y=b
is given by:is given by:
Conditional frequency distribution (1)Conditional frequency distribution (1)
Conditional frequency distribution of Y under the condition X=aConditional frequency distribution of Y under the condition X=aii, also written Y|X=a, also written Y|X=aii , is , is
Conditional frequency distribution of X under the condition Y=bConditional frequency distribution of X under the condition Y=bj j , also written X|Y=b, also written X|Y=bjj , ,
Contingency table Contingency table ––
Conditional frequency distribution (2)Conditional frequency distribution (2)
Because ofBecause of
we also havewe also have
The conditional distributions are computed by The conditional distributions are computed by
marginal frequencies. marginal frequencies.
Conditional frequency distribution (2)Conditional frequency distribution (2)
y dividing the joint frequencies by the appropriate y dividing the joint frequencies by the appropriate
ContingencyContingency--tabletable ––
χχ22 coefficientscoefficients
Starting point: How should the joint frequencies look like, if we “empirically” assume Starting point: How should the joint frequencies look like, if we “empirically” assume
independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions)
Starting point: How should the joint frequencies look like, if we “empirically” assume Starting point: How should the joint frequencies look like, if we “empirically” assume
independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions) independence between X and Y (given the marginal distributions)
Contingency table Contingency table ––
Empirical independenceEmpirical independenceIdea: X and Y are “empirically” independeIdea: X and Y are “empirically” independe
are equal in each subare equal in each sub--population X=apopulation X=aii , i.e. independent of a, i.e. independent of a
Empirical independenceEmpirical independencedent if and only if the conditional frequencies dent if and only if the conditional frequencies
, i.e. independent of a, i.e. independent of aii
Contingency table Contingency table ––
Assessing empirical independenceAssessing empirical independence
Idea: Compare for each cell (Idea: Compare for each cell (i,ji,j) the theoretical frequency with the observed ) the theoretical frequency with the observed
frequency under the assumption of independence frequency under the assumption of independence frequency under the assumption of independence frequency under the assumption of independence
→ → χχ22 coefficient, good approximation in large sample)coefficient, good approximation in large sample)
Assessing empirical independenceAssessing empirical independence
) the theoretical frequency with the observed ) the theoretical frequency with the observed
frequency under the assumption of independence frequency under the assumption of independence frequency under the assumption of independence frequency under the assumption of independence
coefficient, good approximation in large sample)coefficient, good approximation in large sample)
X and Y are empirically independentX and Y are empirically independent
Contingency table Contingency table ––
Properties of the Properties of the χχ22 coefficientscoefficients
X and Y are empirically independentX and Y are empirically independent
large large <==> strong dependence/association<==> strong dependence/association
small small <==> weak dependence/association<==> weak dependence/association
Disadvantage: depends on the dimension of the tableDisadvantage: depends on the dimension of the table
X and Y are empirically independentX and Y are empirically independent
coefficientscoefficients
X and Y are empirically independentX and Y are empirically independent
<==> strong dependence/association<==> strong dependence/association
<==> weak dependence/association<==> weak dependence/association
Disadvantage: depends on the dimension of the tableDisadvantage: depends on the dimension of the table
Graphical representation oGraphical representation o
Graphical representation of the values (Graphical representation of the values (xxii,y,yii
and Y.and Y.and Y.and Y.
The simplest representation of (xThe simplest representation of (x11,y,y11),…,(),…,(xxnn
12
3
log
(PG
EM
.po
st)
-1 0 1 2
-10
log(PGEM.pre)
log
(PG
EM
.po
st)
of two continuous features of two continuous features
ii), ), ii=1,…,n from two continuous features X =1,…,n from two continuous features X
nn,y,ynn) in a coordinate system is a scatter plot ) in a coordinate system is a scatter plot
2 3 4
log(PGEM.pre)
Pearson’s correlation coefficient (1)Pearson’s correlation coefficient (1)
Pearson correlation coefficient: commonly used to describe the strength of linear association Pearson correlation coefficient: commonly used to describe the strength of linear association
between two continuous features. For the data (between two continuous features. For the data (
The range of r is [The range of r is [--1,1]1,1]
r r > 0 positive correlation, positive linear relationship, i.e. values are > 0 positive correlation, positive linear relationship, i.e. values are
around a straight line with positive slope around a straight line with positive slope
r r < 0 negative correlation, negative linear relationship, i.e. values are < 0 negative correlation, negative linear relationship, i.e. values are
around a straight line with negative slope around a straight line with negative slope
r r = 0 no correlation, uncorrelated= 0 no correlation, uncorrelated
Pearson’s correlation coefficient (1)Pearson’s correlation coefficient (1)
Pearson correlation coefficient: commonly used to describe the strength of linear association Pearson correlation coefficient: commonly used to describe the strength of linear association
between two continuous features. For the data (between two continuous features. For the data (xxii,y,yii), ), ii=1,…,n is defined as=1,…,n is defined as
> 0 positive correlation, positive linear relationship, i.e. values are > 0 positive correlation, positive linear relationship, i.e. values are
around a straight line with positive slope around a straight line with positive slope
< 0 negative correlation, negative linear relationship, i.e. values are < 0 negative correlation, negative linear relationship, i.e. values are
around a straight line with negative slope around a straight line with negative slope
Pearson’s correlation coefficient (2)Pearson’s correlation coefficient (2)
The correlation coefficient The correlation coefficient rr measures the strength of a linear relationship measures the strength of a linear relationship
Pearson’s correlation coefficient (2)Pearson’s correlation coefficient (2)
measures the strength of a linear relationship measures the strength of a linear relationship
Pearson’s correlation coefficient (3)Pearson’s correlation coefficient (3)
Rule of thumb:Rule of thumb:
“weak correlation”“weak correlation” “weak correlation”“weak correlation”
“medium correlation”“medium correlation”
“strong correlation”“strong correlation”
Linear transformation:Linear transformation:
correlation coefficient between andcorrelation coefficient between and
correlation coefficient between andcorrelation coefficient between and correlation coefficient between andcorrelation coefficient between and
Pearson’s correlation coefficient (3)Pearson’s correlation coefficient (3)
“medium correlation”“medium correlation”
oror
oror
Equivalent forms of Equivalent forms of rr
Multiplying out yields:Multiplying out yields:
Remember the formula for variances!Remember the formula for variances!
In terms of standard deviations and covarianceIn terms of standard deviations and covariance
with covariance with covariance
and standard deviations and standard deviations
In terms of standard deviations and covarianceIn terms of standard deviations and covariance
Statistic Inference
Estimation
Finding approximations of the model parameters Finding approximations of the model parameters estimation
Finding the uncertainty associated with the model parameter – interval estimation (finding the confidence intervals)
Estimates are used to characterize a population characteristic
Hypothesis testing
Examine the validity of our hypotheses regarding a population characteristic by using observed data.
Finding approximations of the model parameters – point Finding approximations of the model parameters – point
Finding the uncertainty associated with the model parameter interval estimation (finding the confidence intervals)
Estimates are used to characterize a population
Examine the validity of our hypotheses regarding a population characteristic by using observed data.
Estimation
Point estimation, i.e., finding ˆ ˆ ˆ( | ), ( | ), ( | )C F Ny x y x y xθ θ θ
, . , , ,i pgem pre C i sm C F i sm F N i sm N i iy x x x Nθ θ θ ε ε σ= = == + + +
Point estimation, i.e., finding
Interval estimation, i.e, finding the uncertainties associated with the point estimate
Desired properties of the estimator:
unbiasedness (bias is measured as the expected difference between the estmator and the population parameter)
efficiency (could be described by the inverse of variance of the estimator)
ˆ ˆ ˆ( | ), ( | ), ( | )C F Ny x y x y xθ θ θ
ˆ ˆ ˆvar( (C F Nθ θ θ
small mean square error (MSE)
other: consistency, etc.
Common methods to find estimators:
Method of moments
Maximum likelihood estimation
(E
ˆ ˆ ˆ( | ), ( | ), ( | )C F Ny x y x y xθ θ θ
2
, . , , , , ~ (0, )i pgem pre C i sm C F i sm F N i sm N i iy x x x Nθ θ θ ε ε σ= = == + + +
the uncertainties associated with the point estimate
unbiasedness (bias is measured as the expected difference between the estmator and the population parameter)
efficiency (could be described by the inverse of variance of the estimator)
ˆ ˆ ˆ( | ), ( | ), ( | )C F Ny x y x y xθ θ θ
ˆ ˆ ˆ( | )), var( ( | )), var( ( | )), C F Ny x y x y xθ θ θ
2)ˆ( θθ −
Estimation Methods
Method of moments:
Match the first E(X), second (E(X2
Solve the equation system
If sample E(Xk)=g(θ), then ˆ g=θ
Maximum Likelihood Estimation (MLE)
Assuming the data come from a parametric family indexed by a population parameter θ, i.e. X1,…, Xn ~ i.i.d. f(x|
|,...,( 1 nXXf
The probability of observing the data is the likelihood function of the parameter θ under the assumed probabilistic model, i.e.
|,...,( 1 nXXf
,...,( 1xfLikelihood =
2)),…, order moments to the parameters
)( kx-1g
Assuming the data come from a parametric family indexed by a population ~ i.i.d. f(x| θ), the joint density of the data is
)|()| θθ iXfΠ=
The probability of observing the data is the likelihood function of the under the assumed probabilistic model, i.e.
)|()| θθ iXfΠ=
)|()| θθ in xfx Π=
Example: Binomial data
Data: 6,3,5,6,8 number of successes in 5 repeated experiments of tossing a coin 10 times
Is this a fair coin?
What is going to come up for the 11
Assuming a probabilistic model: X
Estimating π
MoM: Because E(X)= 10π, estimate of MoM: Because E(X)= 10π, estimate of (0.6+0.3+0.5+0.6+0.8)/5
MLE: L(π|data) = P(x1=6,…, x5
find the value that maximize the likelihood function
Example: Binomial data
Data: 6,3,5,6,8 number of successes in 5 repeated experiments of
What is going to come up for the 11th toss?
Assuming a probabilistic model: X ~Binom (π,10)
, estimate of π = sample mean/10 = , estimate of π = sample mean/10 =
5= 8|π)=P(x1=6 |π)...P(x5=8|π), then find the value that maximize the likelihood function
Example: Normal data
x1,x2,....,xn ~ iid N(µ,σ2)
Joint pdf for the whole random sample
x( f N
Joint pdf for the whole random sample
σ,µ |(σ,µ | ,...,, ( 1
2
21 n xfxxxf =)
)σ,µ |( ,...,,|σ,µ ( 121 n xfxxxl =)
Likelihood function is basically the pdf for the fixed sample
nµ
i∑=
x σ
Maximum likelihood estimates of the model parameters are numbers that maximize the joint pdf for the fixed sample which is called the Likelihood function
Joint pdf for the whole random sample
2
2
2σ
µ)(x
2
σ2π
1σ,µ |
−
= e)
Joint pdf for the whole random sample
)σ,µ |()...σ,µ |()σ22
2
2
nxfxf
)σ,µ |()...σ,µ| () 2 nxfxf
Likelihood function is basically the pdf for the fixed sample
2
i2ˆ( µ)
σn
x −=∑
Maximum likelihood estimates of the model parameters µ and σ2
are numbers that maximize the joint pdf for the fixed sample which is called the Likelihood function
Hypothesis Testing
Making inference about the value of the population parameter based on the data
Form hypotheses about the population parameter Form hypotheses about the population parameter null hypothesis: H0
alternative hypothesis: H1
Construct and calculate the test statistic
utilizes the population parameter estimates
A desired property is that it conassociated with the parameter estimates
Define a rejection region ( or the significance level) for the test Define a rejection region ( or the significance level) for the test statistic based on H0
Reject the H0 if the test statistic falls the test statistic is deems highprobabilistic model defined by H
Fail to reject H0, the data is not highly unlikely to happen with H
Making inference about the value of the population parameter
Form hypotheses about the population parameter Form hypotheses about the population parameter
Construct and calculate the test statistic
utilizes the population parameter estimates
ontains information about the uncertainty associated with the parameter estimates
Define a rejection region ( or the significance level) for the test Define a rejection region ( or the significance level) for the test
if the test statistic falls inside the rejection region (or ghly unlikely to be generated from the
probabilistic model defined by H0)
, the data is not highly unlikely to happen with H0
Question of interest: if the baseline PGEM levels are different between two conditions ( for example, current smokers vs. never smokers)
Hypothesis Testing Example
two conditions ( for example, current smokers vs. never smokers)
Hypothesis to be tested:
H0: θC = θN
H1: θC ≠ θN
Choose an appropriate statistics D that is able to discriminate between the two hypotheses
Choose a rejection region / significance level
The selection of the statistics defines the test.
Question of interest: if the baseline PGEM levels are different between two conditions ( for example, current smokers vs. never smokers)
Hypothesis Testing Example
two conditions ( for example, current smokers vs. never smokers)
D that is able to discriminate between
rejection region / significance level based on H0
The selection of the statistics defines the test.
Fold change
A commonly used test statistics by biologists
Dividing the average PGEM level in smokers by the average of PGEM levels in never
Hypothesis Testing Example: Fold Change
This statistic does not take into account of the variation of the study sample
If the distribution is skewed, using median instead of the mean as the parameter estimate may be more robust.
,1 ,2 ,
,1 ,2 ,
...ˆ
ˆ
CC C C n
C C
N N N nN
y y y
nD
y y yθ
θ
+ + +
= =+ + +
Dividing the average PGEM level in smokers by the average of PGEM levels in never smokers
Rejection region for D:
Often based on experience about the biological system
For example: one may define the rejection region as D<0.5 or D>2, i.e.
if |log2D| > 1, reject H0 in favour of H1
if |log2D| ≤ 1, reject H0
A commonly used test statistics by biologists
Dividing the average PGEM level in smokers by the average of PGEM levels in never
Hypothesis Testing Example: Fold Change
This statistic does not take into account of the variation of the study sample
If the distribution is skewed, using median instead of the mean as the parameter
,1 ,2 ,...NN N N n
N
y y y
n
+ + +
Dividing the average PGEM level in smokers by the average of PGEM levels in never
Often based on experience about the biological system
For example: one may define the rejection region as D<0.5 or D>2, i.e.
Two sample student t-test:Another commonly used test statistic
Assumption:
Hypothesis Testing Example: t
iid iidAssumption:
2 2
ˆ ˆ
ˆ ˆ
C N
C N
C N
T
n n
θ θ
σ σ
−=
+
The T statistic is random variable with a little bit complicated distribution
iid iid2 2
,1 , ,1 ,,... ~ ( , ), ,... ~ ( , )C NC C n C C N N n N Ny y N y y Nθ σ θ σ
Calculating the T statistic
Under the normality assumption, T has an approx. a
freedom, where d is the closest integer to
2 2 2 2 2 2 21 1( / / ) / ( / ) ( / )
1 1C C N N C C N N
C N
S n S n S n S nn n
+ +
− −
Hypothesis Testing Example: t-test
iid iid
2 2ˆ ˆC N
C Nn n
σ σ
statistic is random variable with a little bit complicated distribution
iid iid2 2
,1 , ,1 ,,... ~ ( , ), ,... ~ ( , )C NC C n C C N N n N Ny y N y y Nθ σ θ σ
has an approx. a t-distribution with d degrees of
2 2 2 2 2 2 21 1( / / ) / ( / ) ( / )
1 1C C N N C C N N
C N
S n S n S n S nn n
+ +
− −
Deciding on the rejection region
Significance level:
Hypothesis Testing Example: t
•Usually represented by α є (0,1)
•A probabilistic interpretation of the rejection region, i.e. the probability that a test
statistic falls into the rejection region J under H
0( | H )P D J∈ =
•Commonly used α in a simple hypothesis testing is 0.05
• Means the rejection region would be in the region of the distribution that the test statistic will only have very small probability to be in.
•We can derive the rejection region based on the significance level.
Hypothesis Testing Example: t-test (2)
A probabilistic interpretation of the rejection region, i.e. the probability that a test
statistic falls into the rejection region J under H0
0( | H ) α∈ =
in a simple hypothesis testing is 0.05
Means the rejection region would be in the region of the distribution that the test statistic will only have very small probability to be in.
We can derive the rejection region based on the significance level.
For the test statistic has a t distribution 8 degrees of freedom, and significance level
The rejection region can be defined as the region, where, under H
5% of the time for T to be above t(0.975; 8)=2.306 or below
Hypothesis Testing Example: t
5% of the time for T to be above t(0.975; 8)=2.306 or below
Thus a typical decision rule in this case would be to reject
t(0.975;8) = 2.306.
0.3
0.4
Density of the t-statistic for k=8. Symmetric
confidence interval for α
x
density o
f t
(k=
8)
-6 -4 -2 0
0.0
0.1
0.2 2.5%
95%
For the test statistic has a t distribution 8 degrees of freedom, and significance level α = 5%
The rejection region can be defined as the region, where, under H0, we would expect only
to be above t(0.975; 8)=2.306 or below t(0.025; 8) = -2.306.
Example: t-test (3)
to be above t(0.975; 8)=2.306 or below t(0.025; 8) = -2.306.
Thus a typical decision rule in this case would be to reject H0 in favour of H1 if |T| >
statistic for k=8. Symmetric
α = 5%.
2 4 6
2.5%
95%
Hypothesis Testing Example:
Instead of comparing the observed twhich define the rejection region, we can compare the pthe test statistic to the significance level: the test statistic to the significance level:
P-valueThe probability of observing values of T that are at least as extreme as the observed t under H
p = P(|T|>|t||H0)
Given a significance level α, we reject the H
Hypothesis Testing Example: t-test (4)
Instead of comparing the observed t-statistic to the critical values which define the rejection region, we can compare the p-value of the test statistic to the significance level: the test statistic to the significance level:
The probability of observing values of T that are at least as extreme as the observed t under H0
, we reject the H0 if p< α.
Two types of tests
Parametric tests: A parametric distribution is assumed for the measured random variables. E.g. the t-test assumes that the variables are normally distributed. (If this were
not the case, this would lead to wrong pnot the case, this would lead to wrong p prior to computing a test statistic, data is transformed in order to produce
random variables that are easier to handle (e.g. to produce approximately normally distributed data).
Non-parametric tests: No parametric distribution function is assumed for the measured random variable when the distribution of the measured variables is not known or when there is no appropriate test that can deal with the distribution of the
measured variables merely rely on the relative order of the values of on some very mild constraints merely rely on the relative order of the values of on some very mild constraints
concerning the shape of the probability distributions of the measured variables (e.g. unimodality, symmetry).
We mention one parametric and one non
A parametric distribution is assumed for the measured random
test assumes that the variables are normally distributed. (If this were not the case, this would lead to wrong p-values or wrong confidence intervals.)not the case, this would lead to wrong p-values or wrong confidence intervals.)prior to computing a test statistic, data is transformed in order to produce random variables that are easier to handle (e.g. to produce approximately
No parametric distribution function is assumed for the
when the distribution of the measured variables is not known or when there is no appropriate test that can deal with the distribution of the
merely rely on the relative order of the values of on some very mild constraints merely rely on the relative order of the values of on some very mild constraints concerning the shape of the probability distributions of the measured variables
We mention one parametric and one non-parametric test which are commonly used.
Given two samples x=(x1,…,xn) and y=(yfrom the random variables X and Y resp.
Null hypothesis: The two variables X and Y have the same distribution.
Wilcoxon rank sum test
Null hypothesis: The two variables X and Y have the same distribution.
Test statistic:
Rank order all N=n +m values from both samples combined ( n is the size of the larger sample and m is the size of the smaller sample).
Sum the ranks of the smaller sample and call this value w.
Look up the level of significance (p-value) in a table using w, m and n.
Exact p-value can be calculated based on all permutations of ranks over both samples. (when n and m are large, approximations based on central limit samples. (when n and m are large, approximations based on central limit theorem can be used).
For large numbers it is almost as sensitive as the two Sample Student ttest.
For small numbers with unknown distributions this test is even more sensitive than the Student t-test.
) and y=(y1,…,ym) drawn independently from the random variables X and Y resp.
Null hypothesis: The two variables X and Y have the same distribution.
Wilcoxon rank sum test
Null hypothesis: The two variables X and Y have the same distribution.
Rank order all N=n +m values from both samples combined ( n is the size of the larger sample and m is the size of the smaller sample).
Sum the ranks of the smaller sample and call this value w.
value) in a table using w, m and n.
value can be calculated based on all permutations of ranks over both samples. (when n and m are large, approximations based on central limit samples. (when n and m are large, approximations based on central limit
For large numbers it is almost as sensitive as the two Sample Student t-
For small numbers with unknown distributions this test is even more
Hypothesis Testing – Error types
If we reject the null hypothesis when it is actually true, we have made what is called a type I error or a false positivedifferentially expressed)differentially expressed)
If we accept the null hypothesis whetype II error or a false negative. (Example: Failed to identify a truly differentially expressed gene)
True negatives
H0 true
H0 not rejected True negatives
Type I error(false positives)
H0 not rejected
(P>=αααα)
H0 rejected
(P<αααα)
Error types
If we reject the null hypothesis when it is actually true, we have made what is false positive. (Example: Falsely declare a gene as
hen it is actually false, then we have made a . (Example: Failed to identify a truly
Type II error
H0 not true
Type II error(false negatives)
True positives(false positives)
In hypothesis testing, the probability of a Type I error
high as significance level of the test.
Hypothesis Testing – Error Types (Cont.)
It is harder to control the probability of a Type II error because we usually do not
have a statistics for testing the alternative hypothesis.
The smaller the true existing difference is the larger the probability of a Type II
error.
Given a statistical testing procedure, it is impossible to keep both error types
arbitrarily large by selecting a special significance level. There is a trade
type I and type II error, as depicted below.
Probability of a type I error
Pro
babili
ty o
f a
type I
I err
or
Trade-off between error
types, plotted for different
significance levels
the probability of a Type I error is controlled to be at most as
Error Types (Cont.)
It is harder to control the probability of a Type II error because we usually do not
have a statistics for testing the alternative hypothesis.
The smaller the true existing difference is the larger the probability of a Type II
Given a statistical testing procedure, it is impossible to keep both error types
arbitrarily large by selecting a special significance level. There is a trade-off between
type I and type II error, as depicted below.
Probability of a type I error
off between error
types, plotted for different
significance levels
Summary
Null hypothesis, test statistics
Significance level, rejection region, p
Type I and type II errors
5-Step testing procedure
Parametric tests: t -test, ANOVA
Non-parametric test: Wilcoxon rank sum test, Kruskal
Significance level, rejection region, p-value
parametric test: Wilcoxon rank sum test, Kruskal-Wallis
Multiple hypothesis testing
Golub et al. (1999) were interested in identifying genes that are
differentially expressed in patients with two type of leukemias:differentially expressed in patients with two type of leukemias:
- acute lymphoblastic leukemia (ALL, class 0) and
- acute myeloid leukemia (AML, class 1).
Gene expression levels were measured using Affymetrix chips
containing g = 6817 human genes.
n = 38 samples
= 27 ALL cases + 11 AML cases.
Multiple hypothesis testing
Golub et al. (1999) were interested in identifying genes that are
differentially expressed in patients with two type of leukemias:differentially expressed in patients with two type of leukemias:
acute lymphoblastic leukemia (ALL, class 0) and
Gene expression levels were measured using Affymetrix chips
The preprocessed data include expression of 3051 genes for each of the 38 subjects.
A two sample t-test statistic was computed for each of the 3051 genes.
Multiple hypothesis testing (Cont. 2)
A two sample t-test statistic was computed for each of the 3051 genes.
P-values were obtained for each gene based on the t statisticH i s t o g r a m o f t e s t s t a t
Fre
qu
ency
20
03
00
40
05
00
Fre
qu
ency
t e s t s t a t
- 5 0 5 1 0
010
020
0
Which of these genes can we consider as differentially expressed?
The preprocessed data include expression of 3051 genes for each of the 38
test statistic was computed for each of the 3051 genes.
Multiple hypothesis testing (Cont. 2)
test statistic was computed for each of the 3051 genes.
values were obtained for each gene based on the t statisticH i s t o g r a m o f p - v a l u e s
400
60
08
00
10
00
120
0
2 * ( 1 - p n o r m ( a b s ( t e s t s t a t ) ) )
0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0
02
00
400
Which of these genes can we consider as differentially expressed?
Multiple hypothesis testing (Cont. 3)
Called significant
Null true F
Can we use the 0.05 significance level to identify differentially expressed genes?
P-value: probability of finding a difference equal or greater than the observed one just by chance under the null hypothesis.
In multiple comparisons (or repeated experiments), pas a measure of false positive rate (F/
Null true F
Alternative true T
Total S
Commonly used 0.05 significance level can result in too many false positive findings.
Exercise: Assuming among the 3000 genes, 20% are truly differentially expressed, can you give a conservative estimate of the rate of false positives among those called significant if we use the 0.05 significance level? How about if the proportion of truly differentially expressed genes is 10%?
Multiple hypothesis testing (Cont. 3)
Called not significant
Total
m - F m
Can we use the 0.05 significance level to identify differentially expressed genes?
probability of finding a difference equal or greater than the observed one just by chance under the null hypothesis. In multiple comparisons (or repeated experiments), p-value can be viewed as a measure of false positive rate (F/ m0)
m0 - F m0
m1- T m1
m - S m
0
Commonly used 0.05 significance level can result in too many false positive
Assuming among the 3000 genes, 20% are truly differentially expressed, can you give a conservative estimate of the rate of false positives among those called significant if we use the 0.05 significance level? How about if the proportion of truly
Multiple hypothesis testing (Cont. 4)
Family-wise error rate (FWER)
probability of having at least one false positives in multiple comparisons.
Many versions of controlling procedure. Many versions of controlling procedure. (1988), Hommel (1988)
Can be too conservative for genomic studies.
α
1 5 10
Table: FWER (expected number of false positives) for different
number of comparisons (N) at different
1 5 10
0.010.010.010.01 0.01
(0.01)
0.05
(0.05)
0.10
(0.1)
0.050.050.050.05 0.05
(0.05)
0.23
(0.25)
0.40
(0.5)
Multiple hypothesis testing (Cont. 4)
probability of having at least one false positives in multiple comparisons.
Many versions of controlling procedure. Bonferroni, Holm (1979), Hochberg Many versions of controlling procedure. Bonferroni, Holm (1979), Hochberg
Can be too conservative for genomic studies.
N
10 50 100 1000
Table: FWER (expected number of false positives) for different
number of comparisons (N) at different αααα level
10 50 100 1000
0.10
(0.1)
0.39
(0.5)
0.63
(1)
1.00
(10)
0.40
(0.5)
0.92
(2.5)
0.99
(5)
1.00
(50)
Multiple hypothesis testing (Cont. 5)
False discovery rate (FDR / pFDR): Proportion of hits that are false (F/S).
Several versions of controlling procedure. and Benjamini & Yekutieli (2001))
A significance measure based on pFDR: q(2003)) q-value: minimum false discovery rate that can be attained when
calling a feature significant
Require to estimate the proportion of true null (mestimation was provided basednull genes are uniformly distributed. null genes are uniformly distributed.
Multiple hypothesis testing (Cont. 5)
Proportion of hits that are false (F/S).
Several versions of controlling procedure. (Benjamini & Hochberg (1995),
A significance measure based on pFDR: q-value (Storey & Tibshirani
value: minimum false discovery rate that can be attained when
Require to estimate the proportion of true null (m0/m). An empirical ed on the fact that the p values for the
null genes are uniformly distributed. null genes are uniformly distributed.
Summary
This only provides some flavor of probability, statistics and their usage.
To learn more: taking a full course!
Introduction to biostatistics for clinical investigators
Statistical methods for observational studies
This only provides some flavor of probability, statistics and their
taking a full course!
Introduction to biostatistics for clinical investigators
Statistical methods for observational studies
References and some useful info
Statistical methods in bioinformatics course slides developed by
Gieger and Dr. Achim TreschGieger and Dr. Achim Tresch http://www.scaibit.de/index.php?id=92Gieger and Dr. Achim TreschGieger and Dr. Achim Tresch http://www.scaibit.de/index.php?id=92
Statistical Methods in Bioinformatics by Warren
Introduction to Statistical Thought by Michael http://www.math.umass.edu/~lavine/Book/book.html
The Elements of Statistical Learning by Trevor Hastie, Robert Jerome Friedman
Statistical Methods in Molecular BiologyMadhu Mazumdar and Heather L. Van EppsMadhu Mazumdar and Heather L. Van Epps
Statistical software package and program language R, http://www.r
References and some useful info
Statistical methods in bioinformatics course slides developed by Dr. Christian Dr. Christian
http://www.scaibit.de/index.php?id=92http://www.scaibit.de/index.php?id=92
by Warren Ewens and Gregory Grant
by Michael Lavine, http://www.math.umass.edu/~lavine/Book/book.html
by Trevor Hastie, Robert Tibshirani and
Statistical Methods in Molecular Biology Edited by Heejung Bang, Xi Kathy Zhou, and Heather L. Van Eppsand Heather L. Van Epps
Statistical software package and program language R, http://www.r-project.org