Intro to comp genomics Lecture 3-4: Examples, Approximate Inference.
Post on 19-Dec-2015
221 Views
Preview:
Transcript
Intro to comp genomics
Lecture 3-4: Examples, Approximate Inference
Example 1: Mixtures of Gaussians
i
iii xNpxP ),;()|(
),;()|( xNxP
i
iii xNpxP ),;()|(
We have experimental results of some valueWe want to describe the behavior of the experimental values:Essentially one behavior? Two behaviors? More?In one dimension it may look very easy: just looking at the distribution will give us a good idea..
We can formulate the model probabilistically as a mixture of normal distributions.
As a generative model: to generate data from the model, we first select the sub-model by sampling from the mixture variable. We then generate a value using the selected normal distribution.
If the data is multi dimensional, the problem is becoming non trivial.
Inference is trivial
i
iii xNpxP ),;()|(
Let’s represent the model as:
i
iii xNpxP ),;()|(
iiiisss
s
xNpxNp
sxssxsxsP
),;(/),;(
)'|Pr()'Pr(/)|Pr()Pr()|(
What is the inference problem in our model?
s
sxsx )|Pr()Pr()Pr(
Inference: computing the posterior probability of a hidden variable given the data and the model parameters.
For p0=0.2, p1=0.8, 0=0, 1=1, 0=1,1=0.2, what is Pr(s=0|0.8) ?
Estimation/parameter learning
i
iii xNpxP ),;()|(
Given data, how can we estimate the model parameters?
i jjjij
ii
n
xNp
xxxL
),;(
)|Pr(),..,|( 1
Transform it into an optimization problem!
Likelihood: a function of the parameters. Defined given the data.
Find parameters that maximize the likelihood: the ML problem
Can be approached heuristically: using any optimization technique.
But it is a non linear problem which may be very difficult
Generic optimization techniques:
Gradient ascent:
Find
Simulation annealing
Genetic algorithms
And more..
)),..,|((maxarg 11
nkk
ak xxLaL
The EM algorithm for mixtures – inference allow for learning
i
iii xNpxP ),;()|(
We start by guessing parameters:
We now go over the samples and compute their posteriors (i.e., inference):
iis iiss
xNpxNpxsP ),;(/),;(),|( 00000
We use the posteriors to compute new estimates for the expected sufficient statistics of each distribution, and for the mixture coefficients:
ii
iii
xsP xsP
xsPxxE
s ),|(
),|(][
0
0
)|(1
ii
iisi
xsP xsP
xsPxxV
s ),|(
),|()(][
0
021
)|(
21
i
ixsPNps
),|(1 01
Continue iterating until convergence.
The EM theorem: the algorithm will converge and will improve likelihood monotonically
But:
No Guarantee of finding the optimumOr of finding anything meaningful
The initial conditions are critical:
Think of starting from 0=0, 1=10, 1,2=1
Solutions: start from “reasonable” solutionsTry many starting points
-1 0 1
Example 2: Mixture of sequence models
• a probabilistic model for binding sites:
• This is the site independent model, defining a probability space over k-mers
• Assume a set of sequences contain unknown binding sites (one for each)• The position of the binding site is a hidden variable h.
• We introduce a background model Pb that describes the sequence outside of the binding site (usually a d-order Markov model)
• Given complete data we can write down the likelihood of a sequence s as:
k
ii imPmP
1
])[()(
k
ibackiback
S
ibackback
ildilsilsPilsPsPlPlsP
idisisPsP
1
||
1
]))1..[|][(/])[(()()()|,(
]))1..[|][()(
• Inference of the binding site location posterior:
• Note that only k-factors need to be computed for each location (Pb(s) is constant))
One hidden variable = trivial inference
i
isPlsPslP )|,(/)|,(),|( 111
• If we assume some of the sequences may lack a binding site, this should be incorporated into the model:
k
ibackiback ildilsilsPilsPsPlPhitPlsP
1
]))1..[|][(/])[(()(*)(*)()|,(
hitl
s
• This is sometime called the ZOOPS model (Zero or one positions)
Hidden Markov Models
Observing only emissions of states to some probability space EEach state is equipped with an emission distribution (x a state, e emission))|Pr( xe
)|Pr()|Pr(),Pr( 1 iiiii sesses
Emission space
Caution! This is NOT the HMM Bayes Net
1.Cycles2.States are NOT random vars!
State space
Example 3: Mixture with “memory”
h i
iiio
h
hxPhhhxPh
hxP
xP
)|()|Pr()|()Pr(
)|,(
)|(
10
We sample a sequence of dependent valuesAt each step, we decide if we continue to sample from the same distribution or switch with probability p
We can compute the probability directly only given the hidden variables.
P(x) is derived by summing over all possible combination of hidden variables. This is another form of the inference problem (why?)
There is an exponential number of h assignments, can we still solve the problem efficiently?
B A
)|( ABP
)|( BAP
)|( AxP )|( BxP
Inference in HMM
Forward formula:
0:1?)(
)'|Pr()|Pr(
0
)('
1'
startsf
ssfsef
s
sNs
is
iis
)|Pr(
)'|Pr()|'Pr( 1
)('
1'
sfinishb
sessbb
Ns
i
sNs
is
is
Backward formula: Emissions
States FinishStart
isf
Emissions
States FinishStart
isb
S
Ls sfinishfL )|Pr(
S
s beginsbL )|Pr(1
Computing posteriors:
Emissions
States FinishStart
)'|Pr()|Pr()(
1)|',Pr( 11
' sssebfsL
ess iis
is
The posterior probability for transition from s’ to s after character i?
The posterior probability for emitting the i’th character from state s?
is
is
i bfsL
es)(
1),|Pr(
Example 4: Hidden states
Example:Two Markov models describe our dataSwitching between models is occurring at randomHow to model this?
No EmissionHidden state
Example 5: Profile HMM for Protein or DNA motifs
M
I
D
M
I
D
M
I
D
M
I
D
M
I
D
M
I
D
S F
•M (Match) states emit a certain amino acid/nucleotide profile•I (Insert) states emit some background profile•D (Delete) states are hidden
•Use the model for classification or annotation•(Both emissions and transition probabilities are informative!)
•Can use EM to train the parameters from a set of examples•(How do we determine the right size of the model?)(google PFAM, Prosite, “HMM profile protein domain”)
Example 6: N-order Markov model
•In most biological sequences, the Markov property is a big problem
•N-order relations can be modeled naturally:
Common error:
Forward/Backward in N-order HMM. Can dynamic programming work?
Emissions
States FinishStart
FinishStart
1-HMM Bayes Net:
2-HMM Bayes Net:
Example 7: Pair-HMM
Given two sequences s1,s2, an alignment is defined by a set of ‘gaps’ (or indels) in each of the sequences.
ACGCGAACCGAATGCCCAA---GGAAAACGTTTGAATTTATAACCCGT-----ATGCCCAACGGGGAAAACGTTTGAACTTATA
indel
indel
Standard dynamic programming algorithm compute the best alignment given such distance metric:
Standard distance metric: )(#])[],[(),,,( 22
11
2121 gapslslssslldi
ii
),,,(min),( 2121
,
2121 sslldssdll
Affine gap cost: l Substitution matrix: ))log(Pr(~),( baba
Pair-HMM
Generalize the HMM concept to probabilistically model alignments.
Problem: we are observing two sequences, not a-priori related. What will be emitted from our HMM?
M
G1
G2S
F
Match states emit and aligned nucleotide pairGap states emit a nucleotide from one of the sequences onlyPr(M->Gi) – “gap open cost”, Pr(G1->G1) – “gap extension cost”
Is it a BN template?Forward-backward formula?
Example 8: The simple tree model
H2
S3
S2 S1
H1
Sequences of extant and ancestral species are random variables, with Val(X) = {A,C,G,T}
Extant Species Sj1,., Ancestral species Hj
1,..(n-1)
Tree T: Parents relation pa Si , pa Hi
(pa S1 = H1 ,pa S3 = H2 ,The root: H2)
For multiple loci we can assume independence and use the same parameters (today):
),Pr(),Pr( jjj hshs
ii paxxiii Qtxx ,)exp()pa|Pr(
)pa|Pr()Pr(),Pr( !ji
jirootiroot
jj xxhhs )|Pr()|Pr()|Pr(
)|Pr()Pr()Pr(
111223
212
hshshs
hhhs
In the triplet:
The model is defined using conditional probability distributions and the root “prior” probability distribution
The model parameters can be the conditional probability distribution tables (CPDs)
Or we can have a single rate matrix Q and branch lengths:
96.001.002.001.0
01.096.001.002.0
02.001.096.001.0
01.002.001.096.0
)|Pr( yx
Ancestral inference
)pa|Pr()Pr()|,Pr( !ji
jirootiroot
jj xxhhs
h
shPs )|,()|Pr(
The Total probability of the data s:
This is also called the likelihood L(). Computing Pr(s) is the inference problem
)|Pr(
)|,(),|Pr(
s
shPsh Given the total probability it is easy
to compute:
)|Pr(/),(),|Pr(|
sshPsxhxhh
i
i
Easy!
Exponential?
Marginalization over hi
We assume the model (structure, parameters) is given, and denote it by :
Posterior of hi given the data
Total probability of the data
Example:
?
A
C A
?
xhh
i
i
shPsxh|
),()|Pr(
Given partial observations s:
)),,Pr(( ACA
The Total probability of the data:
)),,(|Pr( 1 ACAAh
96.001.002.001.0
01.096.001.002.0
02.001.096.001.0
01.002.001.096.0
)|Pr( yx
Uniform prior
Algorithm (Following Felsenstein 1981):
Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a
up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b] Pr(Xr(i)=c|Xi=a) up[r(i)][c]
Down(i):
down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]
down(r(i)), down(l(i))Algorithm:
up(root);LL = 0;foreach a {
L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)
}down(r(root));down(l(root));
Dynamic programming to compute the total probability
?
S3
S2 S1
? up[4]
up[5]
Algorithm (Following Felsenstein 1981):
Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a
up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b] Pr(Xr(i)=c|Xi=a) up[r(i)][c]
Down(i):
down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]
down(r(i)), down(l(i))Algorithm:
up(root);LL = 0;foreach a {
L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)
}down(r(root));down(l(root));
?
S3
S2 S1
? down[4]
down5]
up[3]
P(hi|s) = up[i][c]*down[i][c]/
(jup[i][j]down[i][j])
Computing marginals and posteriors
Simple Tree: Inference as message passing
s
s
s s
s
s
sYou are P(H|our data)
You are P(H|our data)
I am P(H|all data)
DATA
Transition posteriors: not independent!
A CA
C
DATA
96.001.002.001.0
01.096.001.002.0
02.001.096.001.0
01.002.001.096.0
)|Pr( yxDown:(0.25),(0.25),(0.25),(0.25)
Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)
Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)
Understanding the tree model (and BNs): reversing edges
The joint probability of the simple tree model:
),|Pr(),|Pr(),|Pr(),|Pr()|Pr(),Pr( 212213121 hxhxhxhhhxh
Can we change the position of the root and keep the joint probability as is?
)',|Pr()',|Pr()',|Pr()',|Pr()'|Pr(),Pr( 212213212 hxhxhxhhhxh
We need: )',|Pr()'|Pr(),|Pr()|Pr( 212121 hhhhhh
)'|Pr()',|Pr()'|Pr(),|Pr()|Pr( 2212121
11
hhhhhhhhh
)|Pr()'|Pr(/),|Pr()|Pr( 212121 hhhhhh
Inference can become difficult
)pa|Pr()Pr()|,Pr( !ji
jirootiroot
jj xxhhs
We want to perform inference in an extended tree model expressing context effects:
2 31
4 5 6
7 8 9
With undirected cycles, the model is well defined but inference becomes hard
We want to perform inference on the tree structure itself!
Each structure impose a probability on the observed data, so we can perform inference on the space of all possible tree structures, or tree structures + branch lengths
)'Pr()'|Pr(/)Pr()|Pr()|Pr('
DDD
1 2 3
What makes these examples difficult?
Factor graphs
Defining the joint probability for a set of random variables given:
1) Any set of node subsets (hypergraph)
2) Functions on the node subsets (Potentials)
)(1
)Pr( aa xZ
x
)( ax
)|{, VaaAV
x
aa xZ )(
Joint distribution:
Partition function:
If the potentials are condition probabilities, what will be Z?
Things are difficult when there are several modes
Factor
R.V.
Not necessarily 1! (can you think of an example?)
More definitions
The model: a
axZx )(log)log())log(Pr(
Potentials can be defined on discrete, real valued etc.it is also common to define general log-linear models directly:
))(logexp(1
)Pr( a
aa xwZ
x
Inference:
Dx a
aa xwZ
D ))(logexp(1
)|Pr(
)|Pr(/))(logexp(1
),|Pr(,|
DxwZ
DxDxx a
aai
i
Learning:
Find the factors parameterization: )|Pr(maxarg
D
Belief propagation in a factor graph
)(1
)|( aaa xZ
xP
• Remember, a factor graph is defined given a set of random variables (use indices i,j,k.) and a set of factors on groups of variables (use indices a,b..)
)( iia xm
• Think of messages as transmitting beliefs:
a->i : given my other inputs variables, and ignoring your message, you are x
i->a : given my other inputs factors and my potential, and ignoring your message, you are x
• xa refers to an assignment of values to the inputs of the factor a
• Z is the partition function (which is hard to compute)
• The BP algorithm is constructed by computing and updating messages:
• Messages from factors to variables:
• Messages from variables to factors: )( iai xm
(any value attainable by xi)->real values
Messages update rules:
)()(\)(
iicaiNc
iai xmxm
ia xx
jajiaNj
aaiia xmxxm )()()(\)(
Messages from variables to factors:
Messages from factors to variables:
a
i aiN \)(
a
iiaN \)(
The algorithm proceeds by updating messages:
• Define the beliefs as approximating single variables posterios (p(hi|s)):
)()()(
iiaiNa
ii xmxb
Algorithm:
Initialize all messages to uniformIterate until no message change:
Update factors to variables messagesUpdate variables to factors messages
• Why this is different than the mean field algorithm?
)()( ii hqhq
Beliefs on factor inputs
This is far from mean field, since for example:
)()(
)()()()(
\)()(
)()(
jjcajNcaNj
a
jjaNj
jajaNj
aaa
xmx
xbxmxxb
The update rules can be viewed as derived from constraints on the beliefs:
1.requirement on the variables beliefs (bi)
2.requirement on the factor beliefs (ba)
3.Marginalization requirement:
a
i aiN \)(
a
iiaN \)(
ia xxjjc
ajNcaNjaiiiid
iNdxmxxbxm )()()()(
\)()()(
ia xxjjc
ajNciaNjaiia xmxxm )()()(
\)(\)(
ia xx
aaii xbxb\
)()(
)()()(
iiaiNa
ii xmxb
)()()(\)()(
jjcajNcaNj
aaa xmxxb
BP on Tree = Up-Down
s4 s3
h2
h3e
s2 s1
h1
b a
c
d
)|Pr()|Pr()( 12111hshsxup ih
111)( 1 hbhach mmhm
)()()(
)()()(
2\
1
1\
1
11
11
smxhm
smxhm
bshx
bhb
ashx
aha
ib
ia
32
32
1
,313232 )|Pr()|Pr()()(
)(
hhhh
ih
hhhhhdownhup
xdown
3 2
2
3
33
1
31
)(),()(),(
)()(),(
)()()(
323313
3313
\31
h hehedc
hhehdc
hxchcchc
hmhhhhh
hmhmhh
hmxhmc
2 1
3
Loopy BP is not guaranteed to converge
X Y
Y
x
01
10
Y
x
01
10
1 1
0 0
This is not a hypothetical scenario – it frequently happens when there is too much symmetryFor example, most mutational effects are double stranded and so symmetric which can result in loops.
Sampling is a natural way to do approximate inference
MarginalProbability(integration overall space)
MarginalProbability(integration overA sample)
Sampling from a BN
Naively: If we could draw h,s’ according to the distribution Pr(h,s’) then: Pr(s) ~ (#samples with s)/(# samples)
Forward sampling:use a topological order on the network. Select a node whose parents are already determined sample from its conditional distribution (all parents already determined!)
Claim: Forward sampling is correct:
2 31
How to sample from the CPD?
4 5 67 8 9
),Pr()],(1[ shshEP
Focus on the observations
Naïve sampling is terribly inefficient, why?
What is the sampling error?
Why don’t we constraint the sampling to fit the evidence s?
2 31
4 5 6
7 8 9
Two tasks: P(s) and P(f(h)|s), how to approach each/both?
This can be done, but we no longer sample from P(h,s), and not from P(h|s) (why?)
Likelihood weighting
Likelihood weighting: weight = 1use a topological order on the network. Select a node whose parents are already determined if the variable was not observed: sample from its conditional distributionelse: weight *= P(xi|paxi), and fix the observation
Store the sample x and its weight w[x]
Pr(h|s) = (total weights of samples with h) / (total weights)
7 8 9
),|Pr( 211 ij
iij shs
),|Pr( 1ij
iij shs
),|Pr( 11 ij
iij shs ),|Pr( 211 i
jii
j shs),|Pr( 1i
jii
j shs),|Pr( 11 ij
iij shs
Weight=
Importance sampling
M
mmP hf
MfE
1
)(1
][
])(
)()([)]([ )()( HQ
HPHfEHfE HQHP
M
D mhwmhfM
fE
mhhD
1
])[(])[(1
][ˆ
]}[],..,1[{
Unnormalized Importance sampling:
])[(
])[()(
mhQ
mhPhw
)(|)(|)( HPHfHQ To minimize the variance, use a Q distribution is proportional to the target function:
22 )])([(]))()([())()(( HfEHwHfEHwHfVar PQQ
Our estimator from M samples is:
But it can be difficult or inefficient to sample from P. Assume we sample instead from Q, then:
Claim:
Prove it!
Correctness of likelihood weighting: Importance sampling
Unnormalized Importance sampling with the likelihood weighting proposal distribution Q and any function on the hidden variables:
Proposition: the likelihood weighting algorithm is correct (in the sense that it define an estimator with the correct expected value)
For the likelihood waiting algorithm, our proposal distribution Q is defined by fixing the evidence at the nodes in a set E and ignoring the CPDs of variable with evidence.
We sample from Q just like forward sampling from a Bayesian network that eliminated all edges going into evidence nodes!
)pa|Pr(),(
),()( ii
Eixx
shQ
shPhw
)pa|Pr()|( iiEi
xxDxQ
M
hhD mhwmhM
Eii
1
])[(])[(11
]1[ˆ
Normalized Importance sampling
hh
HQ hPhQ
hPhQHwE )('
)(
)(')()]([)(
/)]()([)]([ )()( HwHfEXfE HQHP
M
M
D
mhwM
mhwmhfMfE
mhhD
1
1
])[(1
])[(])[(1
][ˆ
]}[],..,1[{Sample:
NormalizedImportance sampling:
])[(
])[(')(
mhQ
mhPhw
When sampling from P(h|s) we don’t know P, so cannot compute w=P/Q
We do know P(h,s)=P(h|s)P(s)=P(h|s)=P’(h)
So we will use sampling to estimate both terms:
)][(1
/)][()(11
)|(ˆ'
11MM
D mhwM
mhwhM
shP
Using the likelihood weighting Q, we can compute posterior probabilities in one pass (no need to sample P(s) and P(h,s) separately):
Likelihood weighting is effective here:
But not here:
observed
unobserved
Limitations of forward sampling
Symmetric and reversible Markov processes
Definition: we call a Markov process symmetric if its rate matrix is symmetric:
jiij QQji ,
What would a symmetric process converge to?
Definition: A reversible Markov process is one for which:
)|Pr()|Pr( iXjXiXjX stts
i j j i
Time: t s
Claim: A Markov process is reversible iff such that:i
jijiji qq If this holds, we say the process is in detailed balance and the p are its stationary distribution.
i j
qji
qij
Proof: Bayes law and the definition of reversibility
Reversibility
Claim: A Markov process is reversible iff we can write:
ijjij sq where S is a symmetric matrix.
Q,tQ,t’ Q,t’ Q,t
Q,t+t’
Claim: A Markov process is reversible iff such that:i
jijiji qq If this holds, we say the process is in detailed balance.
i j
qji
qijProof: Bayes law and the definition of reversibility
Markov Chain Monte Carlo (MCMC)
We don’t know how to sample from P(h)=P(h|s) (or any complex distribution for that matter)
The idea: think of P(h|s) as the stationary distribution of a Reversible Markov chain
)()|()()|( yPyxxPxy
Find a process with transition probabilities for which:
Then sample a trajectory ,,...,, 21 myyy
)()(1
lim xPxyCn i
n
Theorem: (C a counter)
Process must be irreducible (you can reach from anywhere to anywhere with p>0)
(Start from anywhere!)
The Metropolis(-Hastings) Algorithm
Why reversible? Because detailed balance makes it easy to define the stationary distribution in terms of the transitions
So how can we find appropriate transition probabilities?
)()|()()|( yPyxxPxy
)|()|( xyFyxF
))(/)(,1min( xPyP
We want:
Define a proposal distribution:
And acceptance probability:
)|()(
)1,)(
)(min()|())(),(min()|(
))(
)(,1min()|()()|()(
yxyP
yP
xPyxFyPxPxyF
xP
yPxyFxPxyxP
What is the big deal? we reduce the problem to computing ratios between P(x) and P(y)
x yF
))(/)(,1min( xPyP
Acceptance ratio for a BN
To sample from:
)),(/),'(,1min())|(/)|'(,1min( shPshPshPshP
We affected only the CPDs of hi and its children
Definition: the minimal Markov blanket of a node in BN include its children, Parents and Children’s parents.
To compute the ratio, we care only about the values of hi and its Markov Blanket
For example, if the proposal distribution changes only one variable h i what would be the ratio?
?))|,..,,,..,Pr(/)|,..,',,..,Pr(,1min( 11111111 shhhhhshhhhh niiiniii
)|Pr( shWe will only have to compute:
Gibbs sampling
)|..,,,..,Pr(/),,..,',..,Pr()|,..,,..,Pr(
),..,,,..,|'Pr()|,..,,..,Pr(),|'()|(
11111111
111111
shhhhshhhshhh
shhhhhshhhshhshP
niinini
niiini
A very similar (in fact, special case of the metropolis algorithm):
Start from any state hdo { Chose a variable Hi
Form ht+1 by sampling a new hi from }
This is a reversible process with our target stationary distribution:
Gibbs sampling is easy to implement for BNs:
ihiij
jiii
iijji
ii
niii
hhhhh
hhhhhshhhhh
''pa
pa1111
)ˆ,''|Pr()ˆ|''Pr(
)ˆ,'|Pr()ˆ|'Pr(),..,,,..,|'Pr(
ih
)|Pr( ti hh
Sampling in practice
)()(1
lim xPxyCn i
n
How much time until convergence to P?(Burn-in time)
Mixing
Burn in Sample
Consecutive samples are still correlated! Should we sample only every n-steps?
We sample while fixing the evidence. Starting from anywere but waiting some time before starting to collect data
A problematic space would be loosely connected:
Examples for bad spaces
More terminology: make sure you know how to define these:
Inference
Parameter learning
Likelihood
Total probability/Marginal probability
Exact inference/Approximate inference
Z-scores, T-test – the basics
BABA
BBAA
BA
nnnnSnSn
XXt
112
)1()1( 22
You want to test if the mean (RNA expression) of a gene set A is significantly different than that of a gene set B.
If you assume the variance of A and B is the same:
t is distributed like T with nA+nB-2 degrees of freedom
If you don’t assume the variance is the same:
)1/()1/(/:..222222
22
BB
BA
A
A
B
B
A
A
B
B
A
A
BA
nn
sn
n
s
n
s
n
sfod
ns
ns
XXt
But in this case the whole test becomes rather flaky!
In a common scenario, you have a small set of genes, and you screen a large set of conditions for interesting biases.
You need a quick way to quantify deviation of the mean
For a set of k genes, sampled from a standard normal distribution, how would the mean be distributed?
)1
,0(K
NThe Mean
So if your conditions are normally distributed, and pre-standartize to mean 0, std 1
You can quickly compute the sum of values over your set and generate a z-score
|| A
XZ A
Kolmogorov-smirnov statistics
|)()(|max
|)()(|max
22xSxSD
xPxSD
NNx
Nx
1
21 22
)1(2)(j
jjKS eQ
The D-statistics is a-parameteric: you can transform x arbitrarly (e.g. logx) without changing it
The D statistics distribution is given by a the form:
)/11.012.0(
)(21
21
DNNQ
observedDP
NN
NNN
eeKS
e
An a-parameteric variant on the T-test theme is the Mann-Whitney test.
You Take your two sets and rank them together. You count the ranks of one of your set (R1)
2
)1( 111
nnRU
12
)1(
2/
),(~
2121
21
nnnn
nn
NU
U
U
UU
Hyper-geometric and chi-square test
B
A
B
A
n
N
k
n
kn
nN
kBAP )|(|
A
B
Nnnn
nnnn
nnnn
nnnn
321
3333231
2232221
1131211
ji ji
jiji
nN
nnn
, ,
2,,,
2)(
Chi-square distributed with m*n-m-n+1 d.o.f.
top related