Intro to comp genomics Lecture 3-4: Examples, Approximate Inference.

Intro to comp genomics

Lecture 3-4: Examples, Approximate Inference

Example 1: Mixtures of Gaussians

iii xNpxP ),;()|(

),;()|( xNxP

iii xNpxP ),;()|(

We have experimental results of some valueWe want to describe the behavior of the experimental values:Essentially one behavior? Two behaviors? More?In one dimension it may look very easy: just looking at the distribution will give us a good idea..

We can formulate the model probabilistically as a mixture of normal distributions.

As a generative model: to generate data from the model, we first select the sub-model by sampling from the mixture variable. We then generate a value using the selected normal distribution.

If the data is multi dimensional, the problem is becoming non trivial.

Inference is trivial

iii xNpxP ),;()|(

Let’s represent the model as:

iii xNpxP ),;()|(

iiiisss

xNpxNp

sxssxsxsP

),;(/),;(

)'|Pr()'Pr(/)|Pr()Pr()|(

What is the inference problem in our model?

sxsx )|Pr()Pr()Pr(

Inference: computing the posterior probability of a hidden variable given the data and the model parameters.

For p0=0.2, p1=0.8, 0=0, 1=1, 0=1,1=0.2, what is Pr(s=0|0.8) ?

Estimation/parameter learning

iii xNpxP ),;()|(

Given data, how can we estimate the model parameters?

i jjjij

)|Pr(),..,|( 1

Transform it into an optimization problem!

Likelihood: a function of the parameters. Defined given the data.

Find parameters that maximize the likelihood: the ML problem

Can be approached heuristically: using any optimization technique.

But it is a non linear problem which may be very difficult

Generic optimization techniques:

Gradient ascent:

Simulation annealing

Genetic algorithms

And more..

)),..,|((maxarg 11

ak xxLaL

The EM algorithm for mixtures – inference allow for learning

iii xNpxP ),;()|(

We start by guessing parameters:

We now go over the samples and compute their posteriors (i.e., inference):

iis iiss

xNpxNpxsP ),;(/),;(),|( 00000

We use the posteriors to compute new estimates for the expected sufficient statistics of each distribution, and for the mixture coefficients:

xsP xsP

xsPxxE

s ),|(

),|(][

xsP xsP

xsPxxV

s ),|(

),|()(][

ixsPNps

),|(1 01

Continue iterating until convergence.

The EM theorem: the algorithm will converge and will improve likelihood monotonically

No Guarantee of finding the optimumOr of finding anything meaningful

The initial conditions are critical:

Think of starting from 0=0, 1=10, 1,2=1

Solutions: start from “reasonable” solutionsTry many starting points

-1 0 1

Example 2: Mixture of sequence models

• a probabilistic model for binding sites:

• This is the site independent model, defining a probability space over k-mers

• Assume a set of sequences contain unknown binding sites (one for each)• The position of the binding site is a hidden variable h.

• We introduce a background model Pb that describes the sequence outside of the binding site (usually a d-order Markov model)

• Given complete data we can write down the likelihood of a sequence s as:

ii imPmP

])[()(

ibackiback

ibackback

ildilsilsPilsPsPlPlsP

idisisPsP

]))1..[|][(/])[(()()()|,(

]))1..[|][()(

• Inference of the binding site location posterior:

• Note that only k-factors need to be computed for each location (Pb(s) is constant))

One hidden variable = trivial inference

isPlsPslP )|,(/)|,(),|( 111

• If we assume some of the sequences may lack a binding site, this should be incorporated into the model:

ibackiback ildilsilsPilsPsPlPhitPlsP

]))1..[|][(/])[(()(*)(*)()|,(

• This is sometime called the ZOOPS model (Zero or one positions)

Hidden Markov Models

Observing only emissions of states to some probability space EEach state is equipped with an emission distribution (x a state, e emission))|Pr( xe

)|Pr()|Pr(),Pr( 1 iiiii sesses

Emission space

Caution! This is NOT the HMM Bayes Net

1.Cycles2.States are NOT random vars!

State space

Example 3: Mixture with “memory”

hxPhhhxPh

)|()|Pr()|()Pr(

We sample a sequence of dependent valuesAt each step, we decide if we continue to sample from the same distribution or switch with probability p

We can compute the probability directly only given the hidden variables.

P(x) is derived by summing over all possible combination of hidden variables. This is another form of the inference problem (why?)

There is an exponential number of h assignments, can we still solve the problem efficiently?

)|( ABP

)|( BAP

)|( AxP )|( BxP

Inference in HMM

Forward formula:

0:1?)(

)'|Pr()|Pr(

startsf

ssfsef

)'|Pr()|'Pr( 1

sfinishb

sessbb

Backward formula: Emissions

States FinishStart

Emissions

States FinishStart

Ls sfinishfL )|Pr(

s beginsbL )|Pr(1

Computing posteriors:

Emissions

States FinishStart

)'|Pr()|Pr()(

1)|',Pr( 11

' sssebfsL

ess iis

The posterior probability for transition from s’ to s after character i?

The posterior probability for emitting the i’th character from state s?

i bfsL

1),|Pr(

Example 4: Hidden states

Example:Two Markov models describe our dataSwitching between models is occurring at randomHow to model this?

No EmissionHidden state

Example 5: Profile HMM for Protein or DNA motifs

•M (Match) states emit a certain amino acid/nucleotide profile•I (Insert) states emit some background profile•D (Delete) states are hidden

•Use the model for classification or annotation•(Both emissions and transition probabilities are informative!)

•Can use EM to train the parameters from a set of examples•(How do we determine the right size of the model?)(google PFAM, Prosite, “HMM profile protein domain”)

Example 6: N-order Markov model

•In most biological sequences, the Markov property is a big problem

•N-order relations can be modeled naturally:

Common error:

Forward/Backward in N-order HMM. Can dynamic programming work?

Emissions

States FinishStart

FinishStart

1-HMM Bayes Net:

2-HMM Bayes Net:

Example 7: Pair-HMM

Given two sequences s1,s2, an alignment is defined by a set of ‘gaps’ (or indels) in each of the sequences.

ACGCGAACCGAATGCCCAA---GGAAAACGTTTGAATTTATAACCCGT-----ATGCCCAACGGGGAAAACGTTTGAACTTATA

Standard dynamic programming algorithm compute the best alignment given such distance metric:

Standard distance metric: )(#])[],[(),,,( 22

2121 gapslslssslldi

),,,(min),( 2121

2121 sslldssdll

Affine gap cost: l Substitution matrix: ))log(Pr(~),( baba

Pair-HMM

Generalize the HMM concept to probabilistically model alignments.

Problem: we are observing two sequences, not a-priori related. What will be emitted from our HMM?

Match states emit and aligned nucleotide pairGap states emit a nucleotide from one of the sequences onlyPr(M->Gi) – “gap open cost”, Pr(G1->G1) – “gap extension cost”

Is it a BN template?Forward-backward formula?

Example 8: The simple tree model

Sequences of extant and ancestral species are random variables, with Val(X) = {A,C,G,T}

Extant Species Sj1,., Ancestral species Hj

1,..(n-1)

Tree T: Parents relation pa Si , pa Hi

(pa S1 = H1 ,pa S3 = H2 ,The root: H2)

For multiple loci we can assume independence and use the same parameters (today):

),Pr(),Pr( jjj hshs

ii paxxiii Qtxx ,)exp()pa|Pr(

)pa|Pr()Pr(),Pr( !ji

jirootiroot

jj xxhhs )|Pr()|Pr()|Pr(

)|Pr()Pr()Pr(

111223

hshshs

In the triplet:

The model is defined using conditional probability distributions and the root “prior” probability distribution

The model parameters can be the conditional probability distribution tables (CPDs)

Or we can have a single rate matrix Q and branch lengths:

96.001.002.001.0

01.096.001.002.0

02.001.096.001.0

01.002.001.096.0

)|Pr( yx

Ancestral inference

)pa|Pr()Pr()|,Pr( !ji

jirootiroot

jj xxhhs

shPs )|,()|Pr(

The Total probability of the data s:

This is also called the likelihood L(). Computing Pr(s) is the inference problem

)|,(),|Pr(

shPsh Given the total probability it is easy

to compute:

)|Pr(/),(),|Pr(|

sshPsxhxhh

Exponential?

Marginalization over hi

We assume the model (structure, parameters) is given, and denote it by :

Posterior of hi given the data

Total probability of the data

Example:

shPsxh|

),()|Pr(

Given partial observations s:

)),,Pr(( ACA

The Total probability of the data:

)),,(|Pr( 1 ACAAh

96.001.002.001.0

01.096.001.002.0

02.001.096.001.0

01.002.001.096.0

)|Pr( yx

Uniform prior

Algorithm (Following Felsenstein 1981):

Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a

up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b] Pr(Xr(i)=c|Xi=a) up[r(i)][c]

Down(i):

down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]

down(r(i)), down(l(i))Algorithm:

up(root);LL = 0;foreach a {

L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)

}down(r(root));down(l(root));

Dynamic programming to compute the total probability

? up[4]

Algorithm (Following Felsenstein 1981):

Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a

up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b] Pr(Xr(i)=c|Xi=a) up[r(i)][c]

Down(i):

down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]

down(r(i)), down(l(i))Algorithm:

up(root);LL = 0;foreach a {

L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)

}down(r(root));down(l(root));

? down[4]

down5]

P(hi|s) = up[i][c]*down[i][c]/

(jup[i][j]down[i][j])

Computing marginals and posteriors

Simple Tree: Inference as message passing

sYou are P(H|our data)

You are P(H|our data)

I am P(H|all data)

Transition posteriors: not independent!

96.001.002.001.0

01.096.001.002.0

02.001.096.001.0

01.002.001.096.0

)|Pr( yxDown:(0.25),(0.25),(0.25),(0.25)

Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)

Understanding the tree model (and BNs): reversing edges

The joint probability of the simple tree model:

),|Pr(),|Pr(),|Pr(),|Pr()|Pr(),Pr( 212213121 hxhxhxhhhxh

Can we change the position of the root and keep the joint probability as is?

)',|Pr()',|Pr()',|Pr()',|Pr()'|Pr(),Pr( 212213212 hxhxhxhhhxh

We need: )',|Pr()'|Pr(),|Pr()|Pr( 212121 hhhhhh

)'|Pr()',|Pr()'|Pr(),|Pr()|Pr( 2212121

hhhhhhhhh

)|Pr()'|Pr(/),|Pr()|Pr( 212121 hhhhhh

Inference can become difficult

)pa|Pr()Pr()|,Pr( !ji

jirootiroot

jj xxhhs

We want to perform inference in an extended tree model expressing context effects:

With undirected cycles, the model is well defined but inference becomes hard

We want to perform inference on the tree structure itself!

Each structure impose a probability on the observed data, so we can perform inference on the space of all possible tree structures, or tree structures + branch lengths

)'Pr()'|Pr(/)Pr()|Pr()|Pr('

What makes these examples difficult?

Factor graphs

Defining the joint probability for a set of random variables given:

1) Any set of node subsets (hypergraph)

2) Functions on the node subsets (Potentials)

)Pr( aa xZ

)|{, VaaAV

aa xZ )(

Joint distribution:

Partition function:

If the potentials are condition probabilities, what will be Z?

Things are difficult when there are several modes

Factor

Not necessarily 1! (can you think of an example?)

More definitions

The model: a

axZx )(log)log())log(Pr(

Potentials can be defined on discrete, real valued etc.it is also common to define general log-linear models directly:

))(logexp(1

)Pr( a

aa xwZ

Inference:

aa xwZ

D ))(logexp(1

)|Pr(/))(logexp(1

),|Pr(,|

DxDxx a

Learning:

Find the factors parameterization: )|Pr(maxarg

Belief propagation in a factor graph

)|( aaa xZ

• Remember, a factor graph is defined given a set of random variables (use indices i,j,k.) and a set of factors on groups of variables (use indices a,b..)

)( iia xm

• Think of messages as transmitting beliefs:

a->i : given my other inputs variables, and ignoring your message, you are x

i->a : given my other inputs factors and my potential, and ignoring your message, you are x

• xa refers to an assignment of values to the inputs of the factor a

• Z is the partition function (which is hard to compute)

• The BP algorithm is constructed by computing and updating messages:

• Messages from factors to variables:

• Messages from variables to factors: )( iai xm

(any value attainable by xi)->real values

Messages update rules:

)()(\)(

iicaiNc

iai xmxm

jajiaNj

aaiia xmxxm )()()(\)(

Messages from variables to factors:

Messages from factors to variables:

i aiN \)(

iiaN \)(

The algorithm proceeds by updating messages:

• Define the beliefs as approximating single variables posterios (p(hi|s)):

)()()(

iiaiNa

ii xmxb

Algorithm:

Initialize all messages to uniformIterate until no message change:

Update factors to variables messagesUpdate variables to factors messages

• Why this is different than the mean field algorithm?

)()( ii hqhq

Beliefs on factor inputs

This is far from mean field, since for example:

)()()()(

jjcajNcaNj

jajaNj

xbxmxxb

The update rules can be viewed as derived from constraints on the beliefs:

1.requirement on the variables beliefs (bi)

2.requirement on the factor beliefs (ba)

3.Marginalization requirement:

i aiN \)(

iiaN \)(

ia xxjjc

ajNcaNjaiiiid

iNdxmxxbxm )()()()(

\)()()(

ia xxjjc

ajNciaNjaiia xmxxm )()()(

\)(\)(

aaii xbxb\

)()()(

iiaiNa

ii xmxb

)()()(\)()(

jjcajNcaNj

aaa xmxxb

BP on Tree = Up-Down

)|Pr()|Pr()( 12111hshsxup ih

111)( 1 hbhach mmhm

)()()(

,313232 )|Pr()|Pr()()(

hhhhhdownhup

)(),()(),(

)()(),(

)()()(

323313

h hehedc

hhehdc

hxchcchc

hmhhhhh

hmhmhh

hmxhmc

Loopy BP is not guaranteed to converge

This is not a hypothetical scenario – it frequently happens when there is too much symmetryFor example, most mutational effects are double stranded and so symmetric which can result in loops.

Sampling is a natural way to do approximate inference

MarginalProbability(integration overall space)

MarginalProbability(integration overA sample)

Sampling from a BN

Naively: If we could draw h,s’ according to the distribution Pr(h,s’) then: Pr(s) ~ (#samples with s)/(# samples)

Forward sampling:use a topological order on the network. Select a node whose parents are already determined sample from its conditional distribution (all parents already determined!)

Claim: Forward sampling is correct:

How to sample from the CPD?

4 5 67 8 9

),Pr()],(1[ shshEP

Focus on the observations

Naïve sampling is terribly inefficient, why?

What is the sampling error?

Why don’t we constraint the sampling to fit the evidence s?

Two tasks: P(s) and P(f(h)|s), how to approach each/both?

This can be done, but we no longer sample from P(h,s), and not from P(h|s) (why?)

Likelihood weighting

Likelihood weighting: weight = 1use a topological order on the network. Select a node whose parents are already determined if the variable was not observed: sample from its conditional distributionelse: weight *= P(xi|paxi), and fix the observation

Store the sample x and its weight w[x]

Pr(h|s) = (total weights of samples with h) / (total weights)

),|Pr( 211 ij

iij shs

),|Pr( 1ij

iij shs

),|Pr( 11 ij

iij shs ),|Pr( 211 i

j shs),|Pr( 1i

j shs),|Pr( 11 ij

iij shs

Weight=

Importance sampling

mmP hf

)()([)]([ )()( HQ

HPHfEHfE HQHP

D mhwmhfM

])[(])[(1

]}[],..,1[{

Unnormalized Importance sampling:

])[()(

)(|)(|)( HPHfHQ To minimize the variance, use a Q distribution is proportional to the target function:

22 )])([(]))()([())()(( HfEHwHfEHwHfVar PQQ

Our estimator from M samples is:

But it can be difficult or inefficient to sample from P. Assume we sample instead from Q, then:

Claim:

Prove it!

Correctness of likelihood weighting: Importance sampling

Unnormalized Importance sampling with the likelihood weighting proposal distribution Q and any function on the hidden variables:

Proposition: the likelihood weighting algorithm is correct (in the sense that it define an estimator with the correct expected value)

For the likelihood waiting algorithm, our proposal distribution Q is defined by fixing the evidence at the nodes in a set E and ignoring the CPDs of variable with evidence.

We sample from Q just like forward sampling from a Bayesian network that eliminated all edges going into evidence nodes!

)pa|Pr(),(

),()( ii

)pa|Pr()|( iiEi

hhD mhwmhM

])[(])[(11

Normalized Importance sampling

HQ hPhQ

hPhQHwE )('

)(')()]([)(

/)]()([)]([ )()( HwHfEXfE HQHP

mhwmhfMfE

])[(])[(1

]}[],..,1[{Sample:

NormalizedImportance sampling:

])[(')(

When sampling from P(h|s) we don’t know P, so cannot compute w=P/Q

We do know P(h,s)=P(h|s)P(s)=P(h|s)=P’(h)

So we will use sampling to estimate both terms:

/)][()(11

)|(ˆ'

D mhwM

Using the likelihood weighting Q, we can compute posterior probabilities in one pass (no need to sample P(s) and P(h,s) separately):

Likelihood weighting is effective here:

But not here:

observed

unobserved

Limitations of forward sampling

Symmetric and reversible Markov processes

Definition: we call a Markov process symmetric if its rate matrix is symmetric:

jiij QQji ,

What would a symmetric process converge to?

Definition: A reversible Markov process is one for which:

)|Pr()|Pr( iXjXiXjX stts

i j j i

Time: t s

Claim: A Markov process is reversible iff such that:i

jijiji qq If this holds, we say the process is in detailed balance and the p are its stationary distribution.

Proof: Bayes law and the definition of reversibility

Reversibility

Claim: A Markov process is reversible iff we can write:

ijjij sq where S is a symmetric matrix.

Q,tQ,t’ Q,t’ Q,t

Q,t+t’

Claim: A Markov process is reversible iff such that:i

jijiji qq If this holds, we say the process is in detailed balance.

qijProof: Bayes law and the definition of reversibility

Markov Chain Monte Carlo (MCMC)

We don’t know how to sample from P(h)=P(h|s) (or any complex distribution for that matter)

The idea: think of P(h|s) as the stationary distribution of a Reversible Markov chain

)()|()()|( yPyxxPxy

Find a process with transition probabilities for which:

Then sample a trajectory ,,...,, 21 myyy

lim xPxyCn i

Theorem: (C a counter)

Process must be irreducible (you can reach from anywhere to anywhere with p>0)

(Start from anywhere!)

The Metropolis(-Hastings) Algorithm

Why reversible? Because detailed balance makes it easy to define the stationary distribution in terms of the transitions

So how can we find appropriate transition probabilities?

)()|()()|( yPyxxPxy

)|()|( xyFyxF

))(/)(,1min( xPyP

We want:

Define a proposal distribution:

And acceptance probability:

)(min()|())(),(min()|(

)(,1min()|()()|()(

xPyxFyPxPxyF

yPxyFxPxyxP

What is the big deal? we reduce the problem to computing ratios between P(x) and P(y)

))(/)(,1min( xPyP

Acceptance ratio for a BN

To sample from:

)),(/),'(,1min())|(/)|'(,1min( shPshPshPshP

We affected only the CPDs of hi and its children

Definition: the minimal Markov blanket of a node in BN include its children, Parents and Children’s parents.

To compute the ratio, we care only about the values of hi and its Markov Blanket

For example, if the proposal distribution changes only one variable h i what would be the ratio?

?))|,..,,,..,Pr(/)|,..,',,..,Pr(,1min( 11111111 shhhhhshhhhh niiiniii

)|Pr( shWe will only have to compute:

Gibbs sampling

)|..,,,..,Pr(/),,..,',..,Pr()|,..,,..,Pr(

),..,,,..,|'Pr()|,..,,..,Pr(),|'()|(

11111111

111111

shhhhshhhshhh

shhhhhshhhshhshP

niinini

niiini

A very similar (in fact, special case of the metropolis algorithm):

Start from any state hdo { Chose a variable Hi

Form ht+1 by sampling a new hi from }

This is a reversible process with our target stationary distribution:

Gibbs sampling is easy to implement for BNs:

hhhhhshhhhh

pa1111

)ˆ,''|Pr()ˆ|''Pr(

)ˆ,'|Pr()ˆ|'Pr(),..,,,..,|'Pr(

)|Pr( ti hh

Sampling in practice

lim xPxyCn i

How much time until convergence to P?(Burn-in time)

Mixing

Burn in Sample

Consecutive samples are still correlated! Should we sample only every n-steps?

We sample while fixing the evidence. Starting from anywere but waiting some time before starting to collect data

A problematic space would be loosely connected:

Examples for bad spaces

More terminology: make sure you know how to define these:

Inference

Parameter learning

Likelihood

Total probability/Marginal probability

Exact inference/Approximate inference

Z-scores, T-test – the basics

nnnnSnSn

)1()1( 22

You want to test if the mean (RNA expression) of a gene set A is significantly different than that of a gene set B.

If you assume the variance of A and B is the same:

t is distributed like T with nA+nB-2 degrees of freedom

If you don’t assume the variance is the same:

)1/()1/(/:..222222

But in this case the whole test becomes rather flaky!

In a common scenario, you have a small set of genes, and you screen a large set of conditions for interesting biases.

You need a quick way to quantify deviation of the mean

For a set of k genes, sampled from a standard normal distribution, how would the mean be distributed?

NThe Mean

So if your conditions are normally distributed, and pre-standartize to mean 0, std 1

You can quickly compute the sum of values over your set and generate a z-score

Kolmogorov-smirnov statistics

|)()(|max

22xSxSD

)1(2)(j

jjKS eQ

The D-statistics is a-parameteric: you can transform x arbitrarly (e.g. logx) without changing it

The D statistics distribution is given by a the form:

)/11.012.0(

observedDP

An a-parameteric variant on the T-test theme is the Mann-Whitney test.

You Take your two sets and rank them together. You count the ranks of one of your set (R1)

)1( 111

Hyper-geometric and chi-square test

kBAP )|(|

3333231

2232221

1131211

Chi-square distributed with m*n-m-n+1 d.o.f.

Intro to comp genomics Lecture 3-4: Examples, Approximate Inference.

model parameters

probabilistic model

zoops model

generative model

site independent model

inference problem

approximate inference

trivial inference

Documents

Approximate Bayesian Inference I:

Variational Inference and Mean Field · Variational...

13 Approximate Inference for Observation-Driven Time...

Tutorial on Approximate Bayesian Computation · Inference.....

Approximate Inference in Graphical Models -...

07 approximate inference in bn

Approximate inference: Sampling methods

Approximate Inference for Wireless Communications - CORE ·...

Global Approximate Inference Eran Segal Weizmann Institute.

Lecture 14: Approximate Inference Sampling Methods ·...

Approximate Inference

APPROXIMATE BAYESIAN PARAMETER INFERENCE FOR DYNAMICAL...

Approximate marginal inference in models with stratum ·...

Perturbative Corrections for Approximate Inference...

Approximate MRF Inference Using Bounded Treewidth Subgraphs

ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE WITH