Top Banner
Class 3: Estimating Scoring Rules for Sequence Alignment
35

Class 3: Estimating Scoring Rules for Sequence Alignment.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Class 3: Estimating Scoring Rules for Sequence Alignment.

Class 3: Estimating Scoring Rules for Sequence Alignment

Page 2: Class 3: Estimating Scoring Rules for Sequence Alignment.

2

Reminder

Last class we discussed dynamic programming algorithms for global alignment local alignment

All of these assumed a pre-specified scoring rule (substitution matrix):

that determines the quality of perfect matches, substitutions and indels, independently of neighboring positions.

}){(}){(:

Page 3: Class 3: Estimating Scoring Rules for Sequence Alignment.

3

A Probabilistic Model

But how do we derive a “good” substitution matrix?

It should “encourage” pairs, that are probable to change in close sequences, and “punish” others.

Lets examine a general probabilistic approach, guided by evolutionary intuitions.

Assume that we consider only two options:M: the sequences are evolutionary relatedR: the sequences are unrelated

Page 4: Class 3: Estimating Scoring Rules for Sequence Alignment.

4

Unrelated Sequences

Our model of 2 unrelated sequences s, t is simple: For each position i, both s[i], t[i] are sampled

independently from some “background” distribution q(·) over the alphabet .

Let q(a) be the probability of seeing letter a in any position.

Then the likelihood of s, t (probability of seeing s, t), given they are unrelated is:

n

i

n

i

itqisqRntnsP1 1

])[(])[()|]..1[],..1[(

Page 5: Class 3: Estimating Scoring Rules for Sequence Alignment.

5

Related Sequences Now lets assume that each pair of aligned positions

(s[i],t[i]) evolved from a common ancestor =>s[i],t[i] are dependent ! We assume s[i],t[i] are sampled from some

distribution p(·,·) of letters pairs. Let p(a,b) be a probability that some ancestral

letter evolved into this particular pair of letters Then the likelihood of s, t, given they are related

is:

n

i

itispMntnsP1

])[],[()|]..1[],..1[(

Page 6: Class 3: Estimating Scoring Rules for Sequence Alignment.

6

Decision Problem

Given two sequences s[1..n] and t[1..n] decide whether they were sampled from M or from R

This is an instance of a decision problem that is quite frequent in statistics: hypothesis testing

We want to construct a procedure Decide(s,t) = D(s,t) that returns either M or R

Intuitively, we want to compare the likelihoods of the data in both models…

Page 7: Class 3: Estimating Scoring Rules for Sequence Alignment.

7

Types of Error

Our procedure can make two types of errors:I. s and t are sampled from R, but D(s,t) = MII. s and t are sampled from M, but D(s,t) = R

Define the following error probabilities:

We want to find a procedure D(s,t) that minimizes both types of errors

)|),(Pr()(

)|),(Pr()(

MRtsDD

RMtsDD

Page 8: Class 3: Estimating Scoring Rules for Sequence Alignment.

8

Neyman-Pearson Lemma

•Suppose that D* is such that for some k

•If any other D is such that (D) (D*), then (D) (D*) --> D* is optimal

•k might refer to the weights we wish to give to both types of errors, and on relative abundance of M comparing to R

*

( , | )

( , | )( , )

( , | )

( , | )

P s t MM k

P s t RD s t

P s t MR k

P s t R

Page 9: Class 3: Estimating Scoring Rules for Sequence Alignment.

9

Likelihood Ratio for Alignment

The likelihood ratio is a quantitative measure of two sequences being derived from a common origin, compared to random.

Lets see, that it is a natural score for their alignment !

Plugging in the model, we have that:

i

i

i

itqisq

itisp

itqisq

itisp

RtsP

MtsP

])[(])[(

])[],[(

])[(])[(

])[],[(

)|,(

)|,(

Page 10: Class 3: Estimating Scoring Rules for Sequence Alignment.

10

Likelihood Ratio for Alignment

Taking logarithm of both sides, we get

We can see that the (log-)likelihood score decomposes to sum of single position scores, each dependent only on the two aligned letters !

])[(])[(

])[],[(log

])[(])[(

])[],[(log

)|,(

)|,(log

itqisq

itisp

itqisq

itisp

RtsP

MtsP

i

i

Page 11: Class 3: Estimating Scoring Rules for Sequence Alignment.

11

Probabilistic Interpretation of Scoring Rule

Therefore, if we take our substitution matrix be:

then the score of an alignment is the log-ratio between the two models likelihoods, which is nice. Score > 0 M is more “probable” (k=1)

Score < 0 R is more “probable”

)()(),(

log),(bqaq

bapba

Page 12: Class 3: Estimating Scoring Rules for Sequence Alignment.

12

Modeling Assumptions

It is important to note that this interpretation depends on our modeling assumption of the two hypotheses!!

For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ration will have a different form.

Page 13: Class 3: Estimating Scoring Rules for Sequence Alignment.

13

Constructing Scoring Rules

The formula

suggests how to construct a scoring rule: Estimate p(·,·) and q(·) from the data Compute (a,b) based on p(·,·) and q(·)

)()(),(

log),(bqaq

bapba

Page 14: Class 3: Estimating Scoring Rules for Sequence Alignment.

14

Estimating Probabilities

Suppose we are given a long string s[1..n] of letters from

We want to estimate the distribution q(·) that “generated” the sequence

How should we go about this?

We build on the theory of parameter estimation in statistics

Page 15: Class 3: Estimating Scoring Rules for Sequence Alignment.

15

Statistical Parameter Fitting Consider instances x[1], x[2], …, x[M] such that

The set of values that x can take is known Each is sampled from the same (unknown) distribution

of a known family (multinomial, Gaussian, Poisson, etc.) Each is sampled independently of the rest

The task is to find a parameters set defining the most likely distribution P(x|), from which the instances could be sampled.

The parameters depend on the given family of probability distributions.

Page 16: Class 3: Estimating Scoring Rules for Sequence Alignment.

16

Example: Binomial Experiment

When tossed, it can land in one of two positions: Head or Tail

We denote by the (unknown) probability P(H).Estimation task: Given a sequence of toss samples x[1], x[2], …,

x[M] we want to estimate the probabilities P(H)= and P(T) = 1 -

Head Tail

Page 17: Class 3: Estimating Scoring Rules for Sequence Alignment.

17

Why Learning is Possible?

Suppose we perform M independent flips of the thumbtack

The number of head we see is a binomial distribution

and thus

This suggests, that we can estimate by

kMk

k

MkHeadsP

)1()(#

MHeads ]E[#

MHeads#

Page 18: Class 3: Estimating Scoring Rules for Sequence Alignment.

18

Expected Behavior ( = 0.5)

MHeads#0 0.2 0.4 0.6 0.8 1

M = 10M = 100M = 1000

Probability (rescaled)over datasets of i.i.d. samples

From most large datasets, we get a good approximation to

How do we derive such estimators in a principled way?

Page 19: Class 3: Estimating Scoring Rules for Sequence Alignment.

19

The Likelihood Function How good is a particular ?

It depends on how likely it is to generate the observed data

The likelihood for the sequence H,T, T, H, H is

m

mxPDPDL )|][()|():(

)1()1():( DL

0 0.2 0.4 0.6 0.8 1

L(

:D)

Page 20: Class 3: Estimating Scoring Rules for Sequence Alignment.

20

Maximum Likelihood Estimation

MLE Principle:

Learn parameters that maximize the likelihood function

This is one of the most commonly used estimators in statistics

Intuitively appealing

Page 21: Class 3: Estimating Scoring Rules for Sequence Alignment.

21

Computing the Likelihood Functions

To compute the likelihood in the thumbtack example we only require NH and NT

(the number of heads and the number of tails)

NH and NT are sufficient statistics for the binomial distribution

TH NNDL )1():(

Page 22: Class 3: Estimating Scoring Rules for Sequence Alignment.

22

Sufficient Statistics

A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood

Formally, s(D) is a sufficient statistics if for any two datasets D and D’ s(D) = s(D’ )

L( |D) = L( |D’)

Datasets

Statistics

Page 23: Class 3: Estimating Scoring Rules for Sequence Alignment.

23

Example: MLE in Binomial Data

Applying the MLE principle we get (after differentiating)

(Which coincides with what we would expect)

0 0.2 0.4 0.6 0.8 1

L(

:D)

Example:(NH,NT ) =

(3,2)

MLE estimate is 3/5 = 0.6

TH

H

NN

N

Page 24: Class 3: Estimating Scoring Rules for Sequence Alignment.

24

From Binomial to Multinomial

Suppose X can have the values 1,2,…,KWe want to learn the parameters 1, 2, …, K

Sufficient statistics:N1, N2, …, NK - the number of times each outcome is observed

Likelihood function:

MLE (differentiation with Lagrange multipliers):

K

k

Nk

kDL1

):(

N

Nkk

ˆ

Page 25: Class 3: Estimating Scoring Rules for Sequence Alignment.

25

At last: Estimating q(·)

Suppose we are given a long string s[1..n] of letters from s can be the concatenation of all sequences in our

database We want to estimate the distribution q(·)

Likelihood function:

MLE parameters:

a

Nn

i

aaqisqsqL )(])[():(1

n

Naq a)(

Number of timesa appears in s

Page 26: Class 3: Estimating Scoring Rules for Sequence Alignment.

26

Estimating p(·,·)

Intuition: Find pair of presumably related aligned

sequences s[1..n], t[1..n] Estimate probability of pairs in the sequence:

Again, s and t can be the concatenation of many aligned pairs from the database

n

Nbap ba,),( Number of times a is

aligned with b in (s,t)

Page 27: Class 3: Estimating Scoring Rules for Sequence Alignment.

27

Estimating p(·,·)

Problems: How do we find pairs of presumably related aligned

sequences? Can we ensure that the two sequences are indeed

based on a common ancestor? How far back should this ancestor be?

earlier divergence low sequence similarity later divergence high sequence similarity

The substitution score of each 2 letters should depend on the evolutionary distance of the compared sequences !

Page 28: Class 3: Estimating Scoring Rules for Sequence Alignment.

28

Let Evolution In

Again, we need to make some assumptions: Each position changes independently of the rest The probability of mutations is the same in each

positions Evolution does not “remember”

Timet t+ t+2 t+3 t+4

A A C C GT T T C G

Page 29: Class 3: Estimating Scoring Rules for Sequence Alignment.

29

Model of Evolution

How do we model such a process? The process (for each position independently) is

called a Markov Chain A chain is defined by the transition probability

P(Xt+=b|Xt=a) - the probability that the next state is b given that the current state is a

We often describe these probabilities by a matrix:

T[]ab = P(Xt+=b|Xt=a)

Page 30: Class 3: Estimating Scoring Rules for Sequence Alignment.

30

Two-Step Changes

Based on T[], we can compute the probabilities of changes over two time periods

Thus T[2] = T[]T[]

By induction: T[k] = T[] k

ccbac

t2t TTaXbXP )|(

Page 31: Class 3: Estimating Scoring Rules for Sequence Alignment.

31

Longer Term Changes

Idea: Estimate T[] from some closely related

sequences set S Use T[] to compute T[k] Derive substitution probability after time k:

Note, that the score depends on evolutionary distance, as requested

)(][)(][

)(),|()|,(

aqTaqkT

aqkabpkbapkabab

Page 32: Class 3: Estimating Scoring Rules for Sequence Alignment.

32

Estimating PAM1

Collect counts Nab of aligned pairs (a,b) in similar sequences in S Sources include phylogenetic trees and closely

related sequences (at least 85% positions have exact match)

Normalize counts to get transition matrix T[] , such that average number of changes is 1% that is, this is called 1 point accepted mutation (PAM1)

– an evolutionary time unit !

a

990aap .)|,(

Page 33: Class 3: Estimating Scoring Rules for Sequence Alignment.

33

Using PAM

The matrix PAM-k is defined to be the score based on Tk

Historically researchers use PAM250 Longer than 100 !

Original PAM matrices were based on small number of proteins

Later versions of PAM use more examples Used to be the most popular scoring rule

Page 34: Class 3: Estimating Scoring Rules for Sequence Alignment.

34

Problems with PAM

PAM extrapolates statistics collected from closely related sequences onto distant ones.

But “short-time” substitutions behave differently than “long-time” substitutions: short-time substitutions are dominated by a

single nucleotide changes that led to different translation (like L->I)

long-time substitutions do not exhibit such behavior, are much more random.

Therefore, statistics would be different for different stages in evolution.

Page 35: Class 3: Estimating Scoring Rules for Sequence Alignment.

35

BLOSUM (blocks substitution) matrix

Source: aligned ungapped regions of protein families These are assumed to have a common ancestor

Procedure: Group together all sequences in a family with

more than e.g. 62% identical residues Count number of substitutions within the same

family but across different groups Estimate frequencies of each pair of letters The resulting matrix is called BLOSUM62