Top Banner
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl
37

Gibbs sampling for motif finding in biological sequences

Jan 09, 2017

Download

Documents

tranthien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gibbs sampling for motif finding in biological sequences

Gibbs sampling for motif finding in biological sequences

Christopher Sheldahl

Page 2: Gibbs sampling for motif finding in biological sequences

Biological Sequences

• Proteins : “strings” of amino acids. 20 common types |alphabet| = 20

• Nucleic acids : “strings” of nucleotides 4 common types for RNA; 4 for DNA |alphabet| = 4

Page 3: Gibbs sampling for motif finding in biological sequences

Motif Finding

• Multiple sequences : How to detect related segments (or “motifs”)?

1: …ATACGTTGAGA… 2: …CGACGTTGCAA…

3: …CCACGCTGGAC…

Page 4: Gibbs sampling for motif finding in biological sequences

Sequence Databases are Large

• Genbank : NA : 108 million sequences• SwissProt : proteins : ~500,000

sequences

Page 5: Gibbs sampling for motif finding in biological sequences

Growth of SwissProt

http://www.expasy.org/sprot/relnotes/relstat.html

Page 6: Gibbs sampling for motif finding in biological sequences

The basic problem

• We have N biological sequences.• Assume that there is one evolutionarily

related segment (substring) of known width W in each sequence.

• How do we find this segment in each sequence? Remember that there might be mutations.

Page 7: Gibbs sampling for motif finding in biological sequences

Protein case : We’d like to know the following:

• Motif description (matrix): Probability of each of the 20 aa at each position in the motif of width w:qi,R : Probability of R type aa at position i of motif. Matrix is W x 20.

• Background probability (vector) of each of the 20 aa in sites not part of motif: p1….p20

• Ak : alignment vector of starting indices for the motif in each of the N sequences.

Page 8: Gibbs sampling for motif finding in biological sequences

Scoring a given string• If we knew the motif description matrix and the

background vector we could score any string of width W :

L : score of segment wrt the motif.Ri : the aa type at position i of the segment.

Q = qi,Rii=1

W∏

P = pRii=1

W∏

L =QP

Page 9: Gibbs sampling for motif finding in biological sequences

Scoring an alignment of N sequences

We can score an entire alignment :

AlignScore = Lxx=1

N∏

Lx =QxPx

Page 10: Gibbs sampling for motif finding in biological sequences

What if we just had one new sequence?

• What if we knew the motif description matrix and the background vector for a large number of sequences, and we want to quickly align a new (presumably related) sequence?

Page 11: Gibbs sampling for motif finding in biological sequences

What if we just had one new sequence?

• Score possible locations of the motif in the new sequence and pick one with a high (or max) value of L:

• Note that the better our motif description and background model are, the better the new alignment will be.

L =QP

Page 12: Gibbs sampling for motif finding in biological sequences

Motif Finding Dilemma

• If we knew the alignment, we could calculate the motif description matrix and the background vector.

• If we knew the motif description and the background, we could calculate the alignment (by choosing high L positions).

• We don’t know either of these - we’ve seen one way to approach this problem before...

Page 13: Gibbs sampling for motif finding in biological sequences

Expectation Maximization or The Road Not Taken in this Lecture

Start with Random motif description and background.Repeat until motif description converges:

1) E - step : Calculate alignment points (Ak) from Motif Description and Background. 2) M - step : Calculate Motif Description and background from alignments.

Lawrence et. al. 1990. “An Expectation Maximization Algorithm for the Identification and Characterization

of Common Sites in Unaligned Biolpolymer Sequences”. Proteins.

Page 14: Gibbs sampling for motif finding in biological sequences

Local minima

• After initialization, simple EM algorithm is a deterministic process.

• If the initialization is bad, can get trapped in local minima.

• Heuristic : Try many different starting points.

Page 15: Gibbs sampling for motif finding in biological sequences

Random Sampling

• Note the heuristic of trying many random initial positions. Maybe we can overcome local minima if we employ randomness more thoroughly.

• Can we take random sample alignments from the distribution :

p(Align | S, W, Motif Matrix, Background) where Align means A1, A2…AN.

Page 16: Gibbs sampling for motif finding in biological sequences

Markov Chain Monte Carlo

• Markov Chain Monte Carlo methods generate a Markov chain of points that converges to a distribution of interest.

• “Monte Carlo” : The methods employ randomness.

Page 17: Gibbs sampling for motif finding in biological sequences

Metropolis-Hastings• Metropolis-Hastings is an MCMC model that can sample from

any distribution P, using a proposal distribution Q(x’; x).

• Initialize with random x.• Generate new x’ =

Proposal position according to Q(x’; x)

• Compute α = min( (P(x’) / P(x) ), 1) and accept change with probability α.

Figure : Wikipedia

Page 18: Gibbs sampling for motif finding in biological sequences

Gibbs Sampling

• Gibbs sampling is a variety of Metropolis-Hastings sampling where the sampling step is always accepted.

• For multivariate distributions, in Gibbs sampling only one parameter is changed at a time.

• This makes Gibbs sampling particularly useful for multivariate distributions.

Page 19: Gibbs sampling for motif finding in biological sequences

Motif Finding with Gibbs :Site Sampler

• Site sampler : Sample starting points for motif in each sequence.

• Start with random alignments, then use random sampling for one sequence at a time to gradually improve the alignments.

Lawrence et al., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment”, Science. 1993. 262(5131), 208-214

Page 20: Gibbs sampling for motif finding in biological sequences

Gibbs sampling: Changing One Thing at a Time

• N = 3• We want to sample alignment points A1 for

sequence 1, given the pattern description M, background model B, the sequences S and width W, and alignments A2 (for sequence 2) and A3 (for sequence 3).

p( A1 | S, M, B,W, A2, A3)

Page 21: Gibbs sampling for motif finding in biological sequences

Site Sampler Algorithm

Initialize with random alignment points.While not converged:

Do steps 1 and 2 for each of N sequences : 1) Predictive update step :

Calculate motif matrix and background using all sequences but the currently selected one.

2) Sampling step. Calculate Lx = Qx / Px for all starting points x in this sequence. Choose one with a probability proportional to Lx.

• The inner loop is an iteration of sequences.• In step 2 you sample a new value for Ak i.e. the alignment position for sequence k• Convergence : In theory you sample until alignment no longer changes

Page 22: Gibbs sampling for motif finding in biological sequences

Example: Random AlignmentsRouchka, 1997,“A Brief Overview of Gibbs Sampling”

DNA:Alphabet ={A,C,G,T}

Page 23: Gibbs sampling for motif finding in biological sequences

Example:

Initial Counts New CountsRemove 1st sequence (ATTTAT)

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT Rouchka, 1997.

Page 24: Gibbs sampling for motif finding in biological sequences

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCTRouchka, 1997.

Sequence 1:

Page 25: Gibbs sampling for motif finding in biological sequences

Final Alignment

Page 26: Gibbs sampling for motif finding in biological sequences

Assumptions for Site Sampler

• Assumption of one motif.• Assumption of one copy of motif in each

sequence.• Assumption that we know the motif

width.

Page 27: Gibbs sampling for motif finding in biological sequences

Motif Sampler

• Sample different motifs for each part of the sequence. Store motif descriptions and background vector for each motif.

• Allows multiple motifs, and varying number of each motif to be present.

• Still subject to assumption of known motif width.

Neuwald et al., “Gibbs Motif Sampling: Detection of bacterial membrane protein repeats”. 1995. Protein Science. 4:1618-1632.

Page 28: Gibbs sampling for motif finding in biological sequences

Null model

• Null model has pattern description probabilities identical to the background probabilities for its constituent amino acids.

Page 29: Gibbs sampling for motif finding in biological sequences

Widths and Copy Numbers• The motif sampler requires widths and expected

numbers of each motif per sequence as input parameters. The widths are fixed during the algorithm, but the number of copies can change.

• Construction of subfamilies (more specific models) favored by larger widths and smaller expected number of copies.

• Construction of superfamilies (less specific models) favored by smaller widths and larger expected number of copies.

Page 30: Gibbs sampling for motif finding in biological sequences

Motif Sampler AlgorithmInitialize with Ei non-overlapping random segments for each motif i in each of the

sequences.While not converged:

Do steps 1 and 2 for successive segments in the biological sequences : 1) Predictive update step :

If this segment is in an alignment remove it and recalculate motif matrixand background data excluding this segment.

2) Sampling step : Sample one of the motifs with a probability proportional to the score that the motif description assigns this segment.

• In the inner loop the motif sampler iterates through segments (substrings)• Motif sampler samples among different motif models for each segment.

Page 31: Gibbs sampling for motif finding in biological sequences

Modifications to the Samplers

• There are methods to allow for gapped motifs.

• High scoring alignments can be stored for later investigation.

Page 32: Gibbs sampling for motif finding in biological sequences

Porin structuresNeuwald et al., 1995

Page 33: Gibbs sampling for motif finding in biological sequences

Porin alignment

Neuwald et al., 1995

Page 34: Gibbs sampling for motif finding in biological sequences

Motif descriptions

Page 35: Gibbs sampling for motif finding in biological sequences

Motivation for Gibbs sampling

• We have a joint probability density f(x, y1, y2,…yp)

• We want a marginal density, f(x).

• The integrations are difficult.

Casella and George, “Explaining the Gibbs Sampler”

Page 36: Gibbs sampling for motif finding in biological sequences

Near-optimum Sampler

• A particular motif may not be present in the best alignment found, even though its found in many other near-optimal alignments.

• Sample among near-optimal alignments to capture all sites present.

Page 37: Gibbs sampling for motif finding in biological sequences

Column sampler

• The column sampler allows for a motif to shift:

GCACCTG --> GCACCTG • It also allows for the development of

motifs with gaps: GCACCTG --> GCACCTG