Top Banner
Profiles for Sequences
22

Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Dec 18, 2015

Download

Documents

Cory Goodwin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Profiles for Sequences

Page 2: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Sequence Profiles

• Often, sequences are characterized by similarities that are not well captured through matching algorithms.

• For example, identification of genes in the presence of exons/introns, gene features (CpG islands, etc.), domain profiles in proteins, among others.

• For such sequences, Markov chains provide useful abstractions.

Page 3: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Markov Chains

Sunny

Rain

Cloudy

State transition matrix : The probability of

the weather given the previous day's weather.

Initial Distribution : Defining the probability of the system being in each of the states at time 0.

States : Three states - sunny, cloudy, rainy.

Page 4: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

4

Hidden Markov Models

Hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather).

Observable states : the states of the process that are `visible’.

Page 5: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Hidden Markov Models

Initial Distribution : Initial state probability vector.

State transition Matrix

Emission Probabilities: containing the probability of observing a particular observable state given that the hidden model is in a particular hidden state.

Page 6: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Hidden Markov Models

Observed sequences can be scored if their state transitions are known.

The probability of ACCY along this path is:

.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.

Transition Prob.

Output Prob.

Page 7: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Methods for Hidden Markov Models

Scoring problem:

Given an existing HMM and observed sequence , what is the probability that the HMM can generate the sequence

Page 8: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Methods, contd.

Alignment ProblemGiven a sequence, what is the optimal state sequence that the HMM would use to generate it

Page 9: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Methods, contd.

Training ProblemHow do we estimate the structure and parameters of a HMM from

data.

Page 10: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

HMMs– Some Applications

• Gene finding and prediction

• Protein-Profile Analysis

• Secondary Structure prediction

• Copy Number Variation

• Characterizing SNPs

Page 11: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

11

Gene Template

(Removed)

(Left)

Page 12: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

HMMs: Applications• Classification: Classifying observations within a

sequence• Order: A DNA sequence is a set of ordered observations

• Structure : can be intuitively defined:

• Measure of success: # of complete exons correctly labeled

• Training data: Available from various genome annotation projects

Page 13: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

HMMs for Gene Finding

An HMM for unspliced genes.x : non-coding DNAc : coding state

• Training - Expectation Maximization (EM)• Parsing – Viterbi algorithm

Page 14: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Genefinders: a Comparison

Method Sn Sp AC Sn Sp(Sn+Sp)/

2ME WE

GENSCAN 0.93 0.93 0.91 0.78 0.81 0.8 0.09 0.05FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24

GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17GenLang 0.72 0.75 0.69 0.5 0.49 0.5 0.21 0.21GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.1

SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13

Accuracy per nucleotide Accuracy per exon

Sn = SensitivitySp = SpecificityAc = Approximate CorrelationME = Missing ExonsWE = Wrong Exons

GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html

Page 15: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Protein Profile HMMs• Motivation

– Given a single amino acid target sequence of unknown structure, we want to infer the structure of the resulting protein. Use Profile Similarity

• What is a Profile?– Proteins families of related sequences and structures– Same function– Clear evolutionary relationship– Patterns of conservation, some positions are more

conserved than the others

Page 16: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.

ACA - - - ATG TCA ACT ATCACA C - - AGCAGA - - - ATCACC G - - ATC

HMMs From Alignment

Transition probabilities

Output Probabilities

insertion

Page 17: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Matching states

Insertion states

Deletion states

No of matching states = average sequence length in the familyPFAM Database - of Protein families (http://pfam.wustl.edu)

HMMs from Alignments

Page 18: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

18

• Given HMM, M, for a sequence family, find all members of the family in data base.

• LL – score LL(x) = log P(x|M)(LL score is length dependent – must

normalize or use Z-score)

Database Searching

Page 19: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

Consensus sequence: P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x 0.8x1 x 0.8 = 4.7 x 10 -2

Suppose I have a query protein sequence, and I am interested in which family it belongs to? There can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities.

ACAC - - ATC

Querying a Sequence

Page 20: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

20

Multiple Alignments• Try every possible path through the

model that would produce the target sequences – Keep the best one and its probability.– Output : Sequence of match, insert and

delete states

• Viterbi alg. Dynamic Programming

Page 21: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

HMMs from Unaligned Sequences

• Baum-Welch Expectation-maximization method– Start with a model whose length matches the

average length of the sequences and with random output and transition probabilities.

– Align all the sequences to the model.– Use the alignment to alter the output and transition

probabilities– Repeat. Continue until the model stops changing

• By-product: a multiple alignment

Page 22: Profiles for Sequences. Sequence Profiles Often, sequences are characterized by similarities that are not well captured through matching algorithms. For.

PHMM Example

An alignment of 30 short amino acid sequences chopped out of a alignment of the SH3 domain. The shaded area are the most conserved and were represented by the main states in the HMM. The unshaded area was represented by an insert state.