Top Banner
Multiple Sequence Alignment Alexei Drummond
67

Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

Jan 03, 2016

Download

Documents

Earl Francis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

Multiple Sequence Alignment

Alexei Drummond

Page 2: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

2CS369 2007

Week 3 Learning Outcomes

• Be able to compute the Smith-Waterman (local) pairwise alignment of two sequences given a score matrix and gap penalty

• Be able to compute the Needleman-Wunsch (global) pairwise alignment of two sequences given a score matrix and gap penalty

• Understand the principle of log-odds scoring.

Page 3: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

3CS369 2007

Week 4 Learning Outcomes

• Be able to recognize simple problems that are amenable to dynamic programming (DP) and design a DP algorithm to solve such problems.

• Understand the principle of linear space optimal pairwise alignment

• Understand the principle of quadratic-time pairwise alignment with affine gap penalties.

Page 4: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 4

Computational Biology

Multiple sequence alignment

Global Local

Evolutionary tree reconstruction

Substitution matrices

Pairwise sequence alignment (global and local)

Database searching

BLAST

Sequence statistics

Adapted from slide by Dannie Durant

Page 5: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 5

Multiple sequence alignment

• Definition: Given sequences X(1)…X(N) of lengths n1…nN, seek A(1)…A(N) of length n max{ni} such that

– Obtain X(i) from A(i) by removing gap characters– No columns contains all gaps– He score of the alignment is optimal

Page 6: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 6

Definitions

X (i) = {x1( i),x2

(i),...,xn i( i)}

A(i) = {a1( i),a2

(i),...,an( i)}

A j = {a j(1),a j

(2),...,a j(N )}

Sequence i

Row i in alignment

Column j in alignment

Page 7: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 7

Multiple sequence alignment

The first 55 amino acids of the albumin protein in 4 vertebrate animals unaligned and aligned.

Page 8: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 8

Multiple sequence alignment

• Align N sequences, so that residues in each column share a property of interest– A common ancestor / evolutionary history– A structural or functional role

Page 9: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 9

Multiple sequence alignment

I

L

L

F

H

H

H

Y

T

N

H

H

A

V

V

V

Characters in the same column share evolutionary history

Page 10: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 10

Structure-based alignment

Adapted from slide by Dannie Durant

Page 11: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 11

Scoring function: sum of pairs

A-CTCCATA-GTCC-TACGTCA-T

S(Ai) = s(ai( j ),ai

(k ))k> j

∑j=1

N

ColumnScore(3) = s(C,G) + s(C,G) + s(G,G)

= Match + 2 ⋅Mismatch

A(1)

Column Score

A(2)

A(3)

Page 12: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 12

Scoring function: sum of pairs

A-CTCCATA-GTCC-TACGTCA-T

ColumnScore(3) = s(A,−) + s(A,−) + s(−,−)

= s(−,−) + 2 ⋅GapPenalty

= 2 ⋅GapPenalty

S(Ai) = s(ai( j ),ai

(k ))k> j

∑j=1

N

∑Column Score

Page 13: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 13

Scoring function: tree-based

(1) A-CTCCAT(2) A-GTCC-T(3) ACGTCA-T C G G

G

G

(1) (2) (3)

• Assumptions– Sequences (in particular the characters in a column)

evolved from a common ancestor

– Evolution is parsimonious - mutations are rare

Page 14: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 14

Scoring function: tree-based

(1) A-CTCCAT(2) A-GTCC-T(3) ACGTCA-T

C G G

G

G

(1) (2) (3)

The score is the minimum number of substitutions needed to explain the data, considering all possible internal labels.

Here are 3 of the 16 possible internal labelings of two internal nodes, and the corresponding number of substitutions implied.

C G G

G

C

(1) (2) (3)

C G G

C

C

(1) (2) (3)

1

1

2

Page 15: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 15

Sum of pairs versus tree-based

AAAGG

A

AA

GG GGAAA

SP_Score = 6 Tree_Score = 1

Page 16: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 16

Tree-based scores

• Thought to be the “most biological” but– We don’t know the tree– We need to infer the characters on internal

nodes (more on that in later lectures)– There may be different trees for different parts

of the alignment (if recombination has occurred)

– Not always relevant for structural alignments– Sum of pairs is almost always used in practice.

Page 17: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 17

Linear gap scores & SP scoring

Treat gap as separate symbol

s(a,-) = s(-,a) = gap score

s(-,-) = 0

“Sum of Pairs” (SP) scoring function

Column

Ai

j

k

i

1

N

--

---

--

---

--

--

---

--

-

--

-

--

--

-€

ai( j )

ai(k )

S(Ai) = s(ai( j ),ai

(k ))k> j

∑j=1

N

AlignmentScore = S(Ai)i

Page 18: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 18

Multidimensional dynamic programming

F(i1,i2,...,iN ) = max

F(i1 −1,i2−1,...,iN −1) + S(x i1(1),x i2

(2),...,x iN(N ))

F(i1,i2−1,...,iN −1) + S(−,x i2(2),...,x iN

(N ))

F(i1 −1,i2,i3−1,...,iN −1) + S(x i1(1),−,x i3

(3)...,x iN(N ))

...

F(i1 −1,...,iN−1 −1,iN ) + S(x i1(1),x i2

(2),...,x iN−1

(N−1),−)

F(i1,i2,i3 −1,...,iN −1) + S(−,−,x i3(3),...,x iN

(N ))

...

F(i1,i2 −1,...,iN−1 −1,iN ) + S(−,x i2(2),...,x iN−1

(N−1),−)

...

⎪ ⎪ ⎪ ⎪ ⎪ ⎪

⎪ ⎪ ⎪ ⎪ ⎪ ⎪

F(i1,i2,...,iN )

Define

= max score of an alignment up to the sequences ending with

x i1(1),x i2

(2),...,x iN(N )

All ways of placinggaps in this column

2N −1

Θ(nN2N ) time,

Θ(nN ) space

i

1

N

--

---

--

---

--

--

---

--

1

Page 19: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 19

Dynamic programming for multiple sequence alignment

Optimal score

Traceback

Page 20: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 20

MSA

Carrillo and Lipman (1988), Lipman, Altschul and Kececioglu (1989).

Can optimally align up to 8-10 protein sequences of up to 500 residues.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 21: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 21

Multiple alignment software

Really need approximation methods.

Different techniques

1. Progressive global alignment of sequences starting with an alignment of the most similar sequences and then building a full alignment by adding more sequences

2. Iterative methods that make an initial alignment of groups of sequences and then refine the alignment to achieve a better result (Barton-Sternberg, Simulated annealing, stochastic hill climbing, genetic algorithms)

3. Use of probabilistic models of the indel and substitution process to do statistical inference of alignment. (“Statistical alignment”)

Page 22: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 22

Progressive alignment

Align sequences(pairwise) in some(greedy) order

Decisions

(1) Order of alignments(2) Alignment of sequence to group (only), or allow group to

group(3) Method of alignment, and scoring function

Page 23: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 23

Guide treeA

B

C

D

E

A

B

C

D

F

this ?

or this ?

E

Page 24: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 24

Feng & Doolittle (1987)

Overview

(1) Calculate diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise “distances” (either p-distance or a genetic distance based on a Markov model).

(2) Construct guide tree from the distance matrix by using appropriate clustering algorithm.

(3) Starting from first node added to the tree, align the child nodes (which may be two sequences, a sequence and an alignment, or two alignments). Repeat for all other nodes in the order that they were added to tree, until all sequences have been aligned.

Page 25: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 25

Feng & Doolittle (1987)

sequence-to-group

Best pairwisealignmentdeterminesalignment to group

XX

XXX

XXXX

XXXX

Page 26: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 26

Feng & Doolittle (1987)

sequence-to-group

Best pairwisealignmentdeterminesalignment to group

X

Page 27: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 27

Feng & Doolittle (1987)

sequence-to-group

Best pairwisealignmentdeterminesalignment to group

X– – – – –

This column is encouraged because it has no cost

Page 28: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 28

Feng & Doolittle (1987)

sequence-to-group

Best pairwisealignmentdeterminesalignment to group

XX

XXX

XXXX

XXXX

– – – – –

Page 29: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 29

Feng & Doolittle (1987)

sequence-to-group

Best pairwisealignmentdeterminesalignment to group

XX

XXX

XXXX

XXXX

X X XX X

Page 30: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 30

Feng & Doolittle (1987)

group-to-group

Best pairwisealignmentdeterminesalignment ofgroups

XX

XXX

XXXX

XXXX

XX

XXXX

Page 31: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 31

Feng & Doolittle (1987)

group-to-group

Best pairwisealignmentdeterminesalignment ofgroups

X

XX

Page 32: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 32

Feng & Doolittle (1987)

group-to-group

Best pairwisealignmentdeterminesalignment ofgroups

X

XX– – – – – –

Page 33: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 33

Feng & Doolittle (1987)

group-to-group

Best pairwisealignmentdeterminesalignment ofgroups

XX

XXX

XXXX

XXXX

XX

XXXX

– – – – – –– – – – – –– – – – – –– – – – – –

–––––––

Page 34: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 34

Feng & Doolittle (1987)

group-to-group

Best pairwisealignmentdeterminesalignment ofgroups

XX

XXX

XXXX

XXXX

XX

XXXX

–––––––

– – – – – –– – – – – –– – – – – –– – – – – –

Page 35: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 35

Feng & Doolittle (1987)

group-to-group

Best pairwisealignmentdeterminesalignment ofgroups

XX

XXX

XXXX

XXXX

XX

XXXX

XXXXXXX

XXXX

XXXX

XXXX

XXXX

XXXX

XXXX

Page 36: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 36

Feng & Doolittle (1987)

After alignment is completed gap symbols replaced by “X”.

“Once a gap, always a gap”.

Encourages gaps to occur in same columns in subsequent alignments.

Implemented by PILEUP (from GCG package).

Page 37: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 37

Profile alignmentgroup-to-group

A

B

Total alignment score = score (A) + score (B) + score (A*B)

XXX

XXX

XXX

Page 38: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 38

CLUSTALWThompson, Higgins and Gibson (1994).

Widely used implementation of profile-based progressive multiple alignment.

Similar to Feng-Doolittle method, except for use of profile alignment methods.

Overview:

1. Calculate diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise “distances”.

2. Construct guide tree from distance matrix by using an appropriate neighbour-joining clustering algorithm.

3. Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment.

Plus many other heuristics.

Page 39: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 39

CLUSTAL W heuristics

• Closely related sequences are aligned with hard matrices (BLOSUM80) and distant sequences are aligned with soft matrices (BLOSUM50).

• Hydrophobic residues (which are more likely to be buried) are given higher gap penalties than hydrophilic residues (which are more likely to be surface-accessible).

• Gap-open penalties are also decreased if the position is spanned by 5 or more consecutive hydrophilic residues.

Page 40: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 40

CLUSTAL W heuristics

• Both gap-open penalties and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in the alignment. This rule tries to force all gaps to occur in the same places in an alignment.

• In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low scoring alignment until later in the progressive alignment phase when more profile information has been accumulated.

Page 41: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 41

Iterative refinement

i.e. “hill climbing”. Slightly change solution to improve score. Converge to local optimum.

e.g. Barton-Sternberg (1987) multiple alignment

(1) Find the two sequences with the highest pairwise similarity and align them using standard dynamic programming alignment.(2) Find sequence most similar to a profile of the alignment of the first two, and align it to first

two by profile-sequence alignment. Repeat until all sequences have been included in the multiple alignment.

(3) Remove sequence X(1) and realign it to a profile of the other aligned sequences X(2)…X(N) by profile-sequence alignment. Repeat for sequences X(2)…X(N).

(3) Repeat the previous alignment step a fixed number of times, or until the alignment score converges.

Page 42: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 42

Clustal X

Page 43: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 43

Clustal X

Page 44: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 44

CLUSTALX

Page 45: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 45

CLUSTALX

Page 46: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

C_aminophilum AGCT.YCGCA TGRAGCAGTG TGAAAA.... ............ACTCCGGT GGTACAGGAT C_colinum AGTA..GGCA TCTACAAGTT GGAAAA.... ............ACTGAGGT GGTATAGGAG C_lentocellum GGTATTCGCT TGATTATNAT AGTAAA.... ............GATTTATC GCCATAGGAT C_botulinum_D TTTA.TGGCA TCATACATAA AATAATCAAA ..........GGAGCAATCC GCTTTGAGAT C_novyi_A TTTA.CGGCA T....CGTAG AATAATCAAA ..........GGAGCAATCC GCTTTGAGAT C_gasigenes AGTT.TCGCA TGAAACA... GC.AATTAAA ..........GGAGAAATCC GCTATAAGAT C_aurantibutyricum A.NT.TCGCA TGGAGCA... AC.AATCAAA ..........GGAGCAAT.C ACTATAAGAT C_sp_C_quinii AGTT.T.GCA TGGGACA... GC.AATTAAA ..........GGAGCAATCC GCTATGAGAT C_perfringens AAGA.TGGCA T.CATCA... TTCAACCAAA ..........GGAGCAATCC GCTATGAGAT C_cadaveris TTTT.CTGCA TGGGAAA... GTC.ATGAAA ..........GGAGCAATCC GCTGTAAGAT C_cellulovorans ATTC.TCGCA TGAGAGA... .TGTATCAAA ..........GGAGCAATCC GCTATAAGAT C_K21 TTGR.TCGCA TGATCKAAAC ATCAAAGGAT ..TTTTCTTTGGAAAATTCC ACTTTGAGAT C_estertheticum TTGA.TCGCA TGATCTTAAC ATCAAAGGAA ..TTT..TTCGG..AATTTC ACTTTGAGAT C_botulinum_A AGAA.TCGCA TGATTTTCTT ATCAAAGATT ..T............ATT.. GCTTTGAGAT C_sporogenes AGAA.TCGCA TGATTTTCTT ATCAAAGATT ..T............ATT.. GCTTTGAGAT C_argentinense AAGG.TCGCA TGACTTTTAT ACCAAAGGAG ..T............AATCC GCTATGAGAT C_subterminale AAGG.TCGCA TGACTTTTAT ACCAAAGGAG ..T............AATCC GCTATGAGAT C_tetanomorphum TTTT.CCGCA TGAAAAACTA ATCAAAGGAG ..T............AAT.C GCTTTGAGAT C_pasteurianum AGTT.TCACA TGGAGCTTTA ATTAAAGGAG ..T............AATCC GCTTTGAGAT C_collagenovorans TTGA.TCGCA TGGTCGAAAT ATTAAAGGAG ..T............AATCC GCTTACAGAT C_histolyticum TTTA.ATGCA TGTTAGAAAG ATTAAAGGAG ..............CAATCC GCTTTGAGAT C_tyrobutyricum AGTT.TCACA TGGAATTTGG ATGAAAGGAG ..T............AATTC GCTTTGAGAT C_tetani GGTT.TCGCA TGAAACTTTA ACCAAAGGAG ..T............AATCT GCTTTGAGAT C_barkeri GACA.TCGCA TGGTGTT... .TTAATGAAA ............ACTCCGGT GCCATGAGAT C_thermocellum GGCA.TCGTC CTGTTAT... .CAAAGGAGA ............AATCCGGT ...ATGAGAT Pep_prevotii AGTC.TCGCA TGGNGTTATC ATCAAAGA.. ..............TTTATC GGTGTAAGAT C_innocuum ACGGAGCGCA TGCTCTGTAT ATTAAAGCGC CCTTCAAGGCGTGAAC.... ....ATGGAT S_ruminantium AGTTTCCGCA TGGGAGCTTG ATTAAAGATG GCCTCTACTTGTAAGCTATC GCTTTGCGAT

Page 47: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

C_aminophilum AGCT.YCGCA TGRAGCAGTG TGAAAA.... ............ACTCCGGT GGTACAGGAT C_colinum AGTA..GGCA TCTACAAGTT GGAAAA.... ............ACTGAGGT GGTATAGGAG C_lentocellum GGTATTCGCT TGATTATNAT AGTAAA.... ............GATTTATC GCCATAGGAT C_botulinum_D TTTA.TGGCA TCATACATAA AATAATCAAA ..........GGAGCAATCC GCTTTGAGAT C_novyi_A TTTA.CGGCA T....CGTAG AATAATCAAA ..........GGAGCAATCC GCTTTGAGAT C_gasigenes AGTT.TCGCA TGAAACA... GC.AATTAAA ..........GGAGAAATCC GCTATAAGAT C_aurantibutyricum A.NT.TCGCA TGGAGCA... AC.AATCAAA ..........GGAGCAAT.C ACTATAAGAT C_sp_C_quinii AGTT.T.GCA TGGGACA... GC.AATTAAA ..........GGAGCAATCC GCTATGAGAT C_perfringens AAGA.TGGCA T.CATCA... TTCAACCAAA ..........GGAGCAATCC GCTATGAGAT C_cadaveris TTTT.CTGCA TGGGAAA... GTC.ATGAAA ..........GGAGCAATCC GCTGTAAGAT C_cellulovorans ATTC.TCGCA TGAGAGA... .TGTATCAAA ..........GGAGCAATCC GCTATAAGAT C_K21 TTGR.TCGCA TGATCKAAAC ATCAAAGGAT ..TTTTCTTTGGAAAATTCC ACTTTGAGAT C_estertheticum TTGA.TCGCA TGATCTTAAC ATCAAAGGAA ..TTT..TTCGG..AATTTC ACTTTGAGAT C_botulinum_A AGAA.TCGCA TGATTTTCTT ATCAAAGATT ..T............ATT.. GCTTTGAGAT C_sporogenes AGAA.TCGCA TGATTTTCTT ATCAAAGATT ..T............ATT.. GCTTTGAGAT C_argentinense AAGG.TCGCA TGACTTTTAT ACCAAAGGAG ..T............AATCC GCTATGAGAT C_subterminale AAGG.TCGCA TGACTTTTAT ACCAAAGGAG ..T............AATCC GCTATGAGAT C_tetanomorphum TTTT.CCGCA TGAAAAACTA ATCAAAGGAG ..T............AAT.C GCTTTGAGAT C_pasteurianum AGTT.TCACA TGGAGCTTTA ATTAAAGGAG ..T............AATCC GCTTTGAGAT C_collagenovorans TTGA.TCGCA TGGTCGAAAT ATTAAAGGAG ..T............AATCC GCTTACAGAT C_histolyticum TTTA.ATGCA TGTTAGAAAG ATTAAAGGAG ..............CAATCC GCTTTGAGAT C_tyrobutyricum AGTT.TCACA TGGAATTTGG ATGAAAGGAG ..T............AATTC GCTTTGAGAT C_tetani GGTT.TCGCA TGAAACTTTA ACCAAAGGAG ..T............AATCT GCTTTGAGAT C_barkeri GACA.TCGCA TGGTGTT... .TTAATGAAA ............ACTCCGGT GCCATGAGAT C_thermocellum GGCA.TCGTC CTGTTAT... .CAAAGGAGA ............AATCCGGT ...ATGAGAT Pep_prevotii AGTC.TCGCA TGGNGTTATC ATCAAAGA.. ..............TTTATC GGTGTAAGAT C_innocuum ACGGAGCGCA TGCTCTGTAT ATTAAAGCGC CCTTCAAGGCGTGAAC.... ....ATGGAT S_ruminantium AGTTTCCGCA TGGGAGCTTG ATTAAAGATG GCCTCTACTTGTAAGCTATC GCTTTGCGAT

TCAAAGGAG

TCAAAGGAG

Page 48: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 48

Alignment - considerations

• The programs simply try to maximize the number of matches– The “best” alignment may not be the correct biological

one• Multiple alignments are done progressively

– Such alignments get progressively worse as you add sequences

– Mistakes that occur during alignment process are frozen in.

• Unless the sequences are very similar you will almost certainly have to correct manually

Page 49: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 49

Manual Alignment- softwareGeneious- cross-platform

- http://www.geneious.com/CINEMA- Java applet available from:

– http://www.biochem.ucl.ac.uk

Seqapp/Seqpup- Mac/PC/UNIX available from:– http://iubio.bio.indiana.edu

Se-Al for Macintosh, available from:– http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html

BioEdit for PC, available from:– http://www.mbio.ncsu.edu/RNaseP/info/programs/

BIOEDIT/bioedit.html

Page 50: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 50

Page 51: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 51

Page 52: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 52

Page 53: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 53

Page 54: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 54

Extra T Missing G

Page 55: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 55

Page 56: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 56

Page 57: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 57

Hang on, what makes a good alignment?

Page 58: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 58

What makes a good alignment?

Page 59: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 59

What makes a good alignment?

Structural Alignment

Sequence Alignment

Page 60: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 60

What makes a good alignment?

Page 61: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 61

I hate ad hoc algorithms and manual sequence

alignment!Is there an alternative?

Page 62: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 62

An evolutionary hypothesis

AG

AAT AAC AC ACCG ACC

Insert CCInsert T

Delete G

G->C

G->A

Observations

Hypothesis/Model

T->C

Knowing the rates of different events (substitutions, insertions and deletions) provides a method of assessing the probability of these observations, given this hypothesis: Pr{D|T,Q}

T: the evolutionary tree

Q: parameters of the evolutionary process

Page 63: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 63

Statistics: fitting versus modeling

• Statistical fitting of sequence variation– Count frequencies of changes in real data sets – Build empirical statistical descriptions of the data (Blosum62)– Compare observed frequencies to well defined null hypothesis for

testing (log-odds ratio and scores)– Use scores in ad hoc algorithms for search and alignment (BLAST and

ClustalX)

• Probabilistic models of sequence evolution– Describe a probabilistic model in terms of a process of evolution, rates

of substitution, insertion and deletion– Estimate parameters of the models and compare models using model

comparison (likelihood ratios, Bayes factors)– Use maximum likelihood and Bayesian inference to co-estimate

(uncertainty in) alignment and evolutionary history.

Page 64: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 64

Probabilistic models and biology

3D structure of myoglobin, showing six alpha-helices.

Page 65: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 65

State of the art

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 66: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 66

Bali-Phy

Source: http://bioinformatics.oxfordjournals.org/cgi/content/full/22/16/2047

Page 67: Multiple Sequence Alignment Alexei Drummond. CS369 20072 Week 3 Learning Outcomes Be able to compute the Smith-Waterman (local) pairwise alignment of.

CS369 2007 67

What does the future hold?

• No single “true” alignment– In most situations there are a set of alignments that are

consistent with the observations– Understanding this uncertainty is as important as

understanding the “best” alignment

• Explicit evolutionary model-based methods– Methods that co-estimate alignment and phylogeny are

beginning to appear– Co-estimation of protein structure and alignment using

evolutionary models may be on horizon

• Death of manual sequence alignment?