Top Banner
Pairwise sequence alignments Dynamic programming (Needleman- Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding optimal alignment, but fast
41

Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Dec 25, 2015

Download

Documents

Gervase Owen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Pairwise sequence alignments

• Dynamic programming (Needleman-Wunsch), finds optimal alignment

• Heuristics: Blast (Altschul et al) does not guarantee finding optimal alignment, but fast

Page 2: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Pairwise sequence alignments

APLFVA----ITRSDD

APVFIAGDTRITRSEE

Assumptions:- evolution of sequences through mutations and

deletions/insertions;- the closer similarity between sequences, the more

chances they are evolutionarily related.

Page 3: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Similarity measures: Percent Identity

Identity score – Exact matches receive score of 1 and non-exact matches score of 0

AVLILKQWAVLI I LQ T

------------------------------ 1 1 1 1 0 0 1 0 = 5 (Score of the alignment under “identity”)

Percent identity: identity_score/length_of_the_shorter_protein

Disadvantage of % id: does not take into account the similarity between their properties.

Page 4: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Substitution Matrices – measure of “similarity” Substitution Matrices – measure of “similarity” score of amino-acidsscore of amino-acids

• M(i,j) ~ probability of substituting i into j over some time period

• Percent Accepted Mutation (PAM) unit = evolutionary time corresponding to average of 1 mutation per 100 res.

• Two most popular classes of matrices: – PAMn: relates to mutation probabilities in evolutionary interval of n

PAM units (PAM 120 is often used in practice) – BLOSUMx: relates to mutation probabilities observed between

pairs of related proteins that diverged so above x % identity.BLOSUM62 ~ PAM250

Page 5: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Scoring the gaps

Solution: Have additional penalty for opening a gap

ATTTTAGTACATT- - AGTAC

ATTTTAGTACA-T-T -AGTAC

The two alignments below have the same score. The second alignment is better.

w(k) = h + gk ; h,g constants

Interpretation: const of starting a gap: h+g, extending gap: +g

Affine gap penalty

Page 6: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Dot plot illustration

T T A C T C A A T

A

C

T

C

A

T

T

A

C

The alignment corresponds to path from upper left corner to lower right corner going trough max. nr of dots

Deletions

TTACTCAAT - - -

- - ACTCA- TTAC

Adapted from T. Przytycka

Page 7: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Gap penalties

• First problem is corrected by introducing “gap penalty”: for each gap subtract gap penalty from the score

• Second problem is corrected by introducing additional penalty for opening a gap:

ATCGATTG

andAT – C GAT T - G

They have the same score but the right alignment is more likely from evolutionary perspective(simpler explanation = better explanation)

AT - C - T AAT T T T TA

ATC - - T AATT T T TA

Consider two pairs of alignments:

and

w(k) = h + gk ; h,g constants

Interpretation: const of starting a gap: h+g, extending gap: +g

Affine gap penalty

Page 8: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Organizing the computation – dynamic programming table

Align(Si-1,S’j-1)+ s(ai, a’j)

Align(Si-1,S’j) - g

Align(Si,S’j-1) - g

Align(Si,S’j)= max

{

j

i

Align(i,j) =

Align

+s(ai,aj)

max

Page 9: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Recovering the path

A T T G - A T - G C

A T T G

ATGC

Page 10: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Ignoring initial and final gaps – semiglobal comparison

Recall the initialization step for the dynamic programming table:

A[0,i], A[j,0] – these are responsible for initial gaps. set them to zero!

How to ignore final gaps?

CAGCA - CTTGGATTCTCGG - - - CAGCGTGG - - - - - - - -

No penalties for these gaps

Take the largest value in the last row /column and trace-back form there

Page 11: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Comparing similar sequences

Similar sequences – optimal alignment has small number of gaps.

The “alignment path” stays close to the diagonal

From book Setubal Meidanis”Introduction Comp. Mol. Biol”

Page 12: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Global

Local

Local and global alignments

Page 13: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Local alignment (Smith - Waterman)

So far we have been dealing with global alignment.Local alignment – alignment between substrings.Main idea: If alignment becomes too bad – drop it.

a[i,j]= maxa[i-1,j-1]+ s(ai, aj)a[i-1,j +ga[i,j-1]+ g0

{

Page 14: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Example

Page 15: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

BLAST

• Local heuristics• Fast• Good statistics• Precalculated lookup table of all high score word

matches of three residue long• Extend the hit until score drops below some

threshold

Page 16: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

1 2 3 4 5 6 7 IDVVVVC--------------------------------------- LDLV--C A 2 -2 -2 -1 -1 -1 -2 LDLVFVC--------------------------------------- ADIIFLIR -3 -2 -3 -3 -2 -2 -4

---------------------------------------

N -3 1 -4 -4 -2 -2 -4

---------------------------------------

D -3 7 -4 -4 -3 -3 -4

---------------------------------------

C -2 -4 -2 -1 -2 -1 6

---------------------------------------.

Sequence-profile alignments: sequence profiles describe conserved features with respect to position

in multiple alignment

Gribskov et al, PNAS, 1987;

Schaffer et al, Nucleic Acids Res., 2001

Page 17: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Computational aspects of protein structure

Page 18: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Examples of protein architecture

β-sheet with all pairsof strands parallel

β-sheet with all pairsof strands anti-parallel

Architecture refersto the arrangementand orientation ofSSEs, but not to theconnectivity.

Page 19: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Examples of protein topology

Topology refers tothe manner in whichthe SSEs areconnected.

Two β-sheets (allparallel) with differenttopologies.

Page 20: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Secondary structures are connected to form motifs.

G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

Page 21: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Supersecondary structure: Greek key motifs

G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

Page 22: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Some supersecondary structure motifs are associated with specific function:

DNA binding motifs.Helix-turn-helix motif: recognizes specific palindromic DNA sequence

Zn-finger motif: Zn binds to two Cys and two His; binds in tandems along major groove

Page 23: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

P-loop motif.

Sequence pattern: G/AxxxxGK(x)S/T

Function: mononucleotide binding

Page 24: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Calcium-binding motif.Calcium-binding sequence pattern: DxD/NxDxxxE/DxxE

Function: binding of Ca(2+);

calmodulin: Ca-dependent signaling pathways

A.Lewit-Bentley & S. Rety, 2000

Page 25: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Protein domains can be defined based on:

• Geometry: group of residues with the high contact density, number of contacts within domains is higher than the number of contacts between domains.

- chain continuous domains - chain discontinous domains

• Kinetics: domain as an independently folding unit.

• Physics: domain as a rigid body linked to other domains by flexible linkers.

• Genetics: minimal fragment of gene that is capable of performing a specific function.

Page 26: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Domains as recurrent units of proteins.

• The same or similar domains are found in different proteins.

• Each domain has a well determined compact structure and performs a specific function.

• Proteins evolve through the duplication and domain shuffling.

• Protein domain classification based on comparing their recurrent sequence, structure and functional features – Conserved Domain Database

Page 27: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Conserved Domain Database (CDD).

• Protein domain classification based on comparing their recurrent sequence, structure and functional features – Conserved Domain Database

• CDD represents a collection of multiple sequence alignments corresponding to different protein domains

Page 28: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

CDD icludes a set of multiple sequence alignments.

• Accurate alignments since structure-structure alignments are reconciled with sequence alignments.

• Block-based alignments.• Annotated alignments.• Annotated functionally important sites.

Page 29: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

PSSMs for each CDD are calculated using observed residue frequencies and relationships

between different residue types.

1 2 3 4 5 6 7 IDVVVVC--------------------------------------- LDLV--IA 2 -2 -2 -1 -1 -1 -2 LDLVFVI--------------------------------------- ADIIFLIR -3 -2 -3 -3 -2 -2 -4

--------------------------------------- W(D,3) = log( Q(D,3) / P(D) )N -3 1 -4 -4 -2 -2 -4

--------------------------------------- P(D) – background probabilityD -3 7 -4 -4 -3 -3 -4

--------------------------------------- Q(D,3) – estimated probabilityC -2 -4 -2 -1 -2 -1 6 for residue “D” to be found in

--------------------------------------- column 3..

.

.

Page 30: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

How to annotate domains in a protein using CDD?

• To annotate domains in a protein:

- to find domain boundaries

- to assign function(structure) for each domain

• For each query sequence perform CD-search.• CD-search: query sequence is compared with sequence

profiles derived from CDD multiple sequence alignments.

Page 31: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Classwork

• Retrieve 1WQ1 from MMDB, look at structural domains and domains annotated by CDD. How different are they?

• Pretend you do not know the structure of 1WQ1, perform the CD-search, annotate domain boundaries.

Page 32: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Protein folds.

• Fold definition: two folds are similar if they have a similar arrangement of SSEs (architecture) and connectivity (topology). Sometimes a few SSEs may be missing.

• Fold classification: structural similarity between folds is searched using structure-structure comparison algorithms.

• There is a limited number of folds ~1000 – 3000.

Page 33: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Superfolds are the most populated protein folds.

C.Orengo et al, 1994

• There are about 10 types of folds, the superfolds, to which about 30% of the other folds are similar.

•Superfolds are characterized by a wide range of sequence diversity and spanning a range of non-similar functions.

Page 34: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Why do some folds are more populated than others?

• Thermodynamic stability?• Fast folding?• By chance, through the duplication processes?• Perform essential functions?• Symmetrical folds, emerged through the gene

duplication?• High supersecondary structure content, higher

fraction of local interactions?

Page 35: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Distinguishing structural similarity due to common origin versus convergent

evolution.

Divergent evolution, homologs Convergent evolution, analogs

Page 36: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

TIM barrels• Classified into 21 families in the CATH database.

• Mostly enzymes, but participate in a diverse collection of different biochemical reactions.

• There are intriguing common features across the families, e.g. the active site is always located at the C-terminal end of the barrel.

Catalytic and metal-binding residues aligned in structure-structure alignments

Nagano, C. Orengo and J. Thornton, 2002

Page 37: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Functional diversity of TIM-barrels.

Page 38: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

TIM barrel evolutionary relationships

• Sequence analyses with advanced programs such as PSI-BLAST have identified further relationships among the families.

• Further interesting similarities observed from careful comparison of structures, e.g. a phosphate binding site commonly formed by loops 7, 8 and a small helix.

• In summary, there is evidence for evolutionary relationships between 17 of the 21 families.

Page 39: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

SCOP (Structural Classification of Proteins)

• http://scop.mrc-lmb.cam.ac.uk/scop/

• Levels of the SCOP hierarchy:– Family: clear evolutionary relationship– Superfamily: probable common evolutionary origin– Fold: major structural similarity– Class: secondary structure content

Page 40: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

CATH (Class, Architecture, Topology, Homologous superfamily)

• http://www.biochem.ucl.ac.uk/bsm/cath/

Page 41: Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Classwork

• Using SCOP and CATH classify four protein structures (1b5t, 1n8i, 1tph and 1hti).

• How different are the classifications produced by SCOP and CATH?

• Can these proteins be considered homologous?