1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES.

1

GLOBAL GLOBAL PAIRWISE ALIGNMENTPAIRWISE ALIGNMENT

GLOBAL ALIGNMENT OF:GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES 2 NUCLEOTIDE SEQUENCES

OR OR 2 AMINO-ACID SEQUENCES2 AMINO-ACID SEQUENCES

2

Assumptions:Assumptions:

Life is monophyleticLife is monophyleticBiological entities (sequences, Biological entities (sequences, taxa) share common ancestrytaxa) share common ancestry

3

Any two organisms share a common ancestor in their past

ancestor

descendant 1 descendant 2

4

ancestor (~5 MYA)

5

ancestor (~120 MYA)

6

ancestor (~1,500 MYA)

7

(1) Speciation events(2) Gene duplication (3) Duplicative transposition

Homologoussequences

8

HomologHomolog

y:y: A term

coined by Richard Owen in 1843.

Definition: Similarity resulting from common ancestry.

9

Homology

There are three main types of

molecular homology: orthology,

paralogy (including ohnology) and

xenology.

10

Homology: General Definition

• Homology designates a qualitative relationship of common descent between entities

• Two genes are either homologous or they are not!– it doesn’t make sense to say “two

genes are 43% homologous.”– it doesn’t make sense to say “Linda is

43% pregnant.”

11

Orthology & Paralogy

• Two genes are orthologs if they originated from a single ancestral gene in the most recent common ancestor of their respective genomes

• Two genes are paralogs if they are related by gene duplication. Two genes are ohnologs if they are related by gene duplication due to genome duplication

12

13

= Gene death

14

Xenology is due to horizontal (lateral) gene transfer (HGT or

LGT)

XA and XB are xenologsDistinguishing orthologs from xenologs is impossible in pairwise genomic comparisons, but possible when multiple genomes are compared

15

Orthology, Paralogy, Xenology(Fitch, Trends in Genetics, 2000. 16(5):227-231)

16

By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor.

Homology

17

When comparing sequences, we are interested in POSITIONAL HOMOLOGY. We identify POSITIONAL HOMOLOGY through SEQUENCE ALIGNMENT.

Homology

Alignment:Alignment: A hypothesis concerning positional homology among residues from two or more sequence.Positional homologyPositional homology = In

pairwise alignment, a pair of nucleotides from two

homologous sequences that have descended from one

nucleotide in the ancestor of the two sequences.

19

Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.

20

21

Unknown sequence

Unknown events & unknown sequence of events

Unknown events & unknown sequence of

events

The true alignment is unknown.

There are two modes of alignment.

Global alignment: each residue of sequence A is compared with each residue in sequence B. Global alignment algorithms are used in comparative and evolutionary studies.

Local alignment: Determining if sub-segments of one sequence are present in another. Local alignment methods have their greatest utility in database searching and retrieval (e.g., BLAST).

For reasons of computational complexity, sequence alignment is divided into two categories:

Pairwise alignment (i.e., the alignment of two sequences).

Multiple-sequence alignment (i.e., the alignment of three or more sequences).

Pairwise alignment problems have exact solutions.

Multiple-sequence alignment problems only have approximate (heuristic) solutions.

24

A pairwise alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs:

(1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null

base in the other. GCGGCCCATCAGGTAGTTGGTG-GGCGTTCCATC--CTGGTTGGTGTG

25

-Two DNA sequences: A and B.-Two DNA sequences: A and B.-Lengths are -Lengths are mm and and nn, respectively. , respectively.

-The number of matched pairs is -The number of matched pairs is xx. .

-The number of mismatched pairs -The number of mismatched pairs is is yy. . - Total number of bases in gaps is - Total number of bases in gaps is zz..

26

There are internal internal and terminal terminal gaps.

GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG

27

A terminal gap may indicate missing data.


28

An internal gap indicates that a deletiondeletion or an insertioninsertion has occurred in one of the two lineages.


29

When sequences are compared through alignment, it is impossible to tell whether a deletion has occurred in one sequence or an insertion has occurred in the other. Thus, deletions and insertions are collectively referred to as indels (short for insertion or deletion).


30

The alignment is the first step in many functional and evolutionary studies.

Errors in alignment tend to amplify in later stages of the study.

31

Motivation for sequence alignment

Function– Similarity may be indicative of

similar function.

Evolution– Similarity may be indicative of

common ancestry.

32

Some definitions

34

Methods of alignment:

1. Manual2. Dot matrix3. Distance Matrix4. Combined (Distance +

Manual)

35

Manual aliManual aliggnmentnment. When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection.

GCG-TCCATCAGGTAGTTGGTGTGGCGATCCATCAGGTGGTTGGTGTG

36

Advantages of manual alignment:

(1) use of a powerful and trainable tool (the brain, well… some brains).

(2) ability to integrate additional data, e.g., domain structure, biological function.

37

38

Protein Alignment may be Protein Alignment may be guided by Secondary and guided by Secondary and

Tertiary StructuresTertiary Structures

Homo sapiens

DjlA protein

Escherichia coli

DjlA protein

39

Disadvantages of manual alignment: subjectivitysubjectivity (the algorithm is unspecified)

irreproducibility irreproducibility (the results cannot be independently reproduced)

unscalabilityunscalability (inapplicable to long sequences)

incommensurabilityincommensurability (the results cannot be compared to those obtained by other methods)

40

The dot-matrix method (Gibbs and McIntyre, 1970): The two sequences are written out as column and row headings of a two-dimensional matrix. A dot is put in the dot-matrix plot at a position where the nucleotides in the two sequences are identical.

41

The alignment is defined by a path from the upper-left element to the lower-right element.

42

There are 4 possible steps in the There are 4 possible steps in the path: path:

(1) a diagonal step through a dot = match.

(2) a diagonal step through an empty element of the matrix = mismatch.

(3) a horizontal step = a gap in the sequence on the left of the matrix.

(4) a vertical step = a gap in the sequence on the top of the matrix.

43

A dot matrix may become cluttered. With DNA sequences, ~25% of the elements will be occupied by dots by chance alone.

44

The number of spurious matches is determined by: window size (how many

residues are compared), stringency (the minimum number of matches for a hit), & alphabet size (number of characters states). Window size must be an odd number.

window size =1stringency = 1alphabet size = 4

45

window size =1stringency = 1alphabet size = 4

window size = 3stringency = 2alphabet size = 4

46

window size = 1stringency = 1alphabet size = 20

47

Dot-matrix methods:Dot-matrix methods:

Advantages: By being a visual Advantages: By being a visual representation, and humans representation, and humans being visual animals, the being visual animals, the method may unravel method may unravel information on the evolution of information on the evolution of sequences that cannot easily sequences that cannot easily be gleaned from a line be gleaned from a line alignment.alignment.

Disadvantages: May not Disadvantages: May not identify the best possible identify the best possible alignment.alignment.

48

Advantages:Highlighting Information

The vertical gap indicates The vertical gap indicates that a coding region that a coding region corresponding to ~75 corresponding to ~75 amino acids has either amino acids has either been deleted from the been deleted from the human gene or inserted human gene or inserted into the bacterial gene. into the bacterial gene.

Window size = 60 amino acids; Stringency = 24 matches

49

The two pairs of The two pairs of diagonally oriented diagonally oriented parallel lines most parallel lines most probably indicate that two probably indicate that two small internal duplications small internal duplications occurred in the bacterial occurred in the bacterial gene. gene.

Window size = 60 amino acids; Stringency = 24 matches

Advantages:Highlighting Information

50

Disadvantages:

Not possible to identify the best alignment.

51

Scoring Matrices & Gap Penalties

The true alignment between two sequences is the one that reflects accurately the evolutionary relationships between the sequences.

Since the true alignment is unknown, in practice we look for the optimal alignment, which is the one in which the numbers of mismatches and gaps are minimized according to certain criteria.

53

Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and vice versa.

54

= matches = mismatches = nucleotides in gaps = gaps

55

The scoring scheme comprises a gap penalty and a scoring matrix, M(a,b), that specifies the score for each type of match (a = b) or mismatch (a b).

The units in a scoring matrix may be the nucleotides in the DNA or RNA sequences, the codons in protein-coding regions, or the amino acids in protein sequences.

56

DNA scoring matrices are usually simple. In the simplest scheme all mismatches are given the same penalty.

M(a,b) is positive if a = b and negative otherwise.

In more complicated matrices a distinction may be made between transition and transversion mismatches or each type of mismatch may be penalized differently.

M(a,b) 0 if ab 0 if ab

57

Further complications: Distinguishing among different matches and mismatches.

For example, a mismatched pair consisting of LeuLeu && IleIle, which are very similar biochemically to each other, may be given a lesser penalty than a mismatched pair consisting of ArgArg && GluGlu, which are very dissimilar from each other.

58

Lesser penalty than

59

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

60


B = asx (asp or asn) X = unknownZ = glx (glu or gln) * = termination codon

61


The matrix is symmetrical

62


Positive numbers on the diagonal

63


Mismatches are usually penalized

64


Some mismatches are not penalized

65


A few mismatches are even rewarded

66

Gap penalty (or cost) is a factor (or a set of factors) by which the gap values (numbers and lengths of gaps) are mathematically manipulated to make the gaps equivalent in value to the mismatches.

The gap penalties are based on our assessment of how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions.

MismatchesGaps

68

The gap penalty has two components: a gap-opening penalty and a gap-extension penalty.

69

Three main gap-penalty systems:

(1) Fixed gap-penalty system = 0 gap-extension costs.

70


(2) Linear gap-penalty system = the gap-extension cost is calculated by multiplying the gap length minus 1 by a constant representing the gap-extension penalty for increasing the gap by 1.

71


(3) Logarithmic gap-penalty system = the gap-extension penalty increases with the logarithm of the gap length, i.e., slower.

72

Alignment algorithms

73

Aim: Given a predetermined set of criteria, find the alignment associated with the best score from among all possible alignments.

The OPTIMAL ALIGNMENT

74

The number of possible alignments may be astronomical.

nmmin(n,m)

(nm)!

n!m!

nm2nm

(nm)nm

nn mm

where n and m are the lengths of the two sequences to be aligned.

75

The number of possible alignments may be astronomical.

For example, when two DNA sequences 200 residues long each are compared, there are more than 10153 possible alignments.

In comparison, the number of protons in the universe is only ~1080.

76

FORTUNATELY:

There are computer algorithms for finding the optimal alignment between two sequences that do not require an exhaustive search of all the possibilities.

77

The Needleman-Wunsch (1970) Needleman-Wunsch (1970)

algorithmalgorithm

uses Dynamic Dynamic

ProgrammingProgramming

78

Dynamic programming = a computational technique. It is applicable when large searches can be divided into a succession of small stages, such that (1) the solution of the initial search stage is trivial, (2) each partial solution in a later stage can be calculated by reference to only a small number of solutions in an earlier stage, and (3) the last stage contains the overall solution.

79

Dynamic programming can be applied to problems of alignment because ALIGNMENT SCORES obey the following rules:

S1 x, 1 ySx1, y1S1 x1, 1 y1

80

Path Graph for aligning two Path Graph for aligning two sequencessequences

81

allowedallowed

82

not allowednot allowed

84

Scoring scheme

match = +5mismatch = –3gap-opening penalty = –4gap-extension penalty = 0

Matrix initialization

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Matrix initialization0 + match = 5


Matrix initialization0 + gap = –4


Matrix initialization0 + gap = –4


Matrix fill


0 + match = 5

Matrix fill


5 + gap = 1

Matrix fill


0 + gap = –4

… and so on and so forth


Complete matrix fill


Trace back


95

The alignment is produced by either starting at the highest score in either the rightmost column or the bottom row, and proceeding from right to left by following the best pointers, or at the bottom rightmost cell.

This stage is called the tracebacktraceback. The graph of pointers in the traceback is also referred to as the path graphpath graph because it defines the paths through the matrix that correspond to the optimal alignment or alignments.

Trace back (if we DO allow terminal gaps)


Trace back (if we DO NOT allow terminal gaps)




10 + gap ≠ 11 14 + mismatch = 1110 + gap ≠ 11



10 + gap ≠ 14 9 + match = 145 + gap ≠ 14



4 + mismatch ≠ 9 13 + gap= 90 + gap ≠ 9



8 + match = 13 4 + gap ≠ 139 + gap ≠ 13



–1 + gap ≠ 812 + gap = 8 3 + match = 8



7 + gap = 3 –6 + gap ≠ 3–2 + mismatch ≠ 37 + gap ≠ 12 7 + match = 123 + gap ≠ 12



…

Trace back (complete)


high road/low road/middle roadhigh road/low road/middle road

Two possible alignments:

GAATTCAGTGGA-TC-GA* * ** *

GAATTCAGTGGAT-C-GA* ** * *

107

Scoring Matrices

Mismatch and gap penalties should be inversely proportional to the frequencies with which changes occur.

108

To A To T To C To G Row totals

From A3.4 0.7

(3.6 0.7)4.5 0.8

(4.8 0.9)12.5 1.1

(13.3 1.1)20.3

(21.6)

From T3.3 0.6

(3.5 0.6)

13.8 1.9

(14.7 2.0)

3.3 0.6

(3.5 0.6)20.4

(21.7)

From C4.2 0.5

(4.2 0.5)

20.7 1.3

(16.4 1.3)

4.6 0.6

(4.4 0.6)29.5

(25.1)

From G20.4 1.4

(21.9 1.5)

4.4 0.6

(4.6 0.6)

4.9 0.7

(5.2 0.8)29.7

(31.6)

Column

totals

27.9

(29.5)

28.5

(24.6)

23.2

(23.2)

20.5

(21.3)

Transitions (68%) occur more frequently than transversions (32%).Mismatch penalties for transitions should be smaller than those for transversions.

109

Empirical substitution matrices

PAM (Percent/Point Accepted Mutation)

BLOSUM (BLOcks SUbstitution Matrix)

110

PAM

• Developed by Margaret Dayhoff in 1978.

• Based on comparisons of very similar protein sequences.

111

• A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair occurring in an alignment.

• The values in a scoring matrix are log ratios of two probabilities.

One is the random probability. The other is the probability of a empirical pair occurrence.

• Because the scores are logarithms of probability ratios, they can be added to give a meaningful score for the entire alignment. The more positive the score, the better the alignment!

Log-odds ratios

112

• Align sequences that are at least 85% identical.

– Minimizes ambiguity in alignments and the number of coincident mutations.

• Reconstruct phylogenetic trees and infer ancestral sequences.

• Tally replacements "accepted" by natural selection, in all pairwise comparisons.

– Meaning, the number of times j was replaced by i in all comparisons.

• Compute amino acid mutability (i.e., the propensity of a given amino acid, j, to be replaced).

The PAM matrices(Percent accepted mutations)

113

• Combine data to produce a Mutation Probability Matrix for one PAM of evolutionary distance, which is used to calculate the Log Odds Matrix for similarity scoring.

• Thus, depending on the protein family used, various PAM matrices result - some of which are “good” at locating evolutionary distant conserved mutations and some that are good at locating evolutionary close conserved mutations.

The PAM matrices

114

More on log-odds ratios

In PAM log-odds scores are multiplied by 10 to avoid decimals. Therefore, a PAM score of 2 actually corresponds to a log-odds ratio of 0.2.

0.2 = substitioni to j = log10 { (observed ij mutation rate) / (expected rate) }

The value 0.2 is log10 of the relative expectation value of the mutation. Therefore, the expectation value is 100.2 = 1.6.

So, a PAM score of 2 indicates that (in related sequences) the mutation would be expected to occur 1.6 times more frequently than random.

115

PAM250– Calculated for families of related proteins

(>85% identity)– 1 PAM is the amount of evolutionary

change that yields, on average, one substitution in 100 amino acid residues

– A positive score signifies a common replacement whereas a negative score signifies an unlikely replacement

– PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)

116

Sequence alignment matrix that allows 250 accepted point mutations per 100 amino acids. PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.

PAM250

117

Selecting a PAM Matrix

• Low PAM numbers: short sequences, strong local similarities.

• High PAM numbers: long sequences, weak similarities.– PAM60 for close relations (60% identity)

– PAM120 recommended for general use (40% identity)

– PAM250 for distant relations (20% identity)

• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended.

118

BLOSUM• Blocks Substitution Matrix

– Steven and Jorga G. Henikoff (1992).• Based on BLOCKS database (www.blocks.fhcrc.org)

– Families of proteins with identical function.– Highly conserved protein domains.

• Ungapped local alignment to identify motifs– Each motif is a block of local alignment.– Counts amino acids observed in same column.– Symmetrical model of substitution.

119

BLOSUM62

• BLOSUM matrices are based on local alignments (“blocks” or conserved amino acid patterns).

• BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence.

• All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

• BLOSUM 62 is the default matrix in BLAST 2.0.

120

BLOSUM Matrices

• Different BLOSUMn matrices are calculated independently from BLOCKS

• BLOSUMn is based on sequences that are at most n percent identical.

121

The procedure for calculating a BLOSUM matrix is based on a likelihood method estimating the occurrence of each possible pairwise substitution. Only aligned blocks are used to calculate the BLOSUMs.

The higher the scoreThe more closely related sequences.

BLOSUM62

122

Because all blocks whose members shared at least 62% identity with ANY other member of that block were averaged and represented as 1 sequence.

Why is BLOSUM62 called

BLOSUM62?

123

Selecting a BLOSUM Matrix

• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general

use– BLOSUM80 for close relations– BLOSUM45 for distant relations

124

Equivalent PAM and Blosum matrices

The following matrices are roughly equivalent...

•PAM100 ==> Blosum90 •PAM120 ==> Blosum80 •PAM160 ==> Blosum60 •PAM200 ==> Blosum52 •PAM250 ==> Blosum45

Generally speaking... •The Blosum matrices are best for detecting local alignments. •The Blosum62 matrix is the best for detecting the majority of weak protein similarities. •The Blosum45 matrix is the best for detecting long and weak alignments.

Less divergent

More divergent

125

Comparison of PAM250 and BLOSUM62

The relationship between BLOSUM and PAM substitution matrices:

BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences.

BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.

If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search.

126

Scoring matrices commonly used

• PAM250 – Shown to be appropriate for searching for

sequences of 17-27% identity.

• BLOSUM62– Though it is tailored for comparisons of

moderately distant proteins, it performs well in detecting closer relationships.

• BLOSUM50– Shown to be better for FASTA searches.

127

Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone

(a) Penalty for gaps is 0(b) Penalty for a gap of size k nucleotides is wk = 1 + 0.1k(c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by showing pairs of biochemically similar amino acids

Alignments: things to keep in mind

“Optimal alignment” means “having the highest possible score, given a substitution matrix and a set of gap penalties”

This is NOT necessarily the most meaningful alignment

The assumptions of the algorithm are often wrong:

- substitutions are not equally frequent at all positions,

- it is very difficult to realistically model insertions and deletions.

Pairwise alignment programs ALWAYS produce an alignment (even when it does not make sense to align sequences)

1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES.

Documents