1 GLOBAL GLOBAL PAIRWISE ALIGNMENT PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES 2 NUCLEOTIDE SEQUENCES OR OR 2 AMINO-ACID SEQUENCES 2 AMINO-ACID SEQUENCES
1
GLOBAL GLOBAL PAIRWISE ALIGNMENTPAIRWISE ALIGNMENT
GLOBAL ALIGNMENT OF:GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES 2 NUCLEOTIDE SEQUENCES
OR OR 2 AMINO-ACID SEQUENCES2 AMINO-ACID SEQUENCES
2
Assumptions:Assumptions:
Life is monophyleticLife is monophyleticBiological entities (sequences, Biological entities (sequences, taxa) share common ancestrytaxa) share common ancestry
3
Any two organisms share a common ancestor in their past
ancestor
descendant 1 descendant 2
4
ancestor (~5 MYA)
5
ancestor (~120 MYA)
6
ancestor (~1,500 MYA)
7
(1) Speciation events(2) Gene duplication (3) Duplicative transposition
Homologoussequences
8
HomologHomolog
y:y: A term
coined by Richard Owen in 1843.
Definition: Similarity resulting from common ancestry.
9
Homology
There are three main types of
molecular homology: orthology,
paralogy (including ohnology) and
xenology.
10
Homology: General Definition
• Homology designates a qualitative relationship of common descent between entities
• Two genes are either homologous or they are not!– it doesn’t make sense to say “two
genes are 43% homologous.”– it doesn’t make sense to say “Linda is
43% pregnant.”
11
Orthology & Paralogy
• Two genes are orthologs if they originated from a single ancestral gene in the most recent common ancestor of their respective genomes
• Two genes are paralogs if they are related by gene duplication. Two genes are ohnologs if they are related by gene duplication due to genome duplication
12
13
= Gene death
14
Xenology is due to horizontal (lateral) gene transfer (HGT or
LGT)
XA and XB are xenologsDistinguishing orthologs from xenologs is impossible in pairwise genomic comparisons, but possible when multiple genomes are compared
15
Orthology, Paralogy, Xenology(Fitch, Trends in Genetics, 2000. 16(5):227-231)
16
By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor.
Homology
17
When comparing sequences, we are interested in POSITIONAL HOMOLOGY. We identify POSITIONAL HOMOLOGY through SEQUENCE ALIGNMENT.
Homology
Alignment:Alignment: A hypothesis concerning positional homology among residues from two or more sequence.Positional homologyPositional homology = In
pairwise alignment, a pair of nucleotides from two
homologous sequences that have descended from one
nucleotide in the ancestor of the two sequences.
19
Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.
20
21
Unknown sequence
Unknown events & unknown sequence of events
Unknown events & unknown sequence of
events
The true alignment is unknown.
There are two modes of alignment.
Global alignment: each residue of sequence A is compared with each residue in sequence B. Global alignment algorithms are used in comparative and evolutionary studies.
Local alignment: Determining if sub-segments of one sequence are present in another. Local alignment methods have their greatest utility in database searching and retrieval (e.g., BLAST).
For reasons of computational complexity, sequence alignment is divided into two categories:
Pairwise alignment (i.e., the alignment of two sequences).
Multiple-sequence alignment (i.e., the alignment of three or more sequences).
Pairwise alignment problems have exact solutions.
Multiple-sequence alignment problems only have approximate (heuristic) solutions.
24
A pairwise alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs:
(1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null
base in the other. GCGGCCCATCAGGTAGTTGGTG-GGCGTTCCATC--CTGGTTGGTGTG
25
-Two DNA sequences: A and B.-Two DNA sequences: A and B.-Lengths are -Lengths are mm and and nn, respectively. , respectively.
-The number of matched pairs is -The number of matched pairs is xx. .
-The number of mismatched pairs -The number of mismatched pairs is is yy. . - Total number of bases in gaps is - Total number of bases in gaps is zz..
26
There are internal internal and terminal terminal gaps.
GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG
27
A terminal gap may indicate missing data.
GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG
28
An internal gap indicates that a deletiondeletion or an insertioninsertion has occurred in one of the two lineages.
GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG
29
When sequences are compared through alignment, it is impossible to tell whether a deletion has occurred in one sequence or an insertion has occurred in the other. Thus, deletions and insertions are collectively referred to as indels (short for insertion or deletion).
GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG
30
The alignment is the first step in many functional and evolutionary studies.
Errors in alignment tend to amplify in later stages of the study.
31
Motivation for sequence alignment
Function– Similarity may be indicative of
similar function.
Evolution– Similarity may be indicative of
common ancestry.
32
Some definitions
34
Methods of alignment:
1. Manual2. Dot matrix3. Distance Matrix4. Combined (Distance +
Manual)
35
Manual aliManual aliggnmentnment. When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection.
GCG-TCCATCAGGTAGTTGGTGTGGCGATCCATCAGGTGGTTGGTGTG
36
Advantages of manual alignment:
(1) use of a powerful and trainable tool (the brain, well… some brains).
(2) ability to integrate additional data, e.g., domain structure, biological function.
37
38
Protein Alignment may be Protein Alignment may be guided by Secondary and guided by Secondary and
Tertiary StructuresTertiary Structures
Homo sapiens
DjlA protein
Escherichia coli
DjlA protein
39
Disadvantages of manual alignment: subjectivitysubjectivity (the algorithm is unspecified)
irreproducibility irreproducibility (the results cannot be independently reproduced)
unscalabilityunscalability (inapplicable to long sequences)
incommensurabilityincommensurability (the results cannot be compared to those obtained by other methods)
40
The dot-matrix method (Gibbs and McIntyre, 1970): The two sequences are written out as column and row headings of a two-dimensional matrix. A dot is put in the dot-matrix plot at a position where the nucleotides in the two sequences are identical.
41
The alignment is defined by a path from the upper-left element to the lower-right element.
42
There are 4 possible steps in the There are 4 possible steps in the path: path:
(1) a diagonal step through a dot = match.
(2) a diagonal step through an empty element of the matrix = mismatch.
(3) a horizontal step = a gap in the sequence on the left of the matrix.
(4) a vertical step = a gap in the sequence on the top of the matrix.
43
A dot matrix may become cluttered. With DNA sequences, ~25% of the elements will be occupied by dots by chance alone.
44
The number of spurious matches is determined by: window size (how many
residues are compared), stringency (the minimum number of matches for a hit), & alphabet size (number of characters states). Window size must be an odd number.
window size =1stringency = 1alphabet size = 4
45
window size =1stringency = 1alphabet size = 4
window size = 3stringency = 2alphabet size = 4
46
window size = 1stringency = 1alphabet size = 20
47
Dot-matrix methods:Dot-matrix methods:
Advantages: By being a visual Advantages: By being a visual representation, and humans representation, and humans being visual animals, the being visual animals, the method may unravel method may unravel information on the evolution of information on the evolution of sequences that cannot easily sequences that cannot easily be gleaned from a line be gleaned from a line alignment.alignment.
Disadvantages: May not Disadvantages: May not identify the best possible identify the best possible alignment.alignment.
48
Advantages:Highlighting Information
The vertical gap indicates The vertical gap indicates that a coding region that a coding region corresponding to ~75 corresponding to ~75 amino acids has either amino acids has either been deleted from the been deleted from the human gene or inserted human gene or inserted into the bacterial gene. into the bacterial gene.
Window size = 60 amino acids; Stringency = 24 matches
49
The two pairs of The two pairs of diagonally oriented diagonally oriented parallel lines most parallel lines most probably indicate that two probably indicate that two small internal duplications small internal duplications occurred in the bacterial occurred in the bacterial gene. gene.
Window size = 60 amino acids; Stringency = 24 matches
Advantages:Highlighting Information
50
Disadvantages:
Not possible to identify the best alignment.
51
Scoring Matrices & Gap Penalties
The true alignment between two sequences is the one that reflects accurately the evolutionary relationships between the sequences.
Since the true alignment is unknown, in practice we look for the optimal alignment, which is the one in which the numbers of mismatches and gaps are minimized according to certain criteria.
53
Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and vice versa.
54
= matches = mismatches = nucleotides in gaps = gaps
55
The scoring scheme comprises a gap penalty and a scoring matrix, M(a,b), that specifies the score for each type of match (a = b) or mismatch (a b).
The units in a scoring matrix may be the nucleotides in the DNA or RNA sequences, the codons in protein-coding regions, or the amino acids in protein sequences.
56
DNA scoring matrices are usually simple. In the simplest scheme all mismatches are given the same penalty.
M(a,b) is positive if a = b and negative otherwise.
In more complicated matrices a distinction may be made between transition and transversion mismatches or each type of mismatch may be penalized differently.
M(a,b) 0 if ab 0 if ab
57
Further complications: Distinguishing among different matches and mismatches.
For example, a mismatched pair consisting of LeuLeu && IleIle, which are very similar biochemically to each other, may be given a lesser penalty than a mismatched pair consisting of ArgArg && GluGlu, which are very dissimilar from each other.
58
Lesser penalty than
59
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
60
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
B = asx (asp or asn) X = unknownZ = glx (glu or gln) * = termination codon
61
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
The matrix is symmetrical
62
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Positive numbers on the diagonal
63
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Mismatches are usually penalized
64
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Some mismatches are not penalized
65
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
A few mismatches are even rewarded
66
Gap penalty (or cost) is a factor (or a set of factors) by which the gap values (numbers and lengths of gaps) are mathematically manipulated to make the gaps equivalent in value to the mismatches.
The gap penalties are based on our assessment of how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions.
MismatchesGaps
68
The gap penalty has two components: a gap-opening penalty and a gap-extension penalty.
69
Three main gap-penalty systems:
(1) Fixed gap-penalty system = 0 gap-extension costs.
70
Three main gap-penalty systems:
(2) Linear gap-penalty system = the gap-extension cost is calculated by multiplying the gap length minus 1 by a constant representing the gap-extension penalty for increasing the gap by 1.
71
Three main gap-penalty systems:
(3) Logarithmic gap-penalty system = the gap-extension penalty increases with the logarithm of the gap length, i.e., slower.
72
Alignment algorithms
73
Aim: Given a predetermined set of criteria, find the alignment associated with the best score from among all possible alignments.
The OPTIMAL ALIGNMENT
74
The number of possible alignments may be astronomical.
nmmin(n,m)
(nm)!
n!m!
nm2nm
(nm)nm
nn mm
where n and m are the lengths of the two sequences to be aligned.
75
The number of possible alignments may be astronomical.
For example, when two DNA sequences 200 residues long each are compared, there are more than 10153 possible alignments.
In comparison, the number of protons in the universe is only ~1080.
76
FORTUNATELY:
There are computer algorithms for finding the optimal alignment between two sequences that do not require an exhaustive search of all the possibilities.
77
The Needleman-Wunsch (1970) Needleman-Wunsch (1970)
algorithmalgorithm
uses Dynamic Dynamic
ProgrammingProgramming
78
Dynamic programming = a computational technique. It is applicable when large searches can be divided into a succession of small stages, such that (1) the solution of the initial search stage is trivial, (2) each partial solution in a later stage can be calculated by reference to only a small number of solutions in an earlier stage, and (3) the last stage contains the overall solution.
79
Dynamic programming can be applied to problems of alignment because ALIGNMENT SCORES obey the following rules:
S1 x, 1 ySx1, y1S1 x1, 1 y1
80
Path Graph for aligning two Path Graph for aligning two sequencessequences
81
allowedallowed
82
not allowednot allowed
84
Scoring scheme
match = +5mismatch = –3gap-opening penalty = –4gap-extension penalty = 0
Matrix initialization
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization0 + match = 5
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization0 + gap = –4
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization0 + gap = –4
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
Matrix fill
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
0 + match = 5
Matrix fill
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
5 + gap = 1
Matrix fill
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
0 + gap = –4
… and so on and so forth
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
Complete matrix fill
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
Trace back
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
95
The alignment is produced by either starting at the highest score in either the rightmost column or the bottom row, and proceeding from right to left by following the best pointers, or at the bottom rightmost cell.
This stage is called the tracebacktraceback. The graph of pointers in the traceback is also referred to as the path graphpath graph because it defines the paths through the matrix that correspond to the optimal alignment or alignments.
Trace back (if we DO allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
10 + gap ≠ 11 14 + mismatch = 1110 + gap ≠ 11
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
10 + gap ≠ 14 9 + match = 145 + gap ≠ 14
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
4 + mismatch ≠ 9 13 + gap= 90 + gap ≠ 9
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
8 + match = 13 4 + gap ≠ 139 + gap ≠ 13
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
–1 + gap ≠ 812 + gap = 8 3 + match = 8
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
7 + gap = 3 –6 + gap ≠ 3–2 + mismatch ≠ 37 + gap ≠ 12 7 + match = 123 + gap ≠ 12
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
…
Trace back (complete)
match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0
high road/low road/middle roadhigh road/low road/middle road
Two possible alignments:
GAATTCAGTGGA-TC-GA* * ** *
GAATTCAGTGGAT-C-GA* ** * *
107
Scoring Matrices
Mismatch and gap penalties should be inversely proportional to the frequencies with which changes occur.
108
To A To T To C To G Row totals
From A3.4 0.7
(3.6 0.7)4.5 0.8
(4.8 0.9)12.5 1.1
(13.3 1.1)20.3
(21.6)
From T3.3 0.6
(3.5 0.6)
13.8 1.9
(14.7 2.0)
3.3 0.6
(3.5 0.6)20.4
(21.7)
From C4.2 0.5
(4.2 0.5)
20.7 1.3
(16.4 1.3)
4.6 0.6
(4.4 0.6)29.5
(25.1)
From G20.4 1.4
(21.9 1.5)
4.4 0.6
(4.6 0.6)
4.9 0.7
(5.2 0.8)29.7
(31.6)
Column
totals
27.9
(29.5)
28.5
(24.6)
23.2
(23.2)
20.5
(21.3)
Transitions (68%) occur more frequently than transversions (32%).Mismatch penalties for transitions should be smaller than those for transversions.
109
Empirical substitution matrices
PAM (Percent/Point Accepted Mutation)
BLOSUM (BLOcks SUbstitution Matrix)
110
PAM
• Developed by Margaret Dayhoff in 1978.
• Based on comparisons of very similar protein sequences.
111
• A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair occurring in an alignment.
• The values in a scoring matrix are log ratios of two probabilities.
One is the random probability. The other is the probability of a empirical pair occurrence.
• Because the scores are logarithms of probability ratios, they can be added to give a meaningful score for the entire alignment. The more positive the score, the better the alignment!
Log-odds ratios
112
• Align sequences that are at least 85% identical.
– Minimizes ambiguity in alignments and the number of coincident mutations.
• Reconstruct phylogenetic trees and infer ancestral sequences.
• Tally replacements "accepted" by natural selection, in all pairwise comparisons.
– Meaning, the number of times j was replaced by i in all comparisons.
• Compute amino acid mutability (i.e., the propensity of a given amino acid, j, to be replaced).
The PAM matrices(Percent accepted mutations)
113
• Combine data to produce a Mutation Probability Matrix for one PAM of evolutionary distance, which is used to calculate the Log Odds Matrix for similarity scoring.
• Thus, depending on the protein family used, various PAM matrices result - some of which are “good” at locating evolutionary distant conserved mutations and some that are good at locating evolutionary close conserved mutations.
The PAM matrices
114
More on log-odds ratios
In PAM log-odds scores are multiplied by 10 to avoid decimals. Therefore, a PAM score of 2 actually corresponds to a log-odds ratio of 0.2.
0.2 = substitioni to j = log10 { (observed ij mutation rate) / (expected rate) }
The value 0.2 is log10 of the relative expectation value of the mutation. Therefore, the expectation value is 100.2 = 1.6.
So, a PAM score of 2 indicates that (in related sequences) the mutation would be expected to occur 1.6 times more frequently than random.
115
PAM250– Calculated for families of related proteins
(>85% identity)– 1 PAM is the amount of evolutionary
change that yields, on average, one substitution in 100 amino acid residues
– A positive score signifies a common replacement whereas a negative score signifies an unlikely replacement
– PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)
116
Sequence alignment matrix that allows 250 accepted point mutations per 100 amino acids. PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.
PAM250
117
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local similarities.
• High PAM numbers: long sequences, weak similarities.– PAM60 for close relations (60% identity)
– PAM120 recommended for general use (40% identity)
– PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended.
118
BLOSUM• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992).• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function.– Highly conserved protein domains.
• Ungapped local alignment to identify motifs– Each motif is a block of local alignment.– Counts amino acids observed in same column.– Symmetrical model of substitution.
119
BLOSUM62
• BLOSUM matrices are based on local alignments (“blocks” or conserved amino acid patterns).
• BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence.
• All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.
• BLOSUM 62 is the default matrix in BLAST 2.0.
120
BLOSUM Matrices
• Different BLOSUMn matrices are calculated independently from BLOCKS
• BLOSUMn is based on sequences that are at most n percent identical.
121
The procedure for calculating a BLOSUM matrix is based on a likelihood method estimating the occurrence of each possible pairwise substitution. Only aligned blocks are used to calculate the BLOSUMs.
The higher the scoreThe more closely related sequences.
BLOSUM62
122
Because all blocks whose members shared at least 62% identity with ANY other member of that block were averaged and represented as 1 sequence.
Why is BLOSUM62 called
BLOSUM62?
123
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general
use– BLOSUM80 for close relations– BLOSUM45 for distant relations
124
Equivalent PAM and Blosum matrices
The following matrices are roughly equivalent...
•PAM100 ==> Blosum90 •PAM120 ==> Blosum80 •PAM160 ==> Blosum60 •PAM200 ==> Blosum52 •PAM250 ==> Blosum45
Generally speaking... •The Blosum matrices are best for detecting local alignments. •The Blosum62 matrix is the best for detecting the majority of weak protein similarities. •The Blosum45 matrix is the best for detecting long and weak alignments.
Less divergent
More divergent
125
Comparison of PAM250 and BLOSUM62
The relationship between BLOSUM and PAM substitution matrices:
BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences.
BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.
If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search.
126
Scoring matrices commonly used
• PAM250 – Shown to be appropriate for searching for
sequences of 17-27% identity.
• BLOSUM62– Though it is tailored for comparisons of
moderately distant proteins, it performs well in detecting closer relationships.
• BLOSUM50– Shown to be better for FASTA searches.
127
Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone
(a) Penalty for gaps is 0(b) Penalty for a gap of size k nucleotides is wk = 1 + 0.1k(c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by showing pairs of biochemically similar amino acids
Alignments: things to keep in mind
“Optimal alignment” means “having the highest possible score, given a substitution matrix and a set of gap penalties”
This is NOT necessarily the most meaningful alignment
The assumptions of the algorithm are often wrong:
- substitutions are not equally frequent at all positions,
- it is very difficult to realistically model insertions and deletions.
Pairwise alignment programs ALWAYS produce an alignment (even when it does not make sense to align sequences)