The Basic Local Alignment Search Tool(BLAST)
Rapid data base search tool (1990)
Idea:
(1) Search for high scoring segment pairs
The Basic Local Alignment Search Tool(BLAST)
A Y W T Y I V A L T – Q V R Q Y E A T
S I L C I V M I Y S R A - Q Y R Y W R Y
Most local alignments contain highly conserved sections without gaps
The Basic Local Alignment Search Tool(BLAST)
A Y W T Y I V A L T – Q V R Q Y E A T
S I L C I V M I Y S R A - Q Y R Y W R Y
-> search for high scoring segment pairs
(HSP), i.e. gap-free local alignments
The Basic Local Alignment Search Tool(BLAST)
The Basic Local Alignment Search Tool(BLAST)
A Y W T Y I V A L T – Q V R Q Y E A T
S I L C I V M I Y S R A - Q Y R Y W R Y
Advantages: (a) speed
(b) statistical theory about HSP exists.
The Basic Local Alignment Search Tool(BLAST)
Rapid data base search tool (1990)
Idea:
(1) Search for high scoring segment pairs
(2) Use word pairs as seeds
Pair-wise sequence alignment
T W L M H C A Q Y I C I M X H X C X T H Y
(1) Search word pairs of length 3 with score > T,Use them as seeds.
Pair-wise sequence alignment
Naïve algorithm would have a complexity of O(l1 * l2)
Solution: Preprocess query sequence:
Compile a list of all words that have a
Score > T when aligned to a word in the
Query.
Pair-wise sequence alignment
Naïve algorithm would have a complexity of O(l1 * l2)
Solution: Preprocess query sequence:
Compile a list of all words that have a
Score > T when aligned to a word in the
Query. Complexity: O(l1)
Organize words in efficient data structure (tree) for fast look-up
The Basic Local Alignment Search Tool(BLAST)
Rapid data base search tool (1990)
Idea:(1) Search for high scoring segment pairs (2) Use word pairs as seeds(3) Extend seed alignments until score drops
below threshold value
Pair-wise sequence alignment
T W L M H C A Q Y I C I M X H X C X T H Y
Extend seeds until score drops by X.
Pair-wise sequence alignment
T W L M H C A Q Y I C I X M X H X C X T X H X Y
Extend seeds until score drops by X.
Pair-wise sequence alignment
Algorithm not guaranteed to find best
segment pair
(Heuristic)
But works well in practice!
The Basic Local Alignment Search Tool(BLAST)
New BLAST version (1997)
Two-hit strategy
Pair-wise sequence alignment
W L M H C A Q Y A R V I M X H X C X T H W A X R X v X
Search two word pairs of at the same diagonal, use lower threshold T
The Basic Local Alignment Search Tool(BLAST)
New BLAST version (1997)
Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST
(PSI BLAST)
The Basic Local Alignment Search Tool(BLAST)
Multiple sequence alignment
1aboA 1 .NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1 .NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1 .drvrkksga.........awqGQIVGWYctnlt.............peG
1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment
First question: how to score multiple alignments?
Possible scoring scheme:
Sum-of-pairs score
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQtkngqGWVPSNYITPVN
1ycsB 39 WWWARlndkeGYVPRNLLGLYP
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment
Multiple alignment implies pairwise alignments:
Use sum of scores of these p.a.
1aboA 36 WCEAQt..kngqGWVPSNYITPVN......
1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....
1vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment
Goal:
Find multi-alignment with maximum score !
Multiple sequence alignment
Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment
Multidimensional search space instead of two-dimensional matrix!
Multiple sequence alignment
Multiple sequence alignment
Complexity:
For sequences of length l1 * l2 * l3
O( l1 * l2 * l3 )
For n sequences ( average length l ):
O( ln )
Exponential complexity!
Multiple sequence alignment
Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment
Optimal solution not feasible:
Multiple sequence alignment
Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment
Optimal solution not feasible:
-> Heuristics necessary
Multiple sequence alignment
(A) Carillo and Lipman (MSA)
Find sub-space in dynamic-programming
Matrix where optimal path can be found
Multiple sequence alignment
(B) Stoye, Dress (DCA)
Divide search space into small Calculate optimal alignment for sub-spaces Concatenate sub-alignments
Multiple sequence alignment
(B) Stoye, Dress (DCA)
Multiple sequence alignment
(B) Stoye, Dress (DCA)
Multiple sequence alignment
Progressive alignment.
Carry out a series of pair-wise alignment
Most popular way of constructing multiple alignments:
Progressive alignment.
Carry out a series of pair-wise alignment
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
Align most similar sequences
Multiple sequence alignment
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD
YAVESEASFQPVAALERIN
WLNYNEERGDFPGTYVEYIGRKKISP
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP
Align sequence to alignment
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP
Align alignment to alignment
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN-------- WW--RLNDKEGYVPRNLLGLYP-------- AVVIQDNSDIKVVP--KAKIIRD------- YAVESEA---SVQ--PVAALERIN------ WLN-YNE---ERGDFPGTYVEYIGRKKISP
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN-------- WW--RLNDKEGYVPRNLLGLYP-------- AVVIQDNSDIKVVP--KAKIIRD------- YAVESEA---SVQ--PVAALERIN------ WLN-YNE---ERGDFPGTYVEYIGRKKISP
Rule: “once a gap - always a gap”
Multiple sequence alignment
Order of pair-wise profile alignments determined
by phylogenetic tree based on pair-wise similarity
values (guide tree)
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
Multiple sequence alignment
Problem: simple guide tree determines multiple alignment; multiple alignment determines phyolgeneitc analysis
Multiple sequence alignment
Implementations:
Clustal W, PileUp, MultAlin
Local multiple alignment
M
M
Local multiple alignment
M
M
M
Local multiple alignment
M
M
M
M´
M´
M´
Local multiple alignment
Find motifs contained in all sequences in data set
Problem:
motifs often present in only sub-families
Neither local nor global methods appliccable
Alignment possible if order conserved
The DIALIGN approach
The DIALIGN approach
Combination of local and global methods.
The DIALIGN approach
Combination of local and global methods.
Find local pair-wise similarities between input sequences (fragments)
The DIALIGN approach
Combination of local and global methods.
Find local pair-wise similarities between input sequences (fragments)
Compose alignments from fragments
The DIALIGN approach
Combination of local and global methods.
Find local pair-wise similarities between input sequences (fragments)
Compose alignments from fragments
Ignore non-related parts of the sequences
The DIALIGN approach
atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc
The DIALIGN approach
atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc
The DIALIGN approach
atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc
The DIALIGN approach
atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc
The DIALIGN approach
atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc
------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--
The DIALIGN approach
atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc
------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--
------atcTAATAGTTAaaccccctcgtGCTTag-------AGATCCaaaccagtgcgtgTATTACTAAc----------GGTTcaatcgcgcACATCCgc--
The DIALIGN approach
Score of an alignment:
Define score of fragment f:
l(f) = length of fs(f) = sum of matches (similarity values)
P(f) = probability to find a fragment with length l(f) and at least s(f) matches in random sequences that have the same length as the input sequences.
Score w(f) = -ln P(f)
The DIALIGN approach
Score of an alignment:
Define score of alignment as sum of scores w(f) of its fragments
No gap penalty is used!
Optimization problem for pair-wise alignment:
Find chain of fragments with maximal total score
The DIALIGN approach
------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--
Fragment-chaining algorithm finds optimal chain of
fragments.
The DIALIGN approach
Multiple fragment alignment
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaac----------ggttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
The DIALIGN approach
Multiple fragment alignment
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
Consistency: it is possible to introduce gaps such that all segment pairs are aligned.
The DIALIGN approach
Multiple fragment alignment
atc------TAATAGTTAaactccccCGTGC-TTag
cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
caaa--GAGTATCAcc----------CCTGaaTTGAATaa
Program evaluation
Use biologically verified alignments
(known 3D structure of proteins)
Compare alignments produced by
computer programs to “biologically correct”
alignments.
Program evaluation
(1) First evaluation of multiple alignment programs (McClure, Vasi, Fitch,1994)
4 protein families used:
Globin, kinase, protease, ribonuclease H,
all globally related -> global programs
performed best
Program evaluation
(2) The BAliBASE (Thompson et al., 1999)
~ 100 protein families with known 3D structure,
some with large insertions/deletions.
Program evaluation
1aboA 1 .NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1 .NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1 .drvrkksga.........awqGQIVGWYctnlt.............peG
1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......
Key
alpha helix RED beta strand GREEN core blocks UNDERSCORE
Program evaluation
Results:
Four programs performed best, but no method was best in all test examples.
ClustalW, SAGA and RPPR best for global alignment,DIALIGN best for sequences with large insertions ordeletions.
Program evaluation
(3) Lassmann and Sonnhammer (2002)
Used BAliBASE plus artificial sequencesfor local alignment
Results: T-COFFEE best for closely related sequences, DIALIGN best for distal sequences.
Program evaluation
Alignment of large genomic sequences
Important tool for identifying functional
sites (e.g. genes or regulatory elements)
Alignment of large genomic sequences
Phylogenetic Footprinting:
Functional sites more conserved during evolution
=> Sequence similarity indicates biological function
Alignment of large genomic sequences
DIALIGN performs well in identifying local homologies, but is slow
Quadratic program running time
Quadratic program running time
Quadratic program running time
Quadratic program running time
Quadratic program running time
Quadratic program running time
Quadratic program running time
Solution: Anchored alignments
Solution: Anchored alignments
Solution: Anchored alignments
Solution: Anchored alignments
Solution: Anchored alignments
Solution: Anchored alignments
Solution: Anchored alignments
Solution: Anchored alignments
Find anchor points to reduce search space
Solution: Anchored alignments
Use fast heuristic method to find anchor points:
CHAOS developed together with Mike Brudno
Brudno et al. (2003), BMC Bioinformatics 4:66
Solution: Anchored alignments
(3) Anchored alignments
(3) Anchored alignments
First step to gene prediction:
Exon discovery by genomic alignment
First step to gene prediction:
Exon discovery by genomic alignment
Evaluation of different alignment programs:
Compare local sequence similarity identified by alignment programs to known exons
Morgenstern et al. (2002), Bioinformatics 18:777-787
DIALIGN alignment of human and murine genomic sequences
DIALIGN alignment of tomato and Thaliana genomic sequences
Evaluation of DIALIGN, PipMaker, WABA, BLASTN and TBLASTX on a set of 42 human and murine genomic sequences.
Compare similarities to annotated exons
Apply cut-off parameter to resulting alignments
Measure sensitivity and specificity
Performance of long-range alignment programs for exon discovery (human - mouse comparison)
Performance of long-range alignment programs for exon discovery (thaliana - tomato comparison)
AGenDA:
Alignment-based Gene Detection Algorithm
Bridge small gaps between DIALIGN fragments
-> cluster of fragments
Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons
Recursive algorithm finds biologically consistent chain of potential exons
Identification of candidate exons
Fragments in DIALIGN alignment
Identification of candidate exons
Build cluster of fragments
Identification of candidate exons
Identify conserved splice sites
Identification of candidate exons
Candidate exons bounded by conserved splice sites
Construct gene models using candidate exons
Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending
Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score
)()()(
),()()( SPscfw
Clen
ECdisClenEsc
i
i
Find optimal consistent chain of candidate exons
Find optimal consistent chain of candidate exons
Find optimal consistent chain of candidate exons
Find optimal consistent chain of candidate exons
Find optimal consistent chain of candidate exons
atg gt ag gt ag tga atg tga
Find optimal consistent chain of candidate exons
atg gt ag gt ag tga atg tga
G1 G2
Find optimal consistent chain of candidate exons
Recursive algorithm calculates optimal chain of candidate exons in N log N time
DIALIGN fragments
Candidate exons
Complete model
Results:105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)
0%10%20%30%40%50%60%70%80%90%
100%
sensitivity specificity
AGenDAGenScan
Results:105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)
AGenDA
GenScan
64 %
12 % 17 %
Results:
Quality of AGenDA-based gene models comparable to results from GenScan
Exons identified that have not been identified by GenScan
No statistical models derived from known genes (no training data necessary!)
Method generally appliccable
AGenDA:
Alignment-based Gene Detection Algorithm
WWW server:
http://bibiserv/TechFak.Uni-Bielefeld.DE/agenda
Rinner, Taher, Goel, Sczyrba, Brudno, Batzoglou, Morgenstern, submitted