1 BIOS477/877 L13 - 1 Spring 2020 BIOS 477/877 Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13 - 2 Ø Assignment 4 Review Ø BLASTP & BLASTN outputs Ø BLAST & FASTA statistics TODAY' S TOPICS 2 BIOS477/877 L13 - 3 blastp Similarity Search: Result Page 3 BIOS477/877 L13 - 4 blastp Similarity Search: Result Page Phylogeny based on pairwise distance from BLAST pairwise alignments. ➜ Approximated tree. For a more accurate phylogeny, distances need to be estimated from the multiple alignment. 4 BIOS477/877 L13 - 5 blastp Similarity Search: Result Page Download the BLAST result: - BLAST search result in text format - Sequences and alignments in FASTA format - BLAST hit statistics in “Hit Table (csv)” [Can be imported to any spread sheet program (Excel)] 5 BIOS477/877 L13 - 6 BLASTP results Query coverage: Proportion of the query aligned Bit scores E-value 6
5
Embed
Bioinformatics and Molecular Evolution Øbioinfolab.unl.edu/emlab/Courses/BIOS877/current/slides/BIOS877_L… · Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
BIOS477/877 L13 - 1
Spring 2020
BIOS 477/877
Bioinformatics and Molecular Evolution
Lecture 13
1
BIOS477/877 L13 - 2
Ø Assignment 4 Review
Ø BLASTP & BLASTN outputsØ BLAST & FASTA statistics
TODAY'S TOPICS
2
BIOS477/877 L13 - 3
blastp Similarity Search: Result Page
3
BIOS477/877 L13 - 4
blastp Similarity Search: Result PagePhylogeny based on pairwise distance from BLAST pairwise alignments.
➜ Approximated tree. For a more accurate phylogeny, distances need to be estimated from the multiple alignment.
4
BIOS477/877 L13 - 5
blastp Similarity Search: Result Page
Download the BLAST result:- BLAST search result in text format- Sequences and alignments in FASTA format- BLAST hit statistics in “Hit Table (csv)”
[Can be imported to any spread sheet program (Excel)]
5
BIOS477/877 L13 - 6
BLASTP resultsQuery coverage:
Proportion of the query aligned
Bit scores
E-value
6
2
BIOS477/877 L13 - 7
Nucleotide Similarity Search
megablast*: w=28 (16~256)*This is the default search methoddiscontiguous megablast:
w=11 (or 12)allows some mismatches
blastn: w=11 (7~15)w=7 for a short sequence
Default DB nr/nt
7
BIOS477/877 L13 - 8
Discontiguous megablast
If discontiguous megablast is chosen:
Word matching based on discontiguous pattern (template):e.g., for coding: 1101101101101101 (w=11, t=16)
➜mismatches are allowed for '0' positions
8
BIOS477/877 L13 - 9
BLASTN resultsmegablast (only 4 hits, all E<7e-91)
Discontiguous megablast(12 hits, E<3e-111)
blastn (137 hits, E<5.3)
9
BIOS477/877 L13 - 10
[blastn]
[blastx] (translated query vs. protein db)
BLASTN/BLASTX results
Low complexity region is masked (shown in lower cases)
6 possible frames
10
BIOS477/877 L13 - 11
[blastp]
BLASTP results
Positive (+) scoring AA pairs (similar AA pairs)
11
BIOS477/877 L13 - 12
BLAST results
Click to see the blast search
statistics
12
3
BIOS477/877 L13 - 13
BLAST results
Used to calculate the scores for the alignments
with gaps
13
BIOS477/877 L13 - 14
BLAST Statistics[blastp]
l and K are scoring system specific (for gap alignments)
Normalized Score or Bit Score (S'bit):S'bit = (lS - logeK) / loge2, [S'nat = lS - logeK]l = 0.267, K = 0.041, S'bit = {0.267 x 795 - loge(0.041)} / loge2 = 310.8
Raw Score (S): simply based on pairwise scores & gap penalties
14
BIOS477/877 L13 - 15
BLAST Statistics[blastp]
Normalized Score or Bit Score (S'bit):S'bit = (lS - logeK) / loge2, [S'nat = lS - logeK]l = 0.267, K = 0.041, S'bit = {0.267 x 795 - loge(0.041)} / loge2 = 310.8
Raw Score (S): simply based on pairwise scores & gap penalties
Raw scores (S) depend on the scoring system; cannot be comparedBit scores (S'bit) are normalized using l and K;
® independent of scoring system; can be compared
15
BIOS477/877 L13 - 16
[For a pairwise alignment]Ø Karlin-Altschul equation (Karlin & Altschul, 1990)
P(S≥x) = 1 - exp[-Kmne-lx] ≈ Kmne-lx
Probability of getting the alignment score S ≥ x by chanceE = NP [N: number of random alignments; used in PRSS and LALIGN]
[For database searching]Ø Multiple pairwise alignments: multiple testing problem
• P(S≥x): Probability of getting the alignment score (S) ≥ x by chance from one pairwise alignment
• If P(S≥x) = 0.05, 1-P(S≥x) = 0.95➜ 0.95 is the probability of having S<x by chance for one pairwise alignment
• For 10 alignments, 0.9510 ≈ 0.60 is the probability to have all 10 alignments with S<x➜ 1-0.60 = 0.40 is the probability of having S≥x by chance at least for one alignment
• For 100 alignments, 0.95100 ≈ 0.006 is the probability to have all 100 with S<x➜ 1-0.006 ≈ 0.99 is the probability of having S≥x by chance at least for one alignment
Pairwise alignment vs. database searching
P(S≥x)
x
Pr=0.05 as the significance level is not good enoughif many alignments need to be tested!
16
BIOS477/877 L13 - 17
Ø Multiple comparison correctionInstead of using Prob = aUse Prob = a/N (for N comparisons) as the threshold• For 10 alignments, use P(S≥x) = 0.05/10 = 0.005 (instead of 0.05)➜ (1-0.005)10 ≈ 0.95 is the probability to have S<x by chance for all 10
alignments➜ 1-0.95 = 0.05 is the probability of having S≥x by chance at least for
one alignment
• For 100 alignments, use P(S≥x) = 0.05/100 = 0.0005 (instead of 0.05)➜ (1-0.0005)100≈0.95 is the probability to have S<x by chance for all 100
alignments➜ 1-0.95=0.05 is the probability of having S≥x by chance at least for
one alignment
Bonferroni correction
17
BIOS477/877 L13 - 18
Ø Multiple comparison correctionInstead of using Prob = aUse a' = a/N (for N comparisons) as the threshold➜ a = N x a'
E = N x Prob➜ E = a can be used as the threshold for multiple
comparisons
• For database searching, N is the database size (the number of entries) ➜ the number of alignments
Bonferroni correction in database searching
E-value threshold can be considered as a P-value threshold corrected for
multiple comparisons in database searching
18
4
BIOS477/877 L13 - 19
Ø Karlin-Altschul equation (Karlin & Altschul, 1990)[For a pairwise alignment]P = Kmne-lS (Lec 11 slide 13)m, n: lengths of the sequences compared➜m x n: search space
[For database similarity searching]E = Kmne-lS (NOTE: E=NP is not used in BLAST)
m: length of the queryn: length of the database (total number of residues)
E-value: the expected number of HSPs with scores ≥ SP = 1 - e-E (P ≈ E if E < 0.01)➜ the probability of having at least one HSP with its score ≥ S
BLAST Statistics
Consider a database as a single very long sequence
m': effective length of the queryn': effective length of the databasem' = m - ln' = n - l x (number of sequences in the database)
l: length adjustment➜ correction for edge effects• HSPs cannot occur too close to the search space edges.• HSPs need to be a certain length.
BLAST Statistics
HSP cannot starttoo close to the edge
Note: Calculation methods for length adjustment (l) and m'n'have been changed based on a new finite-size correction (FSC). See Park et al. (2012, BMC Research Note)
20
BIOS477/877 L13 - 21
P-value, E-value, and database search
Ø P-value for pairwise alignment = 1-exp[-Kmne-lx] ≈ Kmne-lx
➜ Probability of getting the alignment score ≥ x from random pairwise comparison (m and n are the lengths of the 2 sequences compared)
Ø E-value = Kmn e-lS
➜ Number of alignments with a score ≥ S expected by chance from a database searchm: length of the query (or effective length, m')n: length of the database (or effective length, n')
Ø P-value for a database search (Bonferroni corrected)➜ the probability of having at least one HSP with its score ≥ S➜ P = 1 - e-E or E = -ln(1-P)
0 < P < 10 < E < N (N: number of random comparisons)Altschul et al. (1994) & BLAST Statistics Tutorial
(P = E/N or E = PN is used in FASTA; N: database size)
21
BIOS477/877 L13 - 22
BLAST Statistics[blastp HSP]
l = 0.267, K = 0.041, S=795, S'bit = {0.267 x 795 - ln(0.041)} / ln2 = 310.8
Expect (E) = Km'n'e-lS or m'n'e-S'nat or m'n'2-S'bit
E = 0.041 x m' x n' x e-0.267 x 795 [from the raw score]E = m' x n' x 2-310.8 [from the bit score]
m' x n': Effective search space
22
BIOS477/877 L13 - 23
BLAST search summary statistics
Scoring matrix & gap penalties
Word size (W)
Neighborhood threshold (T)
Length separating two HSPs to trigger extension (A: two-hit methods)
l, K, and H are pre-estimated for a combination of the scoring matrix and gap penalties