Top Banner
1 BIOS477/877 L13 - 1 Spring 2020 BIOS 477/877 Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13 - 2 Ø Assignment 4 Review Ø BLASTP & BLASTN outputs Ø BLAST & FASTA statistics TODAY' S TOPICS 2 BIOS477/877 L13 - 3 blastp Similarity Search: Result Page 3 BIOS477/877 L13 - 4 blastp Similarity Search: Result Page Phylogeny based on pairwise distance from BLAST pairwise alignments. Approximated tree. For a more accurate phylogeny, distances need to be estimated from the multiple alignment. 4 BIOS477/877 L13 - 5 blastp Similarity Search: Result Page Download the BLAST result: - BLAST search result in text format - Sequences and alignments in FASTA format - BLAST hit statistics in “Hit Table (csv)” [Can be imported to any spread sheet program (Excel)] 5 BIOS477/877 L13 - 6 BLASTP results Query coverage: Proportion of the query aligned Bit scores E-value 6
5

Bioinformatics and Molecular Evolution Øbioinfolab.unl.edu/emlab/Courses/BIOS877/current/slides/BIOS877_L… · Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13

Jul 18, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics and Molecular Evolution Øbioinfolab.unl.edu/emlab/Courses/BIOS877/current/slides/BIOS877_L… · Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13

1

BIOS477/877 L13 - 1

Spring 2020

BIOS 477/877

Bioinformatics and Molecular Evolution

Lecture 13

1

BIOS477/877 L13 - 2

Ø Assignment 4 Review

Ø BLASTP & BLASTN outputsØ BLAST & FASTA statistics

TODAY'S TOPICS

2

BIOS477/877 L13 - 3

blastp Similarity Search: Result Page

3

BIOS477/877 L13 - 4

blastp Similarity Search: Result PagePhylogeny based on pairwise distance from BLAST pairwise alignments.

➜ Approximated tree. For a more accurate phylogeny, distances need to be estimated from the multiple alignment.

4

BIOS477/877 L13 - 5

blastp Similarity Search: Result Page

Download the BLAST result:- BLAST search result in text format- Sequences and alignments in FASTA format- BLAST hit statistics in “Hit Table (csv)”

[Can be imported to any spread sheet program (Excel)]

5

BIOS477/877 L13 - 6

BLASTP resultsQuery coverage:

Proportion of the query aligned

Bit scores

E-value

6

Page 2: Bioinformatics and Molecular Evolution Øbioinfolab.unl.edu/emlab/Courses/BIOS877/current/slides/BIOS877_L… · Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13

2

BIOS477/877 L13 - 7

Nucleotide Similarity Search

megablast*: w=28 (16~256)*This is the default search methoddiscontiguous megablast:

w=11 (or 12)allows some mismatches

blastn: w=11 (7~15)w=7 for a short sequence

Default DB nr/nt

7

BIOS477/877 L13 - 8

Discontiguous megablast

If discontiguous megablast is chosen:

Word matching based on discontiguous pattern (template):e.g., for coding: 1101101101101101 (w=11, t=16)

➜mismatches are allowed for '0' positions

8

BIOS477/877 L13 - 9

BLASTN resultsmegablast (only 4 hits, all E<7e-91)

Discontiguous megablast(12 hits, E<3e-111)

blastn (137 hits, E<5.3)

9

BIOS477/877 L13 - 10

[blastn]

[blastx] (translated query vs. protein db)

BLASTN/BLASTX results

Low complexity region is masked (shown in lower cases)

6 possible frames

10

BIOS477/877 L13 - 11

[blastp]

BLASTP results

Positive (+) scoring AA pairs (similar AA pairs)

11

BIOS477/877 L13 - 12

BLAST results

Click to see the blast search

statistics

12

Page 3: Bioinformatics and Molecular Evolution Øbioinfolab.unl.edu/emlab/Courses/BIOS877/current/slides/BIOS877_L… · Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13

3

BIOS477/877 L13 - 13

BLAST results

Used to calculate the scores for the alignments

with gaps

13

BIOS477/877 L13 - 14

BLAST Statistics[blastp]

l and K are scoring system specific (for gap alignments)

Normalized Score or Bit Score (S'bit):S'bit = (lS - logeK) / loge2, [S'nat = lS - logeK]l = 0.267, K = 0.041, S'bit = {0.267 x 795 - loge(0.041)} / loge2 = 310.8

Raw Score (S): simply based on pairwise scores & gap penalties

14

BIOS477/877 L13 - 15

BLAST Statistics[blastp]

Normalized Score or Bit Score (S'bit):S'bit = (lS - logeK) / loge2, [S'nat = lS - logeK]l = 0.267, K = 0.041, S'bit = {0.267 x 795 - loge(0.041)} / loge2 = 310.8

Raw Score (S): simply based on pairwise scores & gap penalties

Raw scores (S) depend on the scoring system; cannot be comparedBit scores (S'bit) are normalized using l and K;

® independent of scoring system; can be compared

15

BIOS477/877 L13 - 16

[For a pairwise alignment]Ø Karlin-Altschul equation (Karlin & Altschul, 1990)

P(S≥x) = 1 - exp[-Kmne-lx] ≈ Kmne-lx

Probability of getting the alignment score S ≥ x by chanceE = NP [N: number of random alignments; used in PRSS and LALIGN]

[For database searching]Ø Multiple pairwise alignments: multiple testing problem

• P(S≥x): Probability of getting the alignment score (S) ≥ x by chance from one pairwise alignment

• If P(S≥x) = 0.05, 1-P(S≥x) = 0.95➜ 0.95 is the probability of having S<x by chance for one pairwise alignment

• For 10 alignments, 0.9510 ≈ 0.60 is the probability to have all 10 alignments with S<x➜ 1-0.60 = 0.40 is the probability of having S≥x by chance at least for one alignment

• For 100 alignments, 0.95100 ≈ 0.006 is the probability to have all 100 with S<x➜ 1-0.006 ≈ 0.99 is the probability of having S≥x by chance at least for one alignment

Pairwise alignment vs. database searching

P(S≥x)

x

Pr=0.05 as the significance level is not good enoughif many alignments need to be tested!

16

BIOS477/877 L13 - 17

Ø Multiple comparison correctionInstead of using Prob = aUse Prob = a/N (for N comparisons) as the threshold• For 10 alignments, use P(S≥x) = 0.05/10 = 0.005 (instead of 0.05)➜ (1-0.005)10 ≈ 0.95 is the probability to have S<x by chance for all 10

alignments➜ 1-0.95 = 0.05 is the probability of having S≥x by chance at least for

one alignment

• For 100 alignments, use P(S≥x) = 0.05/100 = 0.0005 (instead of 0.05)➜ (1-0.0005)100≈0.95 is the probability to have S<x by chance for all 100

alignments➜ 1-0.95=0.05 is the probability of having S≥x by chance at least for

one alignment

Bonferroni correction

17

BIOS477/877 L13 - 18

Ø Multiple comparison correctionInstead of using Prob = aUse a' = a/N (for N comparisons) as the threshold➜ a = N x a'

E = N x Prob➜ E = a can be used as the threshold for multiple

comparisons

• For database searching, N is the database size (the number of entries) ➜ the number of alignments

Bonferroni correction in database searching

E-value threshold can be considered as a P-value threshold corrected for

multiple comparisons in database searching

18

Page 4: Bioinformatics and Molecular Evolution Øbioinfolab.unl.edu/emlab/Courses/BIOS877/current/slides/BIOS877_L… · Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13

4

BIOS477/877 L13 - 19

Ø Karlin-Altschul equation (Karlin & Altschul, 1990)[For a pairwise alignment]P = Kmne-lS (Lec 11 slide 13)m, n: lengths of the sequences compared➜m x n: search space

[For database similarity searching]E = Kmne-lS (NOTE: E=NP is not used in BLAST)

m: length of the queryn: length of the database (total number of residues)

E-value: the expected number of HSPs with scores ≥ SP = 1 - e-E (P ≈ E if E < 0.01)➜ the probability of having at least one HSP with its score ≥ S

BLAST Statistics

Consider a database as a single very long sequence

T T A G A C G C G T A

A

C

A

G

A

G

C

T

A

Search space

19

BIOS477/877 L13 - 20

Ø Karlin-Altschul equation (Karlin & Altschul, 1990)

E = Km'n'e-lS

m': effective length of the queryn': effective length of the databasem' = m - ln' = n - l x (number of sequences in the database)

l: length adjustment➜ correction for edge effects• HSPs cannot occur too close to the search space edges.• HSPs need to be a certain length.

BLAST Statistics

HSP cannot starttoo close to the edge

Note: Calculation methods for length adjustment (l) and m'n'have been changed based on a new finite-size correction (FSC). See Park et al. (2012, BMC Research Note)

20

BIOS477/877 L13 - 21

P-value, E-value, and database search

Ø P-value for pairwise alignment = 1-exp[-Kmne-lx] ≈ Kmne-lx

➜ Probability of getting the alignment score ≥ x from random pairwise comparison (m and n are the lengths of the 2 sequences compared)

Ø E-value = Kmn e-lS

➜ Number of alignments with a score ≥ S expected by chance from a database searchm: length of the query (or effective length, m')n: length of the database (or effective length, n')

Ø P-value for a database search (Bonferroni corrected)➜ the probability of having at least one HSP with its score ≥ S➜ P = 1 - e-E or E = -ln(1-P)

0 < P < 10 < E < N (N: number of random comparisons)Altschul et al. (1994) & BLAST Statistics Tutorial

(P = E/N or E = PN is used in FASTA; N: database size)

21

BIOS477/877 L13 - 22

BLAST Statistics[blastp HSP]

l = 0.267, K = 0.041, S=795, S'bit = {0.267 x 795 - ln(0.041)} / ln2 = 310.8

Expect (E) = Km'n'e-lS or m'n'e-S'nat or m'n'2-S'bit

E = 0.041 x m' x n' x e-0.267 x 795 [from the raw score]E = m' x n' x 2-310.8 [from the bit score]

m' x n': Effective search space

22

BIOS477/877 L13 - 23

BLAST search summary statistics

Scoring matrix & gap penalties

Word size (W)

Neighborhood threshold (T)

Length separating two HSPs to trigger extension (A: two-hit methods)

l, K, and H are pre-estimated for a combination of the scoring matrix and gap penalties

for gapped alignment

Query: P45897.1Query length: 570 amino acids

23

BIOS477/877 L13 - 24

BLAST search summary statisticsQuery: P45897.1Query length: 570 amino acids m=570 (length of query)

n: length of database

24

Page 5: Bioinformatics and Molecular Evolution Øbioinfolab.unl.edu/emlab/Courses/BIOS877/current/slides/BIOS877_L… · Bioinformatics and Molecular Evolution Lecture 13 1 BIOS477/877 L13

5

BIOS477/877 L13 - 25

BLAST Statistics[blastp HSP]

l = 0.267, K = 0.041, S=795, S'bit = {0.267 x 795 - ln(0.041)} / ln2 = 310.8

Expect (E) = Km'n'e-lS or m'n'e-S'nat or m'n'2-S'bit

E = 0.041 x m' x n' x e-0.267 x 795 [from the raw score]E = m' x n' x 2-310.8 [from the bit score]

1.44E-80(1.44x10-80)

W/O length adjustment: m=570, n=94,578,689,328E = 0.041 x 570 x 94,578,689,328 x e-0.267 x 795 = 1.44E-80

E = 570 x 94,578,689,328 x 2-310.8 = 1.48E-80

P = 1 - e-E

= 1- exp(-1.44x10-80)≈ 0 (P ≈ E if E < 0.01)

25

BIOS477/877 L13 - 26

BLAST search set vs. format option[Before search] Restrict a search against the selected organism

Search space will be limitedE=Kmne-lS

è E-values become smaller

[After search] Restrict the result shown for a selected organism

Search space is not affected

26

BIOS477/877 L13 - 27

BLAST search size and format optionsSearch is not limited; results are filtered to show ”Archaea" sequences

Search space is not affected

27

BIOS477/877 L13 - 28

Search is limited for ”Archaea" sequences

Archaea (taxid:2157)

BLAST search size and format options

28

BIOS477/877 L13 - 29

Search is limited

BLAST search size and format optionsSearch is NOT limited;

results are filtered

(Database size is 72 times larger)Score Query cov E-value % ident Score Query cov E-value % ident

(E-value is 100 times larger)

E=Kmne-lS

E-value is affected by the database size!

29

BIOS477/877 L13 - 30

FASTA

http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml(includes also SSEARCH)

http://www.ebi.ac.uk/Tools/sss/fasta/(includes also SSEARCH)With graphic outputResults can be obtained through email

http://fasta.genome.jp/

30