Top Banner

of 38

BLAST Slides BLAST ALGORITHM

Jun 04, 2018

Download

Documents

daljit-singh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    1/38

    MPI for Developmental Biology, Tubingen

    logo

    BLAST and FASTAHeuristics in pairwise sequence alignment

    Christoph Dieterich

    Department of Evolutionary BiologyMax Planck Institute for Developmental Biology

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    2/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    Introduction

    1 Pairwise alignment is used to detect homologies between

    different protein or DNA sequences, e.g. as global or local

    alignments.

    2 Problem solved using dynamic programming inO(nm)time andO(n)space.

    3 This istoo slowfor searching current databases.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    3/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    Heuristics for large-scale database searching

    In practice algorithms are used that run much faster, at the

    expense of possibly missing some significant hits due to the

    heuristics employed.

    Such algorithms are usually seed-and-extendapproaches in

    which first small exact matches are found, which are then

    extended to obtain long inexact ones.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    4/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    Basic idea: Preprocessing

    After preprocessing, a large part of the computation is already

    finished before we search for similarities.For biological sequences of lengthnthere are 4n (for the

    DNA-alphabet ={A,G,C,T}) and/or 20n (for the amino acidalphabet) different strings.

    For small||andn, it is possible to store all in a hash table.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    5/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    Basic idea: Preprocessing

    1 A certain consistency of the database entries is assumed.

    2 Large sequence databases are split into two parts: oneconsists of the constant part, containing all the sequences

    that were used for hash table generation, and of a dynamic

    part, containing all new entries.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    http://find/http://goback/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    6/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    BLAST

    BLAST, the Basic Local Alignment Search Tool (Altschul et al.,

    1990), is perhaps the most widely used bioinformatics tool ever

    written. It is an alignment heuristic that determines localalignments between aqueryand adatabase. It uses an

    approximation of the Smith-Waterman algorithm.

    BLAST consists of two components: a search algorithm and

    computation of the statistical significance of solutions.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    S S S

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    7/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    BLAST terminology

    Definition

    Letqbe the query anddthe database. Asegmentis simply asubstringsofqord.

    Asegment-pair(s, t)(or hit) consists of two segments, one inqand oned, of the same length.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    I t d ti BLAST St ti ti l l i FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    8/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    BLAST terminology

    Example

    V A L L A R

    P A M M A R

    We think ofsandtas being aligned without gaps andscore

    this alignment using a substitution score matrix, e.g. BLOSUM

    or PAM in the case of protein sequences.

    The alignment score for(s, t)is denoted by(s, t).

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    9/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    BLAST terminology

    Alocally maximal segment pair (LMSP)is any segment

    pair(s, t)whose score cannot be improved by shorteningor extending the segment pair.

    Amaximum segment pair (MSP)is any segment pair(s, t)of maximal alignment score(s, t).

    Given a cutoff scoreS, a segment pair(s, t)is called ahigh-scoring segment pair (HSP), if it is locally maximal

    and(s, t) S.Finally, awordis simply a short substring of fixed length w.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    10/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    The BLAST algorithm

    Goal: Find all HSPs for a given cut-off score.

    Given three parameters, i.e. a word size w, a word similarity

    thresholdTand a minimum cut-off score S. Then we are

    looking fora segment pair with a score of at least S that

    contains at least one word pair of length w with score at least T.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    11/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    The BLAST algorithm - Preprocessing

    Preprocessing: Of the query sequenceqfirst all words of

    lengthware generated. Then a list of all w-mers of lengthw

    over the alphabetthat have similarity>Tto some word inthe query sequenceqis generated.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    12/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    The BLAST algorithm - Preprocessing

    Example

    For the query sequence RQCSAGWthe list of words of length

    w=2 with a scoreT>8 using the BLOSUM62 matrix are:

    word 2 merwith score> 8RQ RQQC QC, RC, EC, NC, DC, KC, MC, SCCS CS,CA,CN,CD,CQ,CE,CG,CK,CT

    SA -AG AGGW GW,AW,RW,NW,DW,QW,EW,HW,KW,PW,SW,TW,WW

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    13/38

    MPI for Developmental Biology, Tubingen

    logo

    Introduction BLAST Statistical analysis FASTA

    The BLAST algorithm - Searching

    1 Localization of the hits:The database sequenced is

    scanned for all hitstofw-merssin the list, and the

    position of the hit is saved.

    2 Detection of hits: First all pairs of hits are searched that

    have a distance of at most A (think of them lying on thesame diagonal in the matrix of the SW-algorithm).

    3 Extension to HSPs:Each suchseed(s, t)is extended inboth directions until its score(s, t)cannot be enlarged

    (LMSP). Then all best extensions are reported that havescore S, these are the HSPs. Originally the extensiondid not include gaps, the modern BLAST2 algorithm allows

    insertion of gaps.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/http://goback/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    14/38

    MPI for Developmental Biology, Tubingen

    logo

    y

    The BLAST algorithm - Searching

    The listLof all words of length wthat have similarity> Tto some word in the query sequence qcan be produced in

    O(|L|)time.

    These are placed in a keyword tree and then, for each

    word in the tree, all exact locations of the word in the

    databasedare detected in time linear to the length of d.

    As an alternative to storing the words in a tree, a

    finite-state machine can be used, which Altschul et al.found to have the faster implementation.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    15/38

    MPI for Developmental Biology, Tubingen

    logo

    y

    The BLAST algorithm

    Use of seeds of lengthwand the termination of extensions with

    fading scores (score dropoff threshold X) are both steps that

    speed up the algorithm.Recent improvements (BLAST 2.0):

    Two word hits must be found within a window ofA residues.

    Explicit treatment of gaps.

    Position-specific iterative BLAST (PSI-BLAST).

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/http://goback/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    16/38

    MPI for Developmental Biology, Tubingen

    logo

    The BLAST algorithm - DNA

    ForDNAsequences, BLAST operates as follows:

    The list of all words of lengthwin the query sequence q is

    generated. In practice,w=12 for DNA.The databasedis scanned for all hits of words in this list.

    Blast uses a two-bit encoding for DNA. This saves space

    and also search time, as four bases are encoded per byte.

    Note that the T parameter dictates the speed and sensitivity ofthe search.

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    17/38

    MPI for Developmental Biology, Tubingen

    logo

    All flavors of BLAST

    BLASTN: compares a DNA query sequence to a DNA

    sequence database; qDNAsDNA

    BLASTP: compares a protein query sequence to a proteinsequence database; qproteinsprotein

    TBLASTN: compares a protein query sequence to a DNA

    sequence database (6 frames translation); qproteinst{+1

    ,

    +2,

    +3,

    1,

    2,

    3}(DNA)

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    18/38

    MPI for Developmental Biology, Tubingen

    logo

    All flavors of BLAST - continued

    BLASTX: compares a DNA query sequence (6 frames

    translation) to a protein sequence database.TBLASTX: compares a DNA query sequence (6 frames

    translation) to a DNA sequence database (6 frames

    translation).

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    19/38

    MPI for Developmental Biology, Tubingen

    logo

    Statistical analysis

    1 Thenull hypothesis H0 states that the two sequences (s, t)arenothomologous. Then thealternative hypothesis

    states that the two sequences are homologous.

    2 Choose an experiment to find the pair(s, t): use BLAST to

    detect HSPs.3 Compute the probability of the result under the hypothesis

    H0, P(Score(s, t)| H0)by generating a probabilitydistribution with random sequences.

    4 Fix a rejection level forH0.5 Perform the experiment, compute the probability of

    achieving the result or higher and compare with the

    rejection level.

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    20/38

    MPI for Developmental Biology, Tubingen

    logo

    Poisson and extreme value distributions

    The Karlin and Altschul theory for local alignments (without

    gaps) is based on Poisson and extreme value distributions. Thedetails of that theory are beyond the scope of this lecture, but

    basics are sketched in the following.

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    21/38

    MPI for Developmental Biology, Tubingen

    logo

    Poisson distribution

    Definition

    The Poisson distribution with parameterv is given by

    P(X =x) =vx

    x!

    ev

    Note thatv is theexpected valueas well as the variance. From

    the equation we follow that the probability that a variable X will

    have a value at least x is

    P(X x) =1 x1

    i=0

    vi

    i!ev

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    22/38

    MPI for Developmental Biology, Tubingen

    logo

    Statistical significance of an HSP

    Problem

    Given an HSP(s, t)with score(s, t). How significant is thismatch(i.e., local alignment)?

    Given the scoring matrixS(a, b), the expected score foraligning a random pair of amino acid is required to be negative:

    E=

    a,b

    papbS(a,b)

  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    23/38

    MPI for Developmental Biology, Tubingen

    logo

    Statistical significance of an HSP

    HSP scores are characterized by two parameters,K and.The parametersK and depend on the backgroundprobabilities of the symbols and on the employed scoring

    matrix. is the unique value for ythat satisfies the equation

    a,b

    papbeS(a,b)y =1

    K and are scaling-factors for the search space and for thescoring scheme, respectively.

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://goforward/http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    24/38

    MPI for Developmental Biology, Tubingen

    logo

    Statistical significance of an HSP

    The number of random HSPs(s, t)with(s, t) Scan bedescribed by a Poisson distribution with parameter

    v=KmneS

    . The number of HSPs with score Sthat weexpectto see due to chance is then the parameter v, alsocalled theE-value:

    E(HSPs with scoreS) =KmneS

    Christoph Dieterich Max Planck Institute for Developmental BiologyBLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    25/38

    MPI for Developmental Biology, Tubingen

    logo

    Hence, the probability of finding exactlyxHSPs with a score

    Sis given by

    P(X=x) =eEEx

    x!,

    whereE is theE-value forC.The probability of finding at least one HSP by chance is

    P(S) =1 P(X =0) =1 eE.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/http://goback/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    26/38

    MPI for Developmental Biology, Tubingen

    logo

    We would like to hide the parametersK and to make iteasier to compare results from different BLAST searches.

    For a given HSP(s, t)we transform therawscoreSinto abit-score:

    S := S ln Kln 2

    .

    Such bit-scores can be compared between different BLAST

    searches, as the parameters of the given scoring systems are

    subsumed in them.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    27/38

    MPI for Developmental Biology, Tubingen

    logo

    We would like to hide the parametersK and to make iteasier to compare results from different BLAST searches.

    For a given HSP(s, t)we transform therawscoreSinto abit-score:

    S := S ln Kln 2

    .

    Such bit-scores can be compared between different BLAST

    searches, as the parameters of the given scoring systems are

    subsumed in them.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    28/38

    MPI for Developmental Biology, Tubingen

    logo

    Significance of a bit-score

    To determine the significance of a given bit-score S the only

    additional value required is the size of the search space. Since

    S= (S ln 2 + ln K)/, we can express theE-value in terms ofthe bit-score as follows:

    E=KmneS =Kmne(S ln 2+ln K) =mn2S

    .

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    29/38

    MPI for Developmental Biology, Tubingen

    logo

    FASTA

    FASTA (pronounced fast-ay) is a heuristic for finding significant

    matches between a query string qand a database stringd. It is

    the older of the two heuristics introduced in the lecture.

    FASTAs general strategy is to find the most significant

    diagonals in the dot-plot or dynamic programming matrix.

    The algorithm consists of four phases: Hashing, 1st

    scoring, 2nd scoring, alignment.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/http://goback/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    30/38

    MPI for Developmental Biology, Tubingen

    logo

    FASTA

    FASTA (pronounced fast-ay) is a heuristic for finding significant

    matches between a query string qand a database stringd. It is

    the older of the two heuristics introduced in the lecture.

    FASTAs general strategy is to find the most significant

    diagonals in the dot-plot or dynamic programming matrix.

    The algorithm consists of four phases: Hashing, 1st

    scoring, 2nd scoring, alignment.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    31/38

    MPI for Developmental Biology, Tubingen

    logo

    Phase 1: Hashing

    The first step of the algorithm is to determine all exact

    matches of lengthk(word-size) between the two

    sequences, calledhot-spots.

    Ahot-spotis given by(i,j), whereiandjare thelocations(i.e., start positions) of an exact match of length k in the

    query and database sequence respectively.

    Any such hot-spot(i,j)lies on the diagonal(ij)of the

    dot-plot or dynamic programming matrix.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    32/38

    MPI for Developmental Biology, Tubingen

    logo

    Phase 1: Hashing

    Using this scheme, the main diagonal has number 0 (i=j),whereas diagonals above the main one have positive

    numbers (i

  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    33/38

    MPI for Developmental Biology, Tubingen

    logo

    Phase 2+3:

    Each of the ten diagonal runs with highest score are further

    processed. Within each of these scores an optimal local

    alignment is computed using the match score substitution

    matrix. These alignments are called initial regions.

    The score of the best sub-alignment found in this phase is

    reported asinit1.

    The next step is to combine high scoring sub-alignments

    into a single larger alignment, allowing the introduction of

    gaps into the alignment. The score of this alignment isreported asinitn

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/http://goback/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    34/38

    MPI for Developmental Biology, Tubingen

    logo

    Phase 4:

    Finally, abandedSmith-Waterman dynamic program is used toproduce an optimal local alignment along the best matched

    regions. The center of the band is determined by the region

    with the scoreinit1, and the band has width 8 for ktup=2. The

    score of the resulting alignment is reported asopt.

    In this way, FASTA determines a highest scoring region, not all

    high scoring alignments between two sequences. Hence,

    FASTA may miss instances of repeats or multiple domains

    shared by two proteins.

    After all sequences of the databases have thus been searched

    a statistical significance similar to the BLAST statistics is

    computed and reported.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://goforward/http://find/http://goback/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    35/38

    MPI for Developmental Biology, Tubingen

    logo

    FASTA

    Example

    Two sequences ACTGACandTACCGA: The hot spots fork=2 are

    marked as pairs of black bullets, a diagonal run is shaded in dark

    grey. An optimal sub-alignment in this case coincides with the

    diagonal run. The light grey shaded band of width 3 around thesub-alignment denotes the area in which the optimal local alignment

    is searched.

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    36/38

    MPI for Developmental Biology, Tubingen

    logo

    Comparing BLAST and FASTA

    BLAST FASTA

    Query

    Data ase

    Query

    ata ase

    Qu

    ery

    Data ase

    Query

    Data ase

    (a) (b)

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://goforward/http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    37/38

    MPI for Developmental Biology, Tubingen

    logo

    BLAT- BLAST-Like Alignment Tool

    BLAT (Kent et al. 2002), is supposedly more accurate and 500times faster than popular existing tools for mRNA/DNA

    alignments and 50 times faster for protein alignments at

    sensitivity settings typically used when comparing vertebrate

    sequences.

    BLATs speed stems from an index of all non-overlapping

    K-mers in the genome. The program has several stages: It

    uses the index to find regions in the genome that are possibly

    homologous to the query sequence. It performs an alignment

    between such regions. It stitches together the aligned regions

    (often exons) into larger alignments (typically genes). Finally,

    BLAT revisits small internal exons and adjusts large gap

    boundaries that have canonical splice siteswherefeasible.Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    Introduction BLAST Statistical analysis FASTA

    http://find/
  • 8/13/2019 BLAST Slides BLAST ALGORITHM

    38/38

    MPI for Developmental Biology, Tubingen

    logo

    Job Advertisement

    I encourage applications for

    Student research projects

    Diploma thesis projects

    HiWi jobs

    http://www2.tuebingen.mpg.de/abt4/plone/BT

    Christoph Dieterich Max Planck Institute for Developmental Biology

    BLAST and FASTA

    http://find/