Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Overview

Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers

Two-stage searching

Coarse searching Uses heuristics to find promising alignments Interval matching

Fine searching Intensive processing of results from coarse search Calculates the actual score of the alignments by using

more detailed information from the matching sequences

This requires retrieval of sequences from the database which is expensive

Intervals

Overlapping intervals:Sequence: ACCTGACG, with length l=8 Interval length: n=3 Intervals: ACC, CCT, TGA, GAC, ACG

Indexed VS. Exhaustive

Exhaustive:Every sequence in the database is retrieved

and processed, this is costly for large databases

Often use heuristics to reduce the number of sequences that need to be aligned

Popular exhaustive systems FASTA, BLAST1, BLAST2 are based on Wilbur-Lipman interval approach

Indexed VS. Exhaustive

Wilbur-LipmanBuild a hashing structure with all intervals in

the query sequence as keysHash all intervals in the entire database, and

if the interval is present in the hashing structure we know that there is a match

This is faster than walking through the query sequence for every interval in the database

Wilbur-Lipman

A C T G A C T C

A C T

C T G

T G A

G A C

C T C

0 4

5

1

3

2

intervaloffsets

query sequence

hashing structure

Indexed search

Instead of retrieving all sequences from the database, use an index to locate promising alignments

Coststorage of the index

Advantages fast lookup, only index needs to be accessedenables partial retrieval (for fine searching)

Indexed search

FLASH Enables gapped searching by indexing permuted

intervals Example:

Interval length = 5, subsequence length = 3 Sequence: ACCTGATT, results in: ACC, ACT, ACG ACT ACG ATG

Problem: Index becomes huge

Indexed search in CAFE

(similar to RAMDB) Index contains:

Search structure Searching is done on keys, these keys

represent overlapping intervals that occur in the database

Posting lists For every interval a list is stored

containing references to all occurences of this interval. These references contain a sequence id and an offset

Indexed search in CAFE

Posting lists can be long Compressed to reduce used disk space

AA

AC

8 12 1117

4

3

8

6 142

2011

interval sequence idoffsets

Compression techniques

To reduce the size of the index posting lists are compressed

Instead of storing each sequence number of the sequence in which the interval occurs, store the first one and use relative offsets for the rest

Eg. 101, 109, 217, 412, 980, 1013 becomes 101, 8, 108, 195, 568, 33

Compression techniques

Of course the same same can be done with the places where the interval occurs inside a query

Also Elias Gamma and Delta coding But just compression was not good

enoughUse heuristics to reduce the size of the index

even further

Heuristics to decrease index size

Limit indexing of wildcardswildcards are instantiated before being

indexed.Several adjacent wildcards may lead to an

explosion of matches (Eg. NNNNN)Solution: only index intervals containing at

most one wildcard

Heuristics to decrease index size

Another one:Do not create an entry in the index for

intervals that occur in more than x% of the sequences

Coarse searching

Uses FRAMES “A frame is a set of one or more matching

intervals between a database sequence and a query sequence that are at the same relative offset”

FRAMES

Frames are constructed using the offsets stored in the index

A frame can be represented by a combination of sequence id and offset

The information contained in frames is used to determine the most promising alignments

FRAMES

F2 : (9,7), (10,8), (27,25), (28,26), (29,27), (30,28), (31,29), (32,…)

F1 : (17,16), (18,17)

F26 : (53,27), (54,28)

Constructing frames

For every interval match with offsets x and y:Calculate the relative offset z If a frame with this relative offset (Fz) already

exists, append (x, y) to this frameOtherwise, create a new frame containing (x, y)

Frame ranking schemes

First approach: FRAMECOUNTCount the amount of matching intervals for

each frame, and take the maximum over all frames

FRAMES

F2 : Contains 8 matching intervals



FRAMECOUNT = 8, resulting from frame F2


Problem with FRAMECOUNT:Does not take into account relative positioning

of intervals within a single frame COVERAGE:

Amount of matching bases per frameTakes into account overlapping of intervals

Comparing FRAMECOUNT and COVERAGE: A : FRAMECOUNT = 7, COVERAGE = 9

B : FRAMECOUNT = 7, COVERAGE = 21

COVERAGE seems to be a more accurate scoring scheme


Another scheme: LENGTHAmount of bases between first and last base

contained in matching intervals in a frame

A: LENGTH = 21 B: LENGTH = 55 C: LENGTH = 55


What we did not understand: “Despite this, the length scheme is particularly attractive since it

ranks highly regions that are longer and, therefore, will rank long homologous alignments ahead of shorter alignments.”

COMBINED scheme: COVERAGE and LENGTH combined Takes into account residues that are not part of matching intervals COMBINED = COVERAGE – k * (LENGTH – COVERAGE) The value for k is determined empirically (<< 1)

Scoring with amino acids

COVERAGE scheme can be improved by using a substitution matrix Instead of counting the number of matches,

use the sum of appropriate values in the substitution matrix

Interval scores can be cached

Normalizing scores

To compensate for increased score with longer sequences:Snorm = (S * 21.21)/(ln l1 * ln l2)

“l1 and l2 are the lengths of the two aligned regions” We assume they mean the lengths of the

aligned sequences, since we have read that this is used by Shpaer et al.

Optimising frames

Apply a fixed ceiling on the amount of frames. Decreases memory usage and CPU time Which frames are most important ?

Frames containing the most discriminating intervals So we should look up the most discriminating intervals first. The database-frequency of each interval occuring in the

query is determined After sorting them, we lookup the intervals with the lowest

frequency first, in this way have the best discrimination.

Optimising frames

NEIGHBOURHOOD schemeGives higher scores for frames that are close

to other framesAllows gaps (indels), by “combining” framesFor every other frame in the same sequence:

add s1/d to the score, where s is the score of the other frame d is the difference in offset from the other frame

Optimising frames

Information stored for frames can be (re)used for fine searching:Matching regions can be used as a starting

point for fine-searchingFrames allow us to partially retrieve all

relevant sequences from the database

Test data

PIR database used to assess accuracy Sequences used that are classified in super families Single member families removed Filtered test set: 1834 sequences Only query sequences with #residues < 500 Precision and recall are measured

Precision = Relevant results/Total results Recall = Relevant results/Total relevant

Test data

Used for assessing speed and index size:Genbank databases:

GENBANK97: 652 mln. nucleotide bases in 1 mln. sequences

GENBANK108: 1797 mln. nucleotide bases in 2.5 mln. sequences

VERTE: 177 mln. nucleotide bases in 0.12 mln. sequences

Results

CAFE compared to BLAST and FASTA: CAFE has similar precision CAFE becomes relatively faster with increasing

database size

Further observations For CAFE, queries take more time and memory for

processing when their intervals occur often in the database

For BLAST and FASTA, performance drops when the entire database can not be stored in memory

Questions Bogdan:

Question 1(overlapping intervals): Why do they use overlapping intervals? Is that

because if you have the string “ABCDEF” and the intervals “ABC” and “DEF”, that than a query interval “BCD” wouldn’t match ?

Answer: We think your intuition is correct, if we do not use

overlapping intervals lots of existing intervals would be missed

Questions Question 2 (optimising frames):

What do they mean with sorting the query intervals,and how does sorting make them discriminate well between sequences?

Answer: Query intervals: the intervals that are generated from the query

sequence. The database-frequency of each interval occuring in the query is determined After sorting them, we lookup the intervals with the lowest frequency first, in

this way have the best discrimination.

Question 3: “Could you explain the two alternatives when the threshold is reached?”

Answer: Option 1: Stop checking matches immediately when the ceiling is reached Option 2: Keep adjusting existing frames with new matches

Questions

Marjolijn:Question:

Can you explain the difference between an inverted index and a fine-grain index?

Answer: Only one type of index is used This index is an inverted list The index also has a high granularity (fine-grain)

Questions

Lee:Question 1 (compression):

What sorts of compression are used to decrease the size of the index ?

Answer: See the part of this presentation about

compression

Questions

Question 2: “They say that the problem with uncompressed lists is that

you can be penalized in time by the disk retrievals. On page 22 they are saying "However, these algorithms are highly reliant on having sufficent memory to store the complete database." So if I understand it right, they store the index on the disk and the database in main memory? Isn't it more common to put the index in main memory and the database on disk?

Answer: The sentence from page 22 is not about CAFE, but about

two other methods: FASTA and BLAST. These methods do not use an inverted index and hence

need fast access to the database.

Questions Jacob:

Question(COMBINED): “I was wondering if the combination of coverage and length as

mentioned on page 14 is sufficient for scoring alignments. I believe that there are lots of examples with different alignments which would score the same. Like this one

CGATCGAATAGCATCGTGGCGGTGAGCGGTTTCTGTTTCTGTTCTT :::::: :::TTATCGAATAGCGGCGCTAGCATCGATCATTCTACTTTCAAACTGC

CGATCGAATAGCATCGTGGCGGTGAGCGGTTTCTGTTTCTGTTCTT ::: ::: :::TTATCTCTATAGCGGCGGGCGCATCGATCATTCTCTTTCAAACTGC

Answer: We should investigate whether clustering influences the probability of

a successful allignment If so we could try to think of a new scoring scheme:

Scheme: CLUSTERING Calculate the degree of clustering within a frame

Questions

Laurence: Question:

“On page 15 in section 3.3 in paragraph 4 it says that composition alignment between matching intervals in the length metric is difficult to model without fine-searching the region. Therefore they elect not to modify the LENGTH scheme for complex models but rather apply statistical normalisation to incorporate variations in composition the resultant score.

Can you maybe explain in further detail why is chosen to apply statistical normalisation in contrast to modifying the LENGTH scheme, and what does fine-searching the region have to do with this”

Answer: LENGTH misses compositional information Compositional information requires retrieval of database sequences However, when random query and database strings get larger, average

allignment scores get higher. To rule this out we do a normalisation. By doing this, length still plays an important role in scoring.

Questions Bram:

Question: “For COVERAGE they use pre-calculations of the scores to

incorporate variations in composition in the resultant score. In LENGTH they use the Shpaer normalisation for it. At the end of the paragraph they propose to use the Shpaer scheme to normalise the COVERAGE scores. What's the advantage of normalising the COVERAGE scores, if we already did the pre-calculation?”

Answer: You probably think of normalisation as ruling out which substitution

matrix was used, that’s not what is meant. The Shpaer normalisation is applied to values calculated for frames

Questions

Adriano Question:

page 22 “In CAFE, evaluation times depend on query length and on statistics of the intervals in the query; intervals with longer inverted lists require more processing. What does "longer inverted lists" mean? and why does it require more processing?”

Answer: Inverted lists = posting lists for matching intervals For every matching interval we need to process its posting

list in order to constuct frames

Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Documents

database slide

index intervals

acg slide

huge slide

offset slide

expensive slide

wildcard slide

fine searching slide