Top Banner
43

Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.
Page 2: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Overview

Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers

Page 3: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Two-stage searching

Coarse searching Uses heuristics to find promising alignments Interval matching

Fine searching Intensive processing of results from coarse search Calculates the actual score of the alignments by using

more detailed information from the matching sequences

This requires retrieval of sequences from the database which is expensive

Page 4: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Intervals

Overlapping intervals:Sequence: ACCTGACG, with length l=8 Interval length: n=3 Intervals: ACC, CCT, TGA, GAC, ACG

Page 5: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Indexed VS. Exhaustive

Exhaustive:Every sequence in the database is retrieved

and processed, this is costly for large databases

Often use heuristics to reduce the number of sequences that need to be aligned

Popular exhaustive systems FASTA, BLAST1, BLAST2 are based on Wilbur-Lipman interval approach

Page 6: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Indexed VS. Exhaustive

Wilbur-LipmanBuild a hashing structure with all intervals in

the query sequence as keysHash all intervals in the entire database, and

if the interval is present in the hashing structure we know that there is a match

This is faster than walking through the query sequence for every interval in the database

Page 7: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Wilbur-Lipman

A C T G A C T C

A C T

C T G

T G A

G A C

C T C

0 4

5

1

3

2

intervaloffsets

query sequence

hashing structure

Page 8: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Indexed search

Instead of retrieving all sequences from the database, use an index to locate promising alignments

Coststorage of the index

Advantages fast lookup, only index needs to be accessedenables partial retrieval (for fine searching)

Page 9: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Indexed search

FLASH Enables gapped searching by indexing permuted

intervals Example:

Interval length = 5, subsequence length = 3 Sequence: ACCTGATT, results in: ACC, ACT, ACG ACT ACG ATG

Problem: Index becomes huge

Page 10: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Indexed search in CAFE

(similar to RAMDB) Index contains:

Search structure Searching is done on keys, these keys

represent overlapping intervals that occur in the database

Posting lists For every interval a list is stored

containing references to all occurences of this interval. These references contain a sequence id and an offset

Page 11: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Indexed search in CAFE

Posting lists can be long Compressed to reduce used disk space

AA

AC

8 12 1117

4

3

8

6 142

2011

interval sequence idoffsets

Page 12: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Compression techniques

To reduce the size of the index posting lists are compressed

Instead of storing each sequence number of the sequence in which the interval occurs, store the first one and use relative offsets for the rest

Eg. 101, 109, 217, 412, 980, 1013 becomes 101, 8, 108, 195, 568, 33

Page 13: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Compression techniques

Of course the same same can be done with the places where the interval occurs inside a query

Also Elias Gamma and Delta coding But just compression was not good

enoughUse heuristics to reduce the size of the index

even further

Page 14: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Heuristics to decrease index size

Limit indexing of wildcardswildcards are instantiated before being

indexed.Several adjacent wildcards may lead to an

explosion of matches (Eg. NNNNN)Solution: only index intervals containing at

most one wildcard

Page 15: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Heuristics to decrease index size

Another one:Do not create an entry in the index for

intervals that occur in more than x% of the sequences

Page 16: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Coarse searching

Uses FRAMES “A frame is a set of one or more matching

intervals between a database sequence and a query sequence that are at the same relative offset”

Page 17: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

FRAMES

Frames are constructed using the offsets stored in the index

A frame can be represented by a combination of sequence id and offset

The information contained in frames is used to determine the most promising alignments

Page 18: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

FRAMES

F2 : (9,7), (10,8), (27,25), (28,26), (29,27), (30,28), (31,29), (32,…)

F1 : (17,16), (18,17)

F26 : (53,27), (54,28)

Page 19: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Constructing frames

For every interval match with offsets x and y:Calculate the relative offset z If a frame with this relative offset (Fz) already

exists, append (x, y) to this frameOtherwise, create a new frame containing (x, y)

Page 20: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Frame ranking schemes

First approach: FRAMECOUNTCount the amount of matching intervals for

each frame, and take the maximum over all frames

Page 21: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

FRAMES

F2 : Contains 8 matching intervals

F1 : Contains 2 matching intervals

F26 : Contains 2 matching intervals

FRAMECOUNT = 8, resulting from frame F2

Page 22: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Frame ranking schemes

Problem with FRAMECOUNT:Does not take into account relative positioning

of intervals within a single frame COVERAGE:

Amount of matching bases per frameTakes into account overlapping of intervals

Page 23: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Comparing FRAMECOUNT and COVERAGE: A : FRAMECOUNT = 7, COVERAGE = 9

B : FRAMECOUNT = 7, COVERAGE = 21

COVERAGE seems to be a more accurate scoring scheme

Page 24: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Frame ranking schemes

Another scheme: LENGTHAmount of bases between first and last base

contained in matching intervals in a frame

Page 25: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

A: LENGTH = 21 B: LENGTH = 55 C: LENGTH = 55

Page 26: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Frame ranking schemes

What we did not understand: “Despite this, the length scheme is particularly attractive since it

ranks highly regions that are longer and, therefore, will rank long homologous alignments ahead of shorter alignments.”

COMBINED scheme: COVERAGE and LENGTH combined Takes into account residues that are not part of matching intervals COMBINED = COVERAGE – k * (LENGTH – COVERAGE) The value for k is determined empirically (<< 1)

Page 27: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Scoring with amino acids

COVERAGE scheme can be improved by using a substitution matrix Instead of counting the number of matches,

use the sum of appropriate values in the substitution matrix

Interval scores can be cached

Page 28: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Normalizing scores

To compensate for increased score with longer sequences:Snorm = (S * 21.21)/(ln l1 * ln l2)

“l1 and l2 are the lengths of the two aligned regions” We assume they mean the lengths of the

aligned sequences, since we have read that this is used by Shpaer et al.

Page 29: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Optimising frames

Apply a fixed ceiling on the amount of frames. Decreases memory usage and CPU time Which frames are most important ?

Frames containing the most discriminating intervals So we should look up the most discriminating intervals first. The database-frequency of each interval occuring in the

query is determined After sorting them, we lookup the intervals with the lowest

frequency first, in this way have the best discrimination.

Page 30: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Optimising frames

NEIGHBOURHOOD schemeGives higher scores for frames that are close

to other framesAllows gaps (indels), by “combining” framesFor every other frame in the same sequence:

add s1/d to the score, where s is the score of the other frame d is the difference in offset from the other frame

Page 31: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Optimising frames

Information stored for frames can be (re)used for fine searching:Matching regions can be used as a starting

point for fine-searchingFrames allow us to partially retrieve all

relevant sequences from the database

Page 32: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Test data

PIR database used to assess accuracy Sequences used that are classified in super families Single member families removed Filtered test set: 1834 sequences Only query sequences with #residues < 500 Precision and recall are measured

Precision = Relevant results/Total results Recall = Relevant results/Total relevant

Page 33: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Test data

Used for assessing speed and index size:Genbank databases:

GENBANK97: 652 mln. nucleotide bases in 1 mln. sequences

GENBANK108: 1797 mln. nucleotide bases in 2.5 mln. sequences

VERTE: 177 mln. nucleotide bases in 0.12 mln. sequences

Page 34: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Results

CAFE compared to BLAST and FASTA: CAFE has similar precision CAFE becomes relatively faster with increasing

database size

Further observations For CAFE, queries take more time and memory for

processing when their intervals occur often in the database

For BLAST and FASTA, performance drops when the entire database can not be stored in memory

Page 35: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions Bogdan:

Question 1(overlapping intervals): Why do they use overlapping intervals? Is that

because if you have the string “ABCDEF” and the intervals “ABC” and “DEF”, that than a query interval “BCD” wouldn’t match ?

Answer: We think your intuition is correct, if we do not use

overlapping intervals lots of existing intervals would be missed

Page 36: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions Question 2 (optimising frames):

What do they mean with sorting the query intervals,and how does sorting make them discriminate well between sequences?

Answer: Query intervals: the intervals that are generated from the query

sequence. The database-frequency of each interval occuring in the query is determined After sorting them, we lookup the intervals with the lowest frequency first, in

this way have the best discrimination.

Question 3: “Could you explain the two alternatives when the threshold is reached?”

Answer: Option 1: Stop checking matches immediately when the ceiling is reached Option 2: Keep adjusting existing frames with new matches

Page 37: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions

Marjolijn:Question:

Can you explain the difference between an inverted index and a fine-grain index?

Answer: Only one type of index is used This index is an inverted list The index also has a high granularity (fine-grain)

Page 38: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions

Lee:Question 1 (compression):

What sorts of compression are used to decrease the size of the index ?

Answer: See the part of this presentation about

compression

Page 39: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions

Question 2: “They say that the problem with uncompressed lists is that

you can be penalized in time by the disk retrievals. On page 22 they are saying "However, these algorithms are highly reliant on having sufficent memory to store the complete database." So if I understand it right, they store the index on the disk and the database in main memory? Isn't it more common to put the index in main memory and the database on disk?

Answer: The sentence from page 22 is not about CAFE, but about

two other methods: FASTA and BLAST. These methods do not use an inverted index and hence

need fast access to the database.

Page 40: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions Jacob:

Question(COMBINED): “I was wondering if the combination of coverage and length as

mentioned on page 14 is sufficient for scoring alignments. I believe that there are lots of examples with different alignments which would score the same. Like this one

CGATCGAATAGCATCGTGGCGGTGAGCGGTTTCTGTTTCTGTTCTT  ::::::                           :::TTATCGAATAGCGGCGCTAGCATCGATCATTCTACTTTCAAACTGC

CGATCGAATAGCATCGTGGCGGTGAGCGGTTTCTGTTTCTGTTCTT  :::            :::               :::TTATCTCTATAGCGGCGGGCGCATCGATCATTCTCTTTCAAACTGC

Answer: We should investigate whether clustering influences the probability of

a successful allignment If so we could try to think of a new scoring scheme:

Scheme: CLUSTERING Calculate the degree of clustering within a frame

Page 41: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions

Laurence: Question:

“On page 15 in section 3.3 in paragraph 4 it says that composition alignment between matching intervals in the length metric is difficult to model without fine-searching the region. Therefore they elect not to modify the LENGTH scheme for complex models but rather apply statistical normalisation to incorporate variations in composition the resultant score.

Can you maybe explain in further detail why is chosen to apply statistical normalisation in contrast to modifying the LENGTH scheme, and what does fine-searching the region have to do with this”

Answer: LENGTH misses compositional information Compositional information requires retrieval of database sequences However, when random query and database strings get larger, average

allignment scores get higher. To rule this out we do a normalisation. By doing this, length still plays an important role in scoring.

Page 42: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions Bram:

Question: “For COVERAGE they use pre-calculations of the scores to

incorporate variations in composition in the resultant score. In LENGTH they use the Shpaer normalisation for it. At the end of the paragraph they propose to use the Shpaer scheme to normalise the COVERAGE scores. What's the advantage of normalising the COVERAGE scores, if we already did the pre-calculation?”

Answer: You probably think of normalisation as ruling out which substitution

matrix was used, that’s not what is meant. The Shpaer normalisation is applied to values calculated for frames

Page 43: Overview Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers.

Questions

Adriano Question:

page 22 “In CAFE, evaluation times depend on query length and on statistics of the intervals in the query; intervals with longer inverted lists require more processing. What does "longer inverted lists" mean? and why does it require more processing?”

Answer: Inverted lists = posting lists for matching intervals For every matching interval we need to process its posting

list in order to constuct frames