Top Banner
TEMPLATE DESIGN © 2007 www.PosterPresentations.com SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University Focus Program: The Genome Revolution and Its Impact on Society Introduction APT: k-tuple Coordinates in a Hash Table How Does SSAHA Code for Base Pairs? Contact Information Works Referenced Nick Altemose, Student, Duke University Trinity College of Arts and Sciences email: [email protected] Owen Astrachan, Professor, Duke University Department of Computer Science email: [email protected] Kelvin Gu, Student, Duke University Trinity College of Arts and Sciences email: [email protected] Tiffany Lin, Student, Duke University Trinity College of Arts and Sciences email: [email protected] Kevin Tao, Student, Duke University Pratt School of Engineering email: [email protected] •The Human Genome contains an immense amount of information: 3 Billion base pairs, 3 Gigabytes of storage space. In fact, if you were to read it aloud continuously at the alarming rate of 10 base pairs per second, it would take you 9.5 years to get through it all. •When you want to search for a sequence in the genome, you must use smart search methods like those used by internet search engines. If you were to search through the entire genome one base pair at a time, it could take hours to get back any results, even with the best computers. SSAHA is a search method that was developed with one goal in mind: speed. •SSAHA takes a database of data, like the 3 Gigabase human genome reference sequence, and organizes it into a fast searchable index, known as a hash table. •This organization is the key to SSAHA’s speed, and the origin of its name: Sequence Search and Alignment by Hashing Algorithm. •The hash table is like a dictionary to the English language, or an index to an encyclopedia. It allows for fast lookup of organized information. After making this hash table, it is then very easy to search for a given sequence. •SSAHA stores individual nucleotide base information as 2-bits of information per base: A is stored as 00 C is stored as 01 G is stored as 10 T is stored as 11 •This allows for fast and efficient storage and processing, using less memory and less processor power •However, the drawback is that only 4 characters are possible, and sometimes real DNA data has gaps and ambiguous characters •Other search engines treat these ambiguities as separate letters, like R or N (N means it could be any base) •SSAHA reverts all ambiguities to A, since AAAAA… is the most common ‘word’ in the genome, and the search can easily recognize it as uninformative SSAHA Step-by-Step Problem Statement In this APT, you will generate a string similar to a single hash table entry. Given an array of DNA strings (our database) and a dimer string (our 2- tuple), output a string containing the positions of every occurrence of that 2- tuple in the database (not limited to a single reading frame). Positions will be in the form of “(0-initiated array element index, 0-initiated character index)” and should be ordered ascending numerically first by array element index, then by character index. Note that these positions will be 0-initiated and not 1-initiated like actual SSAHA positions. Notes Positions should be space-delimited in the string. For example, an output string containing 5 positions should have this format: “(0,2) (1,3) (2,5) (5,3) (6,7)” Constraints database can contain strings of any length, including 0 and 1. All input strings will contain only the characters ‘A,’ ‘G,’ ‘C,’or ‘T’ in any combination. twotuple will always have exactly 2 characters. Examples 1. {“ATTCGT”, “TACGG”, “GAATCA”, “G”} “CG” Returns: “(0, 3) (1, 2)” 2. {“TTCGT”, “AAATC”, “ATATATTTA”, “TCGAAATG”} “AT” Returns: “(1, 2) (2, 0) (2, 2) (2, 4) (3, 6)” 3. {“TTTTT”, “ATTCG”} “TT” Returns: “(0,0) (0, 1) (0, 2) (0, 3) (1, 1)” 4. {“AAAAA”} “GG” Returns: “” For full APT, see http://www.cs.duke.edu/courses/cps004g/fal l07/apt/ For sample solution see http://www.duke.edu/~krt10/ 1. Genome Size Data from Department of Energy Website: http://www.ornl.gov/sci/techresources/Human_Geno me/faq/faqs1.shtml 2. SSAHA data from Genome Research 2001 volume 11, pages 1725-1729: “SSAHA: A Fast Search Method for Large DNA Databases” A C T N G N C A 0001110010000100 1) Break sequences in database into k-tuples. 2) Make a hash table. Ideally, each cell in the hash table is allotted for one particular unique k-tuple. Each cell contains the locations where that particular unique k-tuple occurs. The location specifies the sequence where the k-tuple occurs, and the index in the sequence where the k-tuple occurs. 3) The query sequence is also broken into k-tuples. 4) These query k-tuples are matched to k-tuples in the database. Matches are noted. 5) SSAHA returns search results, listing sequences from the database that best match the query sequence. The degree of similarity is based on the amount of matching k-tuples, the successiveness in which they match k k k k Sequences in database Hash table k k k Query sequence = = How Does a Hash Table Work? •The hash table stores positions at which different short sequences, called k- tuples, occur in the database. A position looks like this: (index, offset), where index identifies which DNA strand contains the k-tuple, and offset identifies at what position in the DNA strand the k-tuple begins. •In lay terms, SSAHA likes to keep things organized, like an anal retentive woman who puts everything in labeled Tupperware, which she then organizes into labeled drawers. The index is like the labeled drawer, and the offset is like the labeled Tupperware. If you tell her to get something from the third drawer, fourth Tupperware, she can quickly and easily locate what you’re looking for. Or, if you ask her where you can find something, she can easily tell you in which drawer and Tupperware you’ll find it. •You only have to make a hash table once for a given database, kind of like you only have to tidy your room up once, then you can find things extremely easily thereafter. Visual of Hash Table Search
1

TEMPLATE DESIGN © 2007 SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University.

Jan 20, 2016

Download

Documents

Melanie Watkins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TEMPLATE DESIGN © 2007  SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University.

TEMPLATE DESIGN © 2007

www.PosterPresentations.com

SSAHA: Search with SpeedNick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan

Duke University Focus Program: The Genome Revolution and Its Impact on Society

Introduction APT: k-tuple Coordinates in a Hash Table

How Does SSAHA Code for Base Pairs?

Contact Information

Works Referenced

Nick Altemose, Student, Duke University Trinity College of Arts and Sciences email: [email protected] Owen Astrachan, Professor, Duke University Department of Computer Science email: [email protected] Kelvin Gu, Student, Duke University Trinity College of Arts and Sciences email: [email protected] Lin, Student, Duke University Trinity College of Arts and Sciences email: [email protected] Kevin Tao, Student, Duke University Pratt School of Engineering email: [email protected]

•The Human Genome contains an immense amount of information: 3 Billion base pairs, 3 Gigabytes of storage space. In fact, if you were to read it aloud continuously at the alarming rate of 10 base pairs per second, it would take you 9.5 years to get through it all.

•When you want to search for a sequence in the genome, you must use smart search methods like those used by internet search engines. If you were to search through the entire genome one base pair at a time, it could take hours to get back any results, even with the best computers.

•SSAHA is a search method that was developed with one goal in mind: speed.

•SSAHA takes a database of data, like the 3 Gigabase human genome reference sequence, and organizes it into a fast searchable index, known as a hash table.

•This organization is the key to SSAHA’s speed, and the origin of its name: Sequence Search and Alignment by Hashing Algorithm.

•The hash table is like a dictionary to the English language, or an index to an encyclopedia. It allows for fast lookup of organized information. After making this hash table, it is then very easy to search for a given sequence.

•SSAHA stores individual nucleotide base information as 2-bits of information per base:

A is stored as 00C is stored as 01G is stored as 10T is stored as 11

•This allows for fast and efficient storage and processing, using less memory and less processor power

•However, the drawback is that only 4 characters are possible, and sometimes real DNA data has gaps and ambiguous characters

•Other search engines treat these ambiguities as separate letters, like R or N (N means it could be any base)

•SSAHA reverts all ambiguities to A, since AAAAA… is the most common ‘word’ in the genome, and the search can easily recognize it as uninformative

SSAHA Step-by-Step

Problem StatementIn this APT, you will generate a string similar to a single hash table entry. Given an array of DNA strings (our database) and a dimer string (our 2-tuple), output a string containing the positions of every occurrence of that 2-tuple in the database (not limited to a single reading frame). Positions will be in the form of “(0-initiated array element index, 0-initiated character index)” and should be ordered ascending numerically first by array element index, then by character index. Note that these positions will be 0-initiated and not 1-initiated like actual SSAHA positions.

NotesPositions should be space-delimited in the string. For example, an output string containing 5 positions should have this format:“(0,2) (1,3) (2,5) (5,3) (6,7)”

Constraints• database can contain strings of any length, including 0 and

1.• All input strings will contain only the characters ‘A,’ ‘G,’ ‘C,’or ‘T’

in any combination.• twotuple will always have exactly 2 characters.

Examples1. {“ATTCGT”, “TACGG”, “GAATCA”, “G”}

“CG” Returns: “(0, 3) (1, 2)”

2. {“TTCGT”, “AAATC”, “ATATATTTA”, “TCGAAATG”} “AT” Returns: “(1, 2) (2, 0) (2, 2) (2, 4) (3, 6)”

3. {“TTTTT”, “ATTCG”} “TT” Returns: “(0,0) (0, 1) (0, 2) (0, 3) (1, 1)”

4. {“AAAAA”} “GG” Returns: “”

For full APT, see http://www.cs.duke.edu/courses/cps004g/fall07/apt/For sample solution see http://www.duke.edu/~krt10/

1. Genome Size Data from Department of Energy Website: http://www.ornl.gov/sci/techresources/Human_Genome/faq/faqs1.shtml

2. SSAHA data from Genome Research 2001 volume 11, pages 1725-1729: “SSAHA: A Fast Search Method for Large DNA Databases”

A C T N G N C A0001110010000100

1) Break sequences in database into k-tuples.

2) Make a hash table. Ideally, each cell in the hash table is allotted for one particular unique k-tuple.

Each cell contains the locations where that particular unique k-tuple occurs. The location specifies the sequence where the k-tuple occurs, and the index in the sequence where the k-tuple occurs.

3) The query sequence is also broken into k-tuples.

4) These query k-tuples are matched to k-tuples in the database. Matches are noted.

5) SSAHA returns search results, listing sequences from the database that best match the query sequence. The degree of similarity is based on the amount of matching k-tuples, the successiveness in which they match (this involves pathing—not explained here) etc. and attempts to account for insertions, deletions and SNPs.

k kk k

Sequences in database

Hash table

kk k

Query sequence

= =

How Does a Hash Table Work?

•The hash table stores positions at which different short sequences, called k-tuples, occur in the database. A position looks like this: (index, offset), where index identifies which DNA strand contains the k-tuple, and offset identifies at what position in the DNA strand the k-tuple begins.

•In lay terms, SSAHA likes to keep things organized, like an anal retentive woman who puts everything in labeled Tupperware, which she then organizes into labeled drawers. The index is like the labeled drawer, and the offset is like the labeled Tupperware. If you tell her to get something from the third drawer, fourth Tupperware, she can quickly and easily locate what you’re looking for. Or, if you ask her where you can find something, she can easily tell you in which drawer and Tupperware you’ll find it.

•You only have to make a hash table once for a given database, kind of like you only have to tidy your room up once, then you can find things extremely easily thereafter.

Visual of Hash Table Search