Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster Department of Computer Science and Engineering, Washington University in Saint Louis, MO Supported by an NIH STTR Grant & NSF Grants DBI-0237902, ITR-0313203, CCR-0217334
25
Embed
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Slide # 3Washington University in St. Louis Basic Local Alignment Search Tool Biosequence comparison software –Query sequence (new genome) to large database of known biosequences Look for similar regions Exponential growth of genomic databases –Longer time for searches to complete –Solutions Perform comparison over multiple machines Specialized hardware - Our Approach
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Biosequence Similarity Search on the Mercury System
Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and
Joseph LancasterDepartment of Computer Science and Engineering,
Washington University in Saint Louis, MO
Supported by an NIH STTR Grant &NSF Grants DBI-0237902, ITR-0313203, CCR-0217334
Washington University in St. Louis Slide # 2
Outline
• Overview of BLAST• Overview of the Mercury system• Description of BLASTN algorithm• Algorithmic changes to BLASTN• Improvement in performance• Related work• Conclusion
Washington University in St. Louis Slide # 3
Basic Local Alignment Search Tool
• Biosequence comparison software– Query sequence (new genome) to large
database of known biosequences• Look for similar regions
• Exponential growth of genomic databases– Longer time for searches to complete– Solutions
• Proximity to disk– Simple operations performed close to disk
• Avoids CPU use– 400 Mbytes/s throughput from the disk
• Concurrent Independent operation– Does not use processor cache cycles,
memory or I/O buses• Reconfigurable logic
– Logic can be tuned to the particular need of the application
Washington University in St. Louis Slide # 6
BLASTN
• BLASTN– Both the query and the database are long DNA strings– Consist of {A, C, T, G} and some unknowns
• Each stage processes lesser data• The stages become more computationally
expensive
Washington University in St. Louis Slide # 7
BLASTN - Terminology
…ACTGTGTTTCACTGACGGGTGT…
…CTGTGTCCCCAACACTGCTGACGTAGAATCGTGTAG…
Query
Database
‘w-mer’ is a sequence of ‘w’ consecutive bases
Washington University in St. Louis Slide # 8
BLASTN - Pipeline - Stage 1
• Matches each ‘11-mer’ in query to database– Exact string matching
• 83% of overall time is spent in this stage• Filters 92% of data entering this stage
– Only 8% of data proceeds to the next stage
Washington University in St. Louis Slide # 9
BLASTN - Pipeline - Stage 2
• Extends the matches from stage 1…ACTGTGTTTCACTGACGGGTGT…
…GTGTCCCCAACATTTCACTGACGAGAATCGTGTAG…
Washington University in St. Louis Slide # 10
BLASTN - Pipeline - Stage 2
• Extends the matches from stage 1– Allows mismatches of individual bases– Does not allow gaps in either the query or the database– Match score should be higher than threshold to proceed
• 16% of pipeline time is spent is this stage• Only 2/100,000 of data entering this stage proceeds to the
next stage
Washington University in St. Louis Slide # 11
BLASTN - Pipeline - Stage 3
• Extends the matches from stage 2…ACCACTGTTTCACTGACG_GA_T_GT…
…CTGTGTCCCCAC_GTTTCACTGACGAGAATCGTGTAG…
Washington University in St. Louis Slide # 12
BLASTN - Pipeline - Stage 3
• Extends the matches from stage 2– Scores matches with Gaps inserted in both the