Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and.
Post on 15-Dec-2015
219 Views
Preview:
Transcript
Copyright © 2004 Synamatix sdn bhd (538481-U)
SynaBASETM: A novel structured-network pattern database platform
forstorage, ultra-high-throughput and sensitive data analysis
October 03 2006
Copyright © 2005 Synamatix sdn bhd (538481-U)
AimsAims
To learn about current research priorities and bioinformatics initiatives
To review Synamatix science and technologies
Demonstrate Synamatix performance capabilities
To explore potential fit and research synergies
Copyright © 2005 Synamatix sdn bhd (538481-U)
Synamatix IntroductionsSynamatix IntroductionsRobert Hercus - Synamatix, MD and Inventor
Australian, over 30 years IT Sciences experiencePioneered many large-scale IT projects
Dr. Arif Anwar – Synamatix, CEOBritish, Ph.D. Oxford Uni./UCL12 yrs+ post-Ph.D. US and EU genomics background
Silicon Genetics, Becton-Dickinson-CLONTECH
Poh Yang Ming – Synamatix, Senior BioinformaticianMalaysian, B.Sc. Biotechnology, M.Sc. IT6 yrs Biotechnology industry and research
IMCB, SingaporeMUST
Johan Poole-Johnson – Synamatix, Accounts ManagerAustralian, B.Com – Murdoch University, Australia8 yrs+ Multinational and Start-up Technology Companies4 yrs Experience in science informatics
Copyright © 2005 Synamatix sdn bhd (538481-U)
Core IP Patented World-wide SynaBASE™Database Platform for high-throughput
genomics
Market shifting towards
very high-throughput genomics
High-growth market
Investing heavily in Personalised Genome
and Healthcare revolution
Who’s who list of customers
USA, Europe, Australia and Singapore
Copyright © 2005 Synamatix sdn bhd (538481-U)
Core competencies
Algorithm development
Software and UI
Bioinformatics and HPC know-how
Training/Support
International Collaborations
Database platform flexibility
Copyright © 2005 Synamatix sdn bhd (538481-U)
New Customers in 2006New Customers in 2006
Copyright © 2005 Synamatix sdn bhd (538481-U)
Copyright © 2005 Synamatix sdn bhd (538481-U)
Copyright © 2005 Synamatix sdn bhd (538481-U)
Copyright © 2005 Synamatix sdn bhd (538481-U)
Command line interface
CORE Database platform
SynaRex Bulk
SynaProbe Bulk
SynaSearch Bulk
SynaMer
SynaFrag
SXSequenceRefs
SXLRESearch
SXFuzzyPatternSearch
Sxpet
SXParse
Data analysi
s
Develop Tools
Another 20+ apps
Graphical Interface
Copyright © 2005 Synamatix sdn bhd (538481-U)
Open platform approachOpen platform approach
Applications andResearch
User or
Synamatix
Internal/Custom
developmentModify Synamatix
Applicationsat source level
IP owned by User:
Copyright © 2005 Synamatix sdn bhd (538481-U)
Why?Why?
Current database platforms will not be able to scale to manage ever increasing data volume and complexity
Novel database platform to meet needs, not a:Suffix treeRelational databaseSuffix array
Copyright © 2005 Synamatix sdn bhd (538481-U)
How?How?
Copyright © 2005 Synamatix sdn bhd (538481-U)
What do we
know about
data ?
Similarity
& association
Common PATTERNS and
functionality
Copyright © 2005 Synamatix sdn bhd (538481-U)
A T G C
A T G C A T G A A T……
AT TG GCCAGAAA AT TGAT
ATG TGC GCACAT
ATG
TGAGAA AAT
ATGC TGCA
ATGCA
GCATCATG
TGCAT
Copyright © 2005 Synamatix sdn bhd (538481-U)
1. SynaBASE is very efficient – scales very 1. SynaBASE is very efficient – scales very wellwell
When more data is added the increase is not proportional as sub-patterns may already exist
Only adding leaf nodes, references are stored
More efficient with more data
Every overlapping pattern, at every position is stored
Patterns are extended until they become unique
Copyright © 2005 Synamatix sdn bhd (538481-U)
0
50
100
150
200
250
0 20 40 60 80 100
Number of Streptococcus pneumoniae r6 genomes
Dat
abas
e s
ize (
Mbyt
es) S. pneumoniae R6 genome size = 2.068 Mbytes
SynaBASE
Flat file
1
Copyright © 2005 Synamatix sdn bhd (538481-U)
2. SynaBASE enables very fast access2. SynaBASE enables very fast access
Number of levels smallFor a query:
Match 1st longest patternFollow Eulerian path through network, picking up longest matching pattern for each posn. In query
Processing time is:Proportional to query size to obtain all unique subpatterns
A C T
AA AC CT TC
AAC ACT CTC
AACT ACTC
AACTC ACTCG
CTCG
CTCGA
TCGA
Copyright © 2005 Synamatix sdn bhd (538481-U)
Q* logN base AQ* logN base A
Size of database
Speed milliseconds
1 10 100 1000
100
200
300
400
500
600
700
800
900
Conventional
SynaBASE
Copyright © 2005 Synamatix sdn bhd (538481-U)
Case Study - Comparison of Human v Mouse genome
3 yrs
SynaBASE BLAST
6h
PatternHunter
22days
Copyright © 2005 Synamatix sdn bhd (538481-U)
3. Increased sensitivity3. Increased sensitivity
Copyright © 2005 Synamatix sdn bhd (538481-U)
BLASTN vs. SynaSearch-BulkBLASTN vs. SynaSearch-BulkCumulative Number of hits shows SynaSearch Bulk found extra hits at low-mid identities
SynaBASE and Blast DB of 700000 Bacterial ORFs queried with 100 1kb sequences
Novel hits
Copyright © 2005 Synamatix sdn bhd (538481-U)
The elephant and the giraffe walked up the mountain
The elephant and the giraffe walked up the mountain
A graph showing Frequency of “string (word)” patterns in a sentence does not reflect meaning
A graph showing Probabilities of predicting Precessor and Successor Characters/events (string Significance) reflecting meaning
4. Novel annotation using SynaBASE4. Novel annotation using SynaBASE
Copyright © 2005 Synamatix sdn bhd (538481-U)
Sig(a1a2a3) =
F(a1a2a3) / Ef(a1a2a3)
= Fr(a1a2a3) * F(a2)
F(a1a2) * F(a2a3)
a1 a2 a3
a1a2 a2a3
a1a2a3
Expected Frequency
Ef(a1a2a3) =
F(a1a2) * F(a2a3) F(a2)
Actual Freq/Expec Freq
SIGNIFICANCE
Copyright © 2005 Synamatix sdn bhd (538481-U)
Gene models correlate with “Gene models correlate with “SIGNIFICANCE”SIGNIFICANCE”
Copyright © 2005 Synamatix sdn bhd (538481-U)
On-going Research Case Studies &
Performance Demonstration
Copyright © 2005 Synamatix sdn bhd (538481-U)
Case Study 1 – contamination identificationCase Study 1 – contamination identification
High-throughput identification of contaminant reads on the basis of over-representation in a SynaBASE
Major problem as vector databases incomplete and/or not updated
Causes bottlenecks in sequence finishing pipeline
Copyright © 2005 Synamatix sdn bhd (538481-U)
1. Build SynaBASE of 5239 Lamprey sequences using SXBuild
SXPet
Analysis Steps
3. Filter patterns to remove polynucleotide repeats of more
than 75% identical base composition
SXPET:A SynaBASE API call for reporting
patterns based on frequency
475 patterns removed resultingIn 17,914 Lamprey patterns
SXBuild
Function definitions
SXBuildA SynaBASE API
call for building SynaBASEs from
Raw sequence data
SynaBASE identifies 18,389 patterns
2. Extract patterns of length 40mer and above using
SXPet
Copyright © 2005 Synamatix sdn bhd (538481-U)
Verification (optional)
Bulksearch
Map patterns back againstvector source references
Unique vector contaminated sequences:
3374 / 5239 (60%)
Function definitions
Bulksearch:A SynaBASE API
call for batch searching of sequences
Search resulting 17914 patterns against UniVec SynaBASE
By using an approach based upon filtering of over represented patterns in SynaBASE, 100% of the vector contaminants sequences are identified.
This obviates the requirement for using the UNIVEC database for screening in 1 step.
Copyright © 2005 Synamatix sdn bhd (538481-U)
Case Study 2 – OverlapperCase Study 2 – Overlapper
Copyright © 2005 Synamatix sdn bhd (538481-U)
Task to accomplishTask to accomplish
Original user data set and requirement was:To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bpReport n-mers that have a frequency >2 and <m
Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps
Hence standard approach limits usage to 32mers
Longer mers help bridge repetitive and low-complexity regions
Copyright © 2005 Synamatix sdn bhd (538481-U)
Long v Short n-mersLong v Short n-mersadvantages and disadvantages
100 mer
+ve
-ve
Fewer false positives
Improvement in final assembly
Errors in reads may lead to false negatives
Slow to process with conventional software
Copyright © 2005 Synamatix sdn bhd (538481-U)
Explanation of advantagesExplanation of advantages
Low-complexity region
A shorter overlap results in more false
positives
A longer overlap results in less false
positives
Final assembly improved
A
B
Copyright © 2005 Synamatix sdn bhd (538481-U)
Using SynaMer there is no time Using SynaMer there is no time increase with longer n-mersincrease with longer n-mers
Copyright © 2005 Synamatix sdn bhd (538481-U)
ConclusionsConclusions
For 30million 1kb reads took 5 hours on a dual CPU itanium
machine, with temporary file size less than 200GB
Time consumed to find overlapping sequences for 33000
900bp reads of a bacterial WGSS reads took less than 20s
100 fold faster than conventional method
Allows use of longer n-mers
Potentially increases quality of assembly
SynaMer will be made released as a product later this
Summer
Copyright © 2005 Synamatix sdn bhd (538481-U)
Case Study 3 – 454 Life sciencesCase Study 3 – 454 Life sciences
Rapid genome assembly from 454 generated reads
Copyright © 2005 Synamatix sdn bhd (538481-U)
Conventional approach to Genome Conventional approach to Genome AssemblyAssembly
Cluster by sequence overlaps
Filter out repeats and detectable errors
Assemble each cluster into one or more contigs
Derive contig consensus
Validate results by comparison to a reference genome sequence (if available)
Copyright © 2005 Synamatix sdn bhd (538481-U)
FragBASE – using the SynaBASE structure….FragBASE – using the SynaBASE structure….
Select patterns of high coverage
Use corrected FragBASE
Use FragBASE network* to extend patterns
Increase pattern size to overcome shorter repeat sections
Copyright © 2005 Synamatix sdn bhd (538481-U)
Stage 3 - error correctionStage 3 - error correction
Build a database of patterns - FragBASE
Compared patterns M.
Genitalium and analyse
Database consists of:
Total patterns – f/rGenitalium patterns – f/rError patterns – f/r
Fragments
Correct errors using significance
Corrected fragments
Copyright © 2005 Synamatix sdn bhd (538481-U)
454 assembly result454 assembly result
400,000 reads assembled into 11 contigs in 11 minutes, 2 minutes for error correctionGenome coverage 99.89%
Copyright © 2005 Synamatix sdn bhd (538481-U)
Case Study 4 - Plant Comparative GenomicsCase Study 4 - Plant Comparative Genomics
Refseq plant release Covers complete and partially sequenced genomes74 898 419 bp in 205 780 sequencesGenerate Sequence alignmentsSequence-based clustering using common K-mers Whole genome phylogeny
Copyright © 2005 Synamatix sdn bhd (538481-U)
Performance ResultsPerformance Results
Copyright © 2005 Synamatix sdn bhd (538481-U)
Sequence clustering based on shared K-mersSequence clustering based on shared K-mers
Copyright © 2005 Synamatix sdn bhd (538481-U)
Case study 5 - Pattern Frequency Case study 5 - Pattern Frequency statistics and SynaBASEstatistics and SynaBASE
SynaBASE stores all patterns from dataPattern frequencies and offsets on source sequencesCharacterize/annotate data Sequence clusteringConserved regionsSimple and Complex repeats Genome segmental duplications
Copyright © 2005 Synamatix sdn bhd (538481-U)
Yeast Genes SynaBASE Frequency StatisticsYeast Genes SynaBASE Frequency Statistics
Copyright © 2005 Synamatix sdn bhd (538481-U)
Arabidopsis thaliana (thale cress)Arabidopsis thaliana (thale cress)
Copyright © 2005 Synamatix sdn bhd (538481-U)
HumanHuman
Copyright © 2005 Synamatix sdn bhd (538481-U)
Mus musculusMus musculus
Copyright © 2005 Synamatix sdn bhd (538481-U)
All Bacteria genomesAll Bacteria genomes
Copyright © 2005 Synamatix sdn bhd (538481-U)
SummarySummary
Cutting-edge Bioinformatics: SynaBASE novel database PLATFORM
UniquePatented worldwideLeads to massive increases in speed and scalabilityAccuracy and sensitivity enhanced
top related