Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses Sixth International Conference on Bioinformatics (InCoB2007) Hong Kong, 28 th August 2007 Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore Tan Tin Wee Vladimir Brusic Yong Loo Lin School of Medicine Cancer Vaccine Center National University of Singapore Dana-Farber Cancer Inst.
35
Embed
Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses Sixth International Conference on Bioinformatics (InCoB2007)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis
of Influenza A Viruses
Sixth International Conference on Bioinformatics (InCoB2007) Hong Kong, 28th August 2007
Olivo Miotto
Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore
Tan Tin Wee Vladimir BrusicYong Loo Lin School of Medicine Cancer Vaccine Center National University of Singapore Dana-Farber Cancer Inst.
Page 2
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
Page 3
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
1
Page 4
Knowledge Aggregation:Scaling up BioinformaticsBioinformatic Analysis is current limited in scope
Usually single domain (single aspect) Mostly small datasets (single genes, or few sequences)
"Horizontal" scalability: connecting domains Multiple database sources, diversely purposed data Systemic and semantic heterogeneity Discovery by relationship analysis
"Vertical" scalability: analyzing large datasets Many thousands of records Diversity of geography, tissue types, host, etc. Discovery by comparative analysis
Curre
ntly,
data
set p
repa
ratio
n
is m
anua
l
Page 5
Horizontal Scalability
BioHaystackSemantic Web
BrowserIBM + MIT
Quan, D (2004): BioHaystack: Gateway to the Biological Semantic Web
Ontologies: vocabularies of concepts and properties that describe a field of knowledge
OWL technology allows user to define ontologiesShared ontologies allow interchange of data
Ontologies support REASONING by means of programs that Read RDF data, encoded using an ontology Apply rules that relate to the described properties Generate new knowledge from these rules
Page 13
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
3
Page 14
Study goals
Analyze all influenza protein sequences available GenBank + GenPept = 92,343 documents Final dataset comprises 40,169 unique sequences
Various types of analysis, e.g. Identify amino acid mutations sites that characterize
human-transmissible strains Compare the diversity of viral sequences over different
periods of time and geographical areas
Several Metadata fields requiredProtein name Subtype Isolate
Host Country Year
Manual Curation is not an Option!
Page 15
Good
Pretty Bad
Not so Good
Inconsistencies in GenBank records
Page 16
Experimental Approach
1. Retrieve all influenza A records from GenBank and Genpept in XML format, using ABK platform
Miotto O, Tan TW, Brusic V (2005) LNCS 3578, 398-405.
2. Use XML structural rules to extract, merge and reconcile the metadata from the records
3. Use RDF encoding and an Ontology to encode and structure the resulting metadata
4. Use a Reasoner with Semantic Rules to restructure the metadata, and make inferences that improve the consistency
Page 17
Outline
Knowledge Aggregation in large-scale analysis
Semantic Technologies for Knowledge Aggregation
Task: Annotating the Influenza Dataset
XML-based structural rules
Rule-based knowledge restructuring
Discussion and Conclusions
4
Page 18
Leveraging on XML
XML offers great advantages for extracting heterogeneous metadata Wide availability Popular encoding for source databases Standard processing software Independence from source schemas Query Language (XPath)
Some disadvantages Almost unreadable by humans Interpretation of semantics requires understanding the
schema
Page 19
Page 20
ABK Structural Rules
Concise visualization of XML as name/value tree
Familiar presentation ofmetadata for biologists
Point-and-click selectionof location and constraints
Large-scale metadata recovery from public databases is difficult even for simple requirements
Relatively simple approaches such as structural rules can do most of the tedious work Accuracy can be further improved with machine learning
Semantic inferences can improve data quality Significant impact on manual curation task
Rules have more potential for intuitive end-user GUI than programming cf. email rules, firewall rules
Page 33
Discussion - 2
Semantic Technologies are suitable for bioinformatics metadata management today Limited infrastructure requirements Flexibility and extensibility of ontologies (Open World)
Enormous potential for analysis tool integration Build tools that are "semantically agnostic"
Reasoning currently computationally expensive Our simple reasoning tasks exceeded the power of a
current desktop when applied to 10,000's records Divide-and conquer strategies were effective, but require
manual work, and are not always applicable Reasoning services and computing grid can help
scalability, but only if easy to access
Page 34
Acknowledgements and Thanks
Institute of Systems Science, NUSFunding support for this conference
Prof. J Thomas August, Johns Hopkins University
AT Heiny, NUS
Partial Grant Support:
National Institute of Allergy and Infectious Diseases, NIHGrant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C