caBIG: the cancer Biomedical Informatics Grid Ken Buetow NCICB/NCI/NIH/DHHS
Dec 25, 2015
NCI biomedical informatics
Goal: A virtual web of interconnected data, individuals, and organizations redefines how research is conducted, care is provided, and patients/participants interact with the biomedical research enterprise
•Trials•Animal Models
states
context•pathways•ontologies
agents•therapeutics•probes
components•genes•genotypes•gene expression•proteins•protein expression
etiology,treatment,prevention
MolecularPathology
ClinicalTrials
caCORE
accessportals
participatinggroup nodes
CancerGenomicsMouse
Models
building common architecture, common tools, and common standards
Interoperability
SemanticSemanticinteroperabilityinteroperability
SyntacticSyntacticinteroperabilityinteroperability
Courtesy: Charlie Mead
in·ter·op·er·a·bil·i·ty- ability of a system...to use the parts or equipment of another systemSource: Merriam-Webster web site
interoperability- ability of two or more systems or components to exchange information and to use the information that has been exchanged.Source: IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer
Glossaries, IEEE, 1990]
Enterprise Vocabulary
NCI Meta-Thesaurus (Cross-map standard vocabularies/ontologies, e.g. SNOMED, MEDRA, ICD)- Semantic integration, inter-vocabulary
mapping- UMLS Metathesaurus extended with
cancer-oriented vocabularies• 800,000 Concepts, 2,000,000
terms and phrases• Mappings among over 50
vocabularies
NCI Thesaurus- Description logic-based- 18,000 “Concepts”
• Concept is the semantic unit• One or more terms describe a
Concept – synonymy• Semantic relationships between
Concepts
biomedical objects
common data elements
controlled vocabulary
Common Data Elements
Structured data reporting elements
Precisely defining the questions and answers- What question are you
asking, exactly?- What are the possible
answers, and what do they mean?
biomedical objects
common data elements
controlled vocabulary
Biomedical Information Objects
Data service infrastructure developed using OMG’s Model Driven Architecture approach
Object models expressed in UML represent actual biomedical research entities such as genes, sequences, chromosomes, sequences, cellular pathways, ontologies, clinical protocols, etc.
The object models form the basis for uniform APIs (Java, SOAP, HTTP-XML, Perl) that provide an abstraction layer and interfaces for developers to access information without worrying about the back-end data stores
biomedical objects
common data elements
controlled vocabulary
Standards supporting infrastructure
Enterprise Vocabulary Services (EVS)- Browsers- APIs
cancer Bioinformatics Infrastructure Objects (caBIO)- Applications- APIs
cancer Data Standards Repository (caDSR)- CDEs- Case Report Forms- Object models- ISO 11179 model
Data AccessObjects
Object Managers
DomainObjects
RMI
Web Server
TomcatServlets
JSPsSOAP XML
XSL/XSLT
HTML (Browsers)
SOAP Clients
Java Applications
DataObjectPresentationClient
Integrating Architecture
HTML/XML Clients
Meta-Data
PERLClients
Semantic Integration: Modeling Time
Class
Attributes
EVS Concept for Attribute ‘agentName’
EVS Concept for Class ‘Agent’
EVS Concept for Attribute ‘id’
.
.
.etc.
EVS Conceptfor instance objects
Object
Mapping to EVS Concepts Done at Modeling Time
Semantic Integration:Metadata Registration Time
UML model, including EVS Concept mappings
ISO11179 mapping
caDSR loading
Curation: Data standards registration
for instance data
Semantic Integration: Runtime
Java Applications
Data AccessObjects(OJB)
Object Managers
Web Server
TomcatServlets
( XMLXSL/XSLT )
JSPs
SOAP
HTML/XML Clients
(Browsers)
SOAP Clients
DataObjectPresentationClient
Perl Clients
Domain Objects[Gene, Disease,
Concept,DataElement]
RMI
ResearchDBs
Research
DBs
caGRID caCORE architecture extension
caBIO server
caBIO client
OGSA-DAI + Globus
Globus
OGSA-DAIcaGRID extension
(metadata)
caGRID extension (caBIO adapter)
caGRID extension(query)
Client
Grid
Data Source
caGRID extension (Concept Discovery)
caGRID extension (Federated Query)
caGRID Extension (Integration of Discovery and Query Services)
NCICB applications:• clincial trials support - C3DS
• molecular pathology - caArray• cancer images - caImage • pre-clinical models - caModelsDb• laboratory support - caLIMS
Standards-based Data System for the conduct of clinical trials:
• C3D (Cancer Central Clinical Database)– WWW-based eCRF-based primary data capture by protocol
• C3PR (Cancer Central Clinical Participant Registry)– WWW-based Central registration of participants across
protocols • C3PA (Cancer Central Clinical Protocol Administration)
– Scientific management system for clinical protocols • C3TR (Cancer Central Clinical Tissue Repository)
– Tissue repository • C3DW (Cancer Central Clinical Data Warehouse)
– De-identified patient information accessed via caBIO
Image Portal• The NCICB has
developed an image portal to allow researchers to search for mouse and human images and annotations– Human and
mouse images and annotations were provided by the MMHCC
Pathway Database • Enhance value of imperfect, but
available, pathway knowledge• Make biological assumptions
explicit• Combine sources of data (e.g.
KEGG, BioCarta, ...)• Merge data from separate
pathways• Build a causal framework to
support (future) quantitative simulation/analysis
Cancer Biomedical Informatics Grid (caBIG)
Common, widely distributed infrastructure permits cancer research community to focus on innovation
Shared vocabulary, data elements, data models facilitate information exchange
Collection of interoperable applications developed to common standard
Raw published cancer research data is available for mining and integration
caBIG action plan Establish pilot network of Cancer Centers
- Groups agreeing to caBIG principles- Mixture of capabilities- Mixture of contributions
Expanding collection of participants Establish consortium development process
- Collecting and sharing expertise- Identifying and prioritizing community
needs- Expanding development efforts
Moving at the speed of the internet…
Three Domain Workspaces and two Cross Cutting Workspaces have been launched during the Pilot
phase
DOMAIN WORKSPACE 3Tissue Banks & Pathology ToolsDOMAIN WORKSPACE 3Tissue Banks & Pathology Tools
provides for the integration, development, and implementation of tissue and pathology tools.
DOMAIN WORKSPACE 2Integrative Cancer ResearchDOMAIN WORKSPACE 2Integrative Cancer Research
provides tools and systems to enable integration and sharing of information.
DOMAIN WORKSPACE 1Clinical Trial Management SystemsDOMAIN WORKSPACE 1Clinical Trial Management Systems
addresses the need for consistent, open and comprehensive tools for clinical trials management.
CROSS CUTTING WORKSPACE 2Architecture
CROSS CUTTING WORKSPACE 2Architecture
developing architectural standards and architecture necessary for other workspaces.
CROSS CUTTING WORKSPACE 1Vocabularies & Common
Data Elements
CROSS CUTTING WORKSPACE 1Vocabularies & Common
Data Elements
responsible for evaluating, developing, and integrating systems for vocabulary and ontology content, standards, and software systems for content delivery
Key deliverables of caBIG pilot
Componentized, standards-based Clinical Trials Management System- e-IND filing/regulatory reporting with FDA- Electronic management of trials- Integration of diverse trials
Tissue Management System- Systematic description and characterization of tissue
resources- Ability to link tissue resources to clinical and molecular
correlative descriptions “Plug and Play” analytic tool set
- microarray- proteomics- pathways- data analysis and statistical methods- gene annotation
Diverse library of raw, structured data
Cancer Molecular Analysis Project (CMAP)
- a prototypic biomedical data integration effort
biomedical objects
common data elements
controlled vocabulary
Profiles, Targets, Agents, Clinical Trials
CGAPNCBIUCSC
(via DAS)
BioCarta
KEGGGeneOntologies
CTEP clinical trials
CGAP gene expression
NCI drug screening
NCI drug screening
caBIG community contributions
Infrastructure- Ontologies- Databases
Applications- Clinical trials
support- Analytic tools- Data mining
Data- Trials- Experimental
outcomes•Genomic•Microarray•Proteomic
acknowledgements
NCICB- Peter Covitz- Sue Dubman- Mary Jo Deering- Leslie Derr- Carl Schaefer- Christos Andonyadis- Mervi Heiskanen - Denise Hise- Kotien Wu- Fei Xu- Frank Hartel
LPG/CCR- Michael Edmundson- Bob Clifford- Cu Nguyen
http://ncicb.nci.nih.gov
http://cmap.nci.nih.gov
http://caBIG.nci.nih.gov