Click here to load reader
Jun 25, 2015
Why not use data intensive science to build better models of diseases together?
– beyond current rewards
Stephen H Friend MD PhD Sage Bionetworks (non-‐profit)
July 27, 2012 UCSF
So what is the problem?
Most approved therapies were assumed to be monotherapies for diseases represen4ng homogenous popula4ons
Our exis4ng disease models o9en assume pathway knowledge sufficient to infer correct therapies
Familiar but Incomplete
Reality: Overlapping Pathways
The value of appropriate representations/ maps
Equipment capable of generating massive amounts of data
“Data Intensive” Science- Fourth Scientific Paradigm
Open Information System
IT Interoperability
Host evolving computational models in a “Compute Space”
WHY NOT USE “DATA INTENSIVE” SCIENCE
TO BUILD BETTER DISEASE MAPS?
what will it take to understand disease?
DNA RNA PROTEIN (dark maVer)
MOVING BEYOND ALTERED COMPONENT LISTS
2002 Can one build a “causal” model?
Preliminary Probabalistic Models- Rosetta /Schadt
Gene symbol Gene name Variance of OFPM explained by gene expression*
Mouse model
Source
Zfp90 Zinc finger protein 90 68% tg Constructed using BAC transgenics Gas7 Growth arrest specific 7 68% tg Constructed using BAC transgenics Gpx3 Glutathione peroxidase 3 61% tg Provided by Prof. Oleg
Mirochnitchenko (University of Medicine and Dentistry at New Jersey, NJ) [12]
Lactb Lactamase beta 52% tg Constructed using BAC transgenics Me1 Malic enzyme 1 52% ko Naturally occurring KO Gyk Glycerol kinase 46% ko Provided by Dr. Katrina Dipple
(UCLA) [13] Lpl Lipoprotein lipase 46% ko Provided by Dr. Ira Goldberg
(Columbia University, NY) [11] C3ar1 Complement component
3a receptor 1 46% ko Purchased from Deltagen, CA
Tgfbr2 Transforming growth factor beta receptor 2
39% ko Purchased from Deltagen, CA
Networks facilitate direct identification of genes that are
causal for disease Evolutionarily tolerated weak spots
Nat Genet (2005) 205:370
DIVERSE POWERFUL USE OF MODELS AND NETWORKS
50 network papers http://sagebase.org/research/resources.php
List of Influential Papers in Network Modeling
(Eric Schadt)
Equipment capable of generating massive amounts of data A-
“Data Intensive” Science- Fourth Scientific Paradigm Score Card for Medical Sciences
Open Information System D-
IT Interoperability D
Host evolving computational models in a “Compute Space F
.
We still consider much clinical research as if we were “hunter gathers”- not sharing
TENURE FEUDAL STATES
Clinical/genomic data are accessible but minimally usable
Little incentive to annotate and curate data for other scientists to use
Mathematical models of disease are not built to be
reproduced or versioned by others
Lack of standard forms for future rights and consents
Lack of data standards..
Background: Informa\on Commons for Biological Func\ons
Sage Mission
Sage Bionetworks is a non-profit organization with a vision to create a “commons” where integrative bionetworks are evolved by
contributor scientists with a shared vision to accelerate the elimination of human disease
Sagebase.org
Data Repository
Discovery Platform
Building Disease Maps
Commons Pilots
Sage Bionetworks Collaborators
Pharma Partners Merck, Pfizer, Takeda, Astra Zeneca, Amgen, Johnson &Johnson
27
Foundations Kauffman CHDI, Gates Foundation
Government NIH, LSDF, NCI
Academic Levy (Framingham) Rosengren (Lund) Krauss (CHORI)
Federation Ideker, Califano, Nolan, Schadt
Better Models of Disease:
INFORMATION COMMONS
Biomedical Information Commons
Better Models of Disease:
INFORMATION COMMONS
Technology Platform
Challenges
Impa
ctfu
l Mod
els
Governance
Products/Approaches
IT
Pharma
Academic Consor4a
Joint Pa4ent/Scien4st
Communi4es
Biotech
Pa4ent Founda4ons
Individual Pa4ents
BeVer Models of Disease:
INFORMATION COMMONS
Technology PlaHorm
Challenges
Impa
cHul M
odels
Governance
Cons4tuencies
IT Pharma
Academic Consor4a
Joint Pa4ent/Scien4st
Communi4es
Biotech
Pa4ent Founda4ons
Individual Pa4ents
RNDP/FA/MEL Communi2es engaging COMMONS PLATFORM
Takeda
WPP
Discovery Network
BrCA/Challenges
BeVer Models of Disease:
INFORMATION COMMONS
Technology PlaHorm
Challenges
Impa
cHul M
odels
Governance
Ongoing Sage Bionetworks Ini4a4ves
Cell Line Challenge
Common Mind/ Mt. Sinai Neuro
TCGA/Challenge
ClearScience
Roche
SB/Gates
Sage CCSB
EU PARTICIPATION
IT
Pharma
Academic Consor4a
Joint Pa4ent/Scien4st
Communi4es
Biotech
Pa4ent Founda4ons
Individual Pa4ents
BeVer Models of Disease:
INFORMATION COMMONS
Technology PlaHorm
Challenges
Impa
cHul M
odels
Governance
Cons4tuencies
A) Miller 159 samples B) Christos 189 samples
C) NKI 295 samples
D) Wang 286 samples
Cell cycle
Pre-mRNA
ECM
Immune response
Blood vessel
E) Super modules
Zhang B et al., Towards a global picture of breast cancer (manuscript).
33
NKI: N Engl J Med. 2002 Dec 19;347(25):1999.
Wang: Lancet. 2005 Feb 19-25;365(9460):671.
Miller: Breast Cancer Res. 2005;7(6):R953.
Christos: J Natl Cancer Inst. 2006 15;98(4):262.
Impactful Models Breast Cancer: Co-expression Networks
What is this?
Bayesian networks enriched in inflamma\on genes correlated with disease severity in pre-‐frontal cortex of 250 Alzheimer’s pa\ents.
What does it mean?
Inflamma\on in AD is an interac\ve mul\-‐pathway system. More broadly, network structure organizes complex disease effects into coherent sub-‐systems and can priori\ze key genes.
Are you joking?
Gene valida\on shows novel key drivers increase Abeta uptake and decrease neurite length through an ROS burst. (highly relevant to AD pathology)
CHRIS GAITERI-‐ALZHEIMER’S
Liver Adipose
FaNy acids
Hypothalamus
Macrophage/ inflamma4on
Lep4n signaling
Phagocytosis-‐ induced lipolysis
Phagocytosis-‐ induced lipolysis
M1 macrophage
A mul\-‐\ssue immune-‐driven theory of weight loss
IMPACTFUL MODELS
IT
Pharma
Academic Consor4a
Joint Pa4ent/Scien4st
Communi4es
Biotech
Pa4ent Founda4ons
Individual Pa4ents
BeVer Models of Disease:
INFORMATION COMMONS
Technology PlaHorm
Challenges
Impa
cHul M
odels
Governance
Cons4tuencies
Two approaches to building common scientific and technical knowledge
Text summary of the completed project Assembled after the fact
Every code change versioned Every issue tracked Every project the starting point for new work All evolving and accessible in real time Social Coding
Synapse is GitHub for Biomedical Data
Data and code versioned Analysis history captured in real time Work anywhere, and share the results with anyone Social Science
Every code change versioned Every issue tracked Every project the starting point for new work All evolving and accessible in real time Social Coding
Leveraging Existing Technologies
Taverna
Addama
tranSMART
Watch What I Do, Not What I Say
sage bionetworks synapse project
Most of the People You Need to Work with Don’t Work with You
sage bionetworks synapse project
My Other Computer is “The Cloud”
sage bionetworks synapse project
Data Analysis with Synapse
Run Any Tool
On Any Platform
Record in Synapse
Share with Anyone
!"#$%#&'()"*'++"++&"(,*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(*1%/2*
(3&4"#*
53,'6%(*
!7"(%,2/"*-./#"++0%(*
1%/2*(3&4"#*
53,'6%(*
!7"(%,2/"*
!#"80)69"*&%8":*;"("#'6%(*
• Automated workflows for cura\on, QC, and sharing of large-‐scale datasets.
• All of TCGA, GEO, and user-‐submiVed data processed with standard normaliza\on methods.
• Searchable TCGA data: • 23 cancers • 11 data plahorms • Standardized meta-‐data ontologies
!"#$%&'()$
*&+%
,-./
0$1-
-'&2-3$45
6 7$
!"#$%#&'()"*'++"++&"(,*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(*1%/2*
(3&4"#*
53,'6%(*
!7"(%,2/"*-./#"++0%(*
1%/2*(3&4"#*
53,'6%(*
!7"(%,2/"*
!#"80)69"*&%8":*;"("#'6%(*
• Comparison of many modeling approaches applied to the same data.
• Models transparently shared and reusable through Synapse.
• Displayed is comparison of 6 modeling approaches to predict sensi\vity to 130 drugs.
• Extending pipeline to evaluate predic\on of TCGA phenotypes.
• Hos\ng of collabora\ve compe\\ons to compare models from many groups.
INTEROPERABILITY
INTEROPERABILITY
Genome Pattern CYTOSCAPE tranSMART I2B2
SYNAPSE
IT
Pharma
Academic Consor4a
Joint Pa4ent/Scien4st
Communi4es
Biotech
Pa4ent Founda4ons
Individual Pa4ents
BeVer Models of Disease:
INFORMATION COMMONS
Technology PlaHorm
Challenges
Impa
cHul M
odels
Governance
Cons4tuencies
Clinical Trial Comparator Arm Partnership (CTCAP)
Description: Collate, Annotate, Curate and Host Clinical Trial Data with Genomic Information from the Comparator Arms of Industry and Foundation Sponsored Clinical Trials: Building a Site for Sharing Data and Models to evolve better Disease Maps.
Public-Private Partnership of leading pharmaceutical companies, clinical trial groups and researchers.
Neutral Conveners: Sage Bionetworks and Genetic Alliance [nonprofits].
Initiative to share existing trial data (molecular and clinical) from non-proprietary comparator and placebo arms to create powerful new tool for drug development.
Started Sept 2010
Shared clinical/genomic data sharing and analysis will maximize clinical impact and enable discovery
• Graphic of curated to qced to models
Arch2POCM
Restructuring the Precompe\\ve Space for Drug Discovery
How to poten\ally De-‐Risk High-‐Risk Therapeu\c Areas
The Federa\on
2008 2009 2010 2011
How can we accelerate the pace of scientific discovery?
Ways to move beyond “traditional” collaborations?
Intra-lab vs Inter-lab Communication
Pfizer CTI/ Industrial PPPs Academic Unions
(Nolan and Haussler)
sage federation: model of biological age
Faster Aging
Slower Aging
Clinical Association - Gender - BMI - Disease Genotype Association Gene Pathway Expression Pr
edicted Age (liver expression)
Chronological Age (years)
Age Differential
Reproducible science==shareable science
Sweave: combines programmatic analysis with narrative
Sweave.Friedrich Leisch. Sweave: Dynamic generation of statistical reports using literate data analysis. In Wolfgang Härdle and Bernd Rönz,editors, Compstat 2002 –
Proceedings in Computational Statistics,pages 575-580. Physica Verlag, Heidelberg, 2002. ISBN 3-7908-1517-9
Dynamic generation of statistical reports using literate data analysis
Portable Legal Consent
(Ac\va\ng Pa\ents)
John Wilbanks
weconsent.us
IT
Pharma
Academic Consor4a
Joint Pa4ent/Scien4st
Communi4es
Biotech
Pa4ent Founda4ons
Individual Pa4ents
BeVer Models of Disease:
INFORMATION COMMONS
Technology PlaHorm
Challenges
Impa
cHul M
odels
Governance
Cons4tuencies
What is the problem?
Our current models of disease biology are primitive and limit doctor’s understanding and ability to treat patients
Current incentives reward those who silo information and work in closed systems
The Solution: Competitions to crowd-source research in biology and other fields
Why competitions? • Objective assessments • Acceleration of progress • Transparency • Reproducibility • Extensible, reusable models
Competitions in biomedical research • CASP (protein structure) • Fold it / EteRNA (protein / RNA structure) • CAGI (genome annotation) • Assemblethon / alignathon (genome assembly / alignment) • SBV Improver (industrial methodology benchmarking) • DREAM (co-organizer of Sage/DREAM competition)
Generic competition platforms • Kaggle, Innocentive, MLComp
The Sage/DREAM breast cancer prognosis challenge
Goal: Challenge to assess the accuracy of computational models designed to predict breast cancer survival using patient clinical and genomic data
Why this is unique: This Sage/DREAM Challenge is a pre-collated cohort: 2000 breast cancer samples
from the Metabric cohort Accessible to all: A cloud-based common compute architecture is being made
available by Google to support the computational models needed to develop and test challenge models
New Rigor: • Contestants will evaluate their models on a validation data set composed of newly generated
data (provided by Dr. Anne-Lise Borreson Dale) • Contestants must demonstrate their models can be reproduced by others
New incentives: leaderboard to energize participants, Science Translational Medicine publication for winning team
Breast cancer patients, funders and researchers can track this Challenge on BRIDGE, an open source online community being built by Sage and Ashoka Changemakers and affiliated with this Challenge
Sage/DREAM Challenge: Details and Timing
Phase 1: July thru end-Sep 2012
Training data: 2,000 breast cancer samples from METABRIC cohort
• Gene expression • Copy number • Clinical covariates • 10 year survival
Supporting data: Other Sage-curated breast cancer datasets
• >1,000 samples from GEO • ~800 samples from TCGA • ~500 additional samples from
Norway group • Curated and available on
Synapse, Sage’s compute platform
Data released in phases on Synapse from now through end-September
Will evaluate accuracy of models built on METABRIC data to predict survival in:
• Held out samples from METABRIC
• Other datasets
Phase 2: Oct 1 thru Nov 12, 2012
Evaluation of models in novel dataset.
Validation data: ~500 fresh frozen tumors from Norway group with:
• Clinical covariates • 10 year survival
Gene expression and copy number data to be generated for model evaluation
• Sent to Cancer Research UK to generate data at same facility as METABRIC
• Models built on training data evaluated on newly generated data
Winners announced at November 12 DREAM conference
!"#$%#&'()"*'++"++&"(,*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(*1%/2*
(3&4"#*
53,'6%(*
!7"(%,2/"*-./#"++0%(*
1%/2*(3&4"#*
53,'6%(*
!7"(%,2/"*
!#"80)69"*&%8":*;"("#'6%(*
METABRIC cohort: 997 breast cancer samples
Clinical covariates
Gene expression (Illumina HT12v3)
Copy number (Affy SNP 6.0)
10 year survival
Loaded through Synapse R client as Bioconductor objects.
!"#$%#&'()"*'++"++&"(,*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(*1%/2*
(3&4"#*
53,'6%(*
!7"(%,2/"*-./#"++0%(*
1%/2*(3&4"#*
53,'6%(*
!7"(%,2/"*
!#"80)69"*&%8":*;"("#'6%(*
Custom models implement train() and predict() API.
Implementa)on of simple clinical-‐only survival model used as baseline predictor.
!"#$%&'#(#")*+,-./%0-1(23.(4)
5"46%768+'1)9+-"+:%;+,'#$)
9-.1+:2%712<2:4=($)5"8+,%
>4<+<)
?,'"#+%@+<4A+,2)
B4.8+4%784C2,4)
D-(#.8%>+,.+<) D+"4+,2%
?<:+"#/)
9+""$%E2<+,)
&,%726(%*+,F)>#,%7+-#"34,#)71#G8#,%H"4#,')
*-.14,%9-4,,#$)
D+"6%I4'+<)
?'+C%D+"F2<4,)
>#,%J2F.'2,)
Federa4on modeling compe44on
Models submiNed and evaluated in real-‐4me
leaderboard >200 models tested within 3 months
Summary hVps://synapse.sagebase.org/ -‐ BCCOverview:0
Transparency, reproducibility
Valida4on in novel dataset
Publica4on in Science Transla4onal Medicine
Dona4on of Google-‐scale compute space.
For the goal of promo4ng democra4za4on of medicine… Registra4on star4ng NOW…
sign up at: synapse.sagebase.org
!"#$%#&'()"*'++"++&"(,*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(* 1%/2*(3&4"#* 53,'6%(* !7"(%,2/"*
-./#"++0%(*1%/2*
(3&4"#*
53,'6%(*
!7"(%,2/"*-./#"++0%(*
1%/2*(3&4"#*
53,'6%(*
!7"(%,2/"*
!#"80)69"*&%8":*;"("#'6%(*
SUMMARY
These new data intensive models of disease will be strikingly powerful
They will not arise within the current academic/industrial loop They will be harder and more expensive than we can accept Ci\zenss as donors of data insights and funds will be cri\cal
For these benefits to be realized -‐ must become affordable therefore we willl need:
Compute spaces A Commons
New ways of being rewarded More eyeballs working without being paid
More willingness to share \ll aoer Clinical Proof of Concept
Alignments between the UCSF Strategic Plan���and the matchstick pilots for the Information
Commons being pursued by Sage Bionetworks • Invest in infrastructure that enables UCSF to excel in basic,
clinical and population science- SYNAPSE as a compute space -links with Michael Wiener/ OneMind
• Build a Bioinformatics initiative across all school by June 2014- Impactful Models / Challenges with Laura Esserman/ LauraVant’Veer
• Enhance existing data repositories and mining tools by June 2012 - Sage collaborations with Alex Pico, Geoff Manley
• Develop infrastructure to support new team-based, interdisciplinary learning models- PLC (The Federation?)
• Accelerate translation of groundbreaking science into therapies Arch2POCM