Big Data in Biomedicine: Transla3ng 300 trillion points of data into new drugs and diagnos3cs Atul Bu;e, MD, PhD Chief, Division of Systems Medicine, Departments of Pediatrics, Gene3cs, and, by courtesy, Computer Science, Pathology, and Medicine Center for Pediatric Bioinforma3cs, LPCH Stanford University abu;[email protected]@atulbu;e @ImmPortDB
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Big Data in Biomedicine: Transla3ng 300 trillion points of data into new drugs and diagnos3cs
Atul Bu;e, MD, PhD Chief, Division of Systems Medicine,
Departments of Pediatrics, Gene3cs, and, by courtesy, Computer Science, Pathology, and Medicine
Center for Pediatric Bioinforma3cs, LPCH Stanford University
Preeclampsia: large cause of maternal and fetal death
• Incidence • 5-‐8% of all pregnancies in the U.S. and worldwide
• 4.1 million births in the U.S. in 2009
• Up to 300K cases of preeclampsia annually in the U.S.
• Mortality • Responsible for 18% of all maternal deaths in the U.S.
• Maternal death in 56 out of every 100,000 live births in US
• Neonatal death in 71 out of every 100,000 live births in US
• Cost • $20 billion in direct costs in the U.S annually
• Average hospital stay of 3.5 days Linda Liu
Ma; Cooper Bruce Ling
New markers for preeclampsia
p value 3.49 X 10-‐4 1.79 X 10-‐5
ng/m
l
p value = 1.92 X 10-‐8
Control N=16
Preeclampsia N=15
Control N=16
Preeclampsia N=17
GA 23-‐34 weeks GA > 34 weeks
ng/m
l
Gesta3onal age (weeks)
march of dimes®
prematurity research center
VERSION: MOD_PRC_LOGO_R7G_082712
at STANFORD University School of Medicine
Linda Liu Bruce Ling
Sequencing Excitement • 454/Roche, Life Technologies • Helicos: $30k genome • Pacific Biosystems: sequence human genome in 15 minutes
• Run 'mes in minutes at a cost of hundreds of dollars
• Complete Genomics: 80 genomes/day
• Ion Torrent and Illumina: ~$1500 per genome
• Oxford: USB s'ck
Lancet, 375:1525, May 1, 2010.
Credit: Euan Ashley, Russ Altman, Steve Quake, Lancet
• Study published in 2008 in Inflammatory Bowel Disease
• Crohn’s Disease and Ulcera've Coli's
• Inves'gated 9 loci in 700 Finnish IBD pa'ents
• We record 100+ items – GWAS, non-‐GWAS papers – Disease, Phenotype – Popula'on, Gender – Alleles and Genotypes – p-‐value (and confidence) – Odds ra'o (and confidence) – Technology, Study design – Gene'c model
• Mapped to UMLS concepts Rong Chen Optra Systems
• Study published in 2008 in Inflammatory Bowel Disease
• Crohn’s Disease and Ulcera've Coli's
• Inves'gated 9 loci in 700 Finnish IBD pa'ents
• We record 100+ items – GWAS, non-‐GWAS papers – Disease, Phenotype – Popula'on, Gender – Alleles and Genotypes – p-‐value (and confidence) – Odds ra'o (and confidence) – Technology, Study design – Gene'c model
• Mapped to UMLS concepts
• Study published in 2009 in Rheumatology
• Ankylosing spondyli's
• Inves'gated 8 SNPs in IL23R in 2000 UK case-‐control pa'ents
• Tables can be rotated • NLP is hard
• Study published in 2009 in Rheumatology
• Ankylosing spondyli's
• Inves'gated 8 SNPs in IL23R in 2000 UK case-‐control pa'ents
• Tables can be rotated • NLP is hard
• Study published in 2009 in Rheumatology
• Ankylosing spondyli's
• Inves'gated 8 SNPs in IL23R in 2000 UK case-‐control pa'ents
• Tables can be rotated • NLP is hard
What are the alleles for rs1004819?
Alleles for rs1004819 are C and T
~11% of records reported genotypes in the nega3ve strand
Number of papers curated
Number of records
Dis3nct SNPs Diseases and phenotypes
~19,000 ~1.6 million ~473,000 ~7,400
Rong Chen Anil Patwardhan
Michael Clark Optra Systems
Personalis
VARIMED: Variants Informing Medicine
Chen R, Davydov EV, Sirota M, Bu;e AJ. PLoS One. 2010 October: 5(10): e13574.
Diseases and Traits • Risk factors are associated with an increased likelihood of developing a given diseases • Smoking à chronic obstruc've pulmonary disease
• Risk factors are iden'fied for diseases through large scale epidemiological studies, which are resource intensive • GWAS have iden'fied gene'c variants for thousands of diseases and traits • If traits and diseases share the same associated gene'c variants, could the trait be used to suggest risk factors for disease?
Identify significant disease-trait genetic associations and clinically validate using EMR data
Gene counts > 3
Disease (n=201)
Varimed
TF-IDF weighing Cosine distance Random shuffling
Trait (n=85)
Disease (n=69)
Trait (n=249)
Disease-Trait Pair (n=120)
p < 1e-8 Disease modules (n=8)
Gene3cs Module
D
Clinical Valida3on
Novel predictions (n=26)
T
q ≤ 0.01
D
Published findings (n=94)
T D
D
D D
T D
T T
T T
Trait modules (n=7)
Complications
Diagnostic tests
Risk factors
1st dx
After dx Before dx
1st dx
Li Li
Assessing significance of disease-‐trait (D-‐T) pair
• Each gene within individual disease or trait by taking into account the frequency of the gene: Term Frequency–Inverse Document Frequency • 2-‐idf(i, j) = 2(i, j) × idfi, = ni, j/(∑k nk, j) x log(D/Di) which adjusted the score of 6(i, j) by taking into account the popularity level of the gene i.
• e.g, 154 D+T, 28 genes in Alzheimer's disease and 5 genes in ESR, CR1 was in common • s-‐idf (AD)=1/28 x log(154/2,10)=0.067 • s-‐idf (ESR)=1/5 x log(154/2,10)=0.377
• D-‐T distance score was calculated using Cosine distance to evaluate similarity between all pairs.
• Randomly sampling all the genes across all the traits, and calculated the D-‐T similarity, repeated 1,000 'mes and generated the q value based on the number of the samplings.
Categoriza3ons for known D-‐T pairs and discover poten3al confounders in GWAS studies
38 pairs 27 pairs 28 pairs
93 pairs
T D
Gene3c Variants
T D
Gene3c Variants
Timing of Disease Progression
Risk Factor Consequence
T
D
Gene3c Variants
Diagnos3c Test
Li Li
Diagnos3c tests where traits occur at the same 3me as disease onset
An3body 3ter
Hepa<<s B vaccine response Png et al, Hum Mol Genet, 2011
Even though this GWAS did not explicitly par'cipants with the autoimmune diseases above, our approach inferred known rela'onships between diseases and traits based on their shared gene'c architecture
T
D
Gene3c Variants
Diagnos3c Test
Li Li
Significant genes shared between an3body 3ter and 16 autoimmune diseases
Known clinical study: Smoking is the primary risk factor for COPD although lixle was known the pathogenesis between smoking and COPD. Pauwels et al, 2001, Vestbo et al 2012 In GWAS study: Six GWAS studies are related to COPD in VARIMED and their COPD cohorts all are from smoking pa'ents. Cho et al, 2012, Pillai SG, 2010, Wang et al 2010, Cho et al, 2010, lambrechts et al, 2010, Pillai SG, 2009 As COPD occurs ayer smoking, the variants associated with COPD could be influenced by smoking, and the gene'c variants for COPD could be unmasked if smoking confounder is excluded in GWAS.
Smoking COPD
Li Li
Gene3c Variants
Consequence where traits occur aqer the disease onset Trait Common Genes Genes Shared q-‐value
Alanine aminotransferase levels 1 C12orf51 0.001
Cholesterol levels 3 ALDH2; BRAP; C12orf51 0.001
HDL cholesterol levels 2 C12orf51; OAS3 <0.001
Known clinical study: High HDL criterion was observed with triple frequency in the ADS group, high cholesterol diet was associated with ADS pa'ents , and ALT levels have been seen to increase with daily alcohol intake in pa'ents who developed ADS. Kahl et al, 2010; imhof et al, 2001, Gross GA, 1994
In GWAS study: 3 genes for cholesterol levels reported by Kato et al. and 2 genes for ALT and HDL-‐C reported by Young et al. could be biased by alcohol effect as the authors did not perform alcohol intake adjustment or controlled for drinking habits on these genes in their GWAS studies. Kato et al, 2011; Kamatani et al, 2010 The GWAS to iden'fy concrete gene'c variants for these three clinical measurements should be performed in pa'ents without ADS as a confounder
Alcohol dependence syndrome (ADS)
ALT HDL-‐C
ADS
Li Li
27 novel pairs Trait Disease Common
Genes Genes Shared q-‐value
Mean corpuscular volume Acute lymphoblas3c leukemia 1 IKZF1 0.001 Mean cell hemoglobin concentra3on Alcohol dependence 1 ALDH2 0.005
Independent pa3ent cohort valida3on: clinical data warehouses
• STRIDE: clinical data warehouse, has ICD9 diagnoses codes, CPT procedure codes, and lab results on over 1.7 million pediatric and adult pa'ents at Stanford Hospital and Clinic, independent cohort 1/1/2005 to 7/15/2012
• Collabora'ons also with Columbia University and Mount Sinai School of Medicine to validate findings
• Time frame for analysis: within one year before the 1st disease diagnosis or within one year ayer the 1st disease diagnosis
Collaborators • Jeff Wiser, Patrick Dunn, Mike Atassi / Northrop Grumman • Ashley Xia and Quan Chen / NIAID • Takashi Kadowaki, Momoko Horikoshi, Kazuo Hara, Hiroshi Ohtsu / U Tokyo • Kyoko Toda, Satoru Yamada, Junichiro Irie / Kitasato Univ and Hospital • Shiro Maeda / RIKEN • Alejandro Sweet-‐Cordero, Julien Sage / Pediatric Oncology • Mark Davis, C. Garrison Fathman / Immunology • Russ Altman, Steve Quake / Bioengineering • Euan Ashley, Joseph Wu, Tom Quertermous / Cardiology • Mike Snyder, Carlos Bustamante, Anne Brunet / Gene'cs • Jay Pasricha / Gastroenterology • Rob Tibshirani, Brad Efron / Sta's'cs • Hannah Valan'ne, Kiran Khush/ Cardiology • Ken Weinberg / Pediatric Stem Cell Therapeu'cs • Mark Musen, Nigam Shah / Na'onal Center for Biomedical Ontology • Minnie Sarwal / Nephrology • David Miklos / Oncology
Support • Lucile Packard Founda'on for Children's Health • NIH: NIAID, NLM, NIGMS, NCI; NIDDK, NHGRI, NIA, NHLBI, NCATS • March of Dimes • Hewlex Packard • Howard Hughes Medical Ins'tute • California Ins'tute for Regenera've Medicine • Luke Evnin and Deann Wright (Scleroderma Research Founda'on) • Clayville Research Fund • PhRMA Founda'on • Stanford Cancer Center, Bio-‐X, SPARK
• Tarangini Deshpande • Alan Krensky, Harvey Cohen • Hugh O’Brodovich • Isaac Kohane
Admin and Tech Staff • Susan Aptekar • Jen Cory • Boris Oskotsky