RA rs6457620 Intergenic C hr.6 75 138 MS rs3135388 D R B 1*1501 108 61 RA rs6679677 RSBN1 238 134 RA rs2476601 PTPN22 238 134 AF rs2200733 C hr.4q25 292 147 CD rs11805303 IL23R 493 107 T2D rs4506565 TC F7L2 503 532 CD rs17234657 C hr.5 513 106 CD rs1000113 C hr.5 626 107 T2D rs12255372 TC F7L2 745 510 T2D rs12243326 TC F7L2 746 520 CD rs17221417 NOD2 866 107 AF rs10033464 C hr.4q25 1046 143 CD rs2542151 PTPN22 1104 107 MS rs2104286 IL2R A 2133 61 MS rs6897932 IL7R A 2263 61 T2D rs10811661 C D K N 2B 2406 534 T2D rs8050136 FTO 2569 533 T2D rs5219 K C N J11 2792 533 T2D rs5215 K C N J11 2908 527 T2D rs4402960 IG F2B P2 3111 527 gene /region m arker num ber needed num ber identified disease O dds ratio 0.5 1.0 2.0 5.0 0.1 1 10 PGPop: PharmacoGenomic discovery and replication in very large patient POPulations PGPop: SUMMARY PGPop was conceived as a network resource to provide to PGRN an opportunity to identify large groups of real world patients with known drug exposures and outcomes for pharmacogenomic study in a clinical setting. Each PGPop node includes a very large collection of patient data, drug exposures, and outcomes, and they share the general characteristic that they include “all comers” rather than more narrowly defined clinical trial populations. Some consortium nodes include large DNA collections in place, while others cover millions of lives and have committed to an infrastructure to collect DNA from patients with identified phenotypes. The participating systems include •BioVU, the Vanderbilt DNA databank that currently links 90,000 de-identified electronic health records (EHR) records with DNA obtained from discarded blood samples •The Marshfield Clinic Personalized Medicine Research Project (PMRP) that includes DNA from almost 20,000 individuals coupled to an EHR that extends back to the 1960s •Informatics for Integrating Biology and the Bedside (i2b2), an informatics capability at Harvard supported by the National Center for Biomedical Computing. The i2b2 group will not only contribute informatics excellence, but has also developed the Crimson Project that can provide DNA linked to de-identified medical records to Harvard Partners investigators from over 800,000 patient visits annually. •BioBank Japan, a resource that includes DNA and other biospecimens in >300,000 subjects. Clinical data are collected by medical coordinators at each of the 66 participating hospitals that cover 2% of all Japanese hospital beds (~25,000). •The integrated pharmacoepidemiology program of 13 health plans participating in the HMO Research Network Center for Education and Research in Therapeutics (CERT); these plans together cover 11,000,000 lives. •The Pharmacy benefits company Medco, that currently provides services to >60 million patients and has an active program in pharmacogenomics Vanderbilt BioVU – design and current status Leadership at PGPop nodes Top: The BioVU model. BioVU uses DNA extracted from blood samples that were obtained in the course of clinical care and that are about to be discarded. Using discarded biologic material as a research resource requires that the associated clinical information be de-identified. Accordingly, the first step (top left) in creation of the BioVU resource was creation of an image, termed the Synthetic Derivative, of the Vanderbilt EMR in which identifiers have been scrubbed and the medical record number has been hashed. The medical record number in eligible blood samples is labeled with the same hashed number, and DNA extracted. Bottom: Sample access procedures. After signing a data use agreement, investigators gain access to the Synthetic Derivative. The Data Use Agreement includes further stipulations against attempts at re- identification, and mandates that genotype data be redeposited into the resource. Tools to conduct simple automated searches are in place, but investigator curation is generally required to more precisely identify cases and controls for subsequent studies. Sam ple retrieval B699tre563msd.. scrub be d F5rt783m bncds… scrub bed B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. scrub be d F5rt783m bncds… scrub be d B699tre563msd.. scrub bed F5rt783m bncds… scrub bed B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. scrub be d F5rt783m bncds… scrub be d B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. sc ru bbed F5rt783m bncds… sc r u bbed B699tre563msd.. scrub be d F5rt783m bncds… scrubbed B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. scr u bbed F5rt783m bncds… scrub bed B699tre563msd.. scrub be d F5rt783m bncds… scrubbed B699tre563msd.. scru bbed F5rt783m bncds… scru bbed B699tre563msd.. scru bbed F5rt783m bncds… scrub be d B699tre563msd.. scrub be d F5rt783m bncds… scrubbed B699tre563msd.. scru bbed F5rt783m bncds… scru bbed B699tre563msd.. scru bbed F5rt783m bncds… scrub be d B699tre563msd.. scrub be d F5rt783m bncds… scrubbed B699tre563msd.. scru bbed F5rt783m bncds… scrub bed B699tre563msd.. scru bbed F5rt783m bncds… scrub be d B699tre563msd.. scrubbed F5rt783m bncds… scrubbed B 699tre563m sd.. scrubbed F5rt783m bncds… scrub be d B699tre563msd.. scru bbed F5rt783m bncds… scrub be d F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B 699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B 699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . G enotyping, genotype- phenotype relations cases controls + O ne w ay hash Investigator query cases controls + D ata use agreem ent BioVU (Vanderbilt University Medical Center) Marshfield Clinic Personalized Medicine Research Project (PMRP) Crimson Project (i2b2 at Harvard) HMO Research Network Center for Education and Research in Therapeutics Biobank Japan Medco (Pharmacy benefits) R esource C urrent size EM R DNA in hand Ethnicity (% ) C aucasian African American Asian Hispanic BioVU 90,000 Y Y 85 12 1 1 PM RP 20,000 Y Y 98 0.5 1 Crim son 800,000 Y 60 10 15 15 Biobank Japan 300,000 Y 100 HMORN CERT 11,000,000 Y varies 1-33 1-9 1-39 M edco 65,000,000 Hua Xu Josh Denny Yusuke Nakamura Zak Kohane Cathy McCarty Bob Davis Felix Frueh Dan Roden, PI The BioVU “demonstration project”. The first 10,000 subjects accrued were all genotyped at multiple SNP sites previously associated with disease susceptibility, and then natural language processing methods were used to identify cases and controls in the entire set. The experiment thus mimics a situation in which genotypic information is available in many subjects, and sets are then selected for genotype-phenotype analysis. The results are ordered by the number of cases estimated for replication (“number needed” column), calculated from previously-reported odds ratios, indicated by a red square. The number of cases actually identified is also shown (“number identified”). The blue diamonds indicate the point estimate of the allelic odds ratio derived from analysis of cases and controls identified. The confidence intervals for these estimates are also provided. This analysis used only cases in which European ancestry had been assigned. AF: atrial fibrillation; CD: Crohn’s Disease; MS: multiple sclerosis; RA: rheumatoid arthritis; T2D: type 2 diabetes. eligible John D oe O ne w ay hash A7C C F99D E5732… . A7C C F99D E65732… . Extract DNA A7C C F99D E65732… . John D oe The “synthetic derivative” (SD ) Searches conducted in B ioVU (April-M ay 2009,in preparation forthe PG Pop subm ission) Phenotype Location in EM R searched R equesting investigator/ site Num ber % w om en % A frican- American BioVU (M ay 21,2009) 56,907 58.1 9.9 w arfarin medications PAT 4,482 48.3 9.5 5 m ostcomm only prescribed statins medications Krauss/PA R C 10,216 46.0 10.9 clopidogrel medications Shuldiner/PAPI Limdi/UAB 4,407 42.4 10.1 prednisone or dexam ethasone medications Relling /PA AR 4KID S 10,584 58.7 12.1 m etform in + Type 2 diabetes + H gA1c Com plex N LP-based search Giacomini/PM T 1,794 55.7 21.3 rheum atoid arthritis Com plex N LP-based search Plenge/M GH 1,777 77.1 9.5 asthma ICD 9 code W eiss/PH AT 3,916 70.8 17.5 hypertension ICD 9 code Johnson/P EA R 21,102 52.0 14.3 Zyban,Wellbutrin,bupropion, C hantix,V arenicline in m edications O R “nicotine replacem ent”in the history and physical,problem listor discharge sum m ary. Tyndale/P N AT 3,855 70.2 8.3 N LP:N atural Language processing;EM R :Electronic M edical R ecord PGPop goals PGPop will be managed by a Steering Committee that will include representation from the participating nodes. Our initial task will be (1) organization of the resource and (2) execution of a demonstration project that will establish mechanisms for access to samples from multiple resource nodes. We anticipate that mechanisms to access PGPop will be similar to those being established for access to other PGRN resources. This will likely involve an application process to be reviewed by components of the PGRN and by PGPop. There will be costs associated with accessing the samples, which remain to be determined. PGPop goals are 1. Establish the infrastructure to enable rapid access to well-phenotyped samples across nodes • Catalog resource components • Facilitate access to cases and controls, and ultimately samples • Coordination of methods to define phenotypes across nodes. 2. Undertake a demonstration project across nodes in Year 01 3. Deploy the resource for pharmacogenomic studies proposed by PGRN sites • The Steering Committee and PGRN will receive applications and decide on scientific merit. The Steering Committee will establish which PGPop node(s) can and wish to collaborate on a given project. Any single PGRN center could interact individually with any participating node. We anticipate that PGPop would support 1-2 projects/year. 4. Evaluate best practices and models for using large resources for pharmacogenomic science