PGPop: PharmacoGenomic discovery and replication in very large patient POPulations PGPop: SUMMARY PGPop was conceived as a network resource to provide.

RA rs6457620 Intergenic Chr. 6 75 138

MS rs3135388 DRB1*1501 108 61

RA rs6679677 RSBN1 238 134

RA rs2476601 PTPN22 238 134

AF rs2200733 Chr. 4q25 292 147

CD rs11805303 IL23R 493 107

T2D rs4506565 TCF7L2 503 532

CD rs17234657 Chr. 5 513 106

CD rs1000113 Chr. 5 626 107

T2D rs12255372 TCF7L2 745 510

T2D rs12243326 TCF7L2 746 520

CD rs17221417 NOD2 866 107

AF rs10033464 Chr. 4q25 1046 143

CD rs2542151 PTPN22 1104 107

MS rs2104286 IL2RA 2133 61

MS rs6897932 IL7RA 2263 61

T2D rs10811661 CDKN2B 2406 534

T2D rs8050136 FTO 2569 533

T2D rs5219 KCNJ11 2792 533

T2D rs5215 KCNJ11 2908 527

T2D rs4402960 IGF2BP2 3111 527

gene / regionmarkernumber needed

number identified

disease

Odds ratio

0.5 1.0 2.0 5.0

0.1 1 10

PGPop: PharmacoGenomic discovery and replication in very large patient POPulations

PGPop: SUMMARYPGPop was conceived as a network resource to provide to PGRN an opportunity to identify large groups of real world patients with known drug exposures and outcomes for pharmacogenomic study in a clinical setting.

Each PGPop node includes a very large collection of patient data, drug exposures, and outcomes, and they share the general characteristic that they include “all comers” rather than more narrowly defined clinical trial populations. Some consortium nodes include large DNA collections in place, while others cover millions of lives and have committed to an infrastructure to collect DNA from patients with identified phenotypes. The participating systems include •BioVU, the Vanderbilt DNA databank that currently links 90,000 de-identified electronic health records (EHR) records with DNA obtained from discarded blood samples•The Marshfield Clinic Personalized Medicine Research Project (PMRP) that includes DNA from almost 20,000 individuals coupled to an EHR that extends back to the 1960s•Informatics for Integrating Biology and the Bedside (i2b2), an informatics capability at Harvard supported by the National Center for Biomedical Computing. The i2b2 group will not only contribute informatics excellence, but has also developed the Crimson Project that can provide DNA linked to de-identified medical records to Harvard Partners investigators from over 800,000 patient visits annually.•BioBank Japan, a resource that includes DNA and other biospecimens in >300,000 subjects. Clinical data are collected by medical coordinators at each of the 66 participating hospitals that cover 2% of all Japanese hospital beds (~25,000). •The integrated pharmacoepidemiology program of 13 health plans participating in the HMO Research Network Center for Education and Research in Therapeutics (CERT); these plans together cover 11,000,000 lives. •The Pharmacy benefits company Medco, that currently provides services to >60 million patients and has an active program in pharmacogenomics

Vanderbilt BioVU – design and current status

Leadership at PGPop nodes

Top: The BioVU model. BioVU uses DNA extracted from blood samples that were obtained in the course of clinical care and that are about to be discarded. Using discarded biologic material as a research resource requires that the associated clinical information be de-identified. Accordingly, the first step (top left) in creation of the BioVU resource was creation of an image, termed the Synthetic Derivative, of the Vanderbilt EMR in which identifiers have been scrubbed and the medical record number has been hashed. The medical record number in eligible blood samples is labeled with the same hashed number, and DNA extracted. Bottom: Sample access procedures. After signing a data use agreement, investigators gain access to the Synthetic Derivative. The Data Use Agreement includes further stipulations against attempts at re-identification, and mandates that genotype data be redeposited into the resource. Tools to conduct simple automated searches are in place, but investigator curation is generally required to more precisely identify cases and controls for subsequent studies. Samples are retrieved for genotyping after review of a genotyping plan. Planning for BioVU began in 2004 and the first samples were acquired in 2007. The resource currently accrues 500-1000 samples/week, and now holds ~90,000 samples. Samples from the Vanderbilt Children’s Hospital were included in spring 2010.

Sample retrieval

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

Genotyping, genotype-phenotype relations

cases

controls

+

On

e w

ay

hash

Investigator query

cases

controls

+

Data use agreement

BioVU (Vanderbilt University Medical Center)Marshfield Clinic Personalized Medicine Research Project (PMRP)

Crimson Project (i2b2 at Harvard)

HMO Research Network Center for Education and Research in Therapeutics

Biobank Japan

Medco (Pharmacy benefits)

Table 1: PGPop nodes

Resource Current size

EMR DNA

in hand

Ethnicity (%)

Caucasian African

American Asian Hispanic

BioVU 90,000 Y Y 85 12 1 1 PMRP 20,000 Y Y 98 0.5 1 Crimson 800,000 Y 60 10 15 15 Biobank Japan

300,000 Y 100

HMORN CERT

11,000,000 Y varies 1-33 1-9 1-39

Medco 65,000,000

Hua Xu Josh Denny

Yusuke Nakamura

Zak Kohane

Cathy McCarty

Bob Davis

Felix Frueh

Dan Roden, PI

The BioVU “demonstration project”. The first 10,000 subjects accrued were all genotyped at multiple SNP sites previously associated with disease susceptibility, and then natural language processing methods were used to identify cases and controls in the entire set. The experiment thus mimics a situation in which genotypic information is available in many subjects, and sets are then selected for genotype-phenotype analysis. The results are ordered by the number of cases estimated for replication (“number needed” column), calculated from previously-reported odds ratios, indicated by a red square. The number of cases actually identified is also shown (“number identified”). The blue diamonds indicate the point estimate of the allelic odds ratio derived from analysis of cases and controls identified. The confidence intervals for these estimates are also provided. This analysis used only cases in which European ancestry had been assigned. AF: atrial fibrillation; CD: Crohn’s Disease; MS: multiple sclerosis; RA: rheumatoid arthritis; T2D: type 2 diabetes.

eligibleJoh

n D

oe O

ne w

ay h

ash

A7C

CF

99D

E57

32…

.

A7C

CF

99D

E65

732…

.

Extract DNA

A7C

CF

99D

E65

732…

.

Joh

n D

oe

The “synthetic derivative”(SD)

Searches conducted in BioVU (April-May 2009, in preparation for the PGPop submission)

Phenotype Location in

EMR searched

Requesting investigator /

site Number

% women

% African-American

BioVU (May 21, 2009) 56,907 58.1 9.9 warfarin medications PAT 4,482 48.3 9.5 5 most commonly prescribed statins

medications Krauss/PARC 10,216 46.0 10.9

clopidogrel medications Shuldiner/PAPI Limdi/UAB

4,407 42.4 10.1

prednisone or dexamethasone

medications Relling /PAAR4KIDS

10,584 58.7 12.1

metformin + Type 2 diabetes + HgA1c

Complex NLP-based search

Giacomini/PMT 1,794 55.7 21.3

rheumatoid arthritis Complex NLP-based search

Plenge/MGH 1,777 77.1 9.5

asthma ICD9 code Weiss/PHAT 3,916 70.8 17.5 hypertension ICD9 code Johnson/PEAR 21,102 52.0 14.3 Zyban, Wellbutrin, bupropion, Chantix, Varenicline in medications OR “nicotine replacement” in the history and physical, problem list or discharge summary.

Tyndale/PNAT 3,855 70.2 8.3

NLP: Natural Language processing; EMR: Electronic Medical Record

PGPop goalsPGPop will be managed by a Steering Committee that will

include representation from the participating nodes. Our initial task will be (1) organization of the resource and (2) execution of a demonstration project that will establish mechanisms for access to samples from multiple resource nodes.

We anticipate that mechanisms to access PGPop will be similar to those being established for access to other PGRN resources. This will likely involve an application process to be reviewed by components of the PGRN and by PGPop. There will be costs associated with accessing the samples, which remain to be determined.

PGPop goals are 1. Establish the infrastructure to enable rapid access to well-

phenotyped samples across nodes• Catalog resource components• Facilitate access to cases and controls, and ultimately

samples• Coordination of methods to define phenotypes across

nodes. 2. Undertake a demonstration project across nodes in Year 013. Deploy the resource for pharmacogenomic studies proposed by

PGRN sites• The Steering Committee and PGRN will receive

applications and decide on scientific merit. The Steering Committee will establish which PGPop node(s) can and wish to collaborate on a given project. Any single PGRN center could interact individually with any participating node. We anticipate that PGPop would support 1-2 projects/year.

4. Evaluate best practices and models for using large resources for pharmacogenomic science

PGPop: PharmacoGenomic discovery and replication in very large patient POPulations PGPop: SUMMARY PGPop was conceived as a network resource to provide.

Documents

biovu resource

clinical data

large dna collections

research resource

vanderbilt dna databank

deidentified medical

network resource

biovu model