integrating Data for Analysis, Anonymization, and SHaring Infrastructure to Host Sensitive Data: HIPAA Cloud Storage and Compute Claudiu Farcas, Olivier Harismendy, Antonios Koures UCSD
integrating Data for Analysis, Anonymization, and SHaring
Infrastructure to Host Sensitive Data:
HIPAA Cloud Storage and Compute
Claudiu Farcas, Olivier Harismendy, Antonios Koures
UCSD
Outline
• History
• iDASH CLOUD/SHADE Current State
• Future Plans for CLOUD
• The Quest for Repeatable Science
• Genomics Collaborations
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 2
In the beginnings…
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 3
There was a
“MindMap” to serve
the needs of a very
diverse community
We had a plan…
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 4
Conquer the world of
hardware through
extensive
virtualization
… and started our journey.
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 5
Some hardware
and …
Rack #1 Rack #2
lots of ideas…
some good, others …
Road blocks and mishaps…
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 6
Rack #1 Rack #2
Some things
simply don’t
work out…
… so we start
from scratch
… and successes to keep us going.
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 7
Tools and data create science!
Fast forward to today…
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 8
Safe HIPAA-compliant Annotated Data deposit box Environment
On-demand Virtualized Elastic Resilient Compute And Storage Technology
HIPAA and non-public data
public data, tools, recipes
Po
wer
ed b
y
MID
AS
Data Tools Recipes
upload & download data
compute request, direct upload & download of proprietary data, tool, recipe
middleware and HIPAA security developed by iDASH
Compute nodes Memory Disk storage Networking
Po
wer
ed b
y V
Mw
are AUTOMATED
iDASH CLOUD
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 9
Quarantine
Development
Production
Staging
Successive progression through the environment towards Production
Technical Specifications
3 computation tiers 3 storage tiers 10GbE throughout Full redundancy RSA Two Factor Auth. Remote data replication
1000+ cores 9TB+ RAM 1PB+ storage
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 10
Cloud Environments
• Quarantine
» An isolated environment for incoming code and applications
• Only accessible internally
• All ports closed, except SSH
» Apps and/or code can be scanned for vulnerabilities, malware etc
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 11
Cloud Environments
• Development & Testing » A controlled environment designed for agile
development
» No Personal Health Information(PHI) • VPN access, no 2-Factor
» Source code control
» Bug tracking
» Development wiki
» Confluence
» Group Chat utility
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 12
Cloud Environments
• Staging/QA Environment
» Uses both VPN and 2-factor authentication
» Mirrors, but is independent of, Production
• Used for pre-production tests and UAT
• Must have user acceptance before promotion into production
• Production
» Very secure with VPN, 2-factor and significant development restrictions
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 13
Cloud Improvements Y5
• Added a Quarantine environment with tools/utilities to analyze unknown incoming applications and source code
• Added Development services to the Development environment
» GitLab source code control
» Jira for bug and issue tracking
» Openfire chat
» Confluence for organizing and sharing information
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 14
Cloud Improvements Y5
• Added 3 FX630 Dell chasses each with 4 blades
• Each blade has two CPU sockets populated with the Intel Haswell E5-2699 v2 processors and 512GB of RAM
• This brings the available core count to over 1000
• Added additional disks (SSD and non-SSD) to increase the capacity of the cloud to over 1PB
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 15
Future plans
• FISMA ATO
• Integration of popular pipelines (e.g., SeqWare, OmicsPipe, etc) into blueprints
• Billing and accounting
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 16
Future State: NSX implementation
• Analogous to server virtualization for compute, the NSX network virtualization approach allows System Admins to treat their physical network as a pool of transport capacity that can be consumed and repurposed on demand
• Network services are programmatically distributed to each virtual machine, independent of the underlying network hardware or topology, so workloads can be dynamically added or moved and all of the network and security services attached to the virtual machine move with it » Automate network provisioning for tenants with customization
and complete isolation » Better, dynamically adjustable, isolation of the cloud
environments (Quarantine, Dev, QA, Prod)
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 17
Future State: Hybrid Cloud
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 18
Secure Section
Public Cloud
Private Cloud
Challenges for reproducible research
• missing or obsolete source code
• undocumented or unexpected dependencies to install and configure applications
• undisclosed values of the parameters used in published analyses
• requirements for querying and pre-processing external reference datasets.
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 19
Support for containers
Running Docker within Linux VM: • Flexibility of the applications (bundle necessary libraries, retain provenance) • Improve efficiency/ scalability/ economics / security of the cloud
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 20
FlightDeck
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 21
Repeatable Results
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 22
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
iDASH On-Demand Resources
CLOUD
SHADE Repository
Automation
Repeatable Results
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 23
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Blueprint
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
Blueprint
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
Blueprint
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
Blueprint
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
iDASH On-Demand Resources
Bookshelf
MyDATA
Templates
Repeatable Results
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 24
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Blueprint
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
Blueprint
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
Blueprint
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
Blueprint
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
Instance
Workflow
Short reads
Index reference
Align to reference
Call variants
Annotate variants
Pick high impact
Deleterious SNPs
Co
nte
xt
Reference DB
Test data
Configuration
Helper tools
OS
iDASH On-Demand Resources
Bookshelf
MyDATA
Input Results
Instance
External Data
Protected Health Information
• Cancer Genomic Data is Protected Health Information » DNA sequences
• Germline polymorphism, insertion, deletions • Somatic mutations • Structural variations
» RNA sequences » Genotyping arrays
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 25
Cancer Genomics Datasets
• Moores Cancer Center Internal datasets (sequencing or genotyping) » 1032 Chronic Lymphocytic Leukemia
» 20 Myelodysplastic syndromes
» 12 Mesotheliomas
» 29 Appendix cancer
» 38 Breast cancers (sequencing)
» 36 Breast cancers (genotyping)
• “Public” Datasets » The Cancer Genome Atlas (1078 Breast, 478 Lung)
» The International Cancer Genome Consortium (100 Ovarian)
» dbGAP datasets
9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 26
Genomics Machines Available
Li Ding , Jay Mashl - WashU Brad Chapman- Harvard
Best Practice Pipelines • Germline variant calling • Cancer variant calling • Structural variant calling • RNA-seq • smallRNA-seq • ChIP-seq • Standard
Local vs Remote Public Data
0
200
400
600
800
1000
1200
hg
38
hg
38
hg
38
ExA
CE
xA
CE
xA
C
db
SN
Pd
bS
NP
db
SN
P
dbNS…
dbNS…
dbNS…
Rem
ote
/Lo
cal
Du
rati
on
Local copy is installed up to 1000x faster Cumulative from 6.8 hrs to 29 seconds
Broad Institute
NCBI Soft-genetics
UC Santa Cruz
Tumor vs Normal Exome
.fastq
.bam
.refined.bam
.vcf
.annotated.vcf
.copy_number
Alignment(BWA)
Duplicateremoval(Picard)QualityRecalibra on(GATK)
IndelRealignment(GATK)
Variantcalling(VarScan)
VariantAnnota on(Oncotator,VariantTools)
DatabasesdbNSFPExACdbSNPCOSMIC
.realigned.bam
.fastq
.bam
.refined.bam
.realigned.bam
NORMALDNA TUMORDNA
Performance is similar to public cloud Little/No overhead from docker
Pan Cancer Analysis of Whole Genomes
• 2,601 donors (Tumor-Normal WG pairs)
• ~300GB of data per donor
• Sanger Center Dockerized workflow
• 9 VMs, 32 CPUs, 256GB RAM, 1TB storage
• 115 donors analyzed
» Shortest 21hrs
» Longest 17 days (not included in freeze)
Download x2
getBasFile x2
ASCAT allele count x2 Pindel input x2
BRASS Input
ASCAT Pindel x 24
BRASS
ASCAT finalize Pindel 2 VCF x24
Package results
Caveman prepare x2 BRASS filter
Pindel merge
Caveman setup BRASS split
Caveman split xN
BRASS assemble xN
Package results Caveman split concat
BRASS grass
Caveman mstep xN BRASS tabix
Caveman merge Package results
Caveman estep xN
Caveman merge
Caveman add ID
Caveman flag
Package results Caveman cleanup
Metrics
VCF upload
PCAWG Sanger Workflow
PCAWG Sanger Wall Time
0.0e+00
2.5e+05
5.0e+05
7.5e+05
1.0e+06
idashosdc bsc ebi ucscsangerdkfz oicr etri pdc rikenAnalysis Center
Wa
ll tim
e (
s)
analysis_center
bsc
dkfz
ebi
etri
idash
oicr
osdc
pdc
riken
sanger
ucsc
• Not corrected for • N CPUs • available RAM • File size
Workflow 1.0.4/5/6
OICR is mainly run on AWS spot instances