Infrastructure for Sharing Very Large Data Sets http://www.sam.pitt.edu Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling Research Associate Professor Departments of Chemistry and Computational & Systems Biology University of Pittsburgh
29
Embed
Infrastructure for Sharing Very Large Data Sets Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Infrastructure for Sharing Very Large Data Sets
http://www.sam.pitt.edu
Antonio M. Ferreira, PhD
Executive Director, Center for Simulation and Modeling
Research Associate Professor Departments of Chemistry and Computational & Systems Biology
University of Pittsburgh
PARTS OF THE INFRASTRUCTURE PUZZLE• Hardware
• Networking• Storage• Compute
• Software• Beyond scp/rsync• Globus• gtdownload
• Policies• Not all data is “free”• Access controls
PARTS OF THE INFRASTRUCTURE PUZZLE• Hardware
• Networking• Storage• Compute
• Software• Beyond scp/rsync• Globus, gtdownload, bbcp, etc.
• Policies• Not all data is “free”• Access controls
THE “OLD” MODEL
Disk Main Memory
L3 Cache
L2 Cache
L1 Cache
L1i Cache
CPU Core
Bus
NETWORK IS THE NEW BUS
Main Memory
L3 Cache
L2 Cache
L1 Cache
L1i Cache
CPU Core
Bus
Disk
Network
DATA SOURCES AT PITT
• TCGA• Currently 1.1 PB growing by ~50 TB/mo.
• Pitt is largest single contributor
• UPMC Hospital System• 27 individual hospitals generating clinical and
genomic data• ~30,000 patients in BRCA alone
• LHC• Generates more than 10 PB/year• Pitt is a Tier 3 site
TCGA DATA BREAKDOWNCancer Pitt Contribution All Univ's Contribution Pitt's
• Need the right file systems to backend a DMZ• Lustre/GPFS• How do you pull data from the high-speed network?• Where will it land?
• DMZ explicitly avoids certain security restrictions
• Access Controls• Genomics/Bioinformatics is growing enormously• DMZ is likely not HIPPA-compliant
• Is it EPHI?• Can we let it live with non-EPHI data?
CURRENT FILE SYSTEMS
• /home directories are traditional NFS
• SLASH2 filesystem for long-term storage• 1 PB of total storage• Accessible from both PSC and Pitt compute
hardware
• Lustre for “active” data• 5 GB/s total throughput• 800 MB/s single-stream performance• InfiniBand connectivity
• Important for both compute and I/O
• Computing on Distributed Genomes• How do we make this work once we get the data?• Need the APIs
• Genomic data from UPMC• UPMC has data collection• UPMC lacks HPC systems for analysis
INSTITUE FOR PERSONALIZED MEDICINE• Pitt/UPMC joint venture
• Drug Discovery Institute• Pitt Cancer Institute• UPMC Cancer Institute• UPMC Enterprise Analytics
• Improve patient care
• Discover novel uses for existing therapeutics
• Develop novel therapeutics
• Enable genomics-based research and treatement
WHAT IS PGRR?
What PGRR IS… What PGRR is not….
1. A common information technology framework for accessing deidentified national big data datasets that are important for Personalized Medicine
2. A portal that allows you to use this data easily with tools and resources provided by the Simulation and Modeling Center (SaM), Pittsburgh Supercomputing Center (PSC), and UPMC Enterprise Analytics (EA)
3. A managed environment to help you meet the information security and regulatory requirements for using this data
4. A process for helping you stay current about updates and modifications made to these datasets
1. A place to store your individual research results
2. A system to access UPMC clinical data
3. A service for analyzing data on your behalf
Data Exacell Storage (SLASH2)
PGRR
PSC
IPM
Po
rtal
Pipeline Codes
Frank
Pitt
Pitt (IPM, UPCI)
M
Source (e.g. NCI, CGHub)TCGA
GO
Blackligh
t
She
rlock
BAM
BAM
Panasa
s40 T
B
Bra
she
ar
29
0 T
B
Virtuoso
10 G
bit
(t
hro
ttle
d t
o 2
Gb
it)
Net
wo
rk
Re
plic
ati
on
Metadata
superce
ll1
00 T
B
Da
tab
as
e n
od
es
BAMNon-BAM
Non-BAM
Non-BAM
Non-BAM
Non-BAM
Non-BAM
MDS
~8 TB*
~100 TB*
Xyratex
240 T
B
Bl1
lo
cal
75
T
B
Bl2
local 100 T
B
*Growing to ~1 PB of BAM data and 33 TB of non-BAM data
Pittsburgh Genome Resource Repository
n1n2n3 n0
InfiniBand1Gbit (assumed)
How Do We Protect Data?
• Genomic Data (~424 TB)• Deidentified genomic data• Patient genomic data from UPMC system
• DUAs (Data Use Agreements)• Umbrella document signed by all Pitt/UPMC
researchers• Required training for all users• Access restricted to DUA users only
• dBGap (not HIPAA)
• We host, but user (via DUA) is ultimately responsible for data protection