Top Banner
Infrastructure for Sharing Very Large Data Sets http://www.sam.pitt.edu Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling Research Associate Professor Departments of Chemistry and Computational & Systems Biology University of Pittsburgh
29

Infrastructure for Sharing Very Large Data Sets Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

Dec 19, 2015

Download

Documents

Dominic Casey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

Infrastructure for Sharing Very Large Data Sets

http://www.sam.pitt.edu

Antonio M. Ferreira, PhD

Executive Director, Center for Simulation and Modeling

Research Associate Professor Departments of Chemistry and Computational & Systems Biology

University of Pittsburgh

Page 2: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

PARTS OF THE INFRASTRUCTURE PUZZLE• Hardware

• Networking• Storage• Compute

• Software• Beyond scp/rsync• Globus• gtdownload

• Policies• Not all data is “free”• Access controls

Page 3: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

PARTS OF THE INFRASTRUCTURE PUZZLE• Hardware

• Networking• Storage• Compute

• Software• Beyond scp/rsync• Globus, gtdownload, bbcp, etc.

• Policies• Not all data is “free”• Access controls

Page 4: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

THE “OLD” MODEL

Disk Main Memory

L3 Cache

L2 Cache

L1 Cache

L1i Cache

CPU Core

Bus

Page 5: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

NETWORK IS THE NEW BUS

Main Memory

L3 Cache

L2 Cache

L1 Cache

L1i Cache

CPU Core

Bus

Disk

Network

Page 6: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

DATA SOURCES AT PITT

• TCGA• Currently 1.1 PB growing by ~50 TB/mo.

• Pitt is largest single contributor

• UPMC Hospital System• 27 individual hospitals generating clinical and

genomic data• ~30,000 patients in BRCA alone

• LHC• Generates more than 10 PB/year• Pitt is a Tier 3 site

Page 7: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

TCGA DATA BREAKDOWNCancer Pitt Contribution All Univ's Contribution Pitt's

Percentage

Mesothelioma (MESO) 9 37 24.32

Prostate adenocarcinoma (PRAD) 95 427 22.25

Kidney renal clear cell carcinoma (KIRC) 107 536 19.96

Head and Neck squamous cell carcinoma (HNSC) 74 517 14.31

Breast Invasive Carcinoma (BRCA) 149 1061 14.04

Ovarian serous cystadenocarcinoma (OV) 63 597 10.55

Uterine Carcinosarcoma (UCS) 6 57 10.53

Thyroid carcinoma (THCA) 49 500 9.80

Skin Cutaneous Melanoma (SKCM) 41 431 9.51

Bladder Urothelial Carcinoma (BLCA) 23 268 8.58

Uterine Corpus Endometrial Carcinoma (UCEC) 44 556 7.91

Lung adenocarcinoma (LUAD) 31 500 6.20

Pancreatic adenocarcinoma (PAAD) 7 113 6.19

Colon adenocarcinoma (COAD) 21 449 4.68

Lung squamous cell carcinoma (LUSC) 21 493 4.26

Stomach adenocarcinoma (STAD) 15 373 4.02

Kindey renal papillary cell carcinoma (KIRP) 9 227 3.96

Rectum adenocarcinoma (READ) 6 169 3.55

Sarcoma (SARC) 7 199 3.52

Pheochromocytoma and Paraganglioma (PCPG) 4 179 2.23

Liver hepatocellular carcinoma (LIHC) 3 240 1.25

Cervical Squamous cell carcinoma and endocervical adenocarcinoma (CESC)

3 242 1.24

Esophageal carcinoma (ESCA) 2 165 1.21

Adrenocortical Carcinoma (ACC) 0 92 0.00

Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (DLBC) 0 38 0.00

Gliobastoma mutliforme (GBM) 0 661 0.00

Kidney chromophobe (KICH) 0 113 0.00

Acute Myeloid Leukemia (LAML) 0 200 0.00

Brain Lower Glade Glioma (LGG) 0 516 0.00

Page 8: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.
Page 9: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

HOW QUICKLY DO YOU NEED YOUR DATA?

http://fasterdata.es.net/home/requirements-and-expectations

Page 10: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

HOW DO WE LEVERAGE THIS ON CAMPUS?

http://noc.net.internet2.edu/i2network/maps-documentation/maps.html

Page 11: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

SCIENCE DMZ

http://fasterdata.es.net/science-dmz/science-dmz-architecture/

Page 12: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

AFTER THE DMZ

• Now that you have a DMZ, what’s next?

• It’s the last mile• Relatively easy to bring 100 Gbps to the data center• It’s another thing entirely to deliver such speeds to

clients (disk, compute, etc.)

• How do we address the challenge?• DCE and IB are converging• Right now, high bandwidth network to storage is

probably the best we can do• Select users and laboratories get 10 GE to their

systems

Page 13: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

CAMPUS 100GE NETWORKING

Page 14: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

PITT/UPMC NETWORKING

Page 15: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

BEYOND THE CAMPUS: XSEDE

• The most advanced,

powerful, and robust

collection of integrated

digital resources and

services in the world.

• 11 supercomputers, 3

dedicated visualization

servers. Over 2 PFLOPs

peak computational power.

Single virtual system that scientists can use to interactively share computing resources, data, and expertise …

• Online training for XSEDE and general HPC topics

• XSEDE Annual XSEDE conference

Learn more at http://www.xsede.org

Page 16: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

PSC/PITT STORAGE

http://www.psc.edu/index.php/research-programs/advanced-systems/data-exacell

Page 17: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

SLASH2 ARCHITECTURE

Page 18: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

AFTER THE DMZ (CONT.)

• Need the right file systems to backend a DMZ• Lustre/GPFS• How do you pull data from the high-speed network?• Where will it land?

• DMZ explicitly avoids certain security restrictions

• Access Controls• Genomics/Bioinformatics is growing enormously• DMZ is likely not HIPPA-compliant

• Is it EPHI?• Can we let it live with non-EPHI data?

Page 19: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

CURRENT FILE SYSTEMS

• /home directories are traditional NFS

• SLASH2 filesystem for long-term storage• 1 PB of total storage• Accessible from both PSC and Pitt compute

hardware

• Lustre for “active” data• 5 GB/s total throughput• 800 MB/s single-stream performance• InfiniBand connectivity

• Important for both compute and I/O

Page 20: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

• Computing on Distributed Genomes• How do we make this work once we get the data?• Need the APIs

• Genomic data from UPMC• UPMC has data collection• UPMC lacks HPC systems for analysis

Page 21: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

INSTITUE FOR PERSONALIZED MEDICINE• Pitt/UPMC joint venture

• Drug Discovery Institute• Pitt Cancer Institute• UPMC Cancer Institute• UPMC Enterprise Analytics

• Improve patient care

• Discover novel uses for existing therapeutics

• Develop novel therapeutics

• Enable genomics-based research and treatement

Page 22: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

WHAT IS PGRR?

What PGRR IS… What PGRR is not….

1. A common information technology framework for accessing deidentified national big data datasets that are important for Personalized Medicine

2. A portal that allows you to use this data easily with tools and resources provided by the Simulation and Modeling Center (SaM), Pittsburgh Supercomputing Center (PSC), and UPMC Enterprise Analytics (EA)

3. A managed environment to help you meet the information security and regulatory requirements for using this data

4. A process for helping you stay current about updates and modifications made to these datasets

1. A place to store your individual research results

2. A system to access UPMC clinical data

3. A service for analyzing data on your behalf

Page 23: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

Data Exacell Storage (SLASH2)

PGRR

PSC

IPM

Po

rtal

Pipeline Codes

Frank

Pitt

Pitt (IPM, UPCI)

M

Source (e.g. NCI, CGHub)TCGA

GO

Blackligh

t

She

rlock

BAM

BAM

Panasa

s40 T

B

Bra

she

ar

29

0 T

B

Virtuoso

10 G

bit

(t

hro

ttle

d t

o 2

Gb

it)

Net

wo

rk

Re

plic

ati

on

Metadata

superce

ll1

00 T

B

Da

tab

as

e n

od

es

BAMNon-BAM

Non-BAM

Non-BAM

Non-BAM

Non-BAM

Non-BAM

MDS

~8 TB*

~100 TB*

Xyratex

240 T

B

Bl1

lo

cal

75

T

B

Bl2

local 100 T

B

*Growing to ~1 PB of BAM data and 33 TB of non-BAM data

Pittsburgh Genome Resource Repository

n1n2n3 n0

InfiniBand1Gbit (assumed)

Page 24: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

How Do We Protect Data?

• Genomic Data (~424 TB)• Deidentified genomic data• Patient genomic data from UPMC system

• DUAs (Data Use Agreements)• Umbrella document signed by all Pitt/UPMC

researchers• Required training for all users• Access restricted to DUA users only

• dBGap (not HIPAA)

• We host, but user (via DUA) is ultimately responsible for data protection

Page 25: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

TCGA ACCESS RULES

Page 26: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

CONTROLING ACCESS

Page 27: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

PGRR DATA NOTIFICATIONS

Page 28: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

ACKNOWLEDGEMENTS

• Albert DeFusco (Pitt/SaM)

• Brian Stengel (Pitt/CSSD)

• Rebecca Jacobson (Pitt/DBMI)

• Adrian Lee (Pitt/Cancer Institute)

• J. Ray Scott (PSC)

• Jared Yanovich (PSC)

• Phil Blood (PSC)

Page 29: Infrastructure for Sharing Very Large Data Sets  Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

CENTER FOR SIMULATION AND MODELING

Center for Simulation and Modeling (SaM)

326 Eberly (412) 648-3094 http://

www.sam.pitt.edu

• Co-directors: Ken Jordan & Karl Johnson

• Associate Director: Michael Barmada

• Executive Director: Antonio Ferreira

• Administrative Coordinator: Wendy Janocha

• Consultants: Albert DeFusco, Esteban

Meneses, Patrick Pisciuneri, Kim Wong

Network Operations Center (NOC)

• RIDC Park

• Lou Passarello

• Jeff Raymond, Jeff White

Swanson School of Engineering (SSoE)

• Jeremy Dennis