Page 1
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Igor Bogicevic, CTO
Security and Compliance
at the Petabyte ScaleLessons from the National Cancer Institute’s
Cancer Genomics Cloud PilotAngel Pizarro, AWS Scientific Computing
October 2015
Page 2
What to expect from this session
• Background: Unique challenges for securing genomics
information
• Case study: Democratizing access to The Cancer
Genome Atlas (TCGA) through the Seven Bridges
Cancer Genomics Cloud
• Deep dives: How we’ve leveraged AWS to support
secure and compliant genomics research
Page 3
Why is securing genomics
information hard?
Page 4
i) Genomics data is big…and getting bigger
NGS: Next Generation Sequencing
NGS sequencers include machines from Illumina, Life Technologies, and Pacific Biosciences. Human genome data based on estimates of whole human genomes sequenced
Sources: Financial reports of Illumina, Life Technologies, Pacific Biosciences; revenue guidances; JP Morgan; The Economist; Seven Bridges Analysis.
Between 2014–2018, production of new NGS data to exceed 2 exabytes
# s
equencers
Genom
ic d
ata
Tb
Page 5
ii) Genomes are inherently sensitive
Very personal (including your relatives…)
Can’t fully anonymize information
Can’t take it back once it’s out there
Page 6
iii) Research is highly collaborative and
diverse
It occurs in large teams... ...with numerous analytical tools
Page 7
The Challenge
Enable thousands of researchers
using hundreds of (custom) tools
to analyze petabytes of highly sensitive data
in a secure and compliant environment
Page 8
Case study:
Bringing the Cancer Genome
Atlas (TCGA) to the Cloud
This project has been funded in whole or in part with Federal funds from the
National Cancer Institute, National Institutes of Health, Department of Health
and Human Services, under Contract No. HHSN261201400008C.
Page 9
TCGA is one of the richest and most complete
genomics data sets in the world
34 tumor types
from thousands
of patients…
…analyzed across
multiple
dimensions…
…by researchers
across the US…
…at a cost of
$375 million.
1.5+ petabytes, growing to 3.5 petabytes in the next year
Page 10
But learning from this data is challenging
Page 11
The Cancer Genomics Cloud Pilots seek to
directly address these difficulties
• Initiated by Dr. Harold Varmus in 2013
• BAA issued in January 2014
• 3 pilots awarded September 2014o Broad Institute
o Institute for Systems Biology
o Seven Bridges Genomics
Early access: November 2015
Open release: January 2016
www.CancerGenomicsCloud.org
Page 12
Our approach to democratizing
access to TCGA data
Page 13
The components of democratized access –
Data
● Immediately and securely access
petabytes of open-access and
controlled-access cancer genomics
data.
● Analyze data from your private
cohorts alongside public data.
● Data access governed by the NIH
Genomic Data Sharing Policy.
● As an NIH trusted partner, Seven
Bridges is able to authorize approved
researchers.
● First controlled access genomic
dataset on AWS.
● Coming soon:
http://aws.amazon.com/public-data-
sets/tcga/.
Page 14
The components of democratized access –
Reproducibility
1.1.2 2.0a 2.3Lite
● Execute workflows from primary
analysis through visualization.
● Each result is always associated with
a complete snapshot of the tool
versions, parameters, and input files.
Page 15
The components of democratized access –
Open standards
● Native execution of Docker-based Common
Workflow Language (CWL) pipelines allows
portability and sharing of custom tools.
● APIs support workflow automation and
enhance interoperability.
Page 16
...implemented through our genomics platform
Page 17
How we’ve leveraged AWS to
support secure and compliant
genomics research
Page 18
Security and compliance―connected, but separate.
Page 19
Security
• Network and data security overview
• Parallel file access at scale
• Enabling secure computation using researcher-
contributed tools
• Enabling secure user access and collaboration
Page 20
Simplified system architecture
Encrypted Amazon S3 buckets
Virtual private cloud
(Development environment)
Virtual private cloud
(Production environment)
Dynamic worker
instancesInfrastructure
server
Seven Bridges
website
Dynamic worker
instances
Infrastructure
server
IPSEC VPN
Seven Bridges
offices
Open VPN
Gateway
Remote
workforce
AWS
IPSEC
AWS
IPSEC
UserAccess platform
download data
Data flow
Secure access point
AWS
Page 21
Securing the network
• Extensive use of virtual private clouds (VPCs)
• Separate dev and production environments
DevProduction
● Built-in IPSEC allows easy
network integration
• Open VPN to secure remote
user access
● Each instance and VPC is
individually firewalled
Page 22
Securing data
• At-rest encryption
• Amazon S3 SSE, SSE-KMS
• Amazon EBS encryption
• Ephemeral storage
DevProduction
• In transit
• Data in-transit-fortifying - TLS
exclusively on S3
● From other users
• AWS IAM to access other users’ buckets
Page 23
Controls to support secure data
• Atomic data access
• Data locality
• Dedicated tenancy on
computation instances
• Using only encrypted storage
• Strict data purging
Amazon S3 Amazon EBS Amazon EC2
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"112",
"Effect":"Deny",
"Principal": "*",
"Action":"s3:PutObject",
"Resource":"arn:aws:s3:::examplebucket/*",
"Condition": {
"StringNotEquals": {"s3:x-amz-server-side-encryption": "AES256"}
}
}
]
}
dm-crypt
Page 24
Security
• Network and data security overview
• Parallel file access at scale
• Enabling secure computation using researcher-
contributed tools
• Enabling secure user access and collaboration
Page 25
Parallel file access at scale
The Challenge:
Many bioinformatics tasks require sharing of
intermediary results between multiple instances.
Page 26
Parallel file access at scale – NFS
Observed network
saturation at ~8 NFS clients.
Page 27
Hypothesis
• Amazon S3 would remove single NFS server bandwidth
bottleneck.
• Presenting user’s S3 objects as a local filesystem could provide
an elegant abstraction that any application could use.
• Cumulative S3 read/write speed should scale mostly linearly
with number of workers.
• Total read/write speed on shared S3 objects should significantly
exceed NFS server solution speed on >10 workers.
Page 28
Parallel access at scale – SBG-FS/Amazon S3
Amazon S3
Page 29
SBG-FS single worker performance
Compute Instances
300200100
90
215
894
Thro
ughput M
B/s
400
600
50 250150
1st read (SBG-FS Prefetch)
Write (SBG-FS Upload)
2nd read (SBG- FS Cache)
Page 30
SBG-FS cumulative worker performance
Compute Instances
300200100
50
250
Thro
ughput G
B/s
150
200
50 250150
1st read (SBG-FS Prefetch)
Write (SBG-FS Upload)
2nd read (SBG- FS Cache)100
Page 31
SBG-FS auditing capabilities
Amazon S3
Page 32
Security
• Network and data security overview
• Parallel file access at scale
• Enabling secure computation using researcher-
contributed tools
• Enabling secure user access and collaboration
Page 33
Enabling secure computation using
researcher-contributed tools
The Challenge:
bioinformatics tools
10,000+
50+tools used in single
TCGA marker paper
Our Approach:
Common Workflow Language (CWL) wrapper
Seven Bridges Platform
Page 34
Benefits of using Docker to deploy user-
contributed tools
• Enables solid resource
isolation at the container
level
• Simplifies deploying and
managing tools at scale
DevProduction
Page 35
Security risks posed by use of Docker
• Docker daemon runs under
root privileges
• User can intentionally or
unintentionally add malicious
apps
• If resources management not
set properly, apps could do
damage outside its container
DevProduction
Page 36
Enabling secure use of Docker containers
● Know your private vs. public
resources
● Isolate network resources for
each container (firewalling)
• Be careful with linking
containers
• Aggregate logs (forensics)
DevProduction
Page 37
Security
• Network and data security overview
• Parallel file access at scale
• Enabling secure computation using researcher-
contributed tools
• Enabling secure user access and collaboration
Page 38
Enabling secure access
DevProduction
● Organizations have diverse
models of internal structure
and responsibilities
• Roles and authentication
models are very diverse
• Federated authentication
and SSO
Page 39
Supporting federated login for controlled data
access
Error Message
Approved Researchers
cron x 24hr
Metadata service
ELK stackVerify
SAML
Page 40
Enabling collaboration
• SBG Platform provides isolation
of resources at project level
• Users can share projects and
control access through roles
• Basic role provides just a read
access, write/copy privileges
separate from execution
One Billing Group
per project
$
Multiple users and
roles per project
Users participate in projects
and can provide funding
. .
(-
$ $$
$
Project-specific user roles
Multiple users per project
Clear funding/payment
responsibility
Page 41
Overall system security is enabled by
monitoring and testing
• Penetration testing
• Patch management
• Software and infrastructure vulnerability assessments
• Monitoring of platform performance and availability
• Pandora FMS/OSSEC/Sysdig
• Auditing and logs at a project and platform level
• Logs aggregated and available for inspection with ELK
stack
Page 42
Putting it all together 1. User logs on to the platform
2. Platform creates a unique signed URL
for the user
3. Using signed URL, data is uploaded to
an encrypted Amazon S3 bucket
4. After the user starts a computation, the
Seven Bridges Platform calculates the
optimal execution plan and starts
dedicated task worker instances
5. Worker instances securely pull data
from Amazon S3
6. Worker instances are able to securely
share intermediate data
7. Final results are uploaded to
Amazon S3
Encrypted
S3 bucket
User
EC2
instancesData sharing
between instances
6
SevenBridges
Computation environment
Seven Bridges Platform
4
1,2
3
5,7 Encrypted
Amazon S3
Amazon EC2
Instances
Page 43
Lessons learned from petabyte-scale security
• Isolate resources as much as possible
• Encrypt everything―it will make your life easier
• Understand the scale of the data
• Measure everything
• Leverage the infrastructure
Page 45
When we talk about compliance, we talk about
Building trust Shared language
Page 46
dbGaPProtect against risk associated with release of genomes of
individuals consenting to participate in research studies.
HIPAAProtect against risk associated with release of Personal Health
Information (PHI).
ISO 27001 Provides framework for general security management of assets
across the organization and is a general specification for
information security management system (ISMS).
Compliance frameworks
Page 47
Shared responsibility == compliance coordination
Sta
cked R
esponsib
ility
Facilities
Infrastructure
Virtualization
API and Service Endpoints
AWS
Data Security
Data Provenance
Application Monitoring
OS, Network, etc.
Seven Bridges
Genomics
Users | Groups | Projects | Applications Researcher
Auditor
Page 48
Shared responsibility across frameworks
dbGaP
HIPAA
ISO 27001
ResearcherAWS Seven Bridges
Page 49
Shared responsibility across frameworks
dbGaP
HIPAA
ISO 27001
ResearcherAWS Seven Bridges
Page 50
Shared responsibility across frameworks
dbGaP
HIPAA
ISO 27001
ResearcherAWS Seven Bridges
Page 51
Securely integrating with platforms
Page 52
Security and compliance in practiceS
tacked R
esponsib
ility Data Security
Data Provenance
Application Monitoring
OS, Network, etc.
Users | Groups | Projects | Applications
Facilities
Infrastructure
Virtualization
API and Service Endpoints
Horizontal
Responsibility
Seven Bridges GenomicsResearcher Amazon Web Services
Page 53
Use case: Analyze Personal Genome Project data
http://personalgenomes.org
VPC subnet
Dedicated instance
1000 Genomes
Page 54
Strategies to follow
• Rely on the platform as much as possible
• Follow security best practices outlined in the AWS
documentation
• Have a checklist!
Page 55
Compliance checklist
AWS security
VPC, security groups, encrypted storage
Protect AWS credentials
Protect platform credentials
SOPs for OS and application updates
Audit and logging of the activities outside of platform
Data provenance and lifecycle
Page 56
AWS architecture
IAM instance role
VPC subnet
Security
group
Virtual private cloud
• Access platforms via
Internet or VPC peering
• DevOps for instance and
application management
• Protect credentials with
AWS IAM and AWS KMS
Page 57
Secure bootstrapping with instance UserData
Page 58
AWS Command Line Interface
Page 59
Secure and format local storage
Page 60
Compliance checklist
AWS security
VPC, security groups, encrypted storage
Protect AWS credentials
Protect platform credentials
SOPs for OS and application updates
❑ Audit and logging of the activities outside of platform
❑ Data provenance and lifecycle
Page 62
Remember to complete
your evaluations!