Amazon Web Services for Bioinformatics June 2, 2015.

Amazon Web Services for Bioinformatics

June 2, 2015

2

Overview

• Cloud Service Providers• Amazon Web Services Offerings• Hands-on

– Setting up an AWS account– Initiating a Cloud Server for Galaxy– Running Analysis on Galaxy

• Break • Cloud Use Case: 1000 Genomes Project

– Accessing and analyzing 1000 Genomes data on AWS

– Terminate AWS cluster• AWS usage costs and terminating services• Break• Cloud Use Case: Million Veterans Program

3

Introductions and Workshop Considerations

• Introduction• What’s your name?• Where are you from?• What do you do?• Tell us something interesting about yourself!

• Workshop Considerations• Content only requires basic computing skills, so don’t get

discouraged if you don’t understand anything• Follow along with your computer• Help thy neighbor• Ask questions• Engage and enjoy

4

Cloud Service Providers (CSP)

• Amazon Web Services (AWS)

• Verizon Terremark

• Microsoft Azure

• Google

• IBM

• HP

• Apple

• CenturyLink

5

Amazon Web Services (AWS) Offerings

• EC2 – Elastic Compute

• S3 – Storage

• EMR – Elastic Map Reduce

• IAM – Identity and Access Management

• RDS – Relational Database

• Glacier – Archival Storage

• AWS Zones – Transfer fee between zones

• Free Usage Tier

6

Getting Started: Setting up an AWS Account

This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

• Access Amazon Web Services

https://352634094794.signin.aws.amazon.com/console

• Logging in

User Name: user [user umber e.g. user1, user2, user37]

Password: hpcc

[Note: “I have an MFA token” should be left unchecked]

7

Getting Started: Setup an EC2 Instance


• What’s an AMI? Amazon Machine Image• Two ways to launch an EC server instance

o AWS consoleo AMI

» Amazon Marketplace» Public URL

• Launch through public Galaxy AMI: https://usegalaxy.org/cloudlaunch

• Locate Key ID and Secret Key– AWS Console > Identity and Access Management > Rotate your

access keys– Click “Manage your access keys”– Scroll down and click “Manage access keys”– Click “Create access key”– Click “Show user security credentials”– Click “Download Credentials”

https://usegalaxy.org/cloudlaunch



8

Getting Started: Launch EC2 Instance


1. Bring back the “Launch a Galaxy Cloud Instance” Screen2. Copy and paste Key into “Enter Key ID” field3. Copy and paste Secret Key into “Enter Secret Key” field4. Choose a name for your Galaxy server5. Choose a simple password (e.g. hpcc)6. Key Pair “Create New”7. Instance Type “Compute optimized Large (2 vCPU/4GB

RAM)”8. Click Submit[Takes a few minutes to launch an instance – check the console]9. Click the instance URL to access CloudMan Interface10. Username “admin”, password “hpcc”, select “Transient Storage”[Takes a few minutes to launch Galaxy]11. Click “Access Galaxy”[Galaxy can also be accessed by typing the URL from console]

9

The 1000 Genomes Project

• Goal is to study genetic variants with at least 1% frequency in populations

• Phase I started in 2010 with 4 populations and 1000 Genomes• Phase II and III completed in 2013 with 2500 genomes from 25

populations

10

1000 Genomes Project Data, Analysis, and Results

• Data is stored by EBI and NCBI and AWS

• 2500 whole genomes sequenced at 28x

• Genome Wide Association Studies

• Focus on common and rare genetic conditions, population genetics, evolution and ancestry

11

Create an S3 Bucket and Add Data

Create S3 bucket• Return to the AWS console and click “S3”• Create new S3 data bucket – Name: “user[x]data”[Note: bucket name should be unique, lowercase, and alphanumeric• Create new folder in your bucket – Name: “user[x]folder”

Find 1000 Genomes Data• Gp tp 1000 Genomes Data Browser:

http://browser.1000genomes.org/tools.html• Select “Data Slicer > Online Version”• Select genome location on Chr 7 ”7:50000-100000”• Select VCF Filters “By Population”• Select CLM and download file to your local computer

Upload to S3 bucket• Upload a file in your S3 bucket – Rename it to: “CLM.vcf.gz”• Change permissions of your file to “everyone”

http://browser.1000genomes.org/tools.html



12

Command Line Access to EC2 Server and S3 Bucket

Command line access to your server• Windows – Download “Putty” or any other SSH client• Mac – Open “Terminal”• Go to CloudMan console and copy server address for command line access ssh -i cloudman_key_pair.pem [email protected]

Access your S3 Data Bucket• Access your S3 bucket

wget http://user[x]data.S3.amazonaws.com/user[x]folder/CLM.vcf.gz• Unzip and view your VCF file

gunzip CLM.vcf.gzhead CLM.vcf

Access 1000 Genomes Data [Public Bucket on S3] • Download 1000 Genomes XML file

wget http://S3.amazonaws.com/10000genomes• Download populations File

wget http://1000genomes.S3.amazonaws.com/20131219.populations.tsv• View 1000 Genomes population

head 20131219.populations.tsv

mailto:[email protected]

mailto:[email protected]

http://S3.amazonaws.com/10000genomes

http://1000genomes.S3.amazonaws.com/populations.csv

http://1000genomes.S3.amazonaws.com/populations.csv

13

AWS Usage Costs and terminating services

• Usage costs are calculated and billed monthly• Usage is determined by the hour during which an instance

starts• E.g. EC2 instance running from 2:55 PM - 4:05 PM will be

billed for 3 hours• Be sure to stop or terminate instances when not in use

• EC2• Server Instance• Storage Volume

• S3• Terminating our instance

• Go to CloudMan webpage and click “Terminate Cluster”• Terminate EC2 storage volumes• Delete S3 buckets and folders• Check console to ensure all services have been stopped

14

The Million Veterans Program (MVP)

• National voluntary research program funded by the Department of Veterans Affairs Office of Research & Development

• Goal is to study how genes and environment factors affect veterans’ health

• Building one of the world's largest medical databases containing biological samples and health information from one million veterans• Blood samples for genomic profiling

– Single Nucleotide Polymorphism (SNP) Array Analysis– Next Generation Sequencing (NGS) Analysis

• Personal health surveys and military deployment history• Electronic health records

• Genomic Informatics for Integrative Science (GenISIS) comprises hardware, platform, and tools to manage, store, and analyze MVP data

• Current recruitment has passed 400K samples with a goal of 1 Million samples in 5 years

• Total Data Volume expected to exceed 10 Petabytes in 5 years


15

Overview


16

MVP Data Warehouse

• Metadata extracted from vendor generated genomic data using SNP Arrays Genotyping, Whole Genome Sequencing, and Whole Exome Sequencing will be cataloged in a Metadata Database

• Genomic data will be linked with corresponding de-identified clinical and survey data by an Honest Broker system

• Terminology and Annotation Server will allow researchers to incorporate a wide array of genomic and clinical annotations to integrate genomic, survey, and clinical data

• Query Mart will enable researchers to build cohorts and subset data using clinical and genomic information and export to the Data Mart for further analysis


17

Cloud Broker


• Cloud Portal manages access control for different types of data and users

• Cloud Engine co-locates data with analytical tools

• Intelligent Orchestration Tool maps data and processes to storage and compute clusters to efficiently manage resources

• Geographically distributed computational resources pooled through a virtual private cloud


Data Lake – Key Value Data Store

18

SNP

rs4362914

Gene

TCF7L2

Sample

SHIP000675221

Patient

PT-00589A

Patient

PT-00589A

ConditionDiabetes

Type II

SNP

rs4362914

Genome Loc

Chr7:4344859978

Sample

SHIP000675221

SNP

rs4362914

SNP

rs4362914

ConditionDiabetes

Type II

SurveyS-2014-06-18-A3288

Deployment

Vietnam War

Genome Loc

Chr7:4344859978

Genotype

T

Sample

SHIP000675221

SurveyS-2014-06-18-A3288

Gene

TCF7L2

Condition

DiabetesType II

Tier 1

Tier 2

Tier 3

Access Control

19

Challenges and Lessons Learned


• Petabyte scale genomics data poses storage, transfer, and processing challenges

• Cloud computing offers optimal solutions for data storage and analytics• Next generation algorithms with built-in scalability features (e.g. Apache

Hadoop/MapReduce)• Co-locating data and analytical tools to reduce data replication and

transfer bottlenecks

• Genomic data is PHI and should be protected using Data-in-Motion and Data-at-Rest best practices

• Encryption and decryption of genomic datasets constitute a significant fraction of data transfer and analysis time – YMMV

• Efficient architectural design of storage and processing systems diminish security risks and encryption/decryption bottlenecks

• Data integration and metadata annotation are critical in deriving knowledge from data

• Lack of unified standard formats in genomics necessitates substantial effort in highly specialized analytical pipelines

• Data integration can be powered by annotation using multiple ontologies• Data annotation upon ingest is crucial in a rapidly changing genomic

sequencing landscape

20

Questions

Amazon Web Services for Bioinformatics June 2, 2015.

Documents

access galaxy galaxy

console slide

aws account

access keys

galaxy cloud instance

unchecked slide

galaxy server

launch ec2 instance