Talk at NCRR P41 Director's Meeting

Post on 15-Jan-2015

1524 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Invited Talk given at the NCRR P41 Director's meeting on October 12, 2010

Transcript

Amazon Web ServicesA platform for life science research

Deepak Singh, Ph.D.Amazon Web Services

NCRR P41 PI meeting, October 2010

the new reality

lots and lots and lots and lots and lots of data

lots and lots and lots and lots and lots of

people

lots and lots and lots and lots and lots of

places

constant change

science in a new reality

science in a new reality^

science in a new realitydata

^

goal

optimize the most valuable resource

compute, storage, workflows, memory,

transmission, algorithms, cost, …

enter the cloud

what is the cloud?

infrastructure

scalable

3000 CPU’s for one firm’s risk management application

!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'

8%%9%.:/'

;<"&/:1='

>?,3?,44@'

A&B:1='

>?,>?,44@'

C".:1='

>?,D?,44@'

E(.:1='

>?,F?,44@'

;"%/:1='

>?,G?,44@'

C10"&:1='

>?,H?,44@'

I%:.%/:1='

>?,,?,44@'

3444JJ'

344'JJ'

highly available

US East Region

Availability Zone A

Availability Zone B

Availability Zone C

Availability Zone D

durable

99.999999999%

dynamic

extensible

secure

a utility

on-demand instancesreserved instances

spot instances

infrastructure as code

class Instance attr_accessor :aws_hash, :elastic_ip def initialize(hash, elastic_ip = nil) @aws_hash = hash @elastic_ip = elastic_ip end def public_dns @aws_hash[:dns_name] || "" end def friendly_name public_dns.empty? ? status.capitalize : public_dns.split(".")[0] end def id @aws_hash[:aws_instance_id] endend

include_recipe "packages"include_recipe "ruby"include_recipe "apache2"

if platform?("centos","redhat") if dist_only? # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" endelse %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end endend

gem_package "passenger" do version node[:passenger][:version]end

execute "passenger_module" do command 'echo -en "\n\n\n\n" | passenger-install-apache2-module' creates node[:passenger][:module_path]end

import botoimport boto.emrfrom boto.emr.step import StreamingStepfrom boto.emr.bootstrap_action import BootstrapActionimport time

# set your aws keys and S3 bucket, e.g. from environment or .botoAWSKEY= SECRETKEY= S3_BUCKET=NUM_INSTANCES = 1

conn = boto.connect_emr(AWSKEY,SECRETKEY)

bootstrap_step = BootstrapAction("download.tst", "s3://elasticmapreduce/bootstrap-actions/download.sh",None)

step = StreamingStep(name='Wordcount',                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',                     cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],                     reducer='aggregate',                     input='s3n://elasticmapreduce/samples/wordcount/input',                     output='s3n://' + S3_BUCKET + '/output/wordcount_output')

jobid = conn.run_jobflow(    name="testbootstrap",     log_uri="s3://" + S3_BUCKET + "/logs",     steps = [step],    bootstrap_actions=[bootstrap_step],    num_instances=NUM_INSTANCES)

print "finished spawning job (note: starting still takes time)"

state = conn.describe_jobflow(jobid).stateprint "job state = ", stateprint "job id = ", jobidwhile state != u'COMPLETED':    print time.localtime()    time.sleep(30)    state = conn.describe_jobflow(jobid).state    print "job state = ", state    print "job id = ", jobid

print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMPprint "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."

Connect to Elastic MapReduce

Install packages

Set up mappers &reduces

job state

a data science platform

dataspaces

Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data

accept all data formats

evolve APIs

beyond the database and the data warehouse

move compute to the data

data is a royal garden

compute is a fungible commodity

“I terminate the instance and relaunch it. Thats my error handling”

Source: @jtimberman on Twitter

the cloud is an architectural and

cultural fit for data science

amazon web services

your data science platform

s3://1000genomes

Credit: Angel Pizzaro, U. Penn

AWS knows scalable infrastructure

you know the science

we can make this work together

deesingh@amazon.com Twitter:@mndoci

http://slideshare.net/mndocihttp://mndoci.com

Inspiration and ideas from Matt Wood, James Hamilton

& Larry Lessig

Credit” Oberazzi under a CC-BY-NC-SA license

top related