Top Banner
Scientific Computing in the Clouds Karan Bhatia, Google May 1, 2017
36

in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Jul 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Scientific Computing in the CloudsKaran Bhatia, GoogleMay 1, 2017

Page 2: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Investing to meet University and research needs

Page 3: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

$29.4 BillionGoogle’s trailing 3 Year CAPEX investment

1 Billion End users served by GCP customers

Page 4: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Current regions and number of zones

Data Localization

Network path

Committed regions for 2017 and number of zones

#

# https://peering.google.comhttps://cloud.google.com/compute/docs/regions-zones/regions-zones

2

3

Singapore2

S Carolina

N Virginia

BelgiumLondon

Tokyo (2016)

TaiwanMumbai

Sydney

Oregon

Iowa

Frankfurt

São Paulo

Finland

3

3

33

3

3

2

43

3

3

Points of presence

Page 5: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Agenda

Big Compute

Big Data

Programs

Patterns

Page 6: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Big Compute

Page 7: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Proprietary + Confidential

SC16 CMS DemonstratorTarget: generate 1 Billion events in 48 hours during Supercomputing 2016 on Google Cloud via HEPCloud

35% filter efficiency = stage out 380 million events → 150 TB output

Double the size of global CMS computing resources

CMS Higgs Event - credit: CERN https://commons.wikimedia.org/wiki/File:CMS_Higgs-event.jpg

Page 8: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Cores from Google

Page 9: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

MIT Research w/ VMs

Products used: Google Compute Engine, Cloud Storage, DataStore

220,000 cores on preemptible VMs

2,250 32-core instances, 60 CPU-years of computation in a single afternoon

Answers in hours v. months

Page 10: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Broad Firecloud:WDL, Cromwell and Google Genomics

WDL: an external DSL used by computational biologists to express the analytical pipelines

Cromwell: a scalable, robust engine for executing WDL against pluggable backends including local, Docker, Grid Engine or …

Google Genomics Pipelines API: co-developed by Broad and Google Genomics, a scalable Docker-as-a-Service with data scheduling

Page 11: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Pipeline definition{

"name": "samtools index",

"description": "Run samtools index to generate a BAM index file",

"inputParameters": [

{"name": "inputFile",

"localCopy": {

"disk": "data",

"path": "input.bam"

}

},

{"name": "outputFile",

"localCopy": {

"disk": "data",

"path": "output.bam.bai"

}

},

],

"resources": {

"minimumCpuCores": 1,

"minimumRamGb": 1,

"disks": [{

"name": "data",

"type": "PERSISTENT_HDD"

"sizeGb": 200,

"mountPoint": "/mnt/data",

}]

},

"docker": {

"imageName": "quay.io/cancercollaboratory/dockstore-tool-samtools-index",

"cmd": "samtools index /mnt/data/input.bam /mnt/data/output.bam.bai"

}

}

Page 12: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Create, run, monitor, and kill pipelines

Create$ gcloud alpha genomics pipelines create --pipeline-json-file PIPELINE-FILE.json --pipeline-json-file samtools_index.json

Created samtools index, id: PIPELINE-ID

Run$ gcloud alpha genomics pipelines run --pipeline_id PIPELINE-ID \

--logging gs://YOUR-BUCKET/YOUR-DIRECTORY/logs \

--inputs inputFile=gs://genomics-public-data/gatk-examples/example1/NA12878_chr22.bam \

--outputs outputFile=gs://YOUR-BUCKET/YOUR-DIRECTORY/output/NA12878_chr22.bam.bai

Running: operations/OPERATION-ID

Status$ gcloud alpha genomics operations describe OPERATION-ID

Kill$ gcloud alpha genomics operations cancel OPERATION-ID

Page 13: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

DSUB (google genomics pipelines)

Page 14: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built
Page 15: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Lessons

● Integration with third-party workload manager vs roll your own vs something in between

○ HTCondor, Slurm, Google Genomics Pipelines, ssh○ Managed instance groups

● On-premise + hybrid vs on-cloud● Cost optimizations

○ Preemptible vms and custom machine types○ Per-minute billing

● Networking is a key differentiator, public peering + internet2 member

Page 16: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Intel Skylake

● Significant “per core” performance improvements

● Intel® Advanced Vector Extension 512 (Intel® AVX-512)

○ 2x flops/second● Accelerated IO with Intel® Omni-Path

Architecture (Fabric)● Integrated Intel® QuickAssist Technology

(crypto & compression offload)● Intel® Resource Director Technology (Intel®

RDT) for Efficiency & TCO

Page 17: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Hardware Accelerated

● Available Today: NVIDIA K80 GPU● Coming Soon: Tensor Processing

Unit (TPU)● Custom ASIC built and optimized

for TensorFlow● Used in production at Google for

over 16 months● 7 years ahead of GPU performance

per watt

Page 18: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built
Page 19: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Data

Page 20: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

© 2016 Google 20Proprietary + Confidential

Data Prep (beta)

Cloud Dataprep

Cloud Pub/Sub

Cloud Dataflow

1. Ingest Data

Clean Data Raw Data

Google BigQuery

Data Studio

Cloud ML

2. Instantly Prepare Data 3. Analyze Data

Page 21: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

© 2016 Google 21Proprietary + Confidential

Supports Common Data Sources of Any SizeProcess diverse datasets - structured and unstructured. Transform data stored in CSV, JSON, or relational Table formats. Prepare datasets of any size, megabytes to terabytes, with equal ease.

Cloud DataprepInstant Data ExplorationVisually explore and interact with data in seconds. Instantly understand data distribution and patterns. There is no need for one to write code. You can prepare data with a few clicks.

Intelligent Data CleansingCloud Dataprep automatically identifies data anomalies and helps you to take corrective actions fast. Get data transformation suggestions based on your usage pattern. Standardize, structure, and join datasets easily with a guided approach.

ServerlessCloud Dataprep is a serverless service, so you do not need to create or manage infrastructure.

Seriously PowerfulCloud Dataprep is built on top of powerful Google Cloud Dataflow service. Cloud Dataprep is auto-scalable and can easily handle processing massive data sets.

Page 22: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

© 2016 Google 22

Page 23: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Google cloud computing can help universities transform

Page 24: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Teaching Faculty in select

countries

Teachinguniversity courses

In computer science or

related fields

Page 25: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

● National Science Foundation

○ BIGDATA

● National Institutes of Health

○ Data Commons

Funding Agency Partnerships

Page 26: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built
Page 27: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Google Cloud Public Datasets Program

Mission: Facilitate the onboarding of datasets into Google Cloud products

Page 28: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

How to use the basic headline + body:

1. Replace body text by either typing directly into table boxes or copy and paste content in from other source42+ datasets

Page 29: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

You can contribute too!

Visit: https://cloud.google.com/public-datasets/

Email: [email protected]

Page 30: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Themes / Patterns for Scientific Computing

Page 31: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Extending the Cloud APIs

PI/BiologistWeb Access

Computational Research ScientistPython, R, SQL

Algorithm Developerssh, programmatic access

ISB-CGC GUI Google GUI

GoogleISB-CGC API

Compute Engine VMs

Cloud Storage BigQuery Genomics

API

Local Storage

ISB-CGC Hosted Data Controlled-Access Data Open-Access Data User Data

Page 32: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built
Page 33: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built
Page 34: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

TensorFlow

● World’s most popular ML framework

● Developer friendly yet performance optimized

● Powers over 100 Google services

● Managed infrastructure with Cloud ML

● Tutorials at https://www.tensorflow.org

Page 35: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Linear Regression VS Neural Network

Page 36: in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. Seriously Powerful Cloud Dataprep is built

Thank you!