Top Banner
MOVING HPC APPLICATIONS TO CLOUD The Practitioner Prospective © 2009 Grid Dynamics — Proprietary and Confidential Victoria Livschitz CEO, Grid Dynamics
46

Cloud HPC

May 11, 2015

Download

Technology

Application of Cloud Computing for HPC problems
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cloud HPC

© 2009 Grid Dynamics — Proprietary and Confidential

MOVING HPC APPLICATIONS TO CLOUD

The Practitioner ProspectiveVictoria Livschitz

CEO, Grid Dynamics

Page 2: Cloud HPC

04/12/2023© 2009 Grid Dynamics 2

AGENDA

• What clouds & HPC are being discussed?

• HPC & Clouds: match made in heaven?

• Concerns: dealing with performance, data and security

• Strategies of moving HPC to clouds

• Overview of HPC cloudware platforms

• Case studies: Monte Carlo, Batch Analytics, Excel @ Cloud

• Conclusions: where is cloud-HPC headed?

Page 3: Cloud HPC

04/12/2023 3

WHAT ARE WE TALKING ABOUT?

© 2009 Grid Dynamics

HPC = Grid + Map/Reduce

Clouds = Public Clouds

Page 4: Cloud HPC

04/12/2023 4

HPC + CLOUD: Match made in Heaven or Hell?

© 2009 Grid Dynamics

Page 5: Cloud HPC

04/12/2023 5

THE BLESSINGS

© 2009 Grid Dynamics

Use Case Cloud So What?

Limited budget for new hardware

Allows to lease as mach as needed and pay as

you go

Why buy a caw when all you need is a milk?

Infrequent “monster” jobs

Cost neutrality: 100 VM @ 1h = 10 VM @ 10 h

Impossible -> easy & cost-effective

Pressing Time to Market

Speedup innovation by having multiple isolated

dev and QA environments

Get to market first with more quality product

Disaster Recovery

Ability to promptly restore fallen capacity

Redundant geo-distributed storage

Quickly deploy “Plan B” while restoring fallen

system

Increasing IT complexity Allows to outsource core IT concerns

Concentrate on your core value, not on IT

Page 6: Cloud HPC

04/12/2023 6

THE CURSE: BARRIERS OF ADOPTION

• Performance issues• Virtualization tolls CPU and especially I/O

• Cloud networks are not designed for low-latency communication

• Data issues• HPC can consume or produce enormous data volumes

• Need to move them in and out of cloud, which leads to latency & cost

• Vendor-related issues• Memory caps (currently ~16 GB) limits some shared memory jobs

• Legacy issues: cloud to support only latest and greatest kernels and libs

• Licensing and certifications of vendor software

• Security issues• Data privacy, availability and integrity

• Private data moving over WAN

© 2009 Grid Dynamics

Page 7: Cloud HPC

04/12/2023 7

RAW CLOUD PERFORMANCE: IS REALLY AN ISSUE?

• Cloud HPC is slower than bare metal cluster• For majority of use cases: from 5% to 30% slower

• Not an issue if you can compensate it by having more VM

• Consider time sharing and queuing on static HPC cluster

• Slow but dedicated cloud can get things done faster

© 2009 Grid Dynamics

Page 8: Cloud HPC

04/12/2023 8

MITIGATING DATA ISSUES

Concern Mitigation

In-house data storage is easy and cheap

Redundant geo-distributed storage neither easy nor cheap

Data movement to and from cloud is slow and costly

Not all providers charge for data movement

Compress

Overnight data FedExing

Using data on a cloud is slow

Use native cloud data sources that scale

Data grids helps to cache, serve and process performance-critical data

© 2009 Grid Dynamics

Page 9: Cloud HPC

04/12/2023 9

MITIGATING SECURITY ISSUES

Concern Mitigation

Lack of perimeter defense Hybrid architecture

Data security:Data and IP on the cloud seems much more vulnerable in terms of privacy, integrity and availability

Transient in-memory management of sensitive data

Encrypted file systems

Data transport:WAN data movement concerns

VPN, SSH tunnels, NFSv4, Amazon VPC, or Hybrid architecture

Data persistency: • Proprietary data encryption and

replication• Cloud provider business continuity• Is data really gone when it is

deleted?

Due diligence of cloud provider

© 2009 Grid Dynamics

Page 10: Cloud HPC

04/12/2023 10

HYBRID CLOUD ARCHITECTURE

© 2009 Grid Dynamics

• Keep your private data secure on colo

• Perimeter firewall for internet-facing services

• LAN connection to elastic capacity

Page 11: Cloud HPC

04/12/2023 11

IS CLOUD HPC ALREADY A REALITY?

• Gaia ESA mission:• To build a catalogue of 1B stars (1% of Galaxy)

• to be launched 2011 for 5 year mission

• 3-8 Mbit/s downlink, 30 Gb/day

• Data reduction cycles• Multiple observation allow to refine star positions

• 6 month observation cycle followed by 2 week catalog refinement cycle

• Reasons to go to cloud• Bursty load profile

• EC2-based solution is cheaper: 350K EURO vs 720K EURO in-house without power and storage

• Risk mitigation: no need to purchase up-front datacenter for 5 years mission, as probe may get lost any day

© 2009 Grid Dynamics

Page 12: Cloud HPC

04/12/2023 12

MOVING HPC TO CLOUD STRATEGIES

© 2009 Grid Dynamics

Build• Cloudware-based HPC• Native cloud HPC solutions

Buy• Move commercial grid to cloud environment

Page 13: Cloud HPC

04/12/2023 13

MOVING A GRID TO A CLOUD• WHEN?

• CPU is a bounding factor

• Legacy code or black-box tasks

• Re-architecting is just not feasible or practical

• For dev and test grids

• Grid vendor is already there

• HOW?• Build your own

• custom worker machine image

• Keep scheduler and data sources on premises for maximum control and security

• Consider SSH tunneling or VPN for maximum security

• Or use vendor’s cloud adapters

• Data Synapse Federator

• Sun Grid Engine DRM

• UnivaUniCloud (SGE)

• Condor – CycleComputingCycleCloud

© 2009 Grid Dynamics

Page 14: Cloud HPC

04/12/2023 14

DATA SYNAPSE FEDERATOR

• Policies for starting / stopping cloud based engines

• Secure connections to cloud based engines

© 2009 Grid Dynamics

Grid Client Federator

DataSynapseManager

On Premise In the Cloud

Page 15: Cloud HPC

04/12/2023 15

DATA SYNAPSE FEDERATOR

• SSH tunnel to communicate over WAN

• For managing engines

• For engines to access on-premise data

• Proxy is doing basic caching

• DS Engine updates grid libraries on boot

© 2009 Grid Dynamics

DataSynapse Engines on EC2

On Premise

On AWS

DS Manager

Proxy Service

Secure SSHTunnel

DS Base AMI

DS S3

Custom AMI

Client S3

Proxy Service

Federator

Activation Policy

Page 16: Cloud HPC

04/12/2023 16

ADOPTING DATAGRID CLOUDWARE

• WHEN?• Data access is a bounding factor

• White box tasks

• Luxury of re-design

• HOW?• Plenty of powerful clustered middleware:

• Oracle Coherence

• GigaSpaces XAP

• GridGain

• Terracotta

• Design application considering

• Data partitioning

• Compute-data affinity

• In-place data processing

© 2009 Grid Dynamics

Page 17: Cloud HPC

04/12/2023 17

GIGASPACES XAP

• Full app stack• General frameworks

• In-memory data grid

• Messaging

• Web container

• Collapsed tiers• Processing unit as

logical unit of scalability

• SLA driven container as physical unit of scalability

• Cloud adapter to provision containers on demand

© 2009 Grid Dynamics

Page 18: Cloud HPC

18

ORACLE COHERENCE

9/18/09© 2009 Grid Dynamics

MainframesDatabases Web Services

Application Tier

Oracle CoherenceData Grid

Data Sources

Data Services

• Most popular data grid product

• True dynamic scalability

• Shared common virtualized app platform

• In-memory data grid

• In-place data processing

• Explicit locking

• ACID Transactions

Page 19: Cloud HPC

04/12/2023 19

NATIVE CLOUD HPC

• WHEN?• Innovative path-finding solutions (speed of innovation)

• True massive scale data processing

• Naturally bursty applications

• Analysis and processing of Big Data

• HOW?• Amazon Elastic Map/Reduce (Hadoop on the cloud)

• HDFS to store large files

• MapReduce to manage workload

• HBase to manage semi-structured data on top of HDFS

• Hive to batch-query and aggregation with QL queries

• Cloudera

• RightScale RightGrid

© 2009 Grid Dynamics

Page 20: Cloud HPC

04/12/2023 20© 2009 Grid Dynamics

CASE STUDIES

• Monte-Carlo @ Cloud• Batch processing @ Cloud• Excel analytics @ Cloud

Page 21: Cloud HPC

04/12/2023 21

MONTE-CARLO @ CLOUD

© 2009 Grid Dynamics

Page 22: Cloud HPC

04/12/2023 22

ANALYTICS APPLICATIONS

• Analytics applications: analyze data or perform computations based on mathematical models

• Typical usage examples• Project sales numbers

• Estimate inventory levels

• Evaluate portfolio values

• Value at risk calculations (VAR)

• Project web site traffics

• Information helps in making better decisions

• Identify and mitigate risks

© 2009 Grid Dynamics

Page 23: Cloud HPC

04/12/2023 23

ANALYTICS APPLICATIONS

Traditional Approach New Approach

Always compute intensive, sometimes data intensive

Runs as a batch Runs as a service

Fixed static footprintUse idle compute cycles (CPU

scavenging) Dynamically scalable

Based on popular scheduler-based grid frameworks Based on emerging HPC technologies

Not designed for near real time processing

Oriented to near real time processing

© 2009 Grid Dynamics

Page 24: Cloud HPC

04/12/2023 24

CLOUD-BASED SOLUTION FOR NEAR REAL-TIME ANALYTICS

• Pros• Dynamically scale it up and down based on the

size of computation

• Create and dispose Infrastructure once the computation is done

• Add more machines to bring the compute time close to real-time

• Cons• Massive data transfer in and out of cloud can be

time consuming. Problems that depend on lots of dynamic data may not be suitable

• Shared processor memory is no longer available. Share-all models are poor candidates

© 2009 Grid Dynamics

Page 25: Cloud HPC

04/12/2023 25

BUSINESS DRIVERS

• Major investment bank• Annuity calculator application

• Monte-Carlo simulation with geometric Brownian motion (GBM)

• Fully parallelizable algorithm

• Customer talks to an agent and agent gets back to the customer next business day

• Currently nightly batch job computes the annuity amounts

• Problems with current approach• System is constrained by time available for batch

• Customer satisfaction can be improved if this can be computed on spot, in near real time

• Adding new resources to system is hard and expensive

© 2009 Grid Dynamics

Page 26: Cloud HPC

04/12/2023 26

REQUIREMENTS AND SOLUTION

• Business requirements• Ability to quickly launch and shutdown the application on demand

• Ability to scale up or down based on the size of the problem

• Complete the simulation in near real-time

• Model functionality should be reusable

• Security

• Re-use existing Monte Carlo models (written in C++)

• Solution• Amazon Web Services

• GridGainCloudware

© 2009 Grid Dynamics

Page 27: Cloud HPC

04/12/2023 27

GRIDGAINCLOUDWARE

© 2009 Grid Dynamics

Page 28: Cloud HPC

04/12/2023 28

CASE STUDY: SOLUTION ARCHITECTURE

© 2009 Grid Dynamics

Page 29: Cloud HPC

04/12/2023 29

CASE STUDY: HIGHLIGHTS

• Monte Carlo simulation service that can be launched on click of a button

• Simulation cluster up and serving in less than 4 minutes

• Scale up the cluster in under 2 mins

• Simulation cluster can be dismissed on click of a button

• ~1M draws in MC simulation yields accurate results in near real time

• SOA Architecture, simulation is a web service that can be consumed by any client

• Dynamically loads the application code and reference data, configures the application on boot up from S3 (Storage cloud)

© 2009 Grid Dynamics

Page 30: Cloud HPC

04/12/2023 30

BATCH ANALYTICS @ CLOUD

© 2009 Grid Dynamics

Page 31: Cloud HPC

04/12/2023 31

WHY BATCH PROCESSING @ CLOUD?

• Traditional batch processing limitations• Limited by number of server resources

• Low utilization

• No way to process burst workload

• HW failure reduces capacity

• Cloud way• Unlimited server resources

• 100% utilization

• Opportunity to scale with load

• Opportunity to automatically restore capacity on failure

• Do it as quickly as you need• Neutral cost equation: 1000 servers @ 1 hour = 10 servers @ 100 hours

© 2009 Grid Dynamics

Page 32: Cloud HPC

04/12/2023 32

EXAMPLE: LOG PROCESSING @ CLOUD

• Problem:• Processing of traffic usage in large enterprise

• NetFlow logs gathered, stored and processed for reports to business

• Various analytics, like biggest traffic offender within enterprise

• Solution:• Terracotta cloudware for cluster management, job distribution and results

gathering

• Logs are served by scalable nginx web server

• Automated provisioning and dynamic scalability

• Deployed on top of Amazon EC2

© 2009 Grid Dynamics

Page 33: Cloud HPC

04/12/2023 33

BATCH PROCESSING ARCHITECTURE

© 2009 Grid Dynamics

Batch processing cluster

Worker Servers Array

FrontendProvisioning

Service

Master

Cloud API

Data source

Scale up request

Job request Job result

New Server

Page 34: Cloud HPC

04/12/2023 34

TERRACOTTA CLOUDWARE

© 2009 Grid Dynamics

• Cluster JVM, not application

• Transparent clustering

• Network attached memory

• Separation of application from infrastructure

• No new API

• Java is the API

• Java memory model

• Java concurrency

Scale-out

Terracotta ServerClustering the JVM

App Server

Web App

JVM

Frameworks

Frameworks

Business Logic

App Server

Web App

JVM

Frameworks

Frameworks

Business Logic

App Server

Web App

JVM

Frameworks

Frameworks

Business Logic

Page 35: Cloud HPC

04/12/2023 35

Worker JVM

TC driver

Worker JVM

Heap

TERRACOTTA MASTER-WORKER ARCHITECTURE

© 2009 Grid Dynamics

TC serverMaster JVM

Heap

TC driver TC driver

TC communication layer

Page 36: Cloud HPC

04/12/2023 36

SCHEDULER BATCH PROCESSING @ CLOUD

• Sun Grid Engine + AWS• When tasks are highly heterogeneous

• For cloud bursting

• Advanced resource management capabilities

• Self-contained AMI to boot and self-organize SGE cluster

• SDM + EC2 adapters to grow and shrink cluster depending on working queue

• Univa UD

© 2009 Grid Dynamics

Page 37: Cloud HPC

04/12/2023 37

EXAMPLE: DNA SEQUENCER

• Problem: DNA Sequencer tool• produces TBs of raw data in one experiment

• Processed by in-house SGE cluster

• refined to GBs after processing

• Storage is cheap, but redundant geo-distributed storage is not cheap

• Frequent need to re-run processing of old experiments, ad-hoc

• Hard to allocate resources for ad-hoc runs, raw data may become unavailable

• Solution: SGE+AWS• Raw data from tool is FedExed to Amazon and uploaded to S3

• Run ad-hoc SGE cluster in the cloud to re-process (same codebase as in-house)

• SGE workers process data from and store results to S3

• Consume refined results: either download directly, or FedEx back to labs

© 2009 Grid Dynamics

Page 38: Cloud HPC

04/12/2023 38

RIGHTGRID: CLOUD WAY FOR BATCH PROCESSING

• Easy way to utilize all power of cloud computing• Dynamic SLA-based scaling of worker machines

• True scalable storage

• TrueScalable messaging

• RightGrid offers lightweight yet powerful framework:• EC2 as worker pool, S3 as mediated storage, SQS as messaging

• Ruby-based framework for JobProducer, JobConsumer, message codec, etc…

• Designed to wrap and run arbitrary code on worker nodes

• Transient and persistent worker execution model

• Failover, error reporting and audit

• Custom scaling policies

© 2009 Grid Dynamics

Page 39: Cloud HPC

04/12/2023 39

RIGHT GRID ARCHITECTURE

© 2009 Grid Dynamics

Page 40: Cloud HPC

04/12/2023 40

EXAMPLE: DOCUMENT CONVERTING

• Problem:• Publishing house needs to convert its documents

repository to standard format for later indexing

• All kinds of document formats to be rendered as pdf documents

• Once-in-a blue moon job

• Solution• Use Amazon EC2 and RigtScale’sRightGrid

framework

• Document storage FedExed to Amazon, uploaded to S3

• Documents converted by application built on top of RightGrid framework

• Converted documents stored on S3

• Resulting document pack is FedExed from Amazon to customer

© 2009 Grid Dynamics

Page 41: Cloud HPC

04/12/2023 41

EXCEL ANALYTICS @ CLOUD

© 2009 Grid Dynamics

Page 42: Cloud HPC

04/12/2023 42

WHY EXCEL @ CLOUD?

• Ubiquitous• Financial analysts think in Excel

• Excel + VBA is current financial analyst IDE

• For many financial institutions, Excel is a main data analysis tool

• Used by analysts and engineers

• Limited Programming Model• Single threaded, memory limited, not that performing

• Need to Run Large Excel Workloads• Parallelization of workload and data is the only way out

• On-demand infrastructure to run parallel excel

© 2009 Grid Dynamics

Page 43: Cloud HPC

04/12/2023 43

MOVING EXCEL TO CLOUD

• Calculation Flow• DAG of calculation units (Macro, UDF, Workbook

recalc)

• Representable as “DAG table” or task dependency table

• Data flow• Workbook as a system of records and data

synchronization point

• Moving around workbooks is costly – moving data deltas is essential

• Template regions are used to capture input and output parameters

© 2009 Grid Dynamics

Page 44: Cloud HPC

04/12/2023 44

MOVING EXCEL TO CLOUD: DEPLOYMENT

© 2009 Grid Dynamics

Scheduler

Compute Nodes(MS Windows & Excel)

Staging Server

Cloud (Private or Public)

Private LinkOr

Internet

Customer Premises

HTTP or FTP Server(Only for Public Clouds)

User PCs(MS Windows & Excel)

Web Server

1. Submit Job

2. Stage Workbook In

3. Submit T

asks

4. Stage Result Out

Page 45: Cloud HPC

04/12/2023 45

FUTURE OF CLOUD HPC

© 2009 Grid Dynamics

Specialized IaaS and PaaS offerings for HPC

• Bare metal with provisioning on demand

• Integrated HPC engines

• Math services

• Domain specific reference data services

Page 46: Cloud HPC

© 2009 Grid Dynamics

Thank You!

Victoria Livschitz

CEO, Grid Dynamics