MOVING HPC APPLICATIONS TO CLOUD The Practitioner Prospective © 2009 Grid Dynamics — Proprietary and Confidential Victoria Livschitz CEO, Grid Dynamics
May 11, 2015
© 2009 Grid Dynamics — Proprietary and Confidential
MOVING HPC APPLICATIONS TO CLOUD
The Practitioner ProspectiveVictoria Livschitz
CEO, Grid Dynamics
04/12/2023© 2009 Grid Dynamics 2
AGENDA
• What clouds & HPC are being discussed?
• HPC & Clouds: match made in heaven?
• Concerns: dealing with performance, data and security
• Strategies of moving HPC to clouds
• Overview of HPC cloudware platforms
• Case studies: Monte Carlo, Batch Analytics, Excel @ Cloud
• Conclusions: where is cloud-HPC headed?
04/12/2023 3
WHAT ARE WE TALKING ABOUT?
© 2009 Grid Dynamics
HPC = Grid + Map/Reduce
Clouds = Public Clouds
04/12/2023 4
HPC + CLOUD: Match made in Heaven or Hell?
© 2009 Grid Dynamics
04/12/2023 5
THE BLESSINGS
© 2009 Grid Dynamics
Use Case Cloud So What?
Limited budget for new hardware
Allows to lease as mach as needed and pay as
you go
Why buy a caw when all you need is a milk?
Infrequent “monster” jobs
Cost neutrality: 100 VM @ 1h = 10 VM @ 10 h
Impossible -> easy & cost-effective
Pressing Time to Market
Speedup innovation by having multiple isolated
dev and QA environments
Get to market first with more quality product
Disaster Recovery
Ability to promptly restore fallen capacity
Redundant geo-distributed storage
Quickly deploy “Plan B” while restoring fallen
system
Increasing IT complexity Allows to outsource core IT concerns
Concentrate on your core value, not on IT
04/12/2023 6
THE CURSE: BARRIERS OF ADOPTION
• Performance issues• Virtualization tolls CPU and especially I/O
• Cloud networks are not designed for low-latency communication
• Data issues• HPC can consume or produce enormous data volumes
• Need to move them in and out of cloud, which leads to latency & cost
• Vendor-related issues• Memory caps (currently ~16 GB) limits some shared memory jobs
• Legacy issues: cloud to support only latest and greatest kernels and libs
• Licensing and certifications of vendor software
• Security issues• Data privacy, availability and integrity
• Private data moving over WAN
© 2009 Grid Dynamics
04/12/2023 7
RAW CLOUD PERFORMANCE: IS REALLY AN ISSUE?
• Cloud HPC is slower than bare metal cluster• For majority of use cases: from 5% to 30% slower
• Not an issue if you can compensate it by having more VM
• Consider time sharing and queuing on static HPC cluster
• Slow but dedicated cloud can get things done faster
© 2009 Grid Dynamics
04/12/2023 8
MITIGATING DATA ISSUES
Concern Mitigation
In-house data storage is easy and cheap
Redundant geo-distributed storage neither easy nor cheap
Data movement to and from cloud is slow and costly
Not all providers charge for data movement
Compress
Overnight data FedExing
Using data on a cloud is slow
Use native cloud data sources that scale
Data grids helps to cache, serve and process performance-critical data
© 2009 Grid Dynamics
04/12/2023 9
MITIGATING SECURITY ISSUES
Concern Mitigation
Lack of perimeter defense Hybrid architecture
Data security:Data and IP on the cloud seems much more vulnerable in terms of privacy, integrity and availability
Transient in-memory management of sensitive data
Encrypted file systems
Data transport:WAN data movement concerns
VPN, SSH tunnels, NFSv4, Amazon VPC, or Hybrid architecture
Data persistency: • Proprietary data encryption and
replication• Cloud provider business continuity• Is data really gone when it is
deleted?
Due diligence of cloud provider
© 2009 Grid Dynamics
04/12/2023 10
HYBRID CLOUD ARCHITECTURE
© 2009 Grid Dynamics
• Keep your private data secure on colo
• Perimeter firewall for internet-facing services
• LAN connection to elastic capacity
04/12/2023 11
IS CLOUD HPC ALREADY A REALITY?
• Gaia ESA mission:• To build a catalogue of 1B stars (1% of Galaxy)
• to be launched 2011 for 5 year mission
• 3-8 Mbit/s downlink, 30 Gb/day
• Data reduction cycles• Multiple observation allow to refine star positions
• 6 month observation cycle followed by 2 week catalog refinement cycle
• Reasons to go to cloud• Bursty load profile
• EC2-based solution is cheaper: 350K EURO vs 720K EURO in-house without power and storage
• Risk mitigation: no need to purchase up-front datacenter for 5 years mission, as probe may get lost any day
© 2009 Grid Dynamics
04/12/2023 12
MOVING HPC TO CLOUD STRATEGIES
© 2009 Grid Dynamics
Build• Cloudware-based HPC• Native cloud HPC solutions
Buy• Move commercial grid to cloud environment
04/12/2023 13
MOVING A GRID TO A CLOUD• WHEN?
• CPU is a bounding factor
• Legacy code or black-box tasks
• Re-architecting is just not feasible or practical
• For dev and test grids
• Grid vendor is already there
• HOW?• Build your own
• custom worker machine image
• Keep scheduler and data sources on premises for maximum control and security
• Consider SSH tunneling or VPN for maximum security
• Or use vendor’s cloud adapters
• Data Synapse Federator
• Sun Grid Engine DRM
• UnivaUniCloud (SGE)
• Condor – CycleComputingCycleCloud
© 2009 Grid Dynamics
04/12/2023 14
DATA SYNAPSE FEDERATOR
• Policies for starting / stopping cloud based engines
• Secure connections to cloud based engines
© 2009 Grid Dynamics
Grid Client Federator
DataSynapseManager
On Premise In the Cloud
04/12/2023 15
DATA SYNAPSE FEDERATOR
• SSH tunnel to communicate over WAN
• For managing engines
• For engines to access on-premise data
• Proxy is doing basic caching
• DS Engine updates grid libraries on boot
© 2009 Grid Dynamics
DataSynapse Engines on EC2
On Premise
On AWS
DS Manager
Proxy Service
Secure SSHTunnel
DS Base AMI
DS S3
Custom AMI
Client S3
Proxy Service
Federator
Activation Policy
04/12/2023 16
ADOPTING DATAGRID CLOUDWARE
• WHEN?• Data access is a bounding factor
• White box tasks
• Luxury of re-design
• HOW?• Plenty of powerful clustered middleware:
• Oracle Coherence
• GigaSpaces XAP
• GridGain
• Terracotta
• Design application considering
• Data partitioning
• Compute-data affinity
• In-place data processing
© 2009 Grid Dynamics
04/12/2023 17
GIGASPACES XAP
• Full app stack• General frameworks
• In-memory data grid
• Messaging
• Web container
• Collapsed tiers• Processing unit as
logical unit of scalability
• SLA driven container as physical unit of scalability
• Cloud adapter to provision containers on demand
© 2009 Grid Dynamics
18
ORACLE COHERENCE
9/18/09© 2009 Grid Dynamics
MainframesDatabases Web Services
Application Tier
Oracle CoherenceData Grid
Data Sources
Data Services
• Most popular data grid product
• True dynamic scalability
• Shared common virtualized app platform
• In-memory data grid
• In-place data processing
• Explicit locking
• ACID Transactions
04/12/2023 19
NATIVE CLOUD HPC
• WHEN?• Innovative path-finding solutions (speed of innovation)
• True massive scale data processing
• Naturally bursty applications
• Analysis and processing of Big Data
• HOW?• Amazon Elastic Map/Reduce (Hadoop on the cloud)
• HDFS to store large files
• MapReduce to manage workload
• HBase to manage semi-structured data on top of HDFS
• Hive to batch-query and aggregation with QL queries
• Cloudera
• RightScale RightGrid
© 2009 Grid Dynamics
04/12/2023 20© 2009 Grid Dynamics
CASE STUDIES
• Monte-Carlo @ Cloud• Batch processing @ Cloud• Excel analytics @ Cloud
04/12/2023 21
MONTE-CARLO @ CLOUD
© 2009 Grid Dynamics
04/12/2023 22
ANALYTICS APPLICATIONS
• Analytics applications: analyze data or perform computations based on mathematical models
• Typical usage examples• Project sales numbers
• Estimate inventory levels
• Evaluate portfolio values
• Value at risk calculations (VAR)
• Project web site traffics
• Information helps in making better decisions
• Identify and mitigate risks
© 2009 Grid Dynamics
04/12/2023 23
ANALYTICS APPLICATIONS
Traditional Approach New Approach
Always compute intensive, sometimes data intensive
Runs as a batch Runs as a service
Fixed static footprintUse idle compute cycles (CPU
scavenging) Dynamically scalable
Based on popular scheduler-based grid frameworks Based on emerging HPC technologies
Not designed for near real time processing
Oriented to near real time processing
© 2009 Grid Dynamics
04/12/2023 24
CLOUD-BASED SOLUTION FOR NEAR REAL-TIME ANALYTICS
• Pros• Dynamically scale it up and down based on the
size of computation
• Create and dispose Infrastructure once the computation is done
• Add more machines to bring the compute time close to real-time
• Cons• Massive data transfer in and out of cloud can be
time consuming. Problems that depend on lots of dynamic data may not be suitable
• Shared processor memory is no longer available. Share-all models are poor candidates
© 2009 Grid Dynamics
04/12/2023 25
BUSINESS DRIVERS
• Major investment bank• Annuity calculator application
• Monte-Carlo simulation with geometric Brownian motion (GBM)
• Fully parallelizable algorithm
• Customer talks to an agent and agent gets back to the customer next business day
• Currently nightly batch job computes the annuity amounts
• Problems with current approach• System is constrained by time available for batch
• Customer satisfaction can be improved if this can be computed on spot, in near real time
• Adding new resources to system is hard and expensive
© 2009 Grid Dynamics
04/12/2023 26
REQUIREMENTS AND SOLUTION
• Business requirements• Ability to quickly launch and shutdown the application on demand
• Ability to scale up or down based on the size of the problem
• Complete the simulation in near real-time
• Model functionality should be reusable
• Security
• Re-use existing Monte Carlo models (written in C++)
• Solution• Amazon Web Services
• GridGainCloudware
© 2009 Grid Dynamics
04/12/2023 27
GRIDGAINCLOUDWARE
© 2009 Grid Dynamics
04/12/2023 28
CASE STUDY: SOLUTION ARCHITECTURE
© 2009 Grid Dynamics
04/12/2023 29
CASE STUDY: HIGHLIGHTS
• Monte Carlo simulation service that can be launched on click of a button
• Simulation cluster up and serving in less than 4 minutes
• Scale up the cluster in under 2 mins
• Simulation cluster can be dismissed on click of a button
• ~1M draws in MC simulation yields accurate results in near real time
• SOA Architecture, simulation is a web service that can be consumed by any client
• Dynamically loads the application code and reference data, configures the application on boot up from S3 (Storage cloud)
© 2009 Grid Dynamics
04/12/2023 30
BATCH ANALYTICS @ CLOUD
© 2009 Grid Dynamics
04/12/2023 31
WHY BATCH PROCESSING @ CLOUD?
• Traditional batch processing limitations• Limited by number of server resources
• Low utilization
• No way to process burst workload
• HW failure reduces capacity
• Cloud way• Unlimited server resources
• 100% utilization
• Opportunity to scale with load
• Opportunity to automatically restore capacity on failure
• Do it as quickly as you need• Neutral cost equation: 1000 servers @ 1 hour = 10 servers @ 100 hours
© 2009 Grid Dynamics
04/12/2023 32
EXAMPLE: LOG PROCESSING @ CLOUD
• Problem:• Processing of traffic usage in large enterprise
• NetFlow logs gathered, stored and processed for reports to business
• Various analytics, like biggest traffic offender within enterprise
• Solution:• Terracotta cloudware for cluster management, job distribution and results
gathering
• Logs are served by scalable nginx web server
• Automated provisioning and dynamic scalability
• Deployed on top of Amazon EC2
© 2009 Grid Dynamics
04/12/2023 33
BATCH PROCESSING ARCHITECTURE
© 2009 Grid Dynamics
Batch processing cluster
Worker Servers Array
FrontendProvisioning
Service
Master
Cloud API
Data source
Scale up request
Job request Job result
New Server
04/12/2023 34
TERRACOTTA CLOUDWARE
© 2009 Grid Dynamics
• Cluster JVM, not application
• Transparent clustering
• Network attached memory
• Separation of application from infrastructure
• No new API
• Java is the API
• Java memory model
• Java concurrency
Scale-out
Terracotta ServerClustering the JVM
App Server
Web App
JVM
Frameworks
Frameworks
Business Logic
App Server
Web App
JVM
Frameworks
Frameworks
Business Logic
App Server
Web App
JVM
Frameworks
Frameworks
Business Logic
04/12/2023 35
Worker JVM
TC driver
Worker JVM
Heap
TERRACOTTA MASTER-WORKER ARCHITECTURE
© 2009 Grid Dynamics
TC serverMaster JVM
Heap
TC driver TC driver
TC communication layer
04/12/2023 36
SCHEDULER BATCH PROCESSING @ CLOUD
• Sun Grid Engine + AWS• When tasks are highly heterogeneous
• For cloud bursting
• Advanced resource management capabilities
• Self-contained AMI to boot and self-organize SGE cluster
• SDM + EC2 adapters to grow and shrink cluster depending on working queue
• Univa UD
© 2009 Grid Dynamics
04/12/2023 37
EXAMPLE: DNA SEQUENCER
• Problem: DNA Sequencer tool• produces TBs of raw data in one experiment
• Processed by in-house SGE cluster
• refined to GBs after processing
• Storage is cheap, but redundant geo-distributed storage is not cheap
• Frequent need to re-run processing of old experiments, ad-hoc
• Hard to allocate resources for ad-hoc runs, raw data may become unavailable
• Solution: SGE+AWS• Raw data from tool is FedExed to Amazon and uploaded to S3
• Run ad-hoc SGE cluster in the cloud to re-process (same codebase as in-house)
• SGE workers process data from and store results to S3
• Consume refined results: either download directly, or FedEx back to labs
© 2009 Grid Dynamics
04/12/2023 38
RIGHTGRID: CLOUD WAY FOR BATCH PROCESSING
• Easy way to utilize all power of cloud computing• Dynamic SLA-based scaling of worker machines
• True scalable storage
• TrueScalable messaging
• RightGrid offers lightweight yet powerful framework:• EC2 as worker pool, S3 as mediated storage, SQS as messaging
• Ruby-based framework for JobProducer, JobConsumer, message codec, etc…
• Designed to wrap and run arbitrary code on worker nodes
• Transient and persistent worker execution model
• Failover, error reporting and audit
• Custom scaling policies
© 2009 Grid Dynamics
04/12/2023 39
RIGHT GRID ARCHITECTURE
© 2009 Grid Dynamics
04/12/2023 40
EXAMPLE: DOCUMENT CONVERTING
• Problem:• Publishing house needs to convert its documents
repository to standard format for later indexing
• All kinds of document formats to be rendered as pdf documents
• Once-in-a blue moon job
• Solution• Use Amazon EC2 and RigtScale’sRightGrid
framework
• Document storage FedExed to Amazon, uploaded to S3
• Documents converted by application built on top of RightGrid framework
• Converted documents stored on S3
• Resulting document pack is FedExed from Amazon to customer
© 2009 Grid Dynamics
04/12/2023 41
EXCEL ANALYTICS @ CLOUD
© 2009 Grid Dynamics
04/12/2023 42
WHY EXCEL @ CLOUD?
• Ubiquitous• Financial analysts think in Excel
• Excel + VBA is current financial analyst IDE
• For many financial institutions, Excel is a main data analysis tool
• Used by analysts and engineers
• Limited Programming Model• Single threaded, memory limited, not that performing
• Need to Run Large Excel Workloads• Parallelization of workload and data is the only way out
• On-demand infrastructure to run parallel excel
© 2009 Grid Dynamics
04/12/2023 43
MOVING EXCEL TO CLOUD
• Calculation Flow• DAG of calculation units (Macro, UDF, Workbook
recalc)
• Representable as “DAG table” or task dependency table
• Data flow• Workbook as a system of records and data
synchronization point
• Moving around workbooks is costly – moving data deltas is essential
• Template regions are used to capture input and output parameters
© 2009 Grid Dynamics
04/12/2023 44
MOVING EXCEL TO CLOUD: DEPLOYMENT
© 2009 Grid Dynamics
Scheduler
Compute Nodes(MS Windows & Excel)
Staging Server
Cloud (Private or Public)
Private LinkOr
Internet
Customer Premises
HTTP or FTP Server(Only for Public Clouds)
User PCs(MS Windows & Excel)
Web Server
1. Submit Job
2. Stage Workbook In
3. Submit T
asks
4. Stage Result Out
04/12/2023 45
FUTURE OF CLOUD HPC
© 2009 Grid Dynamics
Specialized IaaS and PaaS offerings for HPC
• Bare metal with provisioning on demand
• Integrated HPC engines
• Math services
• Domain specific reference data services
© 2009 Grid Dynamics
Thank You!
Victoria Livschitz
CEO, Grid Dynamics