Rev Up Your HPC Engine Fritz Ferstl, CTO Univa Corp, ff[email protected]
Jan 17, 2015
Rev Up Your HPC EngineFritz Ferstl, CTO Univa Corp, [email protected]
Who is Univa?
Copyright © 2014 Univa Corporation. All Rights Reserved.2
• Profile• Based in Chicago, global
reach
• >500 customers in 3 yrs (mostly Fortune 500)
• Products /Technologies:• Univa Grid Engine
• UniSight
• Univa License Orchestrator
• UniCloud
Data Center Automation ExpertsDo more with less in Big Compute and Big Data
Help organizations play a better game
of Tetris
Challenges for Workload and Resource Management Systems
Copyright © 2014 Univa Corporation. All Rights Reserved. 3
Scalability
• Node counts stay flat or go down, sockets stay flat, cores explode• With the core explosion, the number of jobs also explodes
• Ever shorter run-times, more applications, more use cases
• Large commercial sites approach or go beyond 100K
• Throughput clustersprocess >150 millionjobs / month
4Copyright © 2014 Univa Corporation. All Rights Reserved.
Heterogeneity
5Copyright © 2014 Univa Corporation. All Rights Reserved.
• Hardware• Multi-sockets, multi-cores
• Partial cluster upgrades
• Evolving memory, network and storage architectures
• Accelerators: GPUs, Phi
• Job Profiles• Throughput
• Array Jobs
• Large Parallel
• Interactive
• Sessions
• Reservations
• Transactional
• Hybrid
• Dependencies, Workflows
Policy Variety
6Copyright © 2014 Univa Corporation. All Rights Reserved.
• Automated Transparency?
• Manual overrides
• Preferential access
• Priorities
• Reservations
• Resource Urgencies
• Quotas
• Deadlines
• Conflict Resolution• E.g. don‘t starve large
parallel plus maintainhigh utilization
Use Case Variety
7Copyright © 2014 Univa Corporation. All Rights Reserved.
• Classical HPC (simulation) Large parallel / many mid-size parallel
• Verification / Test Throughput
• From single simulation to parameter study array jobs
• Ultra-short jobs
• Big Data / Data Mining
• Exclusive usage of nodesvs shared usage
Geographical Distribution / Clouds
8Copyright © 2014 Univa Corporation. All Rights Reserved.
• Resource sharing: servers, licenses, data, other
• Data access latencies
• Security
• File system dependencies• Pre-/Post-Staging
• Data locality:• Bring the job to the data
• Or bring the data to the job
SolutionsApproachesBest Practices
Copyright © 2014 Univa Corporation. All Rights Reserved. 9
Evolve
• Architecture Evolution• more cores / nodes / jobs
make it faster
• Integration with GPUs, Phi, etc
• New Scheduling Algorithms• Efficient handling of job mixes:
parallel / array / sequential jobs
• Scheduling of ultra-short jobs
• More Monitoring, Better Error Tracking
• Reporting, Accounting & Analytics
10Copyright © 2014 Univa Corporation. All Rights Reserved.
Be Street-Smart
• Simplify where possible!
• Be-all solution can be themost expensive• Effort
• Poor utilization slow ROI
• Focus on most important goals
11Copyright © 2014 Univa Corporation. All Rights Reserved.
Think Different
• Examples:• Less HA @ more throughput via fast SSD-Raid with
regular back-up
• Use array jobs whereever possible
• More smaller jobs vs fewer biggerjobs
• All considered, preemption maybe a good option
12Copyright © 2014 Univa Corporation. All Rights Reserved.
Accept Difference
• Simple: temporarily designate parts of cluster
• Advanced: Cloud-share• Share resources across separate workload
management system instances
• Dynamically re-assign resources(servers) based on demand
• Provides autonomy whilemaintaining high utilization
• But avoid meta-schedulingwhere you can!
13Copyright © 2014 Univa Corporation. All Rights Reserved.
Tailored Solutions
• Tailoring & add-ons can make all the difference
• Tailoring such as• Job Classes
• Customized reports
• Add-ons such as• Submission portals
and wrappers
14Copyright © 2014 Univa Corporation. All Rights Reserved.
Conclusions
• Workload & Resource Management Systems more required than ever
• Specifically in the “new” era of Cloud and Big Data
• Allows you to benefit from 20+ years of experience in HPC workload orchestration and to move beyond
• Clear-cut set of challenges non-trivial solutions
• Build on best-in-class products, architectures and development teams
• Being “street-smart” about architecting and configuration of a cluster has big impact
15Copyright © 2014 Univa Corporation. All Rights Reserved.
Thank Youhttp://[email protected]
Copyright © 2014 Univa Corporation. All Rights Reserved. 16