Top Banner
Rev Up Your HPC Engine Fritz Ferstl, CTO Univa Corp, ff[email protected]
16

Rev Up Your HPC Engine

Jan 17, 2015

Download

Technology

insideHPC

In this slidecast, Fritz Ferstl from Univa presents: Rev Up Your HPC Engine. The presentation explores the challenges for Workload Management systems in today's datacenters with ever-increasing core counts.

See the presentation video and the full transcript: http://wp.me/p3RLHQ-cjs
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Rev Up Your HPC Engine

Rev Up Your HPC EngineFritz Ferstl, CTO Univa Corp, [email protected]

Page 2: Rev Up Your HPC Engine

Who is Univa?

Copyright © 2014 Univa Corporation. All Rights Reserved.2

• Profile• Based in Chicago, global

reach

• >500 customers in 3 yrs (mostly Fortune 500)

• Products /Technologies:• Univa Grid Engine

• UniSight

• Univa License Orchestrator

• UniCloud

Data Center Automation ExpertsDo more with less in Big Compute and Big Data

Help organizations play a better game

of Tetris

Page 3: Rev Up Your HPC Engine

Challenges for Workload and Resource Management Systems

Copyright © 2014 Univa Corporation. All Rights Reserved. 3

Page 4: Rev Up Your HPC Engine

Scalability

• Node counts stay flat or go down, sockets stay flat, cores explode• With the core explosion, the number of jobs also explodes

• Ever shorter run-times, more applications, more use cases

• Large commercial sites approach or go beyond 100K

• Throughput clustersprocess >150 millionjobs / month

4Copyright © 2014 Univa Corporation. All Rights Reserved.

Page 5: Rev Up Your HPC Engine

Heterogeneity

5Copyright © 2014 Univa Corporation. All Rights Reserved.

• Hardware• Multi-sockets, multi-cores

• Partial cluster upgrades

• Evolving memory, network and storage architectures

• Accelerators: GPUs, Phi

• Job Profiles• Throughput

• Array Jobs

• Large Parallel

• Interactive

• Sessions

• Reservations

• Transactional

• Hybrid

• Dependencies, Workflows

Page 6: Rev Up Your HPC Engine

Policy Variety

6Copyright © 2014 Univa Corporation. All Rights Reserved.

• Automated Transparency?

• Manual overrides

• Preferential access

• Priorities

• Reservations

• Resource Urgencies

• Quotas

• Deadlines

• Conflict Resolution• E.g. don‘t starve large

parallel plus maintainhigh utilization

Page 7: Rev Up Your HPC Engine

Use Case Variety

7Copyright © 2014 Univa Corporation. All Rights Reserved.

• Classical HPC (simulation) Large parallel / many mid-size parallel

• Verification / Test Throughput

• From single simulation to parameter study array jobs

• Ultra-short jobs

• Big Data / Data Mining

• Exclusive usage of nodesvs shared usage

Page 8: Rev Up Your HPC Engine

Geographical Distribution / Clouds

8Copyright © 2014 Univa Corporation. All Rights Reserved.

• Resource sharing: servers, licenses, data, other

• Data access latencies

• Security

• File system dependencies• Pre-/Post-Staging

• Data locality:• Bring the job to the data

• Or bring the data to the job

Page 9: Rev Up Your HPC Engine

SolutionsApproachesBest Practices

Copyright © 2014 Univa Corporation. All Rights Reserved. 9

Page 10: Rev Up Your HPC Engine

Evolve

• Architecture Evolution• more cores / nodes / jobs

make it faster

• Integration with GPUs, Phi, etc

• New Scheduling Algorithms• Efficient handling of job mixes:

parallel / array / sequential jobs

• Scheduling of ultra-short jobs

• More Monitoring, Better Error Tracking

• Reporting, Accounting & Analytics

10Copyright © 2014 Univa Corporation. All Rights Reserved.

Page 11: Rev Up Your HPC Engine

Be Street-Smart

• Simplify where possible!

• Be-all solution can be themost expensive• Effort

• Poor utilization slow ROI

• Focus on most important goals

11Copyright © 2014 Univa Corporation. All Rights Reserved.

Page 12: Rev Up Your HPC Engine

Think Different

• Examples:• Less HA @ more throughput via fast SSD-Raid with

regular back-up

• Use array jobs whereever possible

• More smaller jobs vs fewer biggerjobs

• All considered, preemption maybe a good option

12Copyright © 2014 Univa Corporation. All Rights Reserved.

Page 13: Rev Up Your HPC Engine

Accept Difference

• Simple: temporarily designate parts of cluster

• Advanced: Cloud-share• Share resources across separate workload

management system instances

• Dynamically re-assign resources(servers) based on demand

• Provides autonomy whilemaintaining high utilization

• But avoid meta-schedulingwhere you can!

13Copyright © 2014 Univa Corporation. All Rights Reserved.

Page 14: Rev Up Your HPC Engine

Tailored Solutions

• Tailoring & add-ons can make all the difference

• Tailoring such as• Job Classes

• Customized reports

• Add-ons such as• Submission portals

and wrappers

14Copyright © 2014 Univa Corporation. All Rights Reserved.

Page 15: Rev Up Your HPC Engine

Conclusions

• Workload & Resource Management Systems more required than ever

• Specifically in the “new” era of Cloud and Big Data

• Allows you to benefit from 20+ years of experience in HPC workload orchestration and to move beyond

• Clear-cut set of challenges non-trivial solutions

• Build on best-in-class products, architectures and development teams

• Being “street-smart” about architecting and configuration of a cluster has big impact

15Copyright © 2014 Univa Corporation. All Rights Reserved.

Page 16: Rev Up Your HPC Engine

Thank Youhttp://[email protected]

Copyright © 2014 Univa Corporation. All Rights Reserved. 16