Top Banner
Identifying Opportunities to Improve Efficiency in HPC Clusters Jordi Blasco Co-founder & CTO HPC Advisory Council - Perth - August 2018
33

to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Feb 17, 2019

Download

Documents

nguyenxuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in

HPC Clusters

Jordi BlascoCo-founder & CTO

HPC Advisory Council - Perth - August 2018

Page 2: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Quick introduction to HPCNow!

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

● Global HPC consulting company● IT + scientific background● HPC services and solutions● User-oriented company● Hardware agnostic

Page 3: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

System Administratorsand User Support

Top500 Supercomputer Users

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 4: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Contributions to HPC Community

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 5: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

IISW

Page 6: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Public sector Private Companies

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 7: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Motivation

Page 8: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

MotivationAre you familiar with these issues?

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

1Risk of user dissatisfaction

3Higher IO demanding workflows

5 Cluster Contention

2Higher waiting time

4Hardware no longer

supported

Page 9: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Buy a New Cluster

Large procurement usually involves long and complex RfP process.

MotivationPotential Solutions

Page 10: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Buy a New Cluster

Large procurement usually involves long and complex RfP process.

Use CloudExtend the current compute capacity by using cloud bursting to accommodate peaks of needs is definitely the best option. Unfortunately, ongoing and regular usage become expensive.

MotivationPotential Solutions

Page 11: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Use CloudCloud Bursting Capabilities - Hybrid Cloud

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 12: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Buy a New Cluster

Large procurement usually involves long and complex RfP process.

Use CloudExtend the current compute capacity by using cloud bursting to accommodate peaks of needs is definitely the best option. Unfortunately, ongoing and regular usage become expensive.

Improve Efficiency

By improving the performance and efficiency, you are somehow creating more allocation for new jobs with zero investment in hardware capacity.

MotivationPotential Solutions

Page 13: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Improve Job EfficiencyImpact in Job Allocation and Resources Availability

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 14: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

How to identify tuning opportunities easily?

Page 15: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Job Efficiency Monitoring

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Traditional tools like Ganglia are not capable of representing the metrics required to identify inefficient jobs.● based load monitoring● no link to user● no link to job● no link to other nodes allocated for the same job● no information regarding efficiency in the allocated

resources

Page 16: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Job Efficiency Monitoring Requirements

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

1

Review re

source

s

requested vs

used

Key fundamental m

etrics t

o understand

how well u

sed are th

e reso

urces

requested. 2

Real Tim

e Monito

ring

Enables proacti

ve jo

b profiling and also

enables the poss

ibility t

o trigger re

al

time acti

ons.3

30 Seconds R

esolutio

n

30 seco

nds reso

lution is

quite re

asonable

for the m

ajority of H

PC workl

oads.

The main goal is to identify opportunities to improve user workflows, user codes and applications, in addition to user mistakes.

Given the huge number of jobs and large number of nodes, the solution requires big data strategy.

Page 17: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Architecture

Page 18: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Architecture

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Large number of events to analyse requires the use of Big Data technologies. Data is gathered using custom codes and aggregated into ElasticSearch, an open source search and analytics engine which has high reliability and proven scalability. Finally, the data is represented through Grafana and Kibana, which are leading tools for querying and visualizing large datasets and metrics.

Page 19: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Traditional Stack Pipeline

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 20: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Custom Monitoring Stack pipeline for HPC

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 21: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Custom Monitoring Stack for HPC prototype

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 22: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Standard vs Custom

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Current statistics Standard Custom Prototype

Metrics per user (s) 8 8

Resolution 30 30

Avg. events/cycle 380 380

Avg. size per package (bytes) 2000 400

Avg. TB/year in ElasticSearch 1.80TB 0.17TB

Theoretical limit (events/s) 50k 260k

Page 23: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Standard vs Custom vs Prototype

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Current statistics Standard Custom Prototype

Metrics per user (s) 8 8 318

Resolution 30 30 10

Avg. events/cycle 380 380 380

Avg. size per package (bytes) 2000 400 6800

Avg. TB/year in ElasticSearch 1.80TB 0.17TB 8.5

Theoretical limit (events/s) 50k 260k 15k

The prototype setup is based on LXD containers allocated across two bare metal nodes with 24 cores (Intel Haswell), 32GB of memory, 2TB of SSD disks and 1GB Ethernet.

Page 24: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

● Going down to 10 seconds resolution● Job level usage● Task level usage● Allocated CPU usage● Memory usage● IPC● Disk IO● Network (Infiniband)● Cluster File System

Job Efficiency Monitoring (prototype)Additional Metrics and Features

● Read / Write calls● inodes updates● MB write / read● Open / Close requests● Walltime used / requested● Memory used / requested● Retention / purging policy● Alerts and event correlation● MPI stats (collectives)

Page 25: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Need to Scale Up?

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Migrate to 10GB ethernet

Which could increase the number of events digested

to x10

Use buffersWhich could increase the number of events digested to x10

Add more elasticsearch nodesVirtually unlimited scalability

03

01 02

Page 26: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

No performance penalty based on HPL resultsAdditional Metrics and Features

Page 27: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Custom Monitoring Stack for HPC prototype

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 28: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Most Relevant Case Studies

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Case Study CPUTime (h) Output

VASP user 4,265,883 Improved efficiency from 74% to 97.6%

ORCA user 2,300,033 Improved efficiency from 18% to 87% and IO.

R code user 1,670,402 Resilience issues (zombie tasks)

Fluent workflow user 1,401,825 Improved the efficiency 200%

Ansys Fluent user 1,391,951 x5 efficiency (100 vs 500 cores) + resilience

OMNeT++ user 1,253,462 Improved the efficiency from 6% to 96%

Custom CESM user 1,093,184 Improved the efficiency from 1% to 98%

Page 29: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Conclusions

Page 30: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Efficiency

Scalability

Performance

ConclusionsThanks to the job efficiency monitoring we have been able to improve

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 31: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

Efficiency

Scalability

Performance

ConclusionsThanks to the job efficiency monitoring we have also been able to

Detect user mistakes early

Avoid massive waste of CPU time

Improve user workflows

Accelerate research

Improve reliability

Improve user satisfaction

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 32: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

“The best way to predict the future is to invent it.” -- Alan Kay

Identifying Opportunities to Improve Efficiency in HPC Clusters - Jordi Blasco

Page 33: to Improve Efficiency in Identifying Opportunities HPC ...hpcadvisorycouncil.com/events/2018/australia-conference/pdf/... · Quick introduction to HPCNow! Identifying Opportunities

[email protected]

www.hpcnow.com

Marie Curie, 8 - 08042 Barcelona (Spain)

34 Fernly Rise, 2019 Auckland (New Zealand)

Barcelona

Auckland