Top Banner
A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman University of Maryland Department of Computer Science
18

A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

Jan 01, 2016

Download

Documents

Kelly Booth
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

A Ping Too FarReal World Network Latency Measurement

Gary Jackson

JHU/APL

Work done while at the University of Maryland

Pete Keleher and Alan Sussman

University of Maryland

Department of Computer Science

Page 2: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

2

Introduction

Context:Peer-to-peer HPC resource discovery and management

Goal Collect a high-quality all-to-all network latency map Campus or department scale, including HPC resources As opposed to Internet-scale, which is well-trod ground

Purpose Compare latency prediction techniques Increase the fidelity of peer-to-peer system simulations

Solved many problems Technical solutions to technical and policy obstacles

Managed only partial success Could not get measurements on more than one HPC-equipped cluster, so it’s

not useful to us But maybe the data set is useful to someone else

Page 3: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

3

Four Policy Challenges

1. Where to measure? Ask for access ✖ Compel stakeholders ✖ Find existing resources that meet needs ✔

2. Work around policy obstacles Cannot run persistent daemons on resources

3. Minimal change Cannot ask for significant changes to environment or other

policies

4. Non-disruptive Use of resources cannot disrupt other users

Page 4: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

4

Five Technical Challenges

1. Load interferes with measurement

2. User-level programs on both ends

3. Quick measurements

4. Quality measurements

5. Fix technical obstacles

Page 5: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

5

The Plan

Use local resources

UMIACS HTCondor Pool

– ~160 nodes spread out over several clusters

– Two clusters equipped with InfiniBand (IB)

"Backfills" HTC jobs on to clusters managed with

TORQUE

OSU MPI microbenchmarks

Distributed system to schedule & collect

Page 6: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

6

Particulars of the Environment

Scheduling Cannot schedule arbitrary pairs of nodes in HTCondor

Static Slots 1 job per slot 1 slot per core All slots must be controlled for exclusive measurement

Node Heterogeneity

Lesson learned:Compute environment exists to support somebody’s research, but maybe not yours

Page 7: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

7

Aside: Load Affects Network Latency

Space-sharing application model

Measurements between two IB connected nodes

Varied CPU load

Higher load leads to Increased latency Unpredictable latency

Lesson Learned:Environment for measurement should match model environment

Page 8: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

8

Solved Technical Obstacles

OpenMPI is finicky about OS & libraries Build OpenMPI separately for every single host

OpenMPI over TCP mysteriously hangs Bogus bridge interface for virtualization Tell OpenMPI not to use it

User limits for mapped memory prevents RDMA over IB Had to modify HTCondor init script

IB library provided by OS didn't work Had to build it ourselves on Cluster E First hint that something was really wrong

Lesson Learned:There are going to be a lot of little problems along the way.

Page 9: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

9

Solved Policy Obstacles

Local Resource Management Systems Cannot schedule arbitrary pairs of nodes Cannot run processes outside of HTCondor & Torque Cannot ask to change the way resources are allocated Solution:

Built distributed system to schedule & collect measurements

Accounts Cannot get accounts on some systems Solution:

Workaround to start OpenMPI daemon processes on both ends without SSH

Lesson LearnedSometimes, there are technical solutions to policy problems.

Page 10: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

10

Setting the Stage for Failure

Cluster E One of the two clusters in pool with IB Upstream IB libraries from OS vendor didn’t work IB used exclusively for IPoIB to support Lustre Nodes have a large amount of memory

OpenMPI processes crashing Despite rebuild of IB libraries from

hardware vendor

Page 11: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

11

Fatal Obstacle

IB driver has tunable parameter to adjust the amount of memory that can be mapped (64GB)

Nodes have twice that physical memory (128GB)

Needs to be twice the physical memory size (256GB)

OS vendor has no guidelines for adjusting that value

Unknown impact on Lustre filesystem using IPoIB

So this can't be fixed

Lesson Learned:Sometimes there’s nothing you can do.

Page 12: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

12

Cannot Lay Blame

Sysadmins? No, they made a conservative decision to support primary

stakeholders

IB vendor? Driver right from the IB vendor probably would have worked

OS vendor? Supports what they intended to support (IPoIB)

Me? Using native RDMA over IB isn't asking too much

Lesson Learned:Sometimes it’s no-ones fault.

Page 13: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

13

Results

Ping is not a good predictor of application-level latency

Tends to over-estimate

Compared latency prediction techniques

Distributed Tree Metric (DTM) Vivaldi Global Network Positioning

Result:DTM continues to perform better than the other techniques

Page 14: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

14

Takeaway

IF you are building a big system/thesis that will rely on many different systems/admin domains

THEN you need to check all the potential choke-points in advance

If the work is self contained, this is much easier.

I should have tested MPI over IB RDMA on that cluster much earlier in the process.

Page 15: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.
Page 16: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

16

Policy

Cannot ask for invasive changes to policy

or implementation

Cannot disrupt HTCondor pool

Cannot interfere with TORQUE users

Cannot get accounts on compute nodes

Must be prepared for preemption

Lesson Learned:

Policies exists to support someone’s

research, but maybe not yours.

Page 17: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

17

Seizing a Node

Submitter: query HTCondor and submit master & slave jobs

Node masters & slaves: seize exclusive control over a node

For a node with n slots Submit n-1 slaves Submit 1 master

Page 18: A Ping Too Far Real World Network Latency Measurement Gary Jackson JHU/APL Work done while at the University of Maryland Pete Keleher and Alan Sussman.

18

Scheduling Measurements

When all slaves & master are running, contact central control

Slaves & master yield periodically to allow other jobs to run

Scheduler: Schedule measurements

between masters Collect & store results from

masters