Top Banner
HPC Controls: View to the Future Ralph H. Castain June, 2015
74

HPC Controls Future

Jan 22, 2018

Download

Technology

rcastain
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HPC Controls Future

HPC Controls: View to the Future

Ralph H. Castain

June, 2015

Page 2: HPC Controls Future

This Presentation: A Caveat

• One person’s view into the future

o Where I am leading the open source community

o Compiled from presentations and emails with that

community, the national labs, various corporations

o Spans last 20+ years

• Not what I am expecting any particular entity to do

Page 3: HPC Controls Future

Controls Overview: A Definition

• Resource Manager

o Workload manager

o Run-time environment

• Monitoring

• Resiliency/Error Management

• Overlay Network

• Pub-sub Network

• Console

Page 4: HPC Controls Future

Controls Overview: A Definition

• Resource Manager

o Workload manager

o Run-time environment

• Monitoring

• Resiliency/Error

Management

• Overlay Network

• Pub-sub Network

• Console

Open Ecosystems Today

• Every element is independent island, each with its own community

• Multiple programming languages

• Conflicting licensing

• Cross-element interactions, where they exist, are via text-based messaging

Page 5: HPC Controls Future

General Requirements

• Scalable to exascalelevels & beyond– Better-than-linear scaling

– Constrained memory footprint

• Dynamically configurable– Sense and adapt, user-

directable

– On-the-fly updates

• Open source (non-copy-left)

• Maintainable, flexible

– Single platform that can be utilized to build multiple tools

– Existing ecosystem

• Resilient

– Self-heal around failures

– Reintegrate recovered resources

5

Page 6: HPC Controls Future

Chosen Software Platform

• Demonstrated scalability

• Established community

• Clean non-copy-left licensing

• Compatible architecture

• …

6

Open Resilient Cluster Manager (ORCM)[for reference implementation]

Page 7: HPC Controls Future

2003 - present

7

Open MPI

OpenRTE

Cisco Intel

10s-100s of Knodes(80K in production)

EMC

EnterpriseRouter

ClusterMonitor SCON/RM

OMPI/ORTE

ORCM/SCON

FDDP1994-2003

Page 8: HPC Controls Future

Abstractions

• Divide functional blocks into abstract frameworks

o Standardize the API for each identified functional area

o Can be used external, or dedicated for internal use

o Single-select: pick active component at startup

o Multi-select: dynamically decide for each execution

• Multiple implementations

o Each in its own isolated plugin

o Fully implement the API

o Base “component” holds common functions

Page 9: HPC Controls Future

Example: SCalable Overlay Network (SCON)

RMQ ZMQ BTL OOB

SCON-MSG

TCPUDPIB UGNI SM USNIC TCP CUDAPortals4

SCIF

Send/Recv

Page 10: HPC Controls Future

Example: SCalable Overlay Network (SCON)

10

RMQ ZMQ BTL OOB

SCON-MSG

TCPUDPIB UGNI SM USNIC TCP CUDAPortals4

SCIF

Send/Recv

Inherit

Page 11: HPC Controls Future

ORCM and Plug-ins

• Plug-ins are shared libraries

o Central set of plug-ins in installation tree

o Users can also have plug-ins under $HOME

o Proprietary binary plugins picked up at runtime

• Can add / remove plug-ins after install

o No need to recompile / re-link apps

o Download / install new plug-ins

o Develop new plug-ins safely

• Update “on-the-fly”

o Add, update plug-ins while running

o Frameworks “pause” during update

Page 12: HPC Controls Future

Controls Overview: A Definition

• Resource Manager

o Workload manager

o Run-time environment

• Monitoring

• Resiliency/Error Management

• Overlay Network

• Pub-sub Network

• Console

Page 13: HPC Controls Future

Definition: RM

• Scheduler/Workload Manager

o Allocates resources to session

o Interactive and batch

• Run-Time Environment

o Launch and monitor applications

o Support inter-process communication wireup

o Serve as intermediary between applications and WM

• Dynamic resource requests

• Error notification

o Implement failure policies

13

Page 14: HPC Controls Future

Breaking it Down

• Workload Manager

o Dedicated framework

o Plugins for two-way integration to external WM (Moab, Cobalt)

o Plugins for implementing internal WM (FIFO)

• Run-Time Environment

o Broken down into functional blocks, each with own framework

• Loosely divided into three general categories

• Messaging, launch, error handling

• One or more frameworks for each category

o Knitted together via “state machine”

• Event-driven, async

• Each functional block can be separate thread

• Each plugin within each block can be separate thread(s)

Page 15: HPC Controls Future

Key Objectives for Future

• Orchestration via integration

o Monitoring system, file system, network, facility, console

• Power control/management

• “Instant On”

• Application interface

o Request RM actions

o Receive RM notifications

o Fault tolerance

• Advanced workload management algorithms

o Alternative programming model support (Hadoop, Cloud)

o Anticipatory scheduling

Page 16: HPC Controls Future

RM as Orchestrator

MonitoringConsole

DB

FileSystem

Network

ResourceManager

OverlayNetwork

Pub-Sub

SCON

Prov. Agent

Page 17: HPC Controls Future

Hierarchical Design

orcmd

CN

orcmd

CN

BMC

Rack Controller

orcmd

RACK

DB

RowCtlr

orcmd

CN

orcmd

CN

BMC

Rack Controller

orcmd

RACK

Page 18: HPC Controls Future

Hierarchical Design

orcmd

CN

orcmd

CN

BMC

Rack Controller

orcmd

RACK

DB

RowCtlr

orcmd

CN

orcmd

CN

BMC

Rack Controller

orcmd

RACK

Page 19: HPC Controls Future

Flexible Architecture

• Each tool built on top of same plugin system

o Different combinations of frameworks

o Different plugins activated to play different roles

o Example: orcmd on compute node vs on rack/row controllers

• Designed for distributed, centralized, hybrid operations

o Centralized for small clusters

o Hybrid for larger clusters

o Example: centralized scheduler, distributed “worker-bees”

• Accessible to users for interacting with RM

o Add shim libraries (abstract, public APIs) to access framework APIs

o Examples: SCON, pub-sub, in-flight analytics

Page 20: HPC Controls Future

Code Reuse: Scheduler

sched

moab cobalt FIFO

orcmd

DBdb

pwmgt

cfgi

pvn

Page 21: HPC Controls Future

File System Integration

• Input

o Current data location, time to retrieve and position

o Data the application intends to use

• Job submission, dynamic request (via RTE), persistence across jobs and sessions

o What OS/libraries application requires

• Workload Manager

o Scheduling algorithm factors

• Current provisioning map, NVM usage patterns/requests, persistent data/ckpt location

• Data/library locality, loading time

• Output

o Pre-location requirements to file system (e.g., SPINDLE cache tree)

o Hot/warm/cold data movement, persistence directives to RTE

o NVM & Burst Buffer allocations (job submission, dynamic)

Page 22: HPC Controls Future

Provisioning Integration

• Support multiple provisioning agents

o Warewulf, xCAT, ROCKS

• Support multiple provisioning modes

o Bare metal, Virtual Machines (VMs)

• Job submission includes desired provisioning

o Scheduler can include provisioning time in allocation decision, anticipate

re-use of provisioned nodes

o Scheduler notifies provisioning agent of what nodes to provision

o Provisioning agent notifies RTE when provisioning complete so launch

can proceed

Page 23: HPC Controls Future

Network Integration

• Quality of service allocations

o Bandwidth, traffic priority, power constraints

o Specified at job submission, dynamic request (via RTE)

• Security requirements

o Network domain definitions

• State-of-health

o Monitored by monitoring system

o Reported to RM for response

• Static endpoint

o Allocate application endpoints prior to launch

• Minimize startup time

o Update process location upon fault recovery

Page 24: HPC Controls Future

Power Control/Management

• Power/heat-aware scheduling

o Specified cluster-level power cap, ramp up/down rate limits

o Specified thermal limits (ramp up/down rates, level)

o Node-level idle power, shutdown policies between sessions/time-of-day

• Site-level coordination

o Heat and power management subsystem

• Consider system capacity in scheduling

• Provide load anticipation levels to site controllers

o Coordinate ramps (job launch/shutdown)

• Within cluster, across site

o Receive limit (high/low) updates

o Direct RTE to adjust controls while maintaining sync across cluster

Page 25: HPC Controls Future

“Instant On”

• Objective

o Reduce startup from minutes to seconds

o 1M procs, 50k nodes thru MPI_Init

• On track: 20 sec

• 2018 target: 5 sec

• What we require

o Use HSN for communication during launch

o Static endpoint allocation prior to launch

o Integrate file system with RM for prepositioning, with scheduler for

anticipatory locations

25

Today: ~15-30 min

Page 26: HPC Controls Future

“Instant On” Value-Add

• Requires two things

o Process can compute endpoint of any remote process

o Hardware can reserve and translate virtual endpoints to local ones

• RM knows process map

o Assign endpoint info for each process

o Communicate map to local processes at initialization

• Programming libraries

o Compute connection info from map

• Eliminates costly sharing of endpoint info

Page 27: HPC Controls Future

Application Interface: PMIx

• Current PMI implementations are limited

o Only used for MPI wireup

o Don’t scale adequately

• Communication required for every piece of data

• All blocking operations

o Licensing and desire for standalone client library

• Increasing requests for app-RM interactions

o Job spawn, data pre-location, power control

o Current approach is fragmented

• Every RM creating its own APIs

Page 28: HPC Controls Future

Application Interface: PMIx

• Current PMI implementations are limited

o Only used for MPI wireup

o Don’t scale adequately

• Communication required for every piece of data

• All blocking operations

o Licensing and desire for standalone client library

• Increasing requests for app-RM interactions

o Job spawn, data pre-location, power control

o Current approach is fragmented

• Every RM creating its own APIs

Page 29: HPC Controls Future

PMIx Approach

• Ease adoption

o Backwards compatible to PMI-1/2 APIs

o Standalone client library

o Server convenience library

• BSD-licensed

• Replace blocking with non-blocking operations for scalability

• Add APIs to support

o New use-cases: IO, power, error notification, checkpoint, …

o Programming models beyond MPI

First time

Page 30: HPC Controls Future

PMIx: Fault Tolerance

• Notification

o App can register for error notifications, incipient faults

o RM will notify when app would be impacted

• Notify procs, system monitor, user as directed (e.g., email, tweet)

o App responds with desired action

• Terminate/restart job, wait for checkpoint, etc.

• Checkpoint support

o Timed or on-demand, binary or SCR

o BB allocations, bleed/storage directives across hierarchical storage

• Restart support

o From remote NVM checkpoint, relocate checkpoint, etc.

Page 31: HPC Controls Future

PMIx: Status

• Version 1.0 release

o Preliminary version for commentary

• Version 1.1 release

o Production version

o Scheduled for release by Supercomputing 2015

• Server integrations underway

o SLURM

o ORCM

• PGAS/GASNet integration underway

o Extend support for that programming model

Page 32: HPC Controls Future

Alternative Models: Supporting Hadoop & Cloud

• HPC

o Optimize versus performance, single-tenancy

• Hadoop support

o Optimize versus data locality

o Multi-tenancy

o Pre-location of data to NVM, pre-staging from data store

o Dynamic allocation requests

• Cloud = capacity computing

o Optimize versus cost

o Multi-tenancy

o Provisioning and security support for VMs

Page 33: HPC Controls Future

Workload Manager: Job Description Language

• Complexity of describing job is growing

o Power, file/lib positioning

o Performance vs capacity, programming model

o System, project, application-level defaults

• Provide templates?

o System defaults, with modifiers

• --hadoop:mapper=foo,reducer=bar

o User-defined

• Application templates

• Shared, group templates

o Markup language definition of behaviors, priorities

30 yrs

Page 34: HPC Controls Future

Anticipatory Scheduling: Key to Success

• Current schedulers are reactionary

o Compute next allocation when prior one completes

o Mandates fast algorithm to reduce dead time

o Limit what can be done in terms of pre-loading etc. as they add to dead time

• Anticipatory scheduler

o Look ahead and predict range of potential schedules

o Instantly select “best” when prior one completes

o Support data and binary prepositioning in anticipation of most likely option

o Improved optimization of resources as pressure on algorithm computational

time is relieved

34

Page 35: HPC Controls Future

Anticipatory Scheduler

Console

FileSystem

ResourceManager

OverlayNetwork

Prov. Agent

Retrieve reqdfiles from

backend storage

Cache images to staging points

Projected schedule

Page 36: HPC Controls Future

Controls Overview: A Definition

• Resource Manager

o Workload manager

o Run-time environment

• Monitoring

• Resiliency/Error Management

• Overlay Network

• Pub-sub Network

• Console

Page 37: HPC Controls Future

Key Functional Requirements

• Support all available data collection sensors

o Environmental (junction temps, power)

o Process usage statistics (cpu, memory, disk, network)

o MCA, BMC events

• Support variety of backend databases

• Admin configuration

o Select which sensors, how often sampled, how often reported, when and

where data is stored

o Severity/priority/definition of events

o Local admin customizes config, on-the-fly changes

Page 38: HPC Controls Future

Monitoring System

sensors diag

orcmd

DataInventory Diagnostics

BMC

Compute Node

Launches/Monitors Apps

sensors diag

orcmd

SCON

Rack Controller

DataInventory

dbDB

SCON Agg Data

IO/PvnCaching

Page 39: HPC Controls Future

Sensor Framework

• Each sensor can operate in its own time base

o Sample at independent rate

o Output collected in common framework-level bucket

o Can trigger immediate send of bucket for critical events

• Separate reporting time base

o Send bucket to aggregator node at scheduled intervals

o Aggregator receives bucket in sensor/base, extracts contributions from each

sensor, passes to that sensor module for unpacking/recording

Freq JnTemp Power Base

Sensor

BMC/IPMI

Page 40: HPC Controls Future

Inventory Collection/Tracking

• Performed at boot

o Update on-demand, warm change detected

• Data obtained from three sources

o HWLOC topology collection (processors, memory, disks, NICs)

o Sensor components (BMC/IPMI)

o Network fabric managers (switches/routers)

• Store current inventory and updates

o Track prior locations of each FRU

o Correlate inventory to RAS events

o Detect inadvertent return-to-service of failed units

Page 41: HPC Controls Future

Database Framework

• Database commands go to base “stub” functions

o Cycle across active modules until one acknowledges handling request

o API provides “attributes” to specify database, table, etc.

• Modules check attributes to determine ability to handle

o Plugins call schema framework to format data

• Multi-threading supported

o Each command can be processed by separate thread

o Each database module can process requests using multiple threads

postgres ODBC Base

db

db2lite

Page 42: HPC Controls Future

Diagnostics: Another Framework

• Test the system on demand

o Checks for running jobs and returns error

o Execute at boot, when hardware changed, if errors reported, …

• Each plugin tests specific area

o Data reported to database

o Tied to FRUs as well as overall node

cpu eth Base

diag

mem

Page 43: HPC Controls Future

Controls Overview: A Definition

• Resource Manager

o Workload manager

o Run-time environment

• Monitoring

• Resiliency/Error Management

• Overlay Network

• Pub-sub Network

• Console

Page 44: HPC Controls Future

Planned Progression

MonitoringFault detection

(RAS event generation)

Faultdiagnosis

Faultprediction

Page 45: HPC Controls Future

Planned Progression

MonitoringFault detection

(RAS event generation)

Faultdiagnosis

Faultprediction

FDDP

Action Response

Page 46: HPC Controls Future

Analytics Workflow Concept

I

n

p

u

t

M

o

d

u

l

eGeneralized

Format

Sensors

OtherWorkflows

OtherWorkflow

RASEvent

Available in SCON as well

Pub-Sub

Database

Page 47: HPC Controls Future

Workflow Elements

• Average (window, running, etc.)

• Rate (convert incoming data to events/sec)

• Threshold (high, low)

• Filter

o Selects input values based on provided params

• RAS event

o Generates a RAS event corresponding to input description

• Publish data

Page 48: HPC Controls Future

Analytics

• Execute on aggregator nodes for in-flight reduction

o Sys admin defines, user can define (if permitted)

• Event-based state machine

o Each workflow in own thread, own instance of each plugin

o Branch and merge of workflow

o Tap stream between workflow steps

o Tap data streams (sensors, others)

• Event generation

o Generate events/alarms

o Specify data to be included (window)

48

Page 49: HPC Controls Future

Distributed Architecture

• Hierarchical, distributed approach for unlimited scalability

o Utilize daemons on rack/row controllers

• Analysis done at each level of the hierarchy

o Support rapid response to critical events

o Distribute processing load

o Minimize data movement

• RM’s error manager framework controls response

o Based on specified policies

Page 50: HPC Controls Future

Fault Diagnosis

• Identify root cause and location

o Sometimes obvious – e.g., when direct measurement

o Other times non-obvious

• Multiple cascading impacts

• Cause identified by multi-sensor correlations (indirect measurement)

• Direct measurement yields early report of non-root cause

• Example: power supply fails due to borderline cooling + high load

• Estimate severity

o Safety issue, long-term damage, imminent failure

• Requires in-depth understanding of hardware

Page 51: HPC Controls Future

Fault Prediction: Methodology

• Exploit access to internals

o Investigate optimal location, number of sensors

o Embed intelligence, communications capability

• Integrate data from all available sources

o Engineering design tests

o Reliability life tests

o Production qualification tests

• Utilize learning algorithms to improve performance

o Both embedded, post process

o Seed with expert knowledge

Page 52: HPC Controls Future

Fault Prediction: Outcomes

• Continuous update of mean-time-to-preventative-maintenance

o Feed into projected downtime planning

o Incorporate into scheduling algo

• Alarm reports for imminent failures

o Notify impacted sessions/applications

o Plan/execute preemptive actions

• Store predictions

o Algorithm improvement

Page 53: HPC Controls Future

Error Manager

• Log errors for reporting, future analysis

• Primary responsibility: fault response

o Contains defined response for given types of faults

o Responds to faults by shifting resources, processes

• Secondary responsibility: resilience strategy

o Continuously update and define possible response options

o Fault prediction triggers pre-emptive action

• Select various response strategies via component

o Run-time, configuration, or on-the-fly command

Page 54: HPC Controls Future

Example: Network Communications

Fault tolerant

• Modify run-time to avoid automatic abort upon loss of communication

• Detect failure of any on-going communication

• Reconnect quickly

• Resend lost messages from me

• Request resend of lost messages to me

Network interface card repeatedly loses connection or shows data loss

Resilient

• Estimate probability of failure during session

– Failure history

– Monitor internals

• Temperature, …

• Take preemptive action

– Reroute messages via alternative transports

– Coordinated move of processes to another node

Page 55: HPC Controls Future

Node fails during execution of high-priority application

Example: Node Failure

Fault tolerant

• Detect failure of process(es) on that node

• Find last checkpoint

• Restart processes on another node at the checkpoint state

– May (likely) require restarting all processes at same state

• Resend any lost messages

Resilient

• Monitor state-of-health of node– Temperature of key

components, other signatures

• Estimate probability of failure during future time interval(s)

• Take preemptive action– Direct checkpoint/save

– Coordinate move with application

• Avoids potential need to reset ALL processes back to earlier state!

Page 56: HPC Controls Future

Controls Overview: A Definition

• Resource Manager

o Workload manager

o Run-time environment

• Monitoring

• Resiliency/Error Management

• Overlay Network

• Pub-sub Network

• Console

Page 57: HPC Controls Future

Definition: Overlay Network

• Messaging system

o Scalable/resilient communications

o Integration-friendly with user applications, system management software

o Quality of service support

• In-flight analytics

o Insert analytic workflows anywhere in the data stream

o Tap data stream at any point

o Generate RAS events from data stream

57

Page 58: HPC Controls Future

Requirements

• Scalable to exascale levels– Better-than-linear scaling of

broadcast

• Resilient– Self-heal around failures

– Reintegrate recovered resources

– Support quality of service (QoS) levels

• Dynamically configurable– Sense and adapt, user-

directable

– On-the-fly updates

• In-flight analytics– User-defined workflows

– Distributed, hierarchical analysis of sensor data to identify RAS events

– Used by error manager

• Multi-fabric– OOB Ethernet

– In-band fabric

– Auto-switchover for resilience, QoS

• Open source (non-copy-left)

58

Page 59: HPC Controls Future

High-Level Architecture

RMQ ZMQ BTL OOB

SCON-MSG

TCPUDPIB UGNI SM USNIC TCP CUDA

SCON-ANALYTICS

FILTER AVG THRESHOLD

Portals4

SCIF

Send/Recv

WorkflowQoS

ACK NACK HYBRID

Inherit

Page 60: HPC Controls Future

Messaging System

• Message management level

o Manages assignment of messages (and/or fragments) to transport layers

o Detect/redirects messages upon transport failure

o Interfaces to transports

o Matching logic

• Transport level

o Byte-level message movement

o May fragment across wires within transport

o Per-message selection policies (system default, user specified, etc)

60

Page 61: HPC Controls Future

Quality of Service Controls

• Plugin architecture

o Selected per transport, requested quality of service

• ACK-based (cmd/ctrl)

o ACK each message, or window of messages, based on QoS

o Resend or return error – QoS specified policy and number of retries before giving up

• NACK-based (streaming)

o NACK if message sequence number is out of order indicating lost message(s)

o Request resend or return error, based on QoS

o May ACK after N messages

• Security level

o Connection authorization, encryption, …

Page 62: HPC Controls Future

Controls Overview: A Definition

• Resource Manager

o Workload manager

o Run-time environment

• Monitoring

• Resiliency/Error Management

• Overlay Network

• Pub-sub Network

• Console

Page 63: HPC Controls Future

Pub-Sub: Two Models

• Client-Server (MQ)

o Everyone publishes data to one or more servers

o Clients subscribe to server

o Server pushes data matching subscription to clients

o Option: server logs and provides history prior to subscription

• Broker (MRNet)

o Publishers register data with server

o Clients log request with server

o Server connects client to publisher

o Publisher directly provides data to clients

o Up to publisher to log, provide history

Page 64: HPC Controls Future

Pub-Sub: Two Models

• Client-Server (MQ)

o Everyone publishes data to one or more servers

o Clients subscribe to server

o Server pushes data matching subscription to clients

o Option: server logs and provides history prior to subscription

• Broker (MRNet)

o Publishers register data with server

o Clients log request with server

o Server connects client to publisher

o Publisher directly provides data to clients

o Up to publisher to log, provide history

Page 65: HPC Controls Future

Pub-Sub: Two Models

• Client-Server (MQ)

o Everyone publishes data to one or more servers

o Clients subscribe to server

o Server pushes data matching subscription to clients

o Option: server logs and provides history prior to subscription

• Broker (MRNet)

o Publishers register data with server

o Clients log request with server

o Server connects client to publisher

o Publisher directly provides data to clients

o Up to publisher to log, provide history

ORCM: Support Both Models

Page 66: HPC Controls Future

ORCM Pub-Sub Architecture

• Abstract, extendable APIs

o Utilize “attributes” to specify desired data and other parameters

o Allows each component to determine ability to support request

• Internal lightweight version

o Support for smaller systems

o When “good enough” is enough

o Broker and client-server architectures

• External heavyweight versions supported via plugins

o RabbitMQ

o MRNet

Page 67: HPC Controls Future

ORCM Pub/Sub API

• Required functions

o Advertise available event/data (publisher)

o Publish event/data (publish)

o Get catalog of published events (subscriber)

o Subscribe for event(s)/data (subscriber asynchronous callback based and

polling based)

o Unsubscribe for event/data (subscriber)

• Security – Access control for requested event/data

• Possible future functions

o QoS

o Statistics

Page 68: HPC Controls Future

Controls Overview: A Definition

• Resource Manager

o Workload manager

o Run-time environment

• Monitoring

• Resiliency/Error Management

• Overlay Network

• Pub-sub Network

• Console

Page 69: HPC Controls Future

Basic Approach

• Command-line interface

o Required for mid- and high-end systems

o Control on-the-fly adjustments, update/customize config

o Custom code

• GUI

o Required for some market segments

o Good for visualizing cluster state, data trends

o Provide centralized configuration management

o Integrate with existing open source solution

Page 70: HPC Controls Future

Configuration Management

• Central interface for all configuration

o Eliminate frustrating game of “whack-a-mole”

o Ensure consistent definition across system elements

o Output files/interface to controllers for individual software packages

• RM (queues), network, file system

• Database backend

o Provide historical tracking

o Multiple methods for access, editing

• Command line and GUI

o Dedicated config management tools

Page 71: HPC Controls Future
Page 72: HPC Controls Future
Page 73: HPC Controls Future
Page 74: HPC Controls Future

HPC Controls