Cluster Reliability Project ISIS Vanderbilt University.

Cluster Reliability ProjectCluster Reliability Project

ISISISIS

Vanderbilt UniversityVanderbilt University

PurposePurpose

Put together a set of software tools that Put together a set of software tools that allows us to easily create “pluggable” allows us to easily create “pluggable” components thatcomponents that– Monitor our clusters (hardware, utilization, Monitor our clusters (hardware, utilization,

errors, jobs, etc.)errors, jobs, etc.)– Recognizes anomalous conditions (inference Recognizes anomalous conditions (inference

rules, model comparison, probabilities)rules, model comparison, probabilities)– Take actions to correct or alleviate the Take actions to correct or alleviate the

problems (triggered by other components)problems (triggered by other components)

GoalsGoals

Increase the availability, utilization, and Increase the availability, utilization, and reliability of the computing clustersreliability of the computing clustersReduce the time it takes to diagnose Reduce the time it takes to diagnose problemsproblemsReduce the administrative workload Reduce the administrative workload associated with operating the clustersassociated with operating the clustersAutomate routine administration tasksAutomate routine administration tasksMonitor the use of the clustersMonitor the use of the clustersAllow other systems and user scripts to Allow other systems and user scripts to easily interact with this systemeasily interact with this system

Some basic features include:Some basic features include:

Identifying types of information that must be collected or Identifying types of information that must be collected or communicated,communicated,APIs for communicating information and the creation of monitors APIs for communicating information and the creation of monitors and reactors,and reactors,Popular programming language bindings,Popular programming language bindings,An environment for writing, testing, and releasing new reactors and An environment for writing, testing, and releasing new reactors and monitors,monitors,A basic set of problems that must be addressed, the monitors, A basic set of problems that must be addressed, the monitors, reactors and data that they will communicate,reactors and data that they will communicate,Recording of monitoring information and actions taken,Recording of monitoring information and actions taken,Administration tools that allow for single-point release distribution Administration tools that allow for single-point release distribution and installation, and control of the runtime environment,and installation, and control of the runtime environment,A configuration system that allows for uniform parameter setting and A configuration system that allows for uniform parameter setting and allows for tuning to adjust the performance impact on the system.allows for tuning to adjust the performance impact on the system.

ISIS Goals ISIS Goals

Monitor the health (performance, utilization, state) of all Monitor the health (performance, utilization, state) of all processors and networks in the system (leveraging processors and networks in the system (leveraging existing tools and standards).existing tools and standards).It should closely monitor performance and the status of a It should closely monitor performance and the status of a job, and work together with the workflow subsystem, job, and work together with the workflow subsystem, ensuring good progress for larger analysis campaigns ensuring good progress for larger analysis campaigns that are being conducted. that are being conducted. It should be coupled to the application. Mitigation It should be coupled to the application. Mitigation actions depend on the properties of the application and actions depend on the properties of the application and its overall workflow.its overall workflow.It should be integrated with workflow planning, to allow It should be integrated with workflow planning, to allow for resource optimization. This will include interactions for resource optimization. This will include interactions with real-time scheduling systems.with real-time scheduling systems.

ISIS ResearchISIS Research

Overall planned deliverablesOverall planned deliverables– Customizable Monitoring & Control framework.Customizable Monitoring & Control framework.– Mitigation engine.Mitigation engine.– Monitoring and mitigation design tool.Monitoring and mitigation design tool.– Monitoring and mitigation system generatorMonitoring and mitigation system generator

In this project we will address the critical difficulties in achieving a In this project we will address the critical difficulties in achieving a fault mitigation framework for a large cluster, which is configurable, fault mitigation framework for a large cluster, which is configurable, and strives to minimally affect the performance of the cluster.and strives to minimally affect the performance of the cluster.For this, we will utilize a model-based design approach, that uses For this, we will utilize a model-based design approach, that uses domain-specific modeling languages and model transformer to domain-specific modeling languages and model transformer to enable system design using domain-specific and higher-level enable system design using domain-specific and higher-level abstractions, and uses the monitoring and control framework.abstractions, and uses the monitoring and control framework.

Current activitiesCurrent activities

Strong desire to leverage existing infrastructures for Strong desire to leverage existing infrastructures for networked-system health monitoring.networked-system health monitoring.The evaluation criteria is based on:The evaluation criteria is based on:– Architecture: centralized vs. hierarchicalArchitecture: centralized vs. hierarchical– Monitoring: available sensors as well as configurability for new Monitoring: available sensors as well as configurability for new

sensors.sensors.– Scalability: cluster size vs. resource consumptionScalability: cluster size vs. resource consumption– Handling: smart data mining and virtual sensors.Handling: smart data mining and virtual sensors.– Programmability: configuration language.Programmability: configuration language.– Report Visualization.Report Visualization.

Currently, we are evaluating two packages:Currently, we are evaluating two packages:– Open NMSOpen NMS– AwareAware

Near-term major workNear-term major work

Nov/Dec 06: Nov/Dec 06: – Get all job information and some currently measured Get all job information and some currently measured

attributes into a databaseattributes into a database– Complete evaluation of productsComplete evaluation of products

Jan/Feb 07:Jan/Feb 07:– Additions to current code to record resource Additions to current code to record resource

utilization, IMPI information, errors, and actionsutilization, IMPI information, errors, and actions– Create test system for selected monitoring productCreate test system for selected monitoring product

Mar/Apr 07:Mar/Apr 07:– Replicating existing functionality in new frameworkReplicating existing functionality in new framework

Coordinator

IB HCAMonitor IPMI

MonitorIP Network

Monitor

Phys AttrMonitor

User Proc Monitor

ServiceMonitor

Storage Monitor

PBS Monitor

Job Resource Monitor

Worker Functions – Final System

Action Takers

To/From Manager

Bookeeping Database

Job Activity Monitor

Job Class/Profile Monitor

Driver Monitor CPU state

MonitorUptime Monitor

Restart services,Report success/fail,Recycle drivers,Reboot machine

Activity timing,Running, staging,etc.

General architectureGeneral architecture

Coordinator

ArchiversIB FabricMonitor

IPMIMonitor

EmailMonitor

AlarmPresenters

EmailSenders

DcacheMonitor

IP NetworkMonitor

Phys AttrMonitor

User Proc Monitor

ServiceMonitor

Disk Monitor

Help Ticket Monitor

Job Scanner

Job Checker

PBS

qstat

Database

Acct Log

Maui

Head Node Functions – Final System

Action Takers

To/From Subordinates

BookeepingDatabase

Backup slides followBackup slides follow

Model-Based ApproachModel-Based Approach

Cluster ComputingResources/Grid

Actuator

Job

Co

ntrol

System Health MonitorStatus/Diagnostics/

Prognostics FaultMitigation

Engine

Monitoring Dynamic

JobScheduling

&ResourceAllocation

System PlannerGlobal Manager

Resource PlannerMulti-Campaign Manager

AnalysisCampaigns

Design / ModelingEnvironment


Actuator

Job

Co

ntrol



Engine

Monitoring Dynamic

JobScheduling

&ResourceAllocation



AnalysisCampaigns


System Health Monitoring:

Open NMS /Aware

Fault handlingProcess dataflowHardware Configuration

Modeling in G

ME

Run T

ime

Coordinator

IB HCAMonitor IPMI

MonitorIP Network

Monitor

Phys AttrMonitor

User Proc Monitor

ServiceMonitor

Storage Monitor

PBS Monitor

Job Resource Monitor

Worker Functions – Final System

Action Takers

To/From Manager

Bookeeping Database

Job Activity Monitor

Job Class/Profile Monitor

Driver Monitor CPU state

MonitorUptime Monitor

Restart services,Report success/fail,Recycle drivers,Reboot machine

Activity timing,Running, staging,etc.

Fault M

itigation Engines

Proposed ArchitectureProposed Architecture


Actuator

Job

Co

ntrol



Engine

Monitoring Dynamic

JobScheduling

&ResourceAllocation



AnalysisCampaigns



Actuator

Job

Co

ntrol



Engine

Monitoring Dynamic

JobScheduling

&ResourceAllocation



AnalysisCampaigns


System Health Monitoring System Health Monitoring – ClusterCluster– CampaignCampaign– ApplicationApplication– Leverage existing tool Leverage existing tool

and standardand standardMitigationMitigation– ApplicationApplication– CampaignCampaign– ClusterCluster

Both monitoring and Both monitoring and mitigation must be mitigation must be synchronized across synchronized across related jobs.related jobs.

Design/Modeling Design/Modeling environment for deploying environment for deploying campaigns, monitoring and campaigns, monitoring and mitigation policies.mitigation policies.

MotivationMotivation

Jobs on LQCD cluster are usually long term and Jobs on LQCD cluster are usually long term and interdependent.interdependent.Failure on one node can have domino effect on Failure on one node can have domino effect on other nodes.other nodes.We cannot rely on job-level fault tolerance:We cannot rely on job-level fault tolerance:– as it will be computationally expensiveas it will be computationally expensive– will cause a decrease in performancewill cause a decrease in performance– will make synchronization between related jobs will make synchronization between related jobs

difficult.difficult.

We need a cluster-wide fault-tolerant framework that does We need a cluster-wide fault-tolerant framework that does monitoring and mitigation and is integrated with the monitoring and mitigation and is integrated with the

scheduling framework.scheduling framework.

Fault Mitigation EngineFault Mitigation Engine

Two mitigation schemes:Two mitigation schemes:– Reflex based scheme e.g. relocate jobs, shutdown jobs, rewire Reflex based scheme e.g. relocate jobs, shutdown jobs, rewire

nodes.nodes.– Planning based scheme e.g. optimize the campaign, reschedule Planning based scheme e.g. optimize the campaign, reschedule

jobsjobs

Mitigation schemes integrated with the workflow and job Mitigation schemes integrated with the workflow and job scheduling system.scheduling system.Constraints: deadlines, resources consumption.Constraints: deadlines, resources consumption.Approach: model-based generators to transform the Approach: model-based generators to transform the designs into components and configurations for the designs into components and configurations for the runtime system, making sure that end-users can flexibly runtime system, making sure that end-users can flexibly modify the characteristics of the generated artifactsmodify the characteristics of the generated artifactsWe will integrate with tool for definition of workflows, We will integrate with tool for definition of workflows, monitoring, and mitigation actions.monitoring, and mitigation actions.

Cluster Reliability Project ISIS Vanderbilt University.

Documents

system design

monitoring information

mitigation system generatorin

configuration system

mitigation actions

mitigation engine

mitigation design tool

fault mitigation framework