Cluster Reliability Cluster Reliability Project Project ISIS ISIS Vanderbilt University Vanderbilt University
Dec 30, 2015
Cluster Reliability ProjectCluster Reliability Project
ISISISIS
Vanderbilt UniversityVanderbilt University
PurposePurpose
Put together a set of software tools that Put together a set of software tools that allows us to easily create “pluggable” allows us to easily create “pluggable” components thatcomponents that– Monitor our clusters (hardware, utilization, Monitor our clusters (hardware, utilization,
errors, jobs, etc.)errors, jobs, etc.)– Recognizes anomalous conditions (inference Recognizes anomalous conditions (inference
rules, model comparison, probabilities)rules, model comparison, probabilities)– Take actions to correct or alleviate the Take actions to correct or alleviate the
problems (triggered by other components)problems (triggered by other components)
GoalsGoals
Increase the availability, utilization, and Increase the availability, utilization, and reliability of the computing clustersreliability of the computing clustersReduce the time it takes to diagnose Reduce the time it takes to diagnose problemsproblemsReduce the administrative workload Reduce the administrative workload associated with operating the clustersassociated with operating the clustersAutomate routine administration tasksAutomate routine administration tasksMonitor the use of the clustersMonitor the use of the clustersAllow other systems and user scripts to Allow other systems and user scripts to easily interact with this systemeasily interact with this system
Some basic features include:Some basic features include:
Identifying types of information that must be collected or Identifying types of information that must be collected or communicated,communicated,APIs for communicating information and the creation of monitors APIs for communicating information and the creation of monitors and reactors,and reactors,Popular programming language bindings,Popular programming language bindings,An environment for writing, testing, and releasing new reactors and An environment for writing, testing, and releasing new reactors and monitors,monitors,A basic set of problems that must be addressed, the monitors, A basic set of problems that must be addressed, the monitors, reactors and data that they will communicate,reactors and data that they will communicate,Recording of monitoring information and actions taken,Recording of monitoring information and actions taken,Administration tools that allow for single-point release distribution Administration tools that allow for single-point release distribution and installation, and control of the runtime environment,and installation, and control of the runtime environment,A configuration system that allows for uniform parameter setting and A configuration system that allows for uniform parameter setting and allows for tuning to adjust the performance impact on the system.allows for tuning to adjust the performance impact on the system.
ISIS Goals ISIS Goals
Monitor the health (performance, utilization, state) of all Monitor the health (performance, utilization, state) of all processors and networks in the system (leveraging processors and networks in the system (leveraging existing tools and standards).existing tools and standards).It should closely monitor performance and the status of a It should closely monitor performance and the status of a job, and work together with the workflow subsystem, job, and work together with the workflow subsystem, ensuring good progress for larger analysis campaigns ensuring good progress for larger analysis campaigns that are being conducted. that are being conducted. It should be coupled to the application. Mitigation It should be coupled to the application. Mitigation actions depend on the properties of the application and actions depend on the properties of the application and its overall workflow.its overall workflow.It should be integrated with workflow planning, to allow It should be integrated with workflow planning, to allow for resource optimization. This will include interactions for resource optimization. This will include interactions with real-time scheduling systems.with real-time scheduling systems.
ISIS ResearchISIS Research
Overall planned deliverablesOverall planned deliverables– Customizable Monitoring & Control framework.Customizable Monitoring & Control framework.– Mitigation engine.Mitigation engine.– Monitoring and mitigation design tool.Monitoring and mitigation design tool.– Monitoring and mitigation system generatorMonitoring and mitigation system generator
In this project we will address the critical difficulties in achieving a In this project we will address the critical difficulties in achieving a fault mitigation framework for a large cluster, which is configurable, fault mitigation framework for a large cluster, which is configurable, and strives to minimally affect the performance of the cluster.and strives to minimally affect the performance of the cluster.For this, we will utilize a model-based design approach, that uses For this, we will utilize a model-based design approach, that uses domain-specific modeling languages and model transformer to domain-specific modeling languages and model transformer to enable system design using domain-specific and higher-level enable system design using domain-specific and higher-level abstractions, and uses the monitoring and control framework.abstractions, and uses the monitoring and control framework.
Current activitiesCurrent activities
Strong desire to leverage existing infrastructures for Strong desire to leverage existing infrastructures for networked-system health monitoring.networked-system health monitoring.The evaluation criteria is based on:The evaluation criteria is based on:– Architecture: centralized vs. hierarchicalArchitecture: centralized vs. hierarchical– Monitoring: available sensors as well as configurability for new Monitoring: available sensors as well as configurability for new
sensors.sensors.– Scalability: cluster size vs. resource consumptionScalability: cluster size vs. resource consumption– Handling: smart data mining and virtual sensors.Handling: smart data mining and virtual sensors.– Programmability: configuration language.Programmability: configuration language.– Report Visualization.Report Visualization.
Currently, we are evaluating two packages:Currently, we are evaluating two packages:– Open NMSOpen NMS– AwareAware
Near-term major workNear-term major work
Nov/Dec 06: Nov/Dec 06: – Get all job information and some currently measured Get all job information and some currently measured
attributes into a databaseattributes into a database– Complete evaluation of productsComplete evaluation of products
Jan/Feb 07:Jan/Feb 07:– Additions to current code to record resource Additions to current code to record resource
utilization, IMPI information, errors, and actionsutilization, IMPI information, errors, and actions– Create test system for selected monitoring productCreate test system for selected monitoring product
Mar/Apr 07:Mar/Apr 07:– Replicating existing functionality in new frameworkReplicating existing functionality in new framework
Coordinator
IB HCAMonitor IPMI
MonitorIP Network
Monitor
Phys AttrMonitor
User Proc Monitor
ServiceMonitor
Storage Monitor
PBS Monitor
Job Resource Monitor
Worker Functions – Final System
Action Takers
To/From Manager
Bookeeping Database
Job Activity Monitor
Job Class/Profile Monitor
Driver Monitor CPU state
MonitorUptime Monitor
Restart services,Report success/fail,Recycle drivers,Reboot machine
Activity timing,Running, staging,etc.
General architectureGeneral architecture
Coordinator
ArchiversIB FabricMonitor
IPMIMonitor
EmailMonitor
AlarmPresenters
EmailSenders
DcacheMonitor
IP NetworkMonitor
Phys AttrMonitor
User Proc Monitor
ServiceMonitor
Disk Monitor
Help Ticket Monitor
Job Scanner
Job Checker
PBS
qstat
Database
Acct Log
Maui
Head Node Functions – Final System
Action Takers
To/From Subordinates
BookeepingDatabase
Backup slides followBackup slides follow
Model-Based ApproachModel-Based Approach
Cluster ComputingResources/Grid
Actuator
Job
Co
ntrol
System Health MonitorStatus/Diagnostics/
Prognostics FaultMitigation
Engine
Monitoring Dynamic
JobScheduling
&ResourceAllocation
System PlannerGlobal Manager
Resource PlannerMulti-Campaign Manager
AnalysisCampaigns
Design / ModelingEnvironment
Cluster ComputingResources/Grid
Actuator
Job
Co
ntrol
System Health MonitorStatus/Diagnostics/
Prognostics FaultMitigation
Engine
Monitoring Dynamic
JobScheduling
&ResourceAllocation
System PlannerGlobal Manager
Resource PlannerMulti-Campaign Manager
AnalysisCampaigns
Design / ModelingEnvironment
System Health Monitoring:
Open NMS /Aware
Fault handlingProcess dataflowHardware Configuration
Modeling in G
ME
Run T
ime
Coordinator
IB HCAMonitor IPMI
MonitorIP Network
Monitor
Phys AttrMonitor
User Proc Monitor
ServiceMonitor
Storage Monitor
PBS Monitor
Job Resource Monitor
Worker Functions – Final System
Action Takers
To/From Manager
Bookeeping Database
Job Activity Monitor
Job Class/Profile Monitor
Driver Monitor CPU state
MonitorUptime Monitor
Restart services,Report success/fail,Recycle drivers,Reboot machine
Activity timing,Running, staging,etc.
Fault M
itigation Engines
Proposed ArchitectureProposed Architecture
Cluster ComputingResources/Grid
Actuator
Job
Co
ntrol
System Health MonitorStatus/Diagnostics/
Prognostics FaultMitigation
Engine
Monitoring Dynamic
JobScheduling
&ResourceAllocation
System PlannerGlobal Manager
Resource PlannerMulti-Campaign Manager
AnalysisCampaigns
Design / ModelingEnvironment
Cluster ComputingResources/Grid
Actuator
Job
Co
ntrol
System Health MonitorStatus/Diagnostics/
Prognostics FaultMitigation
Engine
Monitoring Dynamic
JobScheduling
&ResourceAllocation
System PlannerGlobal Manager
Resource PlannerMulti-Campaign Manager
AnalysisCampaigns
Design / ModelingEnvironment
System Health Monitoring System Health Monitoring – ClusterCluster– CampaignCampaign– ApplicationApplication– Leverage existing tool Leverage existing tool
and standardand standardMitigationMitigation– ApplicationApplication– CampaignCampaign– ClusterCluster
Both monitoring and Both monitoring and mitigation must be mitigation must be synchronized across synchronized across related jobs.related jobs.
Design/Modeling Design/Modeling environment for deploying environment for deploying campaigns, monitoring and campaigns, monitoring and mitigation policies.mitigation policies.
MotivationMotivation
Jobs on LQCD cluster are usually long term and Jobs on LQCD cluster are usually long term and interdependent.interdependent.Failure on one node can have domino effect on Failure on one node can have domino effect on other nodes.other nodes.We cannot rely on job-level fault tolerance:We cannot rely on job-level fault tolerance:– as it will be computationally expensiveas it will be computationally expensive– will cause a decrease in performancewill cause a decrease in performance– will make synchronization between related jobs will make synchronization between related jobs
difficult.difficult.
We need a cluster-wide fault-tolerant framework that does We need a cluster-wide fault-tolerant framework that does monitoring and mitigation and is integrated with the monitoring and mitigation and is integrated with the
scheduling framework.scheduling framework.
Fault Mitigation EngineFault Mitigation Engine
Two mitigation schemes:Two mitigation schemes:– Reflex based scheme e.g. relocate jobs, shutdown jobs, rewire Reflex based scheme e.g. relocate jobs, shutdown jobs, rewire
nodes.nodes.– Planning based scheme e.g. optimize the campaign, reschedule Planning based scheme e.g. optimize the campaign, reschedule
jobsjobs
Mitigation schemes integrated with the workflow and job Mitigation schemes integrated with the workflow and job scheduling system.scheduling system.Constraints: deadlines, resources consumption.Constraints: deadlines, resources consumption.Approach: model-based generators to transform the Approach: model-based generators to transform the designs into components and configurations for the designs into components and configurations for the runtime system, making sure that end-users can flexibly runtime system, making sure that end-users can flexibly modify the characteristics of the generated artifactsmodify the characteristics of the generated artifactsWe will integrate with tool for definition of workflows, We will integrate with tool for definition of workflows, monitoring, and mitigation actions.monitoring, and mitigation actions.