Top Banner
CIFTS oordinated Infrastructure for Fault Tolerant System
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

CIFTSCoordinated Infrastructure for Fault Tolerant Systems

Page 2: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Agenda

• The Problem and the purpose

• The CIFTS framework

• The CIFTS team

• Getting Involved

Page 3: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

*System CPUs Reliability

ASCI-Q 8,192MTBI: 6.5hrs

(storage, CPU, memory)

ASCI-W 8,192MTBF: 5hrs (’01) and 40 hrs (’03)

(storage, CPU, 3rd party HW)PSC Lemieux 3,016 MTBI: 9.7hrs

Google 15,000 20 reboots/day; 2-3% machines replaced /year (storage, memory)

*“A Power-aware Run-Time System for High-Performance Computing”, Chung-hsing Hsu and Wu-chun Feng, IEEE International

Supercomputing Conference (SC), 2005

Current HPC Systems

• Top 500 statistics

– Performance growth • 35.86TF/s (2002) to 280FT/s (2007)

– Average node count growth• 128-258 (2002) to 1024-2048 (2007)

Page 4: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Downtime Cost

*Service Cost of One Hour Downtime

Brokerage Operations $6,450,000

Credit Card Authorization $2,600,000

eBay $225,000

Amazon $180,000

Package Shipping Services $150,000

Home Shopping Channel $113,000

Catalog Sales Center $90,000

*“A Power-aware Run-Time System for High-Performance Computing”, Chung-hsing Hsu and Wu-

chun Feng, IEEE International Supercomputing Conference (SC), 2005

“Faults directly impact system downtime and TCO”

Page 5: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Fault Tolerance in HPC

• Available for some HPC components– Storage (RAID variations) and File Systems ( dCache, Tera

Grid FS, Panasas, IBRIX, BulkFS)

– Checkpointing software (application checkpointing ex: BLCR, Condor; operating system checkpointing ex: TICK)

– Software built using hardware technologies like lmsensors, OpenMPI, BMC and other monitoring software like Ganglia

– Middleware (FT-MPI, MPICH-V, FE-MPI, FT ARMCI)

Components mostly deal with faults on an individual basis!Sharing of fault information globally is missing!

Page 6: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

A typical scenario

Launches MPI Job 1

Job Scheduler

Other software on the cluster are agnostic of this MPI job failure.Other software are also agnostic of the reason of MPI job failure!

More failures

detects “communication

failure” with node X

MPI Application(job1)

MPI Aborts!

ApplicationAborts!

Launches MPI Job 2

Page 7: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Fault Tolerant

Backplane

The CIFTS Framework

Linear Algebra Libraries

HPCMiddleware

UniversalLogger

AutomaticActions

DiagnosticsTools

EventAnalysis

System components, libraries and applicationsAutonomics

Job Scheduler/Resource manager

File Systems

Operating systems

Networking libraries

SystemMonitoring

software

SystemManagement

hardware

Operating System Applications

Page 8: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

CIFTS - Usage Scenario

IO node failure. File system down

Parallel FS

File System shares this information

Job SchedulerLaunch jobs with NFS file system

MPI-IOPrints a coherent

error message

Checkpoints itself

Application

Checkpoints itself

Application

Migrates existingjobs

Page 9: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

detects increasing disk

temp. on a Node X

Hardware sensor

Sensor shares thisknowledge

Job SchedulerNot launch jobson node X until

further diagnosis

Diagnostics Utility

Runs scripts forfurther

root-causing

Starts Checkpointing

MPI

CIFTS - Usage Scenario

Parallel FS

Prepare for I/Odata migration from

Node XStarts Checkpointing

Application

Page 10: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Lifecycle of a componentinteraction with FTB

Component Instance

1

23

Register with FTB

Subscribe for events

Publish events

Deregister from FTB

1

2

3

Component Instance

1

23Distributed

Fault Tolerant

Backplane

4 4

4

Page 11: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

FTB AgentFTB AgentFTB AgentFTB Agent

FTB AgentFTB Agent

FTB AgentFTB Agent

FTB AgentFTB AgentFTB AgentFTB Agent

FTB AgentFTB Agent

Register

Register

Component Instance

Subscribe to a set of

events

Component Instance

Component Instance

Register

Publish event

Publish event

Delving deeper in FTB framework

Page 12: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Manager Library

Network

Client Library

Component 1

NetworkModule1

FTB Agent

Component n

Linux BGL CRAY

NetworkModule2

Manager Library

Network

NetworkModule1

NetworkModule2

FTB Client API

FTB Manager API

FTB Agent software stackComponent software stack

FTB Internal Architecture Layers

Page 13: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Manager Library

Network

Client Library

Component 1

NetworkModule1

FTB Agent

Component n

Linux BGL CRAY

NetworkModule2

Manager Library

Network

NetworkModule1

NetworkModule2

FTB Manager API

FTB Agent software stackComponent software stack

What you need to know!

Just the FTB Client API

Page 14: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

CIFTS API* Snapshot

• FTB_Init (IN FTB_comp_info_t *comp_info, OUT FTB_client_handle_t *client_handle, OUT char *error_msg)

• FTB_Publish_event (IN FTB_client_handle_t handle, IN char *event_name, IN FTB_event_data_t *datadetails, OUT char *error_msg)

• FTB_Create_mask (INOUT FTB_event_mask_t *evt_mask, IN char *field_name, IN char *field_val, OUT char *error_msg)

• FTB_Subscribe (IN FTB_client_handle_t chandle, IN FTB_event_mask_t *event_mask, OUT FTB_subscribe_handle_t *shandle, OUT char *error_msg IN int (*callback)(OUT FTB_catch_event_info_t *, OUT void*), IN void *arg)

• FTB_Poll_for_event (IN FTB_subscribe_handle_t shandle, OUT FTB_catch_event_info_t *catch_event, OUT char *error_msg);

• FTB_Finalize (IN FTB_client_handle_t handle);

*Under works

Page 15: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

FTB-enabled Software -- Planned

BLCR

Fault Tolerant

Backplane

FT-LA

SWIMIPS

LAMMPSOpenMPI

PVFS

MPICH2

MVAPICH2

LAM/MPI

Cobalt

ScaLAPACK ROMIO

NWChem

ZeptoOS

CCAApplications

Page 16: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Status

• Alpha version under works– Demos available on SC exhibit floor

• Client API to be finalized by Q4’ CY07

• Beta release, targeted Q1’ CY08– Platforms supported : Linux clusters, IBM

BGL, Cray XT

Page 17: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

CIFTS team

• Argonne National Laboratory– Pete Beckman, Rinku Gupta, Ewing Lusk, Rob Ross, Rajeev Thakur

• Indiana University– Andrew Lumsdaine

• Lawrence Berkeley National Laboratory– Paul Hargrove

• Oak Ridge National Laboratory– Al Geist, David Bernholdt, Pratul Agarwal, Scott Hampton, Byung-Hoon Park, Aniruddha Shet

• Ohio State University– D.K. Panda

• University of Tennessee, Knoxville– Jack Dongarra

Page 18: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Call for Action

BLCR

Fault Tolerant

Backplane

FT-LA

SWIMIPS

PBS/ProLAMMPS

OpenMPI

Lustre

PVFS

Scali MPI

Global Arrays

Intel MPI

MPICH2

Polyserv

GPFS

GFSIBRIX

MVAPICH2 MPICH-MX

Panasas

LAM/MPI

Other Applications

SGE

MAUI

Condor

LSF

Cobalt

Intel MLKScaLAPACK

ROMIO

SLURM NWChem

Fluent MM5 LS-Dyna

ZeptoOS

Linux

EclipseBLASTStar-CD

Page 19: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Need more information?

• SC’07 Exhibit floor– Demos and/or talks at ANL, ORNL and LBNL booth

• CIFTS website– http://www.mcs.anl.gov/research/cifts/

• CIFTS wiki– http://wiki.mcs.anl.gov/cifts

• CIFTS mailing list– [email protected]

Page 20: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Discussion Topics

• Need of CIFTS infrastucture in enterprise environment

• Requirements/constraints for adoption of CIFTS?

• …..

Page 21: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Backup

Page 22: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

CIFTS - The working view

MiddlewareLikeMPI MPI-IO

UniversalLogger

AutomaticActions

DiagnosticsTools

EventAnalysis

Linear Algebra Libraries

CheckpointRestartSystem

PVFS

ResourceManager/JS

Libraries and Applications

System Components

Autonomics

BootstrapServer

Page 23: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Building a FTB-enabled sample component

1. List the events you may want to publish in an XML file (for convenience)

2. Use the API to make the component FTB-enabled

3. Publish and subscribe to events

Page 24: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

FTB-Enabled Component Development (Step1)

STEP 1: Create an XML file, outlining the publishable events

<ftb_component_details><namespace>ftb.ftb_examples.watchdog<namespace><publish_event> <event_name>WATCH_DOG_EVENT</event_name> <event_severity>Info</event_severity> <event_desc>This event is used by watchdog</event_desc></publish_event><publish_event>

…</publish_event></ftb_component_details>

Page 25: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Developing a FTB-enabled component (Step 2)

STEP 2: Enabling your FTB component

#include "libftb.h"#include "ftb_event_def.h"#include "ftb_throw_events.h"

int main (int argc, char *argv[]){

strcpy(cinfo.comp_namespace, "FTB.FTB_EXAMPLES.Watchdog"); strcpy(cinfo.schema_ver, "0.5"); strcpy(cinfo.inst_name, "watchdog"); strcpy(cinfo.jobid,"watchdog-111"); strcpy(cinfo.catch_style,"FTB_POLLING_CATCH"); FTB_Init(&cinfo, &handle, err_msg);

FTB_Register_publishable_events(handle, ftb_ftb_examples_watchdog_events, FTB_FTB_EXAMPLES_WATCHDOG_TOTAL_EVENTS, err_msg);

FTB_Create_mask(&mask, "all", "init", err_msg);FTB_Subscribe(handle, &mask, &shandle, err_msg, NULL, NULL);

FTB_Publish_event(handle, "WATCH_DOG_EVENT", publish_event_data, err_msg);

FTB_Poll_for_event(shandle, &caught_event, err_msg);

FTB_Finalize(handle); return 0;

}

Page 26: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Developing a FTB-enabled component (Step 2..contd)

Creating your subscribe event mask

Create a mask to catch all events1. FTB_Create_mask(&mask, "all", "init", err_msg);

Create a mask to catch “WATCH_DOG_EVENT”1. FTB_Create_mask(&mask, "all", "init", err_msg);2. FTB_Create_mask(&mask, "event_name", "WATCH_DOG_EVENT",

err_msg);

Create a mask to catch events of severity fatal1. FTB_Create_mask(&mask, "all", "init", err_msg);2. FTB_Create_mask(&mask, “severity”, ”FTB_FATAL", err_msg);

Page 27: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Developing a FTB-enabled component (Step 3)

STEP 3: Provide options to end user to compile your code with FTB

• Modify configure.in and makefiles, so that you can compile your code• ./configure --with-ftb=<PATH to FTB install directory>

Page 28: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Setting up FTB environment

Compiling FTB

• Download FTB

1. ./configure --with-platform=linux --with-bstrap-name=hostname

2. make

3. make install

Page 29: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Using FTB

Starting FTB1. ./ftb_database_server2. ./ftb_agent on all linux nodes3. Run you component executables

BootstrapDB

server

FTBAgent

Agent contacts server

BS -Server providesparent address

FTBAgent

FTBAgent

FTBAgent

FTBAgent

FTBAgent

Connection Topology

Page 30: CIFTS Coordinated Infrastructure for Fault Tolerant Systems.

Open Issues

We don’t know the answers to these questions, so we should not be discussing them in the BOF?

• Policy management– Global knowledge of component prioritization for handling

events

• How can components announce their FT capabilities?

• How can components request for action from other components?

• How to we establish scoping of events?