Top Banner
Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003
23

Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Vijay Lakamraju Israel Koren

C. Mani Krishna

Low Overhead Fault Tolerant Networking (in Myrinet)

Architecture and Real-Time Systems (ARTS) Lab.Department of Electrical and Computer Engineering

University of Massachusetts Amherst MA 01003

Page 2: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Motivation

An increasing use of COTS components in systems has been motivated by the need to Reduce cost in design and maintenance Reduce software complexity

The emergence of low cost, high performance COTS networking solutions e.g., Myrinet, SCI, FiberChannel etc.

The increasing complexity of network interfaces has renewed concerns about its reliability The amount of silicon used has increased

tremendously

Page 3: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

The Basic Question

How can we incorporate fault toleranceinto a COTS network technology without greatly compromising its performance?

Page 4: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Microprocessor-based Networks

Most modern network technologies have processors in their interface cards that help to achieve superior network performance

Many of these technologies allow changes in the program running on the network processor

Such programmable interfaces offer numerous benefits: Developing different fault tolerance techniques Validating fault recovery using fault injection experimenting with different communication

protocols We use Myrinet as the platform for our study

Page 5: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Myrinet

Myrinet is a cost-effective high performance (2.2 Gb/s) packet switching technology

At its core is a powerful RISC processor It is scalable to thousands of nodes Low latency communication (8 s) is achieved

through direct interaction with network interface (“OS bypass”)

Flow control, error control and simple “heartbeat mechanisms” are incorporated in hardware

Link and routing specifications are public & standard

Myrinet support software is supplied “open source”

Page 6: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Myrinet Configuration

HostProcessor

SystemBridge

SystemMemory

PCIDMA

PCIBridge

DMAEngine

HostInterface RISC

PacketInterface

SAN/LANConversion

LANai 9

LANai SRAMTimers

0 1 2

I/O Bus

Host Node

Page 7: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Hardware & Software

Programmable Interface

HostProcessor

SystemMemory

NetworkProcessor

LocalMemory

I/O Bus

Myrinet Card

Application

Myrinet Control Program

Middleware(e.g., MPI)

OSdriver

TCP/IPinterface

Page 8: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Susceptability to Failures

Dependability evaluation was carried out using software implemented fault injection Faults were injected in the Control Program (MCP)

A wide range of failures were observed Unexpected latencies and reduction of bandwidth The network processor can hang and stop responding A host system can crash/hang A remote network interface can get affected

Similar type of failures can be expected from other high-speed networks

Such failures can greatly impact the reliability/availability of the system

Page 9: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Summary of Experiments

More than 50% of the failures were host interface hangs

57.91205No Impact

1.1523Other Errors

0.439Host Computer Crash

3.165MCP Restart

12.7264Messages Dropped/Corrupted

24.6514Host Interface Hang

Failure Category Count % of Injections

Total 2080 100

Page 10: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Design Considerations

The faults must be detected and diagnosed as quickly as possible

The network interface must be up and running as soon as possible

The recovery process must ensure that no messages are lost or improperly received/sent Complete correctness should be achieved

The overhead on the normal running of the system must be minimal

The fault tolerance should be made as transparent to the user as possible

Page 11: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Fault Detection

Continuously polling the card can be very costly

We use a spare interval timer to implement a watchdog timer functionality for fault detection

We set the LANai to raise an interrupt when the timer expires

A routine (L_timer) that the LANai is supposed to execute every so often resets this interval timer

If the interface hangs, then L_timer is not executed, causing our interval timer to expire and raising a FATAL interrupt

Page 12: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Fault Recovery Summary

The FATAL interrupt signal is picked by the fault recovery daemon on the host

The failure is verified through numerous probing messages

The control program is reloaded into the LANai SRAM

Any process that was accessing the board prior to the failure is also restored to its original state

Simply reloading the MCP will not ensure correctness

Page 13: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Myrinet Programming Model

Flow control is achieved through send and receive tokens

Myrinet software (GM) provides reliable in-order delivery of messages A modified form of “Go-Back-N” protocol is used Sequence numbers for the protocol are provided

by the MCP One stream of sequence numbers exists per

destination

Page 14: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Typical Control Flow

User process prepares messageUser process sets send token

LANai sdmas messageLANai sends message LANai receives ACKLANai sends event to process

User process handles notification eventUser process reuses buffer

Sender

User process provides receive bufferUser process sets recv token

LANai recvs messageLANai sends ACK LANai rdmas messageLANai sends event to process

User process handles notification eventUser process reuses buffer

Receiver

Page 15: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Duplicate Messages

User process prepares messageUser process sets send token

LANai sdmas messageLANai sends message

Driver reloads MCP into boardDriver resends all unacked messages

LANai sdmas message LANai sends message

Sender

User process provides receive bufferUser process sets recv token

LANai recvs messageLANai sends ACK LANai rdmas messageLANai sends event to process

User process handles notification eventUser process reuses buffer

Receiver

LANai goes down Lost ACK

Duplicate message LANai recvs message

Lack of redundant state information is the cause for this problem

ERROR!

Page 16: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Lost Messages

User process prepares messageUser process sets send token

LANai sdmas messageLANai sends message LANai receives ACKLANai sends event to process

User process handles notification eventUser process reuses buffer

Sender

User process provides receive bufferUser process sets recv token

LANai recvs messageLANai sends ACK

Receiver

LANai goes down

Driver reloads MCP into boardDriver sets all recv tokens again

LANai waits for message

Incorrect commit point is the cause of this problem

ERROR!

Page 17: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Fault Recovery

We need to keep a copy of the state information Checkpointing can be a big overhead Logging critical message information is enough

GM functions are modified so that A copy of the send tokens and the receive tokens is

made with every send and receive call The host processes provide the sequence numbers,

one per (destination node, local port) pair Copy of send and receive token is removed when

the send/receive completes successfully MCP is modified

ACK is sent out only after a message is DMAed to host memory

Page 18: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Performance Impact

The scheme has been integrated successfully into GM Over 1 man year for complete implementation

How much of the performance of the system has been compromised ? After all one can’t get a free lunch these days!

Performance is measured using two key parameters Bandwidth obtained with large messages Latency of small messages

Page 19: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Latency

Page 20: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Bandwidth

Page 21: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Summary of Results

6.8 s6.0 sLANai-CPU utilization

1.15 s0.75 sHost-CPU utilization for receive

0.55 s0.3 sHost-CPU utilization for send

13.0 s11.5 sLatency

92 MHz92.4 MHzBandwidth

FTGMGMPerformance Metric

Host Platform: Pentium III with 256MB RedHat Linux 7.2

Page 22: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Summary of Results

Fault Detection Latency = 50 msFault Recovery Latency = 0.765 sPer-Process Latency = 0.50 s

Page 23: Vijay Lakamraju Israel Koren C. Mani Krishna Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department.

Our Contributions

We have devised smart ways to detect and recover from network interface failures

Our fault detection technique for “network processor hangs” uses software implemented watchdog timers

Fault recovery time (including reloading of network control program) ~ 2 seconds

Performance impact is under 1% for messages over 1KB

Complete user transparency was achieved