Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

Paper Report

Presenter: Jyun-Yan Li

Multiplexed redundant execution: A technique for efficient fault tolerance

in chip multiprocessors

Pramod Subramanyan, Virendra Singh Supercomputer Education and Research Center, Indian Institute of Science, Bangalore, IndiaKewal K. Saluja Electrical and Computer Engg. Dept., University of Wisconsin-Madison, Madison, WIErik Larsson Dept. of Computer and Info. Science, Linkoping University, Linkoping, Sweden

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010

Cite count: 16

2

Continued CMOS scaling is expected to make future microprocessors susceptible to transient faults, hard faults, manufacturing defects and process variations causing fault tolerance to become important even for general purpose processors targeted at the commodity market.

To mitigate the effect of decreased reliability, a number of fault-tolerant architectures have been proposed that exploit the natural coarse-grained redundancy available in chip multiprocessors (CMPs). These architectures execute a single application using two threads, typically as one leading thread and one trailing thread. Errors are detected by comparing the outputs produced by these two threads. These architectures schedule a single application on two cores or two thread contexts of a CMP.

Abstract – part1

3

As a result, besides the additional energy consumption and performance overhead that is required to provide fault tolerance, such schemes also impose a throughput loss. Consequently a CMP which is capable of executing 2n threads in non-redundant mode can only execute half as many (n) threads in fault-tolerant mode.

In this paper we propose multiplexed redundant execution (MRE), a low-overhead architectural technique that executes multiple trailing threads on a single processor core. MRE exploits the observation that it is possible to accelerate the execution of the trailing thread by providing execution assistance from the leading thread.

Abstract – part2

4

Execution assistance combined with coarse-grained multithreading allows MRE to schedule multiple trailing threads concurrently on a single core with only a small performance penalty. Our results show that MRE increases the throughput of fault-tolerant CMP by 16% over an ideal dual modular redundant (DMR) architecture

Abstract – part3

5

Chip multiprocessors (CMPs) become the major for performance growth Susceptible to soft errors, wear-out related permanent

fault …

2 cores or thread contexts execute single program in the CMP Throughput loss

。The throughput of the CMP decreases to half System cost

。Cooling, energy and maintenance cost

What’s the problem

6

Related work

AR-SMT[22]

SRT[21]

CRT[18]

CRTR[13]

SRTR[29]

Razor[11]

Power efficient redundant execution

[26]

Dynamic frequency and voltage scaling to reduce power

Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessorsThis

paper:

Adding recovery

Adding recovery

Using SMT to detect transient faultLeading thread stores results in a delay buffer, and trailing thread re-executes and compare result

Replicating critical pipeline registers and comparing them to detect error

7

Input replication Issue input value to the both threads Optimize trailing thread

。Load Value Queue (LVQ) for the loading data。Branch Outcome Queue (BOQ) for the fetching instruction

Output comparator Verifying results of the both threads before they are

forward to the rest of the system Store queue prevents store data

passing before comparing data

Chip-level Redundant Thread (CRT)

8

Multiplexed Redundant Execution (MRE) Logical partition cores

。Leading core pool and Trailing core pool Executing applications that require fault tolerance

。3th pool Non-redundant applications

Chunk: execution of the application。Sent a message to trailing core by leading core。Push into the Run Request Queue (RRQ) by trailing core

Proposal method

9

Input Replication Using LVQ to accelerate trailing thread loading

。Leading thread transfers result to trailing core’s lVQ after the load instruction retirement

。Trailing thread load data from it Eliminate cache miss

Using BOQ to eliminate misprediction。Leading thread’s branch outcome stores in the BOQ after

the load instruction retirement。Trailing thread accesses branch prediction from it

Eliminate branch misprediction

Interconnect Dividing cores into clusters and connected by bus

interconnect

MRE-enable processor

10

MRE-enable Processor architectureRun Request Queue, trailing core loads

chunks and inserted by leading thread

Branch Outcome Queue, trailing core predicts branch outcome and inserted by leading

thread

Load Value Queue, trailing core loads value and inserted by leading

thread

Exchange fingerprint to another core for detecting fault

11

Using coarse-grained multithreading to execute redundancy Executing corresponding chunks by the RRQ

。RRQ is a FIFO queue

coarse-grained multithreading Implementation Visible state of every trailing thread is stored in the

trailing core。Need registers to hold states

Trailing core

Storage space

ThreadStorage space

Threadregs

regs

core

Copy state

New thread’s

state

12

Sharing the LVQ and BOQ for maximum utilization Allocating a section dynamically by the on-demand Free queue: a list of unused sections

。Share with each threads Allocated queue: a list of allocated sections

Core’s LVQ & BOQ

section

section

section

Free queue

Allocated queue

Allocated queue

Thread Thread section

LVQ

1 2 3 4

allocated

5

section

sectionvalue

value

value

value

value

13

Fault Detection Exchanging executing fingerprint

。Execution results should be the same Branch mispredection in the trailing thread

。Never mispredection in the trailing thread

Fault Isolation Fault must not propagate to other

。Adding speculative (S) bit in the D$ S=1, when write data

this cache line is locked and can’t be writing back to memory If need replaced, then fingerprint should be compared

S=0, compare result is match

。I/O operation Take a checkpoint and compare fingerprints before I/O operation

Fault tolerance mechanism

14

Checkpoint and Recover If compare result is match, leading core store all registers

in the checkpoint store If not match

。Recover register states of the two core form the checkpoint store

。Invalidate data cache line with speculative bit (S=1)。Leading core re-executes after the last checkpoint

Fault coverage Processor logic and certain part of memory access

circuitry。Cache and memory controller can’t be detect

Fingerprint aliasing

Fault tolerance mechanism (cont.)

15

Hardware & software cost Compare with SRT and CRT

Ability of fault tolerant Performance degrade or upgrade

。throughtput Power consumption

Before Experiment

16

Simulation Methodology Using SESC to model MRE

。SESC is a microprocessor architectural simulator Using Wattch for power model

。LVQ and BOQ by CACTI Workload consisted of 9 benchmarks from the SPEC

CPU 2000。Reduce input set by MinneSPEC for reducing simulation

time。Total instructions

Normalized Throughput Per Core (NTPC)

Simulation & evaluation

17

One logical thread Comparison of stores between the leading and

trailing thread increase interconnect trafficCRT compare store value by the store

comparator in the store queue

Store buffer of leading thread waits trailing thread’s stores

-2%

-18%1.86

1.81

18

MRE: 3 core, CRT: 4 core,

Multiplexing with 2 logical threads

0.58

0.47

19

Multiplexed Redundant Execution (MRE) Mitigating the throughput loss due to redundant execution

。A trailing core executes redundant from the many of leading cores by the RRQ and coarse-grained multithreading

My comment Load Value Queue (LVQ) and Branch Outcome Queue

(BOQ)。Reduce extra loading time form the low-level memory

hierarchy Not describe lading core which type of multithreading

Conclusions

20

One of the thread-level parallelism (TLP) Executing multiple instructions from multiple thread at

the time

Architecture superscalar

Simultaneous multithreading (SMT)

Picture from “Computer Architecture – a Quantitative Approach”, John L. Hennessy, David A. Patterson

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

Documents

throughput of fault

efficient fault tolerance

number of fault

execution assistance

thread contexts

trailing thread re

faulttolerant mode

multiple trailing threads