Top Banner
1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan
6

1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

1

ExtraVirt: Detecting and recovering from transient processor faults

Dominic Lucchetti, Steve Reinhardt, Peter Chen

University of Michigan

Page 2: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

2

Flips Happen

Similar die area+

Decreasing transition energy=

Increasing risk of transient failure

Page 3: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

3

Multi-Processors &Virtual Machine

Multi-Processor Ensure error

independence Enable fault detection Efficient resource sharing

Virtual Machine No changes to OS or

applications VM replay

Synchronize replicas Recover correct state

Replica 1 Replica 2

Hypervisor

DeviceDrivers

Replication Management Layer (RML)

Page 4: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

4

Example: Memory

Copy on write Reduces overhead Protects checkpoints

Merge on checkpoint Verify correctness Re-execute on

deviation Memory Fault

Protection ECC against RAM

faults MMU against CPU

faults

Memory CheckpointReplica 1Checkpoint Replica 2

A

B

CD

E

A

B

CX

E

A

B

C

E

Verify

Replica 3

A

B

CD

E

Page 5: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

5

Status

Present VM Replay Beginnings of Replication

Management Layer (RML) Still much to do…

Future Replicate the un-replicated Handle faults in device

drivers Expanded fault model

Replica 1 Replica 2

Hypervisor/RML

DeviceDrivers

Page 6: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.

6

Questions?