Top Banner
Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il 11/06 Self-Stabilizing Device Drivers
25

Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

Shlomi Dolev and Reuven Yagel,Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel{dolev|yagel}@cs.bgu.ac.il

SSS06

11/06

Self-Stabilizing Device Drivers

Page 2: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

2

Motivation for Self-stabilizing OS

• Coping with transient faults(e.g. soft-error (SEU) in a remote sensor results unpredictable system states)

• Growing interest in self-* / autonomic computing systems

• Self-stabilizing algorithms/programs assume hardware and operating system are also stabilizing

• Pentium HALTING problem: “… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down…”

Page 3: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

3

Proposed solution

• To build according to the well defined and understood paradigm of self-stabilization (traditionally used in distributed systems)

• Thereby achieving: trustworthiness, dependability, self-healing, automatic recovery, adaptive systems, …

Page 4: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

4

OS Dependability• Past examples:

– Dijkstra, “THE” Multiprogramming System ’68 (Layered Approach)

– Denning, Lampson ~’70 (Protection)– KeyKOS ’85 (Capabilities)– EROS ’92 (Checkpoints)– Micro-kernel ~‘90, Exo-kernel ’94 (Minimal TCB)

• Recent:– IBM: K42 (Autonomic Computing), SUN: Solaris 10,

Predictive Self-Healing– MSR: Singularity (managed code, protection,

verification)– JHU: The Coyotos Secure Operating System

(Capabilities, type safety)– HW e.g. Intel : Itanium 2, Machine Check Arch.

Page 5: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

5

Goal: Autonomic Computer

• Following any sequence of transient faults, the (operating) system converges

• Using self stabilization:– A system can be started in an arbitrary state

and converge to a desired behavior– Using fair composition to run hardware + OS

layaers• BGU: Self-stabilizing systems, tools & paradigms

– Microprocessor [DH04]– Operating System [DY04, DY05, DY06]– Compiler [DH05]– Framework: autonomic recoverer [BDK03, BD06]– Middleware: File System [DK02], Group Comm. [DS01]

Page 6: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

6

SOS - Directions

• Black-box– Take existing (Desktop/Real-time) OS– Add stabilization layer– Detailed formal specification needed

• Carefully tailoring a tiny kernel– Processor scheduling [SAACS04]– Memory management [SOSP05, SSS05]– Device drivers [SSS06]

Page 7: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

7

A problem has been detected and Windows has been shut down to prevent damage to your computer.

PFN_LIST_CORRUPT

If this is the first time you've seen this error screen,restart your computer. If this screen appears again, followthese steps:

Check to make sure any new hardware or software is properly installed.If this is a new installation, ask your hardware or software manufacturerfor any Windows updates you might need.

If problems continue, disable or remove any newly installed hardwareor software. Disable BIOS memory options such as caching or shadowing.If you need to use Safe Mode to remove or disable components, restartyour computer, press F8 to select Advanced Startup Options, and thenselect Safe Mode.

Technical information:*** STOP: 0x0000004e (0x00000099, 0x00000000, 0x00000000, 0x00000000)

Beginning dump of physical memoryPhysical memory dump complete.Contact your system administrator or technical support group for furtherassistance.

/* Third Edition UNIX, early 1973 */

panic(s)char *s;{

prproc(); update(); printf("panic: %s\n", s); for(;;) idle();}

Page 8: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

8

Assumptions

• Every configuration (processor registers/ memory/IO controllers) is possible

• (Some) program code is hardwired (in ROM) and is correct [Castro & Liskov, 2000]

• Stabilization of other layers• CPU instruction manual (e.g. Pentium)

defines a transition function– Restrict access to privileged instructions

and registers (e.g. HLT, LIDT)– Segmentation for enforcing isolation

Page 9: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

9

Method

• Specify main (additional) requirements for each OS function

• Evolve self-stabilizing solutions that follow computer-architecture/OS progress

• Detailed proof for self-stabilization of algorithms AND implementation

• Using processor instruction set– Don’t rely on existing compilers

Page 10: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

10

Device Driver Challenges

• Practically an essential part of any OS• Co-operation with external controllers

– Adaptation (connecting interfaces)– Translation (abstracting e.g. speed

variation)– (and sometimes adding functionality)

• ”a modern Seagate drive contains roughly 400,000 lines of code” [IRON FS]

• “modern disk controllers often have many megabytes of memory inside the controller” [Minix]

Page 11: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

11

Unreliable Components

• “In Windows XP, for example, device drivers cause 85% of reported failures” [Nooks]

• ”Since drivers account for the majority of the code (over 70% in this release)…

…the error rate for drivers is almost seven times higher than the error rate for the rest of the kernel” [Engler et. al.]

Page 12: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

12

Where is the Problem?

• Device drivers are loaded into kernel and run in privileged mode

• No monitoring of operation & resource usage

• Usually external & less tested code• Not-stabilizing!

• "malicious driver can program a DMA capable device to overwrite any part of memory...We anticipate that future hardware will provide memory protection for DMA transfers." [Singularity]

Page 13: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

13

What Has Been Done?

• Isolation: Micro-kernel [Mach], user mode drivers [Minix], Monitoring [Nooks]

• Virtualization [Xen, …]• Type safety [Singularity, Coyotos]• Model Checking [SLAM]

• ”our goal is to prevent the vast majority of driver-caused crashes with little or no change to existing driver” [Nooks]

Page 14: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

14

MSR Sinuglarity Project• Achieving OS dependability through advances

in languages (compilers), tools (verifiers and monitors).

• Device drivers isolation through Software Isolated Processes (SIP) and resource useage specification

• Recently adding hardware protection:“increased use of hardware error detection and correction techniques, such as error correcting codes (ECC) on memory and data paths, to ensure the assumption of correct execution that underlies all program safety” [MSR-TR-2006-43]

Page 15: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

15

Example: ATA Standard

• 1990 ATA-1 a de-facto protocol for storage devices (ANSI 1994)

• T13 Technical Committee http://www.t13.org/#Projects

• Current draft: ATA-8 • ATA ZOO: AT-Attachment, (E-)IDE,

ATAPI, SATA/PATA

Page 17: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

17

ATA ModelHost:

:Device

Page 18: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

18

ATA Non-stability

Page 19: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

19

Device-Driver Requirements

• Error model: state of device driver & controller program (including PC) might get any value

• Ping-pong: infinitely I/O commands in which there is update & read of the device buffer

• Progress: eventually every (and only) I/O request is executed completely and correctly according to the ATA spec.

Page 20: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

20

Solution 1: Leasing

Page 21: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

21

Solution 1: Leasing

• The driver reaches its idle state infinitely often

• From any configuration a safe (idle) configuration is eventually reached – device reset

• From there ping-pong holds• Eventually progress holds

Page 22: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

22

Solution 2: Consistency Checks

• Device driver contains consistency checker

• Consistency of driver current state with a snapshot of the controller

Page 23: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

23

Solution 2: Consistency Checks

• Eventually ping-pong holds• Eventually progress holds

• Combining the 2 solutions:– varying snapshot sizes according to

degree of device stabilization– fallback to leases and restarts

Page 24: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

24

Future Work

• I/O device drivers– Protocols for co-operation of more than

one microprocessor– Detailed driver / General monitoring

layer

• Gather the different parts

Page 25: Shlomi Dolev and Reuven Yagel, Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel {dolev|yagel}@cs.bgu.ac.il SSS06 11/06.

25

Conclusion

• The work shows theoretical and practical ways to achieve the goal of a self-stabilizing OS, particularly the device drivers part

• By supplying an infrastructure for practical self-stabilizing systems, robust and dependable systems can be achieved

• http://www.cs.bgu.ac.il/~yagel/sos