J. Kochocki October 16, 2008 Aerospace Control and Guidance Systems Meeting 102 1.

1

Reliable Computing

J. KochockiOctober 16, 2008Aerospace Control and Guidance Systems Meeting 102

2

Introduction to Reliable Computing

Attributes, Concepts and Definitions Designing and Specifying Redundant,

Fault Tolerant Architectures Reliable Systems Development at

Draper Laboratory Example Avionics Computational

Systems Redundancy Management in NMR

Systems Summary

Reference: Design by Extrapolation: An Evaluation of Fault Tolerant ArchitecturesRob Hammett, Draper Laboratory, IEEE AESS Systems Magazine, April 2002

3

Highly Reliable Systems

Architecting/Design is a trade process Redundancy/Modularity Component quality Interconnections Redundancy management approach Maintenance Operations . . .

And these architecting decisions impact Mass, power Reliability, availability, …ility Integration, test, maintenance SW size and complexity Extend-ability/evolve-ability, ability to upgrade Emergent properties Development cost Ownership Cost

Logistics

4

Redundant Systems Decisions are Non-Intuitively Coupled Will Adding Redundancy Increase Cost?

Yes Increased acquisition

cost More parts

More spares needed Increased weight More maintenance Inter-component

complexity

No Higher Reliability with

less costly parts Better fault detection

and isolation reduces repair costs

Ability to defer maintenance

Reduced component complexity

Less need for BIT

Redundancy

Total Cost

Reliability

Quality

System Architecture

Unit Failure Rate

Cost of System, Weight,

Complexity

Cost of System

Cost of Unreliability

An Integrated Analysis is Required

5

Avionics: (middle “A”)

The Digital System

Redundancy for Spacecraft is More Than Traditional Avionics

avionics: (small “a”) The Computer

X-38 Fault-Tolerant Computer 777 Airplane Information Management

System

6

Redundancy (and Fault Tolerance) for Spacecraft Big “a” Avionics: All Redundancy in the

System All of the electrical elements and their support Includes power, thermal, data, sensing, actuation, comms, … This is the only context to talk about Fault Tolerance,

Reliability, … for Redundant Systems

Space Station Freedom Modules

DDCU

SPDA

1 4

RC 5A/B RC 10A/B

MDM HA #1MDM HA #2MDM HA #3MDM HA #4

DDCU

SPDA

Hab

DDCU

SPDA

3 2

GW ESA-2 RC 7A/B

RC 8A/BBridge BMPAC F

DDCU

SPDA

Node 1

DDCU

SPDA

62 2 1 35A [35B]

UPS N2 ZENSDP 2MSU 1RC 1A/BMDM N2 #2 SEPSMPAC FIRGW

DDCU

SPDA

Node 2

MITCS

LITCS

MDMN2 #3

MDMN2 #4

UPS N2 AFTSDP 1MSU ZOERC 2A/BMDM N2 #1 SEPSBridge AMPAC CG/W JEM-2

DDCU DDCU

SPDA

62 3 2 1 4 35B

UPS LAC 6SDP 3RC 4A/BMDM LA #5 SEPSMDM LA #2 ITCSG/W LAB A

Lab A

MITCS

LITCS

MDMLA #3

MDMLA #4

UPS LAF 7SDP 4MSU 2RC 6A/BMDM LA #6 SEPSMDM LA #1 ITCSMPAC F

DDCU DDCU

SPDA

Power

Cooling

7

Fault Tolerance (FT) is anIntegrated Systems Issue

As the mass/power/cost is reduced, the system becomes more coupled and interdependent Across technology subsystems Organizational responsibilities Near term decisions constrain long term behaviors

A Good FT System Design Process: Can be applied at the conceptual phase Accounts for interactions across subsystems and

organizations Exposes consequences of decisions Guides the designers and owners/operators

8

Getting the integration right is critical

Fault Tolerance is anIntegrated Systems Issue

Fault Tolerance is a Systems Problem It crosses traditional technical and organizational

boundaries: data, power, thermal, comms, ECLSS, GN&C, …

It is entangled with maintenance, testing, operations

For mass-limited systems, the level of cross-function integration is increased▪ This complicates the integration, requiring a dedicated

team with a system-wide view

Fault Tolerance Taxonomy

DEPENDABILITY

ATTRIBUTES

AVAILABILITY RELIABILITY SAFETY CONFIDENTIALITY INTEGRITY MAINTAINABILITY

FAULT PREVENTION FAULT REMOVAL FAULT TOLERANCE FAULT FORECASTING

MEANS

THREATS

FAULTS ERRORS FAILURES

SECURITY

Prof. Kishor S. Trivedi

Some Fault Tolerance Concepts

Fault: An incorrect state of hardware or software resulting from failures of components, physical interference from the environment, operator error, or incorrect design

Error: The manifestation of a fault Failure: A result of a delivered service deviating

from the specified service caused by an error or fault

10

Sensor

Inputs

Computer

Actuator

Outputs

Fault Error Failure

Fault Tolerance BackgroundRedundancy

11

Sensors

Distributed Agreement Mechanism(Interactive Consistency)

Redundancy Management Processing

Command Distribution Mechanism

Actuators

This is the crux of the difficulty in

redundant components for

reliableInput-Compute-

Output

Fault Tolerance Methods

Fault Tolerant Methods• Redundancy

▪ Most practical means to tolerance• Error (and/or Fault) Detection

Fault Removal• Reconfiguration• Error Masking

Removal + Tolerance• Software Fault Tolerance “The only thing (redundancy) guarantees is a higher fault

arrival rate compared to a non-redundant system…” [Lala & Harper, IEEE 1994]

12

13

Building Reliable Software

Software defect prevention (fault avoidance): Using a disciplined development

approach that minimizes the likelihood that defects will be introduced or will remain undetected in the software product (traditional V&V)

Software fault tolerance: Designing and implementing the

software under an assumption that a limited number of residual defects will remain despite the best efforts to eliminate them

14

40 Years of Mission-Critical and Safety-Critical Systems

Apollo Guidance Computer

fault-free operation to the moon and back

Digital Fly-By-Wirefault-tolerant aircraft

control

Argonne National Laboratory

plant monitor

Venture Star Military Space Plane

Process Control

Next Generation Systems

Swim-by-wire

X-38technology demo for crew return vehicle

Space Shuttle datamanagement system

UUVvehicle control

SSN-21

Charles Stark “Doc” Draper

Father of Inertial Navigation

Space Shuttle UpgradesCrew Return Vehicle

TridentD5 Missile

Autonomous Ops

15

Apollo IMU and Guidance Computer

Three gimbal system imposed risk due to the chance of gimbal lock

“AUGUST 1963”

16

Seawolf Ship Control Processing Unit

Quad 2 Fault Tolerant ProcessorLow MaintenanceHigh Reliability Extensive FDIRIn use since early 1990’s

17

NASA Crew Return Vehicle –X-38 FTPP

Draper developed the quad FT avionics processing architecture for NASA/JSCDraper continues to support NASA/MSFC as the avionics processing design agent.Byzantine-resilient X-38 FTPP System Availability is estimated to be 99.999%

NetworkElement

ICP VME

Bus

NetworkElement

ICPFCP

Digital I/ODecomm

Analog Out

MPCC

Digital I/ODigital I/O

VM

E B

us

NetworkElement

ICPFCP

Decomm

Analog Out

MPCC

Digital I/ODigital I/ODigital I/O

VM

E B

us

NetworkElement

ICP

Analog OutMPCC

DecommDigital I/ODigital I/ODigital I/O

VM

E B

us

FCP

NetworkElement

ICP

Digital I/ODecomm

Digital I/O

Analog OutMPCC

Digital I/O

VM

E B

us

FCP

NE is an FPGA which implementstime synchronization, data

exchange and voting across the CCDL

Cross ChanData Link

NASA/MSFC Crew Launch Vehicle

Draper is NASA’s trusted design agent for the avionics and software for the upper stage of the Ares I Crew Launch Vehicle.

NASA Human-rated flight imposes a 2-FT FO/FO/FS requirement (now 1 FT)

X-38 architecture modifications made to meet new NASA requirements

Apollo – Guidance Shuttle – Backup Flight System Ares I CLV – US Avionics & FSW Ares IV CaLV – Avionics & FSW

Apollo Shuttle CLV CaLV

A long history of great space system success

19

Simple Avionics Control Systems

Sensor

Inputs

Computer

Actuator

Outputs

SensorInputs

Computerwith BIT

Actuator

Outputs

Disconnect

SensorInputs

Computer

with BITActuato

rOutput

s

EngageBackup

Computer

with BIT

Simplex System Simplex with BIT

Dual Standby

20

Self Checking Pairs

Sensor

Inputs

Computer

FC1 ActuatorOutputsComput

erFC2

Identical

Outputs?

Self Checking Pair withSimplex Fault Down

SensorInputs

Computerwith BIT

FC1Actuat

orOutput

sComputerwith BIT

FC2

Output ifAgreementPass BIT on

fail?

BIT Fail

BIT Fail

Self Checking Pair

Self Checking Pair vs.N-Modular Redundancy

21

Self Checking Pair

Switch

Sensor

Effector

CPU1

CPU2

High Integrity NIC

Self Checking Pair

Switch

Sensor

Effector

CPU1

CPU2

High Integrity NIC

Dual Self Checking Pair (SCP)N-Modular Redundant using

Cross-Channel Data Link (CCDL)

CPU1

C&DH BUS

FlightCompute

r

SensorEffecto

r

CPU1

C&DH BUS

FlightCompute

r

SensorEffecto

r

CPU1

C&DH BUS

FlightCompute

r

SensorEffecto

r

22

CCDL Voting Overview

Each channel reads IMU data (good or faulty) from each data bus

IMU_1, IMU_2 and IMU_3 are all exchanged over the CCDL so that each computer has an identical copy of each set of data prior to execution of GN&C software and sensor redundancy management

Channel 1 Channel 2 Channel 3

IMU_1 IMU_2 IMU_3

Cross Channel Data Link (CCDL)

Navigation

Guidance

Control

Navigation

Guidance

Control

Navigation

Guidance

Control

23

ARINC 653 Port/CCDL/Voting: CH1 Up Close

Channel 1 VxWorks 653 Core OS

user mode

supervisormode

Channel 1 GN&C Applications

VxWorks ARINC 653 APEX Shared Library

VxWorks System Shared Library

FTSSInputTask

1553 Rcv DMA

memory

ARINC 653 INPUT Port Memory (e.g. IMU)

1553 Send DMA memory

ARINC 653 OUT Port Memory (e.g. MC)

1553 DMA

Driver (in)

CH2 CCDL DMA

Driver

1553 DMA Driver (out)

FTSSSyncTask

IMU_1Data

BSP and DriversHardware

CH2 CCDL DMA mem

CCDL: To/From other channelsMil-Std-1553

MCCmds

Mil-Std-1553

CH3 CCDL DMA

Driver

CH3 CCDL DMA mem

Channel 1Sensors

50 Hz Control ProcessRead_Sensor_Ports();Redundancy_Mgmt();Execute Algorithms();Write_Cmd_Ports()

Multiple rate groups (1, 10, 20, Hz etc.) for different GN&C processes

Read_Port ()

CH1 Data

CH2 Data

CH3 Data

FTSSOutput

Task

Write_Port () A

pplic

ati

ons

Opera

ting S

yst

em

Channel 1Actuators

Redundant Sensor Management Function

24

Summary

Highly reliable computational systems are required for embedded avionics systems

An integrated design approach is required for designing reliable fault tolerant systems Overlaps engineering disciplines, affects O&M

There are different levels of redundancy which have to be managed, some at the computer level (bit-for-bit) and some at application level (redundant sensor processing)

Vehicle Health Monitoring/FDIR must be designed upfront into the overall fault tolerant architecture

25

Backup Material

Byzantine resilience to asymmetric faults Single Source Exchange One vs. two round exchange

ARINC 653 HW/SW FCR

26

Byzantine Fault Tolerant Single Source Exchange - Step 1

Sensor 1 Sensor 2Sensor

3

FC1Bus FC2Bus FC3Bus

7

Each FC obtains data from its LOCAL sensor connected to the local bus

8 9

27

Byzantine ResilientFault Tolerant Two Round Single Source Exchange - Step 2


7, S

7, S 8, S

9, S

9, S8, S

7 8 9

Data received during Round1

Perform “Round 1” of a two round exchange.Calculate a digital signature “S” (i.e. 32 bit CRC) for local sensor dataSend local sensor data and signature “S” to other FCs.

[8, S] from FC2[9, S] from FC3



28

Byzantine ResilientFault Tolerant Two Round Single Source Exchange - Step 3


[8,S] [9,S]

[8,S] [9,S] [7,S] [9,S]

[7,S] [8,S]

[7,S] [8,S][4,S] [9,S]

Data received during Round2

Perform “Round2” of two round exchange. Reflect data and signature received during Round1 to other FCs. Data and signature are not intentionally altered by the reflecting FC.However, in this example, FC2 CCDL transmitter corrupts original FC1 data during reflection back to FC1. The original “7” becomes a “4” but the original signature S is unaltered.

[4,S] [9,S] from FC2[7,S] [8,S] from FC3



29

Hardware and Software Fault Containment Regions

SBC750GX

VxWorks 653 RTOS

GN&CPartition

MissionMgr

Partition

SBC750GX

VxWorks 653 RTOS

GN&CPartition

MissionMgr

Partition

SBC750GX

VxWorks 653 RTOS

GN&CPartition

MissionMgr

Partition

Point-to-point, Full Duplex, Gigabit Ethernet CCDL (1000 Mbps)

FC1 FC2 FC3

Hardware Fault Containment

Region 1


Region 2


Region 3

Each A653 partition is a

Software Fault Containment

Region

Not all partitions shown

30

Attributes of Reliable Systems

Reliability (Probability of success) Mission Length Fault Tolerance (e.g. FO/FO/FS) Byzantine Resilience Inputs (data congruency) Synchronization Outputs (command voting) System Recovery/Restart (Fault Recoverable) Radiation Tolerance Mass, Power, Volume Constraints Concurrent/Parallel Processing Proprietary Hardware (or COTS) Scalability …

Fault Tolerance BackgroundRedundancy

Redundancy must be considered across the whole system• Temporal redundancy (Primary/Backup/Spare)• Parallel redundancy• Functional (analytical fault detection, isolation)

Redundancy make a simple problem hard. Fault Hypothesis drives the necessary redundancy and assumptions of

the implementations of functional elements Fault Containment Regions, partition the analysis of system failures to

make reliability analysis tractable Redundancy doesn’t always work (garbage-in-garbage-out) Redundancy can also improve the results (such as accuracy of estimators

of dynamic state)

31

32

ecking Pair with Simplex Fault Down

SensorInputs

Computer

Actuator

Outputs

Computer

Output ifAgreement.Pass BIT on

fail?

Dual SCP and TMR

SensorInputs

Computer

Computer

Output ifAgreement.Pass BIT on

fail?

EngageBackup

SensorInputs

Actuator

Outputs

ComputerVotedOutput

(2/3 majority)(middle value)

SensorInputs

Computer

Computer

SensorInputs

Dual Self Checking Pair

Triple Modular Redundancy

J. Kochocki October 16, 2008 Aerospace Control and Guidance Systems Meeting 102 1.

Documents

guidance systems

abmdm n2

abmdm la

abmdm ha

ddcu ddcu spda62

anintegrated systems

faulttolerant computer

criticalfault tolerance