Reliable Computing J. Kochocki October 16, 2008 Aerospace Control and Guidance Systems Meeting 102 1
Jan 12, 2016
1
Reliable Computing
J. KochockiOctober 16, 2008Aerospace Control and Guidance Systems Meeting 102
2
Introduction to Reliable Computing
Attributes, Concepts and Definitions Designing and Specifying Redundant,
Fault Tolerant Architectures Reliable Systems Development at
Draper Laboratory Example Avionics Computational
Systems Redundancy Management in NMR
Systems Summary
Reference: Design by Extrapolation: An Evaluation of Fault Tolerant ArchitecturesRob Hammett, Draper Laboratory, IEEE AESS Systems Magazine, April 2002
3
Highly Reliable Systems
Architecting/Design is a trade process Redundancy/Modularity Component quality Interconnections Redundancy management approach Maintenance Operations . . .
And these architecting decisions impact Mass, power Reliability, availability, …ility Integration, test, maintenance SW size and complexity Extend-ability/evolve-ability, ability to upgrade Emergent properties Development cost Ownership Cost
Logistics
4
Redundant Systems Decisions are Non-Intuitively Coupled Will Adding Redundancy Increase Cost?
Yes Increased acquisition
cost More parts
More spares needed Increased weight More maintenance Inter-component
complexity
No Higher Reliability with
less costly parts Better fault detection
and isolation reduces repair costs
Ability to defer maintenance
Reduced component complexity
Less need for BIT
Redundancy
Total Cost
Reliability
Quality
System Architecture
Unit Failure Rate
Cost of System, Weight,
Complexity
Cost of System
Cost of Unreliability
An Integrated Analysis is Required
5
Avionics: (middle “A”)
The Digital System
Redundancy for Spacecraft is More Than Traditional Avionics
avionics: (small “a”) The Computer
X-38 Fault-Tolerant Computer 777 Airplane Information Management
System
6
Redundancy (and Fault Tolerance) for Spacecraft Big “a” Avionics: All Redundancy in the
System All of the electrical elements and their support Includes power, thermal, data, sensing, actuation, comms, … This is the only context to talk about Fault Tolerance,
Reliability, … for Redundant Systems
Space Station Freedom Modules
DDCU
SPDA
1 4
RC 5A/B RC 10A/B
MDM HA #1MDM HA #2MDM HA #3MDM HA #4
DDCU
SPDA
Hab
DDCU
SPDA
3 2
GW ESA-2 RC 7A/B
RC 8A/BBridge BMPAC F
DDCU
SPDA
Node 1
DDCU
SPDA
62 2 1 35A [35B]
UPS N2 ZENSDP 2MSU 1RC 1A/BMDM N2 #2 SEPSMPAC FIRGW
DDCU
SPDA
Node 2
MITCS
LITCS
MDMN2 #3
MDMN2 #4
UPS N2 AFTSDP 1MSU ZOERC 2A/BMDM N2 #1 SEPSBridge AMPAC CG/W JEM-2
DDCU DDCU
SPDA
62 3 2 1 4 35B
UPS LAC 6SDP 3RC 4A/BMDM LA #5 SEPSMDM LA #2 ITCSG/W LAB A
Lab A
MITCS
LITCS
MDMLA #3
MDMLA #4
UPS LAF 7SDP 4MSU 2RC 6A/BMDM LA #6 SEPSMDM LA #1 ITCSMPAC F
DDCU DDCU
SPDA
Power
Cooling
7
Fault Tolerance (FT) is anIntegrated Systems Issue
As the mass/power/cost is reduced, the system becomes more coupled and interdependent Across technology subsystems Organizational responsibilities Near term decisions constrain long term behaviors
A Good FT System Design Process: Can be applied at the conceptual phase Accounts for interactions across subsystems and
organizations Exposes consequences of decisions Guides the designers and owners/operators
8
Getting the integration right is critical
Fault Tolerance is anIntegrated Systems Issue
Fault Tolerance is a Systems Problem It crosses traditional technical and organizational
boundaries: data, power, thermal, comms, ECLSS, GN&C, …
It is entangled with maintenance, testing, operations
For mass-limited systems, the level of cross-function integration is increased▪ This complicates the integration, requiring a dedicated
team with a system-wide view
Fault Tolerance Taxonomy
DEPENDABILITY
ATTRIBUTES
AVAILABILITY RELIABILITY SAFETY CONFIDENTIALITY INTEGRITY MAINTAINABILITY
FAULT PREVENTION FAULT REMOVAL FAULT TOLERANCE FAULT FORECASTING
MEANS
THREATS
FAULTS ERRORS FAILURES
SECURITY
Prof. Kishor S. Trivedi
Some Fault Tolerance Concepts
Fault: An incorrect state of hardware or software resulting from failures of components, physical interference from the environment, operator error, or incorrect design
Error: The manifestation of a fault Failure: A result of a delivered service deviating
from the specified service caused by an error or fault
10
Sensor
Inputs
Computer
Actuator
Outputs
Fault Error Failure
Fault Tolerance BackgroundRedundancy
11
Sensors
Distributed Agreement Mechanism(Interactive Consistency)
Redundancy Management Processing
Command Distribution Mechanism
Actuators
This is the crux of the difficulty in
redundant components for
reliableInput-Compute-
Output
Fault Tolerance Methods
Fault Tolerant Methods• Redundancy
▪ Most practical means to tolerance• Error (and/or Fault) Detection
Fault Removal• Reconfiguration• Error Masking
Removal + Tolerance• Software Fault Tolerance “The only thing (redundancy) guarantees is a higher fault
arrival rate compared to a non-redundant system…” [Lala & Harper, IEEE 1994]
12
13
Building Reliable Software
Software defect prevention (fault avoidance): Using a disciplined development
approach that minimizes the likelihood that defects will be introduced or will remain undetected in the software product (traditional V&V)
Software fault tolerance: Designing and implementing the
software under an assumption that a limited number of residual defects will remain despite the best efforts to eliminate them
14
40 Years of Mission-Critical and Safety-Critical Systems
Apollo Guidance Computer
fault-free operation to the moon and back
Digital Fly-By-Wirefault-tolerant aircraft
control
Argonne National Laboratory
plant monitor
Venture Star Military Space Plane
Process Control
Next Generation Systems
Swim-by-wire
X-38technology demo for crew return vehicle
Space Shuttle datamanagement system
UUVvehicle control
SSN-21
Charles Stark “Doc” Draper
Father of Inertial Navigation
Space Shuttle UpgradesCrew Return Vehicle
TridentD5 Missile
Autonomous Ops
15
Apollo IMU and Guidance Computer
Three gimbal system imposed risk due to the chance of gimbal lock
“AUGUST 1963”
16
Seawolf Ship Control Processing Unit
Quad 2 Fault Tolerant ProcessorLow MaintenanceHigh Reliability Extensive FDIRIn use since early 1990’s
17
NASA Crew Return Vehicle –X-38 FTPP
Draper developed the quad FT avionics processing architecture for NASA/JSCDraper continues to support NASA/MSFC as the avionics processing design agent.Byzantine-resilient X-38 FTPP System Availability is estimated to be 99.999%
NetworkElement
ICP VME
Bus
NetworkElement
ICPFCP
Digital I/ODecomm
Analog Out
MPCC
Digital I/ODigital I/O
VM
E B
us
NetworkElement
ICPFCP
Decomm
Analog Out
MPCC
Digital I/ODigital I/ODigital I/O
VM
E B
us
NetworkElement
ICP
Analog OutMPCC
DecommDigital I/ODigital I/ODigital I/O
VM
E B
us
FCP
NetworkElement
ICP
Digital I/ODecomm
Digital I/O
Analog OutMPCC
Digital I/O
VM
E B
us
FCP
NE is an FPGA which implementstime synchronization, data
exchange and voting across the CCDL
Cross ChanData Link
NASA/MSFC Crew Launch Vehicle
Draper is NASA’s trusted design agent for the avionics and software for the upper stage of the Ares I Crew Launch Vehicle.
NASA Human-rated flight imposes a 2-FT FO/FO/FS requirement (now 1 FT)
X-38 architecture modifications made to meet new NASA requirements
Apollo – Guidance Shuttle – Backup Flight System Ares I CLV – US Avionics & FSW Ares IV CaLV – Avionics & FSW
Apollo Shuttle CLV CaLV
A long history of great space system success
19
Simple Avionics Control Systems
Sensor
Inputs
Computer
Actuator
Outputs
SensorInputs
Computerwith BIT
Actuator
Outputs
Disconnect
SensorInputs
Computer
with BITActuato
rOutput
s
EngageBackup
Computer
with BIT
Simplex System Simplex with BIT
Dual Standby
20
Self Checking Pairs
Sensor
Inputs
Computer
FC1 ActuatorOutputsComput
erFC2
Identical
Outputs?
Self Checking Pair withSimplex Fault Down
SensorInputs
Computerwith BIT
FC1Actuat
orOutput
sComputerwith BIT
FC2
Output ifAgreementPass BIT on
fail?
BIT Fail
BIT Fail
Self Checking Pair
Self Checking Pair vs.N-Modular Redundancy
21
Self Checking Pair
Switch
Sensor
Effector
CPU1
CPU2
High Integrity NIC
Self Checking Pair
Switch
Sensor
Effector
CPU1
CPU2
High Integrity NIC
Dual Self Checking Pair (SCP)N-Modular Redundant using
Cross-Channel Data Link (CCDL)
CPU1
C&DH BUS
FlightCompute
r
SensorEffecto
r
CPU1
C&DH BUS
FlightCompute
r
SensorEffecto
r
CPU1
C&DH BUS
FlightCompute
r
SensorEffecto
r
22
CCDL Voting Overview
Each channel reads IMU data (good or faulty) from each data bus
IMU_1, IMU_2 and IMU_3 are all exchanged over the CCDL so that each computer has an identical copy of each set of data prior to execution of GN&C software and sensor redundancy management
Channel 1 Channel 2 Channel 3
IMU_1 IMU_2 IMU_3
Cross Channel Data Link (CCDL)
Navigation
Guidance
Control
Navigation
Guidance
Control
Navigation
Guidance
Control
23
ARINC 653 Port/CCDL/Voting: CH1 Up Close
Channel 1 VxWorks 653 Core OS
user mode
supervisormode
Channel 1 GN&C Applications
VxWorks ARINC 653 APEX Shared Library
VxWorks System Shared Library
FTSSInputTask
1553 Rcv DMA
memory
ARINC 653 INPUT Port Memory (e.g. IMU)
1553 Send DMA memory
ARINC 653 OUT Port Memory (e.g. MC)
1553 DMA
Driver (in)
CH2 CCDL DMA
Driver
1553 DMA Driver (out)
FTSSSyncTask
IMU_1Data
BSP and DriversHardware
CH2 CCDL DMA mem
CCDL: To/From other channelsMil-Std-1553
MCCmds
Mil-Std-1553
CH3 CCDL DMA
Driver
CH3 CCDL DMA mem
Channel 1Sensors
50 Hz Control ProcessRead_Sensor_Ports();Redundancy_Mgmt();Execute Algorithms();Write_Cmd_Ports()
Multiple rate groups (1, 10, 20, Hz etc.) for different GN&C processes
Read_Port ()
CH1 Data
CH2 Data
CH3 Data
FTSSOutput
Task
Write_Port () A
pplic
ati
ons
Opera
ting S
yst
em
Channel 1Actuators
Redundant Sensor Management Function
24
Summary
Highly reliable computational systems are required for embedded avionics systems
An integrated design approach is required for designing reliable fault tolerant systems Overlaps engineering disciplines, affects O&M
There are different levels of redundancy which have to be managed, some at the computer level (bit-for-bit) and some at application level (redundant sensor processing)
Vehicle Health Monitoring/FDIR must be designed upfront into the overall fault tolerant architecture
25
Backup Material
Byzantine resilience to asymmetric faults Single Source Exchange One vs. two round exchange
ARINC 653 HW/SW FCR
26
Byzantine Fault Tolerant Single Source Exchange - Step 1
Sensor 1 Sensor 2Sensor
3
FC1Bus FC2Bus FC3Bus
7
Each FC obtains data from its LOCAL sensor connected to the local bus
8 9
27
Byzantine ResilientFault Tolerant Two Round Single Source Exchange - Step 2
FC1Bus FC2Bus FC3Bus
7, S
7, S 8, S
9, S
9, S8, S
7 8 9
Data received during Round1
Perform “Round 1” of a two round exchange.Calculate a digital signature “S” (i.e. 32 bit CRC) for local sensor dataSend local sensor data and signature “S” to other FCs.
[8, S] from FC2[9, S] from FC3
[7, S] from FC1[9, S] from FC3
[7, S] from FC1[8, S] from FC2
28
Byzantine ResilientFault Tolerant Two Round Single Source Exchange - Step 3
FC1Bus FC2Bus FC3Bus
[8,S] [9,S]
[8,S] [9,S] [7,S] [9,S]
[7,S] [8,S]
[7,S] [8,S][4,S] [9,S]
Data received during Round2
Perform “Round2” of two round exchange. Reflect data and signature received during Round1 to other FCs. Data and signature are not intentionally altered by the reflecting FC.However, in this example, FC2 CCDL transmitter corrupts original FC1 data during reflection back to FC1. The original “7” becomes a “4” but the original signature S is unaltered.
[4,S] [9,S] from FC2[7,S] [8,S] from FC3
[8,S] [9,S] from FC1[7,S] [8,S] from FC3
[8,S] [9,S] from FC1[7,S] [9,S] from FC2
29
Hardware and Software Fault Containment Regions
SBC750GX
VxWorks 653 RTOS
GN&CPartition
MissionMgr
Partition
SBC750GX
VxWorks 653 RTOS
GN&CPartition
MissionMgr
Partition
SBC750GX
VxWorks 653 RTOS
GN&CPartition
MissionMgr
Partition
Point-to-point, Full Duplex, Gigabit Ethernet CCDL (1000 Mbps)
FC1 FC2 FC3
Hardware Fault Containment
Region 1
Hardware Fault Containment
Region 2
Hardware Fault Containment
Region 3
Each A653 partition is a
Software Fault Containment
Region
Not all partitions shown
30
Attributes of Reliable Systems
Reliability (Probability of success) Mission Length Fault Tolerance (e.g. FO/FO/FS) Byzantine Resilience Inputs (data congruency) Synchronization Outputs (command voting) System Recovery/Restart (Fault Recoverable) Radiation Tolerance Mass, Power, Volume Constraints Concurrent/Parallel Processing Proprietary Hardware (or COTS) Scalability …
Fault Tolerance BackgroundRedundancy
Redundancy must be considered across the whole system• Temporal redundancy (Primary/Backup/Spare)• Parallel redundancy• Functional (analytical fault detection, isolation)
Redundancy make a simple problem hard. Fault Hypothesis drives the necessary redundancy and assumptions of
the implementations of functional elements Fault Containment Regions, partition the analysis of system failures to
make reliability analysis tractable Redundancy doesn’t always work (garbage-in-garbage-out) Redundancy can also improve the results (such as accuracy of estimators
of dynamic state)
31
32
ecking Pair with Simplex Fault Down
SensorInputs
Computer
Actuator
Outputs
Computer
Output ifAgreement.Pass BIT on
fail?
Dual SCP and TMR
SensorInputs
Computer
Computer
Output ifAgreement.Pass BIT on
fail?
EngageBackup
SensorInputs
Actuator
Outputs
ComputerVotedOutput
(2/3 majority)(middle value)
SensorInputs
Computer
Computer
SensorInputs
Dual Self Checking Pair
Triple Modular Redundancy