Using Communication by Time to Using Communication by Time to Implement Fail-Safe Duplex Redundancy Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les Trois Ilets, Martinique, 20-21 January 2000 David Powell [email protected]Jean Arlat [email protected]Didier Essam [email protected]
28
Embed
Implement Fail-Safe Duplex RedundancyUsing Communication ... · Using Communication by Time to Implement Fail-Safe Duplex Redundancy IFIP 10.4 Workshop on Time and Dependability Les
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Using Communication by Time toUsing Communication by Time toImplement Fail-Safe Duplex RedundancyImplement Fail-Safe Duplex Redundancy
IFIP 10.4 Workshop on Time and DependabilityLes Trois Ilets, Martinique, 20-21 January 2000
➥ either communication or reaction timebound does not exist
➥ P cannot decide if Q has stopped, or ifQ, m1 or m2 are very slow
Asynchronous, or Òtime-freeÓ model
➥ either communication or reaction timebound does not exist
➥ P cannot decide if Q has stopped, or ifQ, m1 or m2 are very slow
Synchronous, or Òbounded timeÓ model
➥ communication bound guaranteed(the network never fails)
➥ P can declare that Q has failed ifTD-TA > 2∆P+ σP
Synchronous, or Òbounded timeÓ model
➥ communication bound guaranteed(the network never fails)
➥ P can declare that Q has failed ifTD-TA > 2∆P+ σP
1515
The Real SystemThe Real System
■ Assumptions● human lives are at stake, so must assume that communication is uncertain:
➥ messages can be lost (omission failures)
➥ messages can be delayed (performance failures)
● fail-safe processing units (coded processor technique)
● table-driven process scheduling
● fail-safe local clocks
Communication system
1616
Timed Asynchronous ModelTimed Asynchronous Model■ The real system
● human lives are at stake, so must assume that communication is uncertain:
➥ messages can be lost (omission failure)
➥ messages can be delayed (performance failure)
● fail-safe processing units (coded processor technique)
● table-driven process scheduling
● fail-safe local clocks
■ The model● Datagram service
➥ Defined upper quantile on transmission delay (δ)
➥ Messages can only suffer omission/performance failures
● Process management service➥ Defined upper quantile on scheduling delay (σ)
➥ Processes can only suffer crash/performance failures
● Hardware clock service
➥ Each non-crashed process has access to a hardware clock with a known upper boundon drift rate (ρ) (NB. clocks are not (cannot be) deterministically synchronized)
[Cristian & Fetzer 1998]
1717
Fail-AwareFail-Aware Datagram Datagram Service Service
■ Let td(m) be the real delay incurred by a message m
■ Choose a constant ∆ so that m can be classified according to:
● if ub(m) ≤ ∆ message is fast
● if ub(m) > ∆ message is slow
■ Moreover, if periodicity of messages is ≤ τ, can calculate ∆ such that, whentd(mÕ)< δ and td (m)< δ (P and Q ÒconnectedÓ), then m is delivered as fast:
● ∆ ≥ 4 τρ + (2+4ρ) δ - δmin (ensures progress when P, Q and the channel between them are timely)
P
Q
TP(t2)
TQ(t1)
≥ δmin td(m) ≤ ub(m)
real time
mmÕ
t1 t2 t3 t4
TP(t3)
TQ(t4)
[Fetzer & Cristian 1997]
1818
PProtocol for rotocol for AAsymmetric symmetric DDuplexuplex REREdundancydundancy
PADRE
fail-awaredatagram
unreliabledatagram
applicationmodule
PADRE
fail-awaredatagram
unreliabledatagram
applicationmodule
■ Idea:● Cannot guarantee consistency of
duplicated units sincecommunication is uncertain
● So, build a fail-aware multicastprotocol
● Indicator signals when consistencyis ensured
➥ Nominal duplex configuration
¥ primary unit in primary mode
¥ secondary unit in standby mode
● Inhibit redundancy switching whenconsistency is not ensured
➥ Safe duplex configuration
¥ primary unit in primary mode
¥ secondary unit in quarantine mode
[Essam� et al. 1999]
1919
PADRE System ModesPADRE System Modes
(Benign failure)
Catastrophicfailure
Nominal service
Fault of primaryor secondary
Fault of secondary
RepairPotential
inconsistency(transmission fault)
Staterestoration
Unsafe
Repair
Safeduplexconfig.
Nominalduplexconfig.
Simplexconfig.
Safe
Fault of primary
Fault of primary
2020
Protocol Protocol PropertiesProperties
■ Safety properties● Unique Primary (UP): at any instant, only one unit is in the primary mode
● Quarantine (MQ): Secondary must leave standby mode within bounded delay ifinconsistent with Primary; return to standby mode only allowed when consistent
● Prefix of History (PH): history of Primary must always be a prefix of that of Secondary
■ Progress properties● Agreement (AP): in the absence of faults, any input accepted by one unit at time t is
accepted by the other unit in the interval [ t-ω , t+ω ]
● Limited Quarantine (LQ): in the absence of faults, a unit in quarantine must eventuallyswitch to standby
2121
Protocol Protocol PrinciplePrinciple
■ Primary only accepts an input if the secondary:➥ has accepted it, or
➥ has been placed in quarantine, or
➥ has failed
■ Secondary only accepts messages sent to it from the primary
2222
Reception ProtocolReception Protocol
Primary
Secondary
can acceptsince
secondary hasaccepted
acknowledgement timeout interval
mr
ack
(mi)
A
mi
mi
mi
can onlyaccept when
sure thatsecondary isin quarantineor has failed
mi+1
Ami+1
mi+1
secondary,go to
quarantine!
2323
Quarantine Control ProtocolQuarantine Control Protocol
gostop donÕt go to quarantinequarantine
2424
Quarantine Control ProtocolQuarantine Control Protocol
R refresh period
I survival timeout interval
Q delay for certain quarantine or failure of secondary
Q
mi+1
mr
ack
(mi)
AA
mi
mi
mi mi+1
mi+1
Primary
Secondary
DonÕt goto quarantine
I
R
I
R
I
R
I
2525
Choice of Value for Choice of Value for QQ
Primary
Secondary
real time
t1 t2 t3
Need: TP(t3) < TP(t1) + Q
Equivalently: Q ≥ TP(t3) - TP(t1)
Now: (t3-t1) = (t3-t2) + (t2 -t1)
and:
(t3-t2) ≤ I (1+ρ)
(t2 -t1) ≤ ∆ (fail-aware datagram)
so:
(t3-t1) ≤ ∆ + I (1+ρ)
but:
TP(t3) - TP(t1) ≤ (t3-t1) (1+ρ)
Therefore, must choose Q such that:
Q ≥ [∆ + I (1+ρ)] (1+ρ)
or:
TS(t2) TS(t3)
TP(t1) TP(t3)
DonÕt goto quarantine
Q
Q ≥ ∆ (1+ρ) + I (1+2ρ)Q ≥ ∆ (1+ρ) + I (1+2ρ)
I
2626
Unique Primary PropertyUnique Primary Property
● Unique Primary (UP): at anyinstant, only one unit is in theprimary mode
➥ Software implementation wouldrequire third party to allowmajority election of a leader
➥ Hardware implementation bymeans of a bistable safety relay
PADRE
fail-awaredatagram
unreliabledatagram
applicationmodule
PADRE
fail-awaredatagram
unreliabledatagram
applicationmodule
2727
De-quarantine ProtocolDe-quarantine Protocol
Objective: secondary to revert from quarantine to standby, to resume its role as a back-up
Principle:
● transfer state of primary to secondary
● in general case, state cannot be transferred in single message
● state of primary may be updated while transfer is being carried out
[Bondavalli et al. 1998]
0StateÊ[1]
0StateÊ[2]
1StateÊ[3]
0StateÊ[4]
1StateÊ[5]
1StateÊ[6]
1StateÊ[n-1]
1StateÊ[n]
StateÊ[1]
StateÊ[2]
StateÊ[3]
StateÊ[4]
StateÊ[5]
StateÊ[6]
StateÊ[n-1]
StateÊ[n]
Primary Secondary (in quarantine)
concurrent update
if last resume sending ÒdonÕt go to quarantineÓ
if last switch toÒstandbyÓ mode
do while ∃ tag=1
2828
ConclusionConclusion
■ Timed asynchronous model● safety does rely on coverage of synchronous assumptions
● progress can be made when system behaves Òas ifÓ it were synchronous
● appropriate model for designing fail-safe distributed systems
■ Asymmetric redundancy management● tolerance of potential inconsistency
● fault-tolerance temporarily sacrificed to guarantee safety
■ Feasibility study● automatic subway in San Juan, Porto Rico
■ Projected applications● automatic subway in Hong Kong