2758769

Slide 1

Validating Computer System and Network Trustworthiness

Prof. William H. SandersDepartment of Electrical and Computer Engineering and

Coordinated Science LaboratoryUniversity of Illinois at Urbana-Champaign

[email protected]

www.mobius.uiuc.eduwww.perform.csl.uiuc.eduwww.iti.uiuc.edu

©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 2

Course Outline

• Issues in Model-Based Validation of High-Availability Computer Systems/Networks

• Combinatorial Modeling• Stochastic Activity Network Concepts• Analytic/Numerical State-Based Modeling • Case Study: Embedded Fault-Tolerant Multiprocessor System• Solution by Simulation• Symbolic State-space Exploration and Numerical Analysis of

State-sharing Composed Models• Case Study: Security Evaluation of a Publish and Subscribe

System• The Art of System Trust Evaluation /Conclusions


Slide 5

What is Validated? -- Dependability

• Dependability is the ability of a system to deliver a specified service.• System service is classified as proper if it is delivered as specified; otherwise it

is improper.• System failure is a transition from proper to improper service.• System restoration is a transition from improper to proper service.

⇒ The “properness” of service depends on the user’s viewpoint!

Reference: J.C. Laprie (ed.), Dependability: Basic Concepts and Terminology, Springer-Verlag, 1992.

improperservice

failure

restoration

properservice


Slide 6

Basic Validation Terms• Measures -- What you want to know about a system. Used to determine if a

realization meets a specification• Models -- Abstraction of the system at an appropriate level of abstraction and/or

details to determine the desired measures about a realization.• Dependability Model Solution Methods -- Method by which one determines

measures from a model. Models can be solved by a variety of techniques:– Combinatorial Methods -- Structure of the model is used to obtain a simple

arithmetic solution.– Analytical/Numerical Methods -- A system of linear differential equations or

linear equations is constructed, which is solved to obtain the desired measures

– Simulation -- The realization of the system is executed, and estimates of the measures are calculated based on the resulting executions (known also as sample paths or trajectories.)

⇒ Möbius supports performance/reliability/availability validation byanalytical/numerical and simulation-based methods.


Slide 7

Dependability Measures: AvailabilityAvailability - quantifies the alternation between deliveries of proper and improper service.

– A(t) is 1 if service is proper at time t, 0 otherwise.

– E[A(t)] (Expected value of A(t)) is the probability that service is proper at time t.

– A(0,t) is the fraction of time the system delivers proper service during [0,t].

– E[A(0,t)] is the expected fraction of time service is proper during [0,t].

– P[A(0,t) > t*] (0 ≤ t* ≤ 1) is the probability that service is proper more than 100t*% of the time during [0,t].

– A(0,t)t→∞ is the fraction of time that service is proper in steady state.

– E[A(0,t)t→∞], P[A(0,t)t→∞ > t*] as above.


Slide 8

Other Dependability Measures

• Reliability - a measure of the continuous delivery of service– R(t) is the probability that a system delivers proper service throughout [0,t].

• Safety - a measure of the time to catastrophic failure– S(t) is the probability that no catastrophic failures occur during [0,t].– Analogous to reliability, but concerned with catastrophic failures.

• Time to Failure - measure of the time to failure from last restoration. (Expected value of this measure is referred to as MTTF - Mean time to failure.)

• Maintainability - measure of the time to restoration from last experienced failure. (Expected value of this measure is referred to as MTTR - Mean time to repair.)

• Coverage - the probability that, given a fault, the system can tolerate the fault and continue to deliver proper service.


Slide 9

How is Validation Done?

Modeling

Simulation

Continuous State

Discrete Event (state)

Sequential Parallel

Analysis/Numerical

Deterministic Non-Deterministic

Probabilistic Non-Probabilistic

State-space-basedNon-State-space-based

(Combinatorial)

Validation

Measurement

Passive(no faultinjection)

Active(Fault Injectionon Prototype)

WithoutContact

WithContact

Hardware-Implemented

Software-Implemented

Stand-aloneSystems

Networks/Distributed

Systems

Möbius supportsmodel-based validationof italicized (red) items.


Slide 10

Integrated Validation Procedure

R

PS Q

Functional Model of the Relevant Subset of the System

ModuleA ModuleB ModuleZ…

AA1 AA2 AA3

RequirementDecomposition

Functional Modelof the System

(Probabilistic orLogical)

Assumptions

Supporting LogicalArguments and Experimentation

AP1 AP2

M1 M2 M3 M4 M6M5

L1 L2 L3


Slide 12

Probability Review: Exponential Random Variables

An exponential random variable X with parameter λ has the CDF

P[X ≤ t] = Fx(t) =

The density function is given by

fx(t) =

The exponential random variable is the only continuous random variable that is “memoryless.”

To see this, let X be an exponential random variable representing the time that an event occurs (e.g., a fault arrival).

Important Fact 1: (memoryless property)!

0 t ≤ 01−e-λt t > 0 .

);()( tFdtdtf xx =

0 t ≤ 0λe-λt t > 0

[ ] [ ] tXPsXstXP >=>+>

.21 is varianceits and 1 ismean Itsλλ


Slide 13

Probability Review: Exponential Event Rate

• The fact that the exponential random variable has the memoryless property indicates that the “rate” at which events occur is constant, i.e., it does not change over time.

• Often, the event associated with a random variable X is a failure, so the “event rate” is often called the failure rate or the hazard rate.

• The event rate of X is defined as the probability that the event associated with Xoccurs within the small interval [t, t + ∆t], given that the event has not occurred by time t, per the interval size ∆t:

• This can be thought of as looking at X at time t, observing that the event has not occurred, and measuring the number of events (probability of the event) that occur per unit of time at time t.

Important Fact 2: The exponential random variable has a constant failure rate!

[ ].

ttXttXtP

∆>∆+≤<


Slide 14

Probability Review: Minimum of Two Independent ExponentialsAnother interesting property of exponential random variables is that the minimum of two independent exponential random variables is also an exponential random variable.Let A and B be independent exponential random variables with rates α and βrespectively. Let us define X = minA,B. What is FX(t)?

FX(t) = P[X ≤ t]= P[minA,B ≤ t]= P[A ≤ t OR B ≤ t]= 1 - P[A > t AND B > t]= 1 - P[A > t] P[B > t]= 1 - (1 - P[A ≤ t])(1 - P[B ≤ t])= 1 - (1 - FA(t))(1 - FB(t))= 1 - (1 - [1 - e-αt])(1 - [1 - e-βt])= 1 - e-αte-βt

= 1 - e-(α + β)t

Important Fact 3: The minimum of two independent exponential random variables is itself exponential with rate the sum of the two rates!


Slide 15

Probability Review: Competition of Two Independent ExponentialsIf A and B are independent and exponential with rate α and β respectively, and A and Bare competing, then we know that one will “win” with an exponentially distributed time (with rate α + β). But what is the probability that A wins?

[ ] [ ] [ ]dxxAPxABAPBAP 0

==<=< ∫∞

[ ] ( )

[ ]

[ ]

[ ]( )

[ ]( )

( )

β+αα

=α=

α=

α−−=

α≤−=

α<=

α=<=

=<=

∫

∫∫∫∫∫∫

∞ β+α−

∞ α−β−

α−∞ β−

α−∞

α−∞

α−∞

∞

0

0

0

0

0

0

0

11

1

dxe

dxee

dxee

dxexBP

dxeBxP

dxexABAP

dxxfxABAP

x

xx

xx

x

x

x

A

Important Fact 4: If A and B are independent, competing exponentials, with rates α and β respectively, the probability that A occurs before B is α/(α + β)!


Slide 16

Course Outline


• Combinatorial Modeling• Stochastic Activity Network Concepts• Analytic/Numerical State-Based Modeling • Case Study: Embedded Fault-Tolerant Multiprocessor System• Solution by Simulation• Symbolic State-space Exploration and Numerical Analysis of

State-sharing Composed Models• Case Study: Security Evaluation of a Publish and Subscribe

System• The Art of System Trust Evaluation /Conclusions


Slide 17

Combinatorial Methods


Slide 18

Introduction to Combinatorial Methods

• Combinatorial validation methods are the simplest kind of analytical/numerical techniques and can be used for reliability and availability modeling under certain assumptions.

• Assumptions are that component failures are independent, and for availability, repairs are independent.

• When these assumptions hold, simple formulas for reliability and availability exist.


Slide 19

Lecture Outline

• Review definition of reliability• Failure rate• System reliability

– Maximum– Minimum– k of N

• Reliability formalisms– Reliability block diagrams– Fault trees– Reliability graphs

• Reliability modeling process


Slide 20

Reliability

• One key to building highly available systems is the use of reliable components and systems.

• Reliability: The reliability of a system at time t (R(t)) is the probability that the system operation is proper throughout the interval [0,t].

• Probability theory and combinatorics can be directly applied to reliability models.

• Let X be a random variable representing the time to failure of a component. The reliability of the component at time t is given byRX(t) = P[X > t] = 1 - P[X ≤ t] = 1 - FX(t).

• Similarly, we can define unreliability at time t byUX(t) = P[X ≤ t] = FX(t).


Slide 21

Failure RateWhat is the rate that a component fails at time t? This is the probability that a component that has not yet failed fails in the interval (t, t + ∆t), as ∆t → 0.

Note that we are not looking at P[X ∈ (t, t + ∆t)] = fX(t). Rather, we are seekingP[X ∈ (t, t + ∆t)| X > t].

rX(t) is called the failure rate or hazard rate.

( )[ ]( )

)()(1

)(

1,

][]),,([]|),([

trtF

tftF

tttXPtXP

tXtttXPtXtttXP

XX

X

X

=−

=

−∆+∈

=

>>∆+∈

=>∆+∈


Slide 22

Typical Failure Rate

Break in Normal operation Wear out

rX(t)

time


Slide 23

System ReliabilityWhile FX can give the reliability of a component, how do you compute the reliability of a system?

System failure can occur when one, all, or some of the components fail. If one makes the independent failure assumption, system failure can be computed quite simply. The independent failure assumption states that all component failures of a system are independent, i.e., the failure of one component does not cause another component to be more or less likely to fail.

Given this assumption, one can determine:1) Minimum failure time of a set of components2) Maximum failure time of a set of components3) Probability that k of N components have failed at a particular time t.


Slide 24

Maximum of n Independent Failure TimesLet X1, . . . , Xn be independent component failure times. Suppose the system fails at time S if all the components fail.

Thus, S = maxX1, . . . , Xn

What is Fs(t)?

Fs(t) = P[S ≤ t]= P[X1 ≤ t AND X2 ≤ t AND . . . AND Xn ≤ t]= P[X1 ≤ t] P[X2 ≤ t] . . . P[Xn ≤ t] By independence= By definition

=

)()...()(21

tFtFtFnXXX

∏=

n

iX tF

i1

)(


Slide 25

Let X1, . . . , Xn be independent component failure times. A system fails at time S if any of the components fail.Thus, S = minX1, . . . , Xn. What is FS(t)?

FS(t) = P[S ≤ t] = P[X1 ≤ t OR X2 ≤ t OR . . . OR Xn ≤ t]

This is an application of the law of total probability (LOTP).

Minimum of n Independent Component Failure Times

] AND . . . AND AND [1 ] OR . . . OR OR [

then, and such that complementset theis and event,an is If :Trick

21

21

n

n

iiii

ii

AAAPAAAP

AAAAAA

−=

∅=∩Ω=∪

ΩA3

A2A1


Slide 26

Minimum cont.

Fs(t) = P[X1 ≤ t OR X2 ≤ t OR . . . OR Xn ≤ t]= 1 - P[X1 > t AND X2 > t AND . . . AND Xn > t] By trick= 1 - P[X1 > t] P[X2 > t] . . . P[Xn > t] By independence= 1 - (1 - P[X1 ≤ t])(1 - P[X2 ≤ t]) . . . (1 - P[Xn ≤ t]) By LOTP

= ))(1(11

∏=

−−n

iX tF

i


Slide 27

k of NLet X1, . . . , Xn be component failure times that have identical distributions (i.e.,

= . . .). The system fails at time S if k of the N components fail.

FS(t) = P[at least k components failed by time t]= P[k failed OR k + 1 failed OR . . . OR N failed]= P[k failed] + P[k + 1 failed] + . . . + P[N failed]

What is P[exactly k failed]?= P[k failed and (N - k) have not]

=

where FX(t) is the failure distribution of each component.

Thus,

- by independenceand axiom ofprobability.

kNX

kX tFtF

kN −−

))(1()(

∑=

−−

=

N

ki

iNX

iXS tFtF

iN

tF ))(1()( )(

)()(21

tFtF XX =


Slide 28

k of N in GeneralFor non-identical failure distributions, we must sum over all combinations of at least k failures.

Let Gk be the set of all subsets of X1, . . . , XN such that each element in Gk is a set of size at least k, i.e.,

Gk = gi ⊆ X1, . . . , XN : |gi| ≥ k.

The set Gk represents all the possible failure scenarios.

Now FS is given by

( )∑ ∏∏∈ ∉∈

−

=

kGg gXX

gXXS tFtFtF )(1 )()(


Slide 29

Component Building BlocksComplex systems can be analyzed hierarchically.

Example: A computer fails if both power supplies fail or both memories fail or the CPU fails.

FS(t) = 1 - (1 - FP1(t)FP2(t))(1- FM1(t)FM2(t))(1 - FC(t))


Slide 30

Summary

A system comprises N components, where the component failure times are given by the random variables X1, . . . , XN. The system fails at time S with distribution FS if:

Condition:

all components fail

one component fails

k components fail,identical distributions

k components fail,general case

Distribution:

∏=

=N

iXS tFtF

i1

)()(

( )∏=

−−=N

iXS tFtF

i1

)(11)(

( )∑=

−−

=

N

ki

iNX

iXS tFtF

iN

tF )(1)( )(

( ) )(1 )()( ∑ ∏∏∈ ∉∈

−

=

kGg gXX

gXXS tFtFtF


Slide 31

Reliability FormalismsThere are several popular graphical formalisms to express system reliability. The core of the solvers is the methods we have just examined. In particular, we will examine

• Reliability Block Diagrams• Fault Trees• Reliability Graphs

There is nothing particularly special about these formalisms except their popularity. It is easy to implement these formalisms, or design your own, in a spreadsheet, for example.


Slide 32

Reliability Block Diagrams

• Blocks represent components.• A system failure occurs if there is no path from source to sink.

Series:System fails if any component fails.

Parallel:System fails if all components fail.

k of N:System fails if at least k of Ncomponents fail.

C1 C2 C3source sink

C1

C2

C3source sink

C1

C2

C3source sink

2 of 3©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 33

ExampleA NASA satellite architecture under study is designed for high reliability. The major computer system components include the CPU system, the high-speed network for data collection and transmission, and the low-speed network for engineering and control. The satellite fails if any of the major systems fail.

There are 3 computers, and the computer system fails if 2 or more of the computers fail. Failure distribution of a computer is given by FC.

There is a redundant (2) high-speed network, and the high-speed network system fails if both networks fail. The distribution of a high-speed network failure is given by FH.

The low-speed network is arranged similarly, with a failure distribution of FL.


Slide 34

RBG Example

computer

source sinkHSN LSN

LSN

2 of 3

( )( ) ( )( )( ) ( )( )( )223

2

3 11 1)( 3

11)( tFtFtFtFi

tF LHi

iC

iCS −−

−

−−= ∑

=

−

HSN

computer

computer


Slide 35

Fault Trees

• Components are leaves in the tree• A component fails = logical value of true, otherwise false.• The nodes in the tree are boolean AND, OR, and k of N gates.• The system fails if the root is true.

AND gatestrue if all the components are true (fail).

OR gatestrue if any of the components are true (fail).

k of N gatestrue if at least k of the components are true (fail).

C1 C3C2

AND

C1 C3C2

OR

C1 C3C2

2 of 3


Slide 36

Fault Tree Example

OR

C1 C3C2

2 of 3 AND

H1 H2

AND

L2L1


Slide 37

Combinatorial Methods: Review

A system comprises N components, where the component failure times are given by the random variables X1, . . . , XN. The system fails at time S with distribution FS if:

Condition:

all components fail

one component fails

k components fail,identical distributions

k components fail,general case

Distribution:

∏=

=N

iXS tFtF

i1

)()(

( )∏=

−−=N

iXS tFtF

i1

)(11)(

( )∑=

−−

=

N

ki

iNX

iXS tFtF

iN

tF )(1)( )(

( ) )(1 )()( ∑ ∏∏∈ ∉∈

−

=

kGg gXX

gXXS tFtFtF


Slide 38

Reliability FormalismsThere are several popular graphical formalisms to express system reliability. The core of the solvers is the methods we have just examined. In particular, we will examine

• Reliability Block Diagrams• Fault Trees• Reliability Graphs

There is nothing particularly special about these formalisms except their popularity. It is easy to implement these formalisms, or design your own, in a spreadsheet, for example.


Slide 39

Reliability Block Diagrams

• Blocks represent components.• A system failure occurs if there is no path from source to sink.

Series:System fails if any component fails.

Parallel:System fails if all components fail.

k of N:System fails if at least k of Ncomponents fail.

C1 C2 C3source sink

C1

C2

C3source sink

C1

C2

C3source sink

2 of 3©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 40

ExampleA NASA satellite architecture under study is designed for high reliability. The major computer system components include the CPU system, the high-speed network for data collection and transmission, and the low-speed network for engineering and control. The satellite fails if any of the major systems fail.

There are 3 computers, and the computer system fails if 2 or more of the computers fail. Failure distribution of a computer is given by FC.

There is a redundant (2) high-speed network, and the high-speed network system fails if both networks fail. The distribution of a high-speed network failure is given by FH.

The low-speed network is arranged similarly, with a failure distribution of FL.


Slide 41

RBG Example

computer

source sinkHSN LSN

LSN

2 of 3

( )( ) ( )( )( ) ( )( )( )223

2

3 11 1)( 3

11)( tFtFtFtFi

tF LHi

iC

iCS −−

−

−−= ∑

=

−

HSN

computer

computer


Slide 42

Fault Trees

• Components are leaves in the tree• A component fails = logical value of true, otherwise false.• The nodes in the tree are boolean AND, OR, and k of N gates.• The system fails if the root is true.

AND gatestrue if all the components are true (fail).

OR gatestrue if any of the components are true (fail).

k of N gatestrue if at least k of the components are true (fail).

C1 C3C2

AND

C1 C3C2

OR

C1 C3C2

2 of 3


Slide 43

Fault Tree Example

OR

C1 C3C2

2 of 3 AND

H1 H2

AND

L2L1


Slide 44

Reliability Graphs

• The arcs represent components and have failure distributions.• A failure occurs if there is no path from source to sink.

Can implement series:

Can implement parallel:

1 2 3source sink

FC1 FC2

1 2source sink

FC1

FC2

FC3


Slide 45

Reliability Graph ExampleReliability graphs can implement more complex interactions.For example, a telephone network “fails” if there is no path from source to sink.

How do we solve this?

1

2

4source sink

3

A

BC

D

E


Slide 46

Solving by Conditioning

].[ solvecan you then ],[ and ],|[ ],[ ],|[ solvecan you If

][]|[][]|[][ ][][][

: tricka is then there

and i.e., events,ary complement are and If

][][]|[ that Recall

EPFPFEPFPFEP

FPFEPFPFEPEPFEPFEPEP

FFFFFF

FPFEPFEP

+=∩+∩=

∅=∩Ω=∪

∩=

ΩE F


Slide 47

First, condition the system on link C being failed.Then the system becomes the series AD in parallel with the series BE.

1

2

4source sink

3

A

BC

D

E

1

2

4source sink

3

A

B

D

E

( )( )( ) ( )( )( ))(][ and

)(1)(11)(1)(11]|[)( |

tFtCPtFtFtFtFtCtSPtF

C

EBDAFailCS

=≤

−−−−−−=≤≤=


Slide 48

Second, condition the system on link C being up.

A

B

D

E1 2,3 4source sink

( )( )

( ).)(1)()()()( Thus,

)(1][1][ and

,)()(1)()(11]|[)(

| |

|

tFtFtFtFtF

tFtCPtCPtFtFtFtFtCtSPtF

CupCSCFailCSS

C

EDBAupCS

−+=

−=≤−=>

−−−=>≤=


Slide 49

Conditioning Fault TreesIt is also possible to use conditioning to solve more complex fault trees. If the same component appears more than once in a fault tree, it violates the independent failure assumption. However, a conditioned fault tree can be solved.

Example: A component C appears multiple times in the fault tree.

( ) ( ) ( ) ( )( )

failed.not has given that system theis andfailed has given that system theis Where

1)(

CUpCSCFailCS

tFtFtFtFtF CUpCSCFailCSS −+=


Slide 50

Reliability/Availability Point Estimates

• Frequently, the desired measure of a reliability model is the reliability at some time t. Thus, the distribution of the system reliability is superfluous; R(t) is the only thing of interest.

• This condition simplifies computation because all that is necessary for solution is the reliability of the components at time t. Solution then becomes a straightforward computation.

• If a system is described in terms of the availability of components at time t, then we may compute the system availability in the same way that reliability is computed. The restriction is that all component behaviors must be independent of one another.


Slide 51

Reliability/Availability TablesA system comprises N components. Reliability of component i at time t is given by RXi(t), and the availability of component i at time t is given by AXi(t).

Condition System Reliability System Availability

system fails if all components fail

system fails ifone component fails

system fails if at least k componentsfail, identical distribution

system fails if at leastk components fail,general case

( ) ( )( )∏=

−−=n

iXiS tRtR

111

( ) ( )∏=

=n

iXiS tRtR

1

( ) ( )( ) ( ) iNX

iXi

N

kiS tRtR

iN

tR −

=

−

= ∑ 1

( ) ( )( ) ( )∑ ∏∏∈ ∉∈

−=

kGg gXX

gXXS tRtRtR 1

( ) ( )( )∏=

−−=n

iXiS tAtA

111

( ) ( )∏=

=n

iXiS tAtA

1

( ) ( )( ) ( )∑=

−−

=

N

ki

iNX

iXS tAtA

iN

tA 1

( ) ( )( ) ( )∑ ∏∏∈ ∉∈

−=

kGg gXX

GXXS tAtAtA 1


Slide 52

Estimating Component Reliability

• For hardware, MIL-HDBK-217 is widely used.– Not always current with modern components.– Lacks distributions; it only contains failure rates.– While not perfect, it seems to be the best source that exists. However,

numbers from MIL-HDBK-217 should be used with caution.

• Due to the nature of software, no accepted mechanism exists to predict software reliability before the software is built.– Best guess is the reliability of previously built similar software.

• In all cases, numbers should be used with caution and adjusted based on observation and experience.

• No substitute for empirical observation and experience!


Slide 53

Modeling Process

• Reliability models are built only after proper service is specified.

• Reliability models are built to answer the question “What subsystem or components must be proper for the system to be proper?”

• Build models hierarchically out of subsystems.

• Estimation and guesses are acceptable, but state them explicitly.

• If unsure, do sensitivity analysis to see how much it matters.


Slide 54

Reliability Modeling Process• Realistic systems result in large RBDs and must be managed hierarchically.

RBD Process(system)Define the systemDefine “proper service”Create RBD out of componentsfor each component

if component is simpleobtain reliability data of component

elseDo RBD Process(component)

end ifCompute reliability of systemDo results meet specification?Modify design and repeat as necessary


Slide 55

Summary

– Reliability: review of definition– Failure rate– System reliability

• Independent failure assumption• Minimum, maximum, k of N• Reliability block diagrams, fault trees, reliability graphs

– Reliability modeling process


Slide 56

Stochastic Activity Network Concepts


Slide 58

Introduction

– The amount of time a program takes to execute can be computed precisely if all factors are known, but this is nearly impossible and sometimes useless. At a more abstract level, we can approximate the running time by a random variable.

– Fault arrivals almost always must be modeled by a random process.

We begin by describing a subset of SANs: stochastic Petri nets.

Stochastic activity networks, or SANs, are a convenient, graphical, high-level language for describing system behavior. SANs are useful in capturing the stochastic (or random) behavior of a system.

Examples:


Slide 59

Stochastic Petri Net ReviewOne of the simplest high-level modeling formalisms is called stochastic Petri nets. A stochastic Petri net is composed of the following components:

• Places: which contain tokens, and are like variables

• tokens: which are the “value” or “state” of a place

• transitions: which change the number of tokens in places

• input arcs: which connect places to transitions

• output arcs: which connect transitions to places


Slide 60

Firing Rules for SPNsA stochastic Petri net (SPN) executes according to the following rules:

• A transition is said to be enabled if for each place connected by input arcs, the number of tokens in the place is ≥ the number of input arcs connecting the place and the transition.

Example:

Transition t1 is enabled.

P1

P2t1


Slide 61

Firing Rules, cont.• A transition may fire if it is enabled. (More about this later.)• If a transition fires, for each input arc, a token is removed from the

corresponding place, and for each output arc, a token is added to the corresponding place.

Example:

Note: tokens are not necessarily conserved when a transition fires.

P1

P2

t1

P3

P4

t1 fires


Slide 62

Specification of Stochastic Behavior of an SPN• A stochastic Petri net is made from a Petri net by

– Assigning an exponentially distributed time to all transitions.– Time represents the “delay” between enabling and firing of a transition.– Transitions “execute” in parallel with independent delay distributions.

• Since the minimum of multiple independent exponentials is itself exponential, time between transition firings is exponential.

• If a transition t becomes enabled, and before t fires, some other transition fires and changes the state of the SPN such that t is no longer enabled, then t aborts, that is, t will not fire.

• Since the exponential distribution is memoryless, one can say that transitions that remain enabled continue or restart, as is convenient, without changing the behavior of the network.


Slide 65

Notes on SPNs• SPNs are much easier to read, write, modify, and debug than Markov chains.

• SPN to Markov chain conversion can be automated to afford numerical solutions to Markov chains.

• Most SPN formalisms include a special type of arc called an inhibitor arc, which enables the SPN if there are zero tokens in the associated place, and the identity (do nothing) function. Example: modify SPN to give writes priority.

• Limited in their expressive power: may only perform +, -, >, and test-for-zero operations.

• These very limited operations make it very difficult to model complex interactions.

• Simplicity allows for certain analysis, e.g., a network protocol modeled by an SPN may detect deadlock (if inhibitor arcs are not used).

• More general and flexible formalisms are needed to represent real systems.


Slide 66

Stochastic Activity NetworksThe need for more expressive modeling languages has led to several extensions to stochastic Petri nets. One extension that we will examine is called stochastic activity networks. Because there are a number of subtle distinctions relative to SPNs, stochastic activity networks use different words to describe ideas similar to those of SPNs.

Stochastic activity networks have the following properties:

• A general way to specify that an activity (transition) is enabled• A general way to specify a completion (firing) rule• A way to represent zero-timed events• A way to represent probabilistic choices upon activity completion• State-dependent parameter values• General delay distributions on activities


Slide 67

SAN SymbolsStochastic activity networks (hereafter SANs) have four new symbols in addition to those of SPNs:

– Input gate: used to define complex enabling predicates and completion functions

– Output gate: used to define complex completion functions

– Cases: (small circles on activities) used to specify probabilistic choices

– Instantaneous activities: used to specify zero-timed events


Slide 73

SAN Terms1. activation - time at which an activity begins

2. completion - time at which activity completes

3. abort - time, after activation but before completion, when activity is no longer enabled

4. active - the time after an activity has been activated but before it completes or aborts.


Slide 74

Illustration of SAN Terms

t

activity time

enabled

activation completion

t

activitytime

enabled

activation completion

activitytime

completionand activation t

enabled

activation aborted

activity time


Slide 75

Completion RulesWhen an activity completes, the following events take place (in the order listed), possibly changing the marking of the network:

1. If the activity has cases, a case is (probabilistically) chosen.

2. The functions of all the connected input gates are executed (in an unspecified order).

3. Tokens are removed from places connected by input arcs.

4. The functions of all the output gates connected to the chosen case are executed (in an unspecified order).

5. Tokens are added to places connected by output arcs connected to the chosen case.

Ordering is important, since effect of actions can be marking-dependent.©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 80

General Delay Distributions• SANs (and their implementation in Möbius) support many activity time

distributions, including:

• All distribution parameters can be marking-dependent• The obvious implication of general delay distributions is that there is no

conversion to a CTMC. Hence, no solutions to CTMCs are applicable. However, simulation is still possible.

• Analytical/numerical solution is possible for certain mixes of exponential and deterministic activities. See the Möbius manual for details.

• See [Kececioglu 91], for example, for appropriate use of some of these distributions.

• Exponential• Hyperexponential• Deterministic• Weibull• Conditional Weibull• Normal

• Erlang• Gamma• Beta• Uniform• Binomial• Negative Binomial


Slide 81

Fault-Tolerant Computer Failure Model Example

A fault-tolerant computer system is made up of two redundant computers. Each computer is composed of three redundant CPU boards. A computer is operational if at least 1 CPU board is operational, and the system is operational if at least 1 computer is operational.

CPU boards fail at a rate of 1/106 hours, and there is a 0.5% chance that a board failure will cause a computer failure, and a 0.8% chance that a board will fail in a way that causes a catastrophic system failure.


Slide 82

SAN computer for Computer Failure Model


Slide 83

Activity Case Probabilities and Input Gate Definition

Activity Case Probability1 0.9872 0.005

CPUfail1

3 0.008

Gate DefinitionPredicate

MARK(CPUboards1 > 0) && MARK(NumComp) > 0Enabled1

FunctionMARK(CPUboards1)− −;


Slide 84

Output Gate Definitions

Gate DefinitionCovered1 Function

if (MARK(CPUboards1) == 0)MARK(NumComp)--;

Uncovered1 FunctionMARK(CPUboards1) = 0;MARK(NumComp)--;

Catastrophic1 FunctionMARK(CPUboards1) = 0;MARK(NumComp) = 0;


Slide 85

Reward VariablesReward variables are a way of measuring performance- or dependability-related characteristics about a model.

Examples:– Expected time until service– System availability– Number of misrouted packets in an interval of time– Processor utilization– Length of downtime– Operational cost– Module or system reliability


Slide 86

Reward Structures

Reward may be “accumulated” two different ways:

– A model may be in a certain state or states for some period of time, for example, “CPU idle” states. This is called a rate reward.

– An activity may complete. This is called an impulse reward.

The reward variable is the sum of the rate reward and the impulse reward structures.


Slide 87

Reward Structure ExampleA web server failure model is used to predict profits. When the web server is fully operational, profits accumulate at $N/hour. In a degraded mode, profits accumulate at Repairs cost $K./hour.6

1 N

By carefully integrating the reward structure from 0 to t, we get the profit at time t. This is an example of an “interval-of-time” variable.

( )

( )−

=

=

0

061

KaC

NN

mRm is a fully functioning markingm is a degraded-mode markingotherwise

a is an activity representing repairotherwise


Slide 88

Reward VariablesA reward variable is the sum of the impulse and rate reward structures over a certain time.

Let [t, t + l] be the interval of time defined for a reward variable:

– If l is 0, then the reward variable is called an instant-of-time reward variable.

– If l > 0, then the reward variable is called an interval-of-time reward variable.

– If l > 0, then dividing an interval-of-time reward variable by l gives a time-averaged interval-of-time reward variable.


Slide 89

Reward Variable Specification

Reward Structure

Instant-of-Time

Time-Average Interval-of-Time

Interval-of-Time

t lim as tgoes toinfinity [t, t + l]

[t, t + l]lim as tgoes toinfinity[t, t + l]

lim as lgoes toinfinity

[t, t + l] [t, t + l]lim as lgoes toinfinity

[t, t + l]lim as tgoes toinfinity


Slide 92

Reward Variables for Computer Failure Model

ReliabilityRate rewards Subnet = computer Predicate: MARK(NumComp) > 0 Function: 1Impulse reward none

NumBoardFailuresRate reward noneImpulse reward Subnet = computer activity = CPUfail1, value = 1 activity = CPUfail2, value = 1


Slide 93

Reward Variables for Computer Failure Model

PerformabilityRate rewards Subnet = computer Predicate: 1 Function: MARK(NumComp)Impulse reward none

NumBoardsRate reward Subnet = computer Predicate: 1 Function: MARK(CPUBboards1) + MARK(CPUboards2)Impulse reward none


Slide 94

Model Composition

A composed model is a way of connecting different SANs together to form a larger model.

Model composition has two operations:

– Replicate: Combine 2 or more identical SANs and reward structures together, holding certain places common among the replicas.

– Join: Combine 2 or more different SANs and reward structures together, combining certain places to permit communication.


Slide 95

Composed Model Specification

• Replicate submodel a certain number of times

• Hold certain places common to all replicas

• Join two or more submodels together

• Certain places in different submodels can be made common


Slide 96

RationaleThere are many good reasons for using composed models.

– Building highly reliable systems usually involves redundancy. The replicate operation models redundancy in a natural way.

– Systems are usually built in a modular way. Replicates and Joins are usually good for connecting together similar and different modules.

– Tools can take advantage of something called the Strong Lumping Theoremthat allows a tool to generate a Markov process with a smaller state space (to be described in Session 7).


Slide 98

Computer Failure Model Revisited: Single computer Model

(Note initial marking of NumComp is two since there will be two computersin the composed model.)


Slide 99

Composed Model for Computer Failure Model

Node Reps Common PlacesRep1 2 NumComp


Slide 102

Composed ModelHow does adding an additional computer affect reliability?

– In the composed model, change number of replications to 3 and change various reward variables - easy (Use a global variable if you think suspect you may want to do this.)

– In “flat” model, add another computer - hard

In composed model, the number of states in the underlying Markov chain is much smaller, especially for large numbers of replications. (Details will be given in Session 7.)


Slide 103

Analytic/Numerical State-Based Modeling


Slide 104

Session Outline

• Review of Markov process theory and fundamentals• Methods for constructing state-level models from SANs• Analytic/numerical solution techniques

– Transient solution• Standard uniformization (instant-of-time variables)• Adaptive uniformization (instant-of-time variables)• Interval-of-time uniformization (interval-of-time variables)

– Steady-state solution (steady-state instant-of-time variables)• Direct solution• Iterative solution


Slide 105

Weaknesses of Simulation

• Simulation relies on good pseudo-random number generation, sufficient observations, and good statistical techniques to produce an approximate solution

• Increasing accuracy by a factor of n requires on the order of n2 more work, which can be prohibitively expensive.

For example, a 5-Nines system reliability model will require approximately 100,000 observations to observe one failure. One digit of accuracy can easily require over 1,000,000 observations!

(For many models, 1,000,000 observations can be generated quickly, but as system failure becomes even rarer, standard simulation quickly becomes infeasible.)


Slide 106

The Case for Analytical/Numerical Techniques

If you can model using exponential delays and your model is sufficiently small, continuous time Markov chains (CTMCs) offer some advantages. These include:

– Typically faster solution time for systems with rare events– Typically takes less time to get more accurate answers– Typically more confidence in the solution

In order to understand when we get these advantages, we must better understand the methods of obtaining solutions to CTMCs.


Slide 107

Random Variable ReviewIt is often convenient to assign a (real) number to every element in Ω. This assignment, or rule, or function, is called a random variable.

Ω -1 0 1

ω

X : Ω → ℜ


Slide 108

Random Process ReviewRandom processes are useful for characterizing the behavior of real systems.

A random process is a collection of random variables indexed by time.

Example: X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = 1,2,3, . . ..

One can ask: ( )[ ]( ) ( )[ ]( )[ ] n.nXE

XXPXP

5321143

122

361

361

=

===

==


Slide 110

Describing a Random ProcessRecall that for a random variable X, we can use the cumulative distribution FX to describe the random variable.

In general, no such simple description exists for a random process.

However, a random process can often be described succinctly in various different ways. For example, if Y is a random variable representing the roll of a die, and X(t) is the sum after t rolls, then we can describe X(t) by

X(t) - X(t - 1) = Y,

P[X(t) = i|X(t - 1) = j] = P[Y = i - j],

or X(t) = Y1 + Y2 + . . . + Yt, where the Yi’s are independent.


Slide 111

Classifying Random Processes: Characteristics of TIf the number of time points defined for a random process, i.e., |T|, is finite or countable (e.g., integers), then the random process is said to be a discrete-timerandom process.

If |T| is uncountable (e.g., real numbers) then the random process is said to be a continuous-time random process.

Example: Let X(t) be the number of fault arrivals in a system up to time t. Since t ∈T is a real number, X(t) is a continuous-time random process.


Slide 112

Classifying Random Processes: State Space TypeLet X be a random process. The state space of a random process is the set of all possible values that the process can take on, i.e.,

S = y: X(t) = y, for some t ∈ T.

If X is a random process that models a system, then the state space of X can represent the set of all possible configurations that the system could be in.


Slide 113

Random Process State Spaces

If the state space S of a random process X is finite or countable (e.g., S = 1,2,3, . . .), then X is said to be a discrete-state random process.

Example: Let X be a random process that represents the number of bad packets received over a network. X is a discrete-state random process.

If the state space S of a random process X is infinite and uncountable (e.g., S = ℜ), then X is said to be a continuous-state random process.

Example: Let X be a random process that represents the voltage on a telephone line. X is a continuous-state random process.

We examine only discrete-state processes in this lecture.


Slide 114

Stochastic-Process Classification Examples

Analog signal A to D converter

Computeravailability

model

round-basednetworkprotocolmodel

Time

State

Continuous

Discrete

Discrete

Continuous


Slide 115

Markov Process

A special type of random process that we will examine in detail is called the Markov process. A Markov process can be informally defined as follows.

Given the state (value) of a Markov process X at time t (X(t)), the future behavior of X can be described completely in terms of X(t).

Markov processes have the very useful property that their future behavior is independent of past values.


Slide 116

Markov ChainsA Markov chain is a Markov process with a discrete state space.

We will always make the assumption that a Markov chain has a state space in 1,2, . . . and that it is time-homogeneous.

A Markov chain is time-homogeneous if its future behavior does not depend on what time it is, only on the current state (i.e., the current value).

We make this concrete by looking at a discrete-time Markov chain (hereafter DTMC). A DTMC X has the following property:

( ) ( ) ( ) ( ) ( )[ ]( ) ( )[ ]

( )kij

Ott

P

itXjktXPnOXntXntXitXjktXP

=

==+=

==−=−==+ −−

,...,2,1, 21

(1)

(2)


Slide 117

DTMCs

Notice that given i, j, and k, is a number!

can be interpreted as the probability that if X has value i, then after k time-steps, X will have value j.

Frequently, we write to mean

( )kijP

( )kijP

ijP ( ).1ijP


Slide 118

Markov ChainsA Markov chain is a Markov process with a discrete state space.

We will always make the assumption that a Markov chain has a state space in 1,2, . . . and that it is time-homogeneous.

A Markov chain is time-homogeneous if its future behavior does not depend on what time it is, only on the current state (i.e., the current value).

We make this concrete by looking at a discrete-time Markov chain (hereafter DTMC). A DTMC X has the following property:

( ) ( ) ( ) ( ) ( )[ ]( ) ( )[ ]

( )kij

Ott

P

itXjktXPnOXntXntXitXjktXP

=

==+=

==−=−==+ −−

,...,2,1, 21

(1)

(2)


Slide 119

DTMCs

Notice that given i, j, and k, is a number!

can be interpreted as the probability that if X has value i, then after k time-steps, X will have value j.

Frequently, we write to mean

( )kijP

( )kijP

ijP ( ).1ijP


Slide 120

State Occupancy Probability VectorLet π be a row vector. We denote πi to be the i-th element of the vector. If π is a state occupancy probability vector, then πi(k) is the probability that a DTMC has value i (or is in state i) at time-step k.

Assume that a DTMC X has a state-space size of n, i.e., S = 1, 2, . . . , n. We say formally

πi(k) = P[X(k) = i]

Note that for all times k.( ) 11

=π∑=

n

ii k


Slide 121

Computing State Occupancy Vectors: A Single Step Forward in Time

If we are given π(0) (the initial probability vector), and Pij for i, j = 1, . . . , n, how do we compute π(1)?

Recall the definition of Pij.Pij = P[X(k+1) = j | X(k) = i]

= P[X(1) = j | X(0) = i]Since ( ) ,10

1=π∑

=

n

ii

( )[ ]( ) ( )[ ] ( )[ ] ( ) ( )[ ] ( )[ ]

( )[ ] ( )[ ]

( )

( )∑

∑

∑

=

=

=

π=

π=

====

===++====

==

n

iiji

n

iiij

n

i

P

P

iXPiXjXP

nXPnXjXPXPXjXPjXP

1

1

1

0

0

00)1(

001...101011( )1jπ


Slide 122

Transition Probability Matrix

Notice that this resembles vector-matrix multiplication.In fact, if we arrange the matrix P = Pij, that is, if

P =

then pij = Pij, and π(1) = π(0)P, where π(0) and π(1) are row vectors, and π(0)P is a vector-matrix multiplication.The important consequence of this is that we can easily specify a DTMC in terms of an occupancy probability vector π and a transition probability matrix P.

( ) ( ) . allfor holds which ,01 have We1

jPn

iijij ∑

=

π=π

p1np11

pn1 pnn ,


Slide 123

Transient Behavior of Discrete-Time Markov ChainsGiven π(0) and P, how can we compute π(k)?

We can generalize from earlier that π(k) = π(k - 1)P.

Also, we can write π(k - 1) = π(k - 2)P, and soπ(k) = [π(k - 2)P]P

= π(k - 2)P2

Similarly, π(k - 2) = π(k - 3)P, and soπ(k) = [π(k - 3)P]P2

= π(k - 3)P3

By repeating this, it should be easy to see thatπ(k) = π(0)Pk


Slide 124

A Simple ExampleSuppose the weather at Urbana-Champaign, Illinois can be modeled the following way:

• If it’s sunny today, there’s a 60% chance of being sunny tomorrow, a 30% chance of being cloudy, and a 10% chance of being rainy.

• If it’s cloudy today, there’s a 40% chance of being sunny tomorrow, a 45% chance of being cloudy, and a 15% chance of being rainy.

• If it’s rainy today, there’s a 15% chance of being sunny tomorrow, a 60% chance of being cloudy, and a 25% chance of being rainy.

If it’s rainy on Friday, what is the forecast for Monday?


Slide 125

Simple Example, cont.Clearly, the weather model is a DTMC.

1) Future behavior depends on the current state only2) Discrete time, discrete state3) Time homogeneous

The DTMC has 3 states. Let us assign 1 to sunny, 2 to cloudy, and 3 to rainy. Let time 0 be Friday.

( ) ( )

=

=π

25.6.15.15.45.4.1.3.6.

1,0,00

P


Slide 126

Simple Example SolutionThe weather on Saturday π(1) is

that is, 15% chance sunny, 60% chance cloudy, 25% chance rainy.

The weather on Sunday π(2) is

The weather on Monday π(3) isπ(3) = π(2)P = (.4316, .42, .1484),

that is, 43% chance sunny, 42% chance cloudy, and 15% chance rainy.

( ) ( ) ( ) ( ),25,.6,.15. 25.6.15.15.45.4.1.3.6.

1,0,001 =

=π=π P

( ) ( ) ( ) ( ).1675,.465,.3675. 25.6.15.15.45.4.1.3.6.

25,.6,.15.12 =

=π=π P


Slide 127

Solution, cont.Alternatively, we could compute P3 since we found

π(3) = π(0)P3.

Working out solutions by hand can be tedious and error-prone, especially for “larger” models (i.e., models with many states). Software packages are used extensively for this sort of analysis.

Software packages compute π(k) by (. . . ((π(0)P)P)P. . .)P rather than computing Pk, since computing the latter results in a large “fill-in.”


Slide 128

Graphical RepresentationIt is frequently useful to represent the DTMC as a directed graph. Nodes represent states, and edges are labeled with probabilities. For example, our weather prediction model would look like this:

2

1 3.6

.3

.45

.6.4

.1

.15

.15

.25

1 = Sunny Day2 = Cloudy Day3 = Rainy Day


Slide 129

“Simple Computer” Example

3

1 2Pidle

Pr

Pff

Pfi

Pcom

Parr

Pfb

Pbusy

X = 1 computer idleX = 2 computer workingX = 3 computer failed

=

ffr

fbbusycom

fiarridle

PPPPPPPP

P0


Slide 130

Limiting Behavior of DTMCsIt is sometimes useful to know the time-limiting behavior of a DTMC. This translates into the “long term,” where the system has settled into some steady-state behavior.

Formally, we are looking for

To compute this, what we want is

There are various ways to compute this. The simplest is to calculate π(n) for increasingly large n, and when π(n + 1) ≅ π(n), we can believe that π(n) is a good approximation to steady-state. This can be rather inefficient if n needs to be large.

( ).lim nn

π∞→

( ) .0lim n

nPπ

∞→


Slide 131

ClassificationsIt is much easier to solve for the steady-state behavior of some DTMC’s than others. To determine if a DTMC is “easy” to solve, we need to introduce some definitions.

Definition: A state j is said to be accessible from state i if there exists an n ≥ 0 such that We write i → j.

Note: recall that

If one thinks of accessibility in terms of the graphical representation, a state j is accessible from state i if there exists a path of non-zero edges (arcs) from node i to node j.

.0)( >nijP

[ ]iXjnXPP nij === )0()()(


Slide 132

State Classification in DTMCsDefinition: A DTMC is said to be irreducible if every state is accessible from every other state.

Formally, a DTMC is irreducible ifi → j for all i,j ∈ S.

A DTMC is said to be reducible if it is not irreducible.

It turns out that irreducible DTMC’s are simpler to solve. One need only solve one linear equation:

π = πP.We will see why this is so, but first there is one more issue we must confront.


Slide 133

PeriodicityConsider the following DTMC:

However, does exist; it is called the time-averaged steady-state distribution, and is denoted by π*.

Definition: A state i is said to be periodic with period d if only when n is some multiple of d. If d = 1, then i is said to be aperiodic.

A steady-state solution for an irreducible DTMC exists if all the states are aperiodic.

A time-averaged steady-state solution for an irreducible DTMC always exists.

1

11 2

( ) No! exist? lim Does nn

π∞→

( )

n

in

in

∑=

∞→

π1lim

0)( >nijP

( ) ( )0,10 =π


Slide 134

Steady-State Solution of DTMCsThe steady-state behavior can be computed by solving the linear equation

π = πP, with the constraint that For irreducible DTMC’s, it can be

shown that this solution is unique. If the DTMC is periodic, then this solution yields π*.

One can understand the equation π = πP in two different ways.• In steady-state, the probability distribution π(n + 1) = π(n)P, and by

definition π(n + 1) = π(n) in steady-state.• “Flow” equations.

Flow equations require some visualization. Imagine a DTMC graph, where the nodes are assigned the occupancy probability, or the probability that the DTMC has the value of the node.

.11

=π∑=

n

ii


Slide 135

Flow Equations

Let πiPij be the “probability mass” that moves from state j to state i in one time-step. Since probability must be conserved, the probability mass entering a state must equal the probability mass leaving a state.

Prob. mass in = Prob. mass out

Written in matrix form, π = πP.i

n

jiji

n

jiji

n

jjij

P

PP

π

π

ππ

=

=

=

∑

∑∑

=

==

1

11

Probability must be conserved, i.e.,∑ = .1iπi . .

.

. . .


Slide 136

Continuous Time Markov Chains (CTMCs)For most systems of interest, events may occur at any point in time. This leads us to consider continuous time Markov chains. A continuous time Markov chain(CTMC) has the following property:

A CTMC is completely described by the initial probability distribution π(0) and the transition probability matrix P(t) = [pij(t)]. Then we can compute π(t) = π(0)P(t).

The problem is that pij(t) is generally very difficult to compute.

( ) ( )[ ][ ]

n

ij

nn

tttP

itXjtXPkttXkttXkttXitXjtXP

<<<<>τ

τ=

==τ+=

=−=−=−==τ+

...0 ,0 allfor

)( , )()(

,...,)(,)(,)(

21

2211


Slide 137

CTMC PropertiesThis definition of a CTMC is not very useful until we understand some of the properties.

First, notice that pij(τ) is independent of how long the CTMC has previously been in state i, that is,

There is only one random variable that has this property: the exponential random variable. This indicates that CTMCs have something to do with exponential random variables. First, we examine the exponential r.v. in some detail.

( ) [ ][ ][ ]

)( )()(

,0for )(

τ=

==τ+=

∈==τ+

ijpitXjtXP

tuiuXjtXP


Slide 138

Exponential Random VariablesRecall the property of the exponential random variable. An exponential random variable X with parameter λ has the CDF

P[X ≤ t] = Fx(t) =

The distribution function is given by

fx(t) =

The exponential random variable is the only random variable that is “memoryless.”

To see this, let X be an exponential random variable representing the time that an event occurs (e.g., a fault arrival).

We will show that

0 t ≤ 01−e-λt t > 0 .

);()( tFdtdtf xx =

0 t ≤ 0λe-λt t > 0

[ ] [ ]. tXPsXstXP >=>+>


Slide 139

Memoryless Property

[ ] [ ][ ]sXP

sXstXPsXstXP>

>+>=>+>

,

[ ][ ]

( )

( )

[ ]tXPe

eee

ee

sFstF

sXPstXP

t

s

st

s

stX

X

>==

=

=

−+−

=

>+>

=

−

−

−−

−

+−

λ

λ

λλ

λ

λ

)(11

Proof of the memoryless property:


Slide 140

Event RateThe fact that the exponential random variable has the memoryless property indicates that the “rate” at which events occur is constant, i.e., it does not change over time.

Often, the event associated with a random variable X is a failure, so the “event rate” is often called the failure rate or the hazard rate.

The event rate of X is defined as the probability that the event associated with Xoccurs within the small interval [t, t + ∆t], given that the event has not occurred by time t, per the interval size ∆t:

This can be thought of as looking at X at time t, observing that the event has not occurred, and measuring the number of events (probability of the event) that occur per unit of time at time t.

[ ].

ttXttXtP

∆>∆+≤<


Slide 141

Observe that:

[ ] [ ][ ]

[ ][ ]

( ) ( )( )( )

( )

general.in )(1

)(

)(11)(

1

,

tFtf

tFttFttF

ttFtFttF

ttXPttXtP

ttXPtXttXtP

ttXttXtP

X

X

X

XX

X

XX

−=

−⋅

∆−∆+

=

∆−−∆+

=

∆⋅>∆+≤<

=

∆⋅>>∆+≤<

=∆

>∆+≤<

In the exponential case,

This is why we often say a random variable X is “exponential with rate λ.”

( ) .11)(1

)( λλλλ

λ

λ

λ

==−−

=− −

−

−

−

t

t

t

t

X

X

ee

ee

tFtf


Slide 142

Minimum of Two Independent ExponentialsAnother interesting property of exponential random variables is that the minimum of two independent exponential random variables is also an exponential random variable.Let A and B be independent exponential random variables with rates α and βrespectively. Let us define X = minA,B. What is FX(t)?

FX(t) = P[X ≤ t]= P[minA,B ≤ t]= P[A ≤ t OR B ≤ t]= 1 - P[A > t AND B > t] - see comb. methods section= 1 - P[A > t] P[B > t]= 1 - (1 - P[A ≤ t])(1 - P[B ≤ t])= 1 - (1 - FA(t))(1 - FB(t))= 1 - (1 - [1 - e-αt])(1 - [1 - e-βt])= 1 - e-αte-βt

= 1 - e-(α + β)t

Thus, X is exponential with rate α + β.©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 143

Competition of Two Independent ExponentialsIf A and B are independent and exponential with rate α and β respectively, and Aand B are competing, then we know that one will “win” with an exponentially distributed time (with rate α + β). But what is the probability that A wins?

[ ] [ ] [ ]dxxAPxABAPBAP 0

==<=< ∫∞

[ ] ( )

[ ]

[ ]

[ ]( )

[ ]( )

( )

β+αα

=α=

α=

α−−=

α≤−=

α<=

α=<=

=<=

∫

∫∫∫∫∫∫

∞ β+α−

∞ α−β−

α−∞ β−

α−∞

α−∞

α−∞

∞

0

0

0

0

0

0

0

11

1

dxe

dxee

dxee

dxexBP

dxeBxP

dxexABAP

dxxfxABAP

x

xx

xx

x

x

x

A


Slide 144

Competing Exponentials in CTMCs

Imagine a random process X with state space S = 1,2,3. X(0) = 1. X goes to state 2 (takes on a value of 2) with an exponentially distributed time with parameter α. Independently, X goes to state 3 with an exponentially distributed time with parameter β. These state transitions are like competing random variables.

We say that from state 1, X goes to state 2 with rate α and to state 3 with rate β.

X remains in state 1 for an exponentially distributed time with rate α + β. This is called the holding time in state 1. Thus, the expected holding time in state 1 is

The probability that X goes to state 2 is The probability X goes to state 3 is

This is a simple continuous-time Markov chain.

1

32

α β X(0) = 1P[X(0) = 1] = 1

.1β+α

.β+αα .β+α

β


Slide 145

Competing Exponentials vs. a Single Exponential With Choice

Consider the following two scenarios:

1. Event A will occur after an exponentially distributed time with rate α. Event B will occur after an independent exponential time with rate β.

2. After waiting an exponential time with rate α + β, event A occurs with probability and event B occurs with probability

These two scenarios are indistinguishable. In fact, we frequently interchange the two scenarios rather freely when analyzing a system modeled as a CTMC.

,β+αα .β+α

β


Slide 146

State-Transition-Rate MatrixA CTMC can be completely described by an initial distribution π(0) and a state-transition-rate matrix. A state-transition-rate matrix Q = [qij] is defined as follows:

qij =

Example: A computer is idle, working, or failed. When the computer is idle, jobs arrive with rate α, and they are completed with rate β. When the computer is working, it fails with rate λw, and with rate λi when it is idle.

rate of going from i ≠ j,state i to state j

i = j.∑≠

−ik

ikq


Slide 147

“Simple Computer” CTMC

Let X = 1 represent “the system is idle,” X = 2 “the system is working,” and X = 3 a failure.

If the computer is repaired with rate µ, the new CTMC looks like

3

2α

β1λi λw

( )( )

λλ+β−βλαλ+α−

=000

ww

ii

Q

( )( )

µ−µλλ+β−βλαλ+α−

=0

ww

ii

Q

3

2α

β1λi λwµ


Slide 148

Analysis of “Simple Computer” ModelSome questions that this model can be used to answer:

– What is the availability at time t?– What is the steady-state availability?– What is the expected time to failure?– What is the expected number of jobs lost due to failure in [0,t]?– What is the expected number of jobs served before failure?– What is the throughput of the system (jobs per unit time), taking into

account failures and repairs?


Slide 149

State-Space Generation from SANsIf the activity delays are exponential, it is straightforward to convert a SAN to a CTMC. We first look at the simple case, where there is no composed model.


Slide 150

State Space (Generated by Möbius)

State No. CPUboards1 CPUboards2 NumComp (Next State, Rate)1 3 3 2 (2,.p1λ),(3,p2λ),(4,P3λ),(5,p1λ),(6,p2λ,),(7,p3λ)2 2 3 2 (8,p1λ),(3,p2λ),(4,p3λ),(9,p1λ),(10,p2λ),(11,p3λ)3 0 3 1 (12,p1λ),(13,(p2+p3) λ)4 0 3 05 3 2 2 (9,p1λ),(12,p2λ),(14,p3λ),(15,p1λ),(6,p2λ),(7,p3λ)6 3 0 1 (10,p1λ),(13,(p2+p3) λ)7 3 0 08 1 3 2 (3,(p1+p2) λ),(4,p3λ),(16,p1λ),(17,p2λ),(18,p3λ)9 2 2 2 (16,p1λ),(12,p2λ),(14,p3λ),(19,p1λ),(10,p2λ),(11,p3λ)10 2 0 1 (17,p1λ),(13,(p2+p3) λ)11 2 0 012 0 2 1 (20,p1λ),(13,(p2+p3) λ)13 0 0 014 0 2 015 3 1 2 (19,p1λ),(20,p2λ),(21,p3λ),(6,(p1+p2) λ),(7,p3λ)16 1 2 2 (12,(p1+p2) λ),(14,p3λ),(22,p1λ),(17,p2λ),(18,p3λ)17 1 0 1 (13, λ)18 1 0 019 2 1 2 (22,p1λ),(20,p2λ),(21,p3λ),(10,(p1+p2λ),(11,p3λ)20 0 1 1 (13, λ)21 0 1 022 1 1 2 (20,(p1+p2) λ),(21,p3λ),(17,(p1+p2) λ),(18,p3λ)


Slide 151

Underlying Markov Model (State Transition Rates Not Shown)

2118

19

15

7

6

20

14

12

5

16

8

4

3

17

11

10

2

22

13

9

1


Slide 152

Reduced Base Model Construction

• “Reduced Base Model” construction techniques make use of composed model structure to reduce the number of states generated.

• A state in the reduced base model is composed of a state tree and an impulse reward.

• During reduced base model construction, the use of state trees permits an algorithm to automatically determine valid lumpings based on symmetries in the composed model.

• The reduced base model is constructed by finding all possible (state tree, impulse reward) combinations and computing the transition rates between states.

• Generation of the detailed base model is avoided.


Slide 153

Example Reduced Base Model State Generation

Composed Model computer


Slide 154

Example Reduced Base Model States and Transitions

1 1

R (NumComp = 2)

computer(CPUboards = 3)


1 1

R (NumComp = 1)



1 1

R (NumComp = 0)



2

R (NumComp = 2)

computer (CPUboards = 3)

covered

uncoveredcatastrophic

(state 1)

(state 4)(state 3)(state 2)


Slide 155

Markov Chain of Reduced Base Model(State Transition Rates not Shown)

8

1312

1110

97

65

4

32

1


Slide 156

State-Space Generation in Möbius(For generating random process representations of models with all

exponential or exponential/deterministic timed activities)

Print out states and reward variables

Print out absorbing states. Useful to detect problems when attempting a steady-state solution.

Place comments, as specified by edit comments, in file.

State-space generation must be done before all analytic/numerical solutions are done.©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 157

Numerical/Analytical Solution Techniques

1) Transient Solution– Standard Uniformization (instant-of-time variables)– Adaptive Uniformization (instant-of-time variables)– Interval-of-time Uniformization (expected value, interval-of-time variables)

2) Steady-state Solution– Direct Solution (instant-of-time steady-state variables)– Iterative Solution (instant-of-time steady-state variables)


Slide 158

CTMC Transient SolutionWe have seen that it is easy to specify a CTMC in terms of the initial probability distribution π(0) and the state-transition-rate matrix.

Earlier, we saw that the transient solution of a CTMC is given by π(t) = π(0)P(t), and we noted that P(t) was difficult to define.

Due to the complexity of the math, we omit the derivation and show the relationship

Solving this differential equation in some form is difficult but necessary to compute a transient solution.

,)()()( QtPtQPtPdtd

== where Q is the state transition rate matrix ofthe Markov chain.


Slide 159

Transient Solution TechniquesSolutions to can be done in many (dubious) ways*:– Direct: If the CTMC has N states, one can write N2 PDEs with N2 initial

conditions and solve N2 linear equations.– Laplace transforms: Unstable with multiple “poles”– Nth order differential equations: Uses determinants and hence is numerically

unstable– Matrix exponentiation: P(t) = eQt, where

Matrix exponentiation has some potential. Directly computing eQt by performing can be expensive and prone to instability.

If the CTMC is irreducible, it is possible to take advantage of the fact that Q = ADA-1, where D is a diagonal matrix. Computing eQt becomes AeDtA-1, where

.!)(

1∑

∞

=

+=n

nQt

nQtIe

∑∞

=

+1 !

)(n

n

nQtI

)()( tQPtPdtd =

( ). ,...,,diag 2211 tdtdtdDt nneeee =

* See C. Moler and C. Van Loan, “Nineteen Dubious Ways to Compute the Exponential of a Matrix,” SIAM Review, vol. 20, no. 4, pp. 801-836, October 1978.


Slide 160

Standard UniformizationStarting with CTMC state transition rate matrix (Q) construct

( )

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) .Pkk

,ke!ktt

.Pe!ktt

QIP

i,iq

tN

k

k

kt

k

k

s

π=+π

πλ

=π

λπ=π

λ+=

≥λλ

λ−

=

λ−∞

=

∑

∑

1with

:ncomputatio actualIn

0

:Then

:DTMC 2.

, rate :processPoisson 1.

0

0

k-step state transition probability

Probability of k transitionsin time t

Choose truncation point to obtain desired accuracy

Compute π(k) iteratively, to avoid fill-in


Slide 161

Error Bound in Uniformization• Answer computed is a lower bound, since each term in summation is positive,

and summation is truncated.• Number of iterations to achieve a desired accuracy bound can be computed

easily.

Error for each state

⇒ Choose error bound, then compute Ns on-the-fly, as uniformization is done.

( ) tN

k

k

e!kts

λ−

=∑ λ

−≤0

1


Slide 163

Transient Uniformization Solver(for transient solution of instant-of-time variables)

Instant-of-time variable time points of interest. Multiple time points may be specified, separated by spaces.

Number of digits of accuracy in the solution. Solution reported is a lower bound.

Volume of intermediate results reported. “1” gives the greatest volume, greater numbers less.


Slide 167

Accumulated Reward Solver (ars)(solves for expected values of interval-of-time and time-averaged interval-

of-time variables on intervals [t0, t1] when both t0 and t1 are finite)

Number of digits of accuracy in the solution. Solution reported is a lower bound.

Series of time intervals for which solution is desired. Intervals are separated by spaces. Each interval can be specified as t1:t2.Volume of

intermediate results reported. “1” gives the greatest volume, greater numbers less.

The accumulated reward solver is based on uniformization, so the hints given for the transient solver apply here as well.


Slide 170

Steady-State Behavior of CTMCs, cont.

This yields the elegant equation π*Q = 0, where the steady-state probability distribution. If the CTMC is irreducible, then π* can be computed with the constraint that

Definition: A CTMC is irreducible if every state in the CTMC is reachable from every other state.

If the CTMC is not irreducible, then more complex solution methods are required.

Notice that for irreducible CTMCs, the steady-state distribution is independent of the initial-state distribution.

( ),tlimt

* π=π∞→

.n

i

*i 1

1=π∑

=


Slide 171

Direct Steady-State SolutionOne steady-state solver in Möbius is the direct steady-state solver. This solver solves the augmented matrix using a form of Gaussian elimination.

Pros:

Cons:

Recommendation: Use for small CTMCs (tens of states) or medium-sized and stiff CTMCs (hundreds to a few thousands), or when high accuracy is required.

Reminder: High accuracy in solution does not mean high accuracy in prediction. Use accuracy to do relative comparisons.

TieQ~* =π

Can get a very accurate solution in a fixed amount of time; “stiffness” (described later) does not affect solution time.

Solution complexity is O(n3), so does not scale well to large models; memory requirements are high due to fill-in and are not known a priori.


Slide 172

Direct Steady-State Solver (dss)(for steady-state solution of instant-of-time variables)

Volume of intermediate results reported. “1” gives the greatest volume, greater numbers less.

Stopping criterion used in iterative refinement phase, after direct solution is done.

Number of rows to search for the “best” pivot when performing LU decomposition

“Grace” factor by which elements may become pivots

Value that, when multiplied by smallest matrix element, is threshold at which elements may be dropped in LU decomposition.


Slide 174

Iterative Solution MethodsThe simplest iterative solution methods are called stationary iterative methods, and they can be expressed as

π(k + 1) = π(k)M,where M is a constant (stationary) matrix. Computing π(k + 1) from π(k) requires one vector-matrix multiplication, or one iteration, which on modern workstations is extremely fast.

The simplest stationary iterative method for CTMCs is called the power method. Recall π*Q = 0. Let M = Q + I.

π(M - I) = 0πM - π = 0πM = ππ(k + 1) = π(k)(Q + I)

The power method typically converges (gets close to the answer) slowly.


Slide 175

Iterative Solution Characteristics

Stationary iterative solution methods have the following characteristics:– Low memory usage (no fill-in); predictable memory usage– Low time per iteration, proportional to the number of non-zero entries– Fast solution time for non-stiff matrices (tens or hundreds of iterations)– Stop when sufficiently accurate– Slow solution time for stiff matrices– Difficult to quantify accuracy, especially for stiff matrices– Easy to implement


Slide 178

Gauss-SeidelOne of the most widely used stationary iterative methods is called Gauss-Seidel. The algorithm appears as follows:

An intuitive explanation for this algorithm:

flow out of node i flow into node i

( ) ( ) ( )∑∑+=

−

=

++ π+π=π−n

ijji

kj

i

jji

kj

kiii qqq

1

1

1

11

( ) ( ) ( )

for endfor end

1

to1for econvergenc to1for

1

1 1

11

π+π−=π

==

∑ ∑−

= +=

++i

j

n

ijji

kjji

kj

ii

ki qq

q

nik


Slide 179

SORThere is an extension to Gauss-Seidel called successive over-relaxation, or SOR, that sometimes gives better performance.

Choosing ω is a hard problem in general. Automatic techniques for choosing ωexist but are not implemented in Möbius.

Note: ω = 1 is the same as Gauss-Seidel.

Recommendation: Leave ω = 1 unless you are solving a similar system many times and the matrix is stiff.

( ) ( ) ( ) ( ) ( )( ) ( )

( ) ( )

.,xxx~

,x~k

kkxx,xxx

ki

ki

ki

kkki

kii

20 where

as computed is iterate, SORth 1 The iterate.

Seidel-Gaussth 1 andth theare and whereLet

1

1

11

<ω<∆ω+=

+

+−=∆

+

+

++


Slide 180

Iterative Steady-State Solver (iss)(for steady-state solution of instant-of-time variables)

Stopping criterion, expressed as 10-x, where x is given. The criterion used is the infinity difference norm.

SOR weight factor. Values < 1 guarantee convergence, but slow it. Values >= 1 speed convergence, but may not converge.

Maximum number of iterations allowed.


Slide 183

Möbius Analytical Solvers

Analytic Solvers (for reward variables only)

Model ClassSteady-state or

Transient

Instant-of-timeor

Interval-of-time

Mean,Variance, orDistribution

ApplicableAnalyticSolver

Steady-state

Instant-of-timea Mean,Variance, andDistribution

dss and iss

Instant-of-time Mean,Variance, andDistribution

trs and atrs

All activitiesexponential

Transient

Interval-of-time Mean ars

Exponential andDeterministic

activities

Steady-state

Instant-of-timeb Mean,Variance, andDistribution

diss andadiss

a if only rate rewards are used, the time-averaged interval-of-time steady-state measure is identical to the instant-of-time steady-state measure (if both exist).

b provided the instant-of-time steady-state distribution is well-defined. Otherwise, the time-averaged interval-of-time steady-state variable is computed and only results for rate rewards should be derived.


Slide 184

Case Study: Fault-Tolerant Embedded Multiprocessor System


Slide 186

Problem Origin• This problem was originally posed in 1992 as a reliability model of a large,

embedded fault-tolerant computer, presumably for space-borne applications. It was posed as a hierarchical model with non-perfect coverage at each level, with the purpose of showing the inadequacy of existing techniques.– Combinatorial methods were incapable of including coverage at all levels of

the hierarchy, thus grossly overstating the reliability.– Markov- or SPN-based methods create far too many states to solve.– Monte-Carlo simulation works, but provides only an estimate (which is

often not good enough).– A specialized tool was developed to do numerical integration of a semi-

Markov process to solve this and similar problems.• In Möbius, we solve a smaller version of the same architecture “exactly” using

Markov models generated by SANs. This is made possible by automatic state lumping using composed models.


Slide 187

Problem Description

• System consists of 2 computers• Each computer consists of

– 3 memory modules (2 must be operational)– 3 CPU units (2 must be operational)– 2 I/O ports (1 must be operational)– 2 error-handling chips (non-redundant)

• Each memory module consists of– 41 RAM chips (39 must be operational)– 2 interface chips (non-redundant)

• A CPU consists of 6 non-redundant chips• An I/O port consists of 6 non-redundant chips• 10 to 20 year operational life


Slide 188

Diagram of Fault-Tolerant Multiprocessor System

..41 RAMs

2 int. ch.

..41 RAMs

2 int. ch.

..41 RAMs

2 int. ch. 2 ch.

..6 CPUchips

..6 CPUchips

..6 CPUchips

..6 I/Ochips

..6 I/Ochips

memory modulememory modulememory module errorhandlers

interface bus

CPU moduleCPU moduleCPU module I/O port I/O port

com

pute

r

com

pute

r

computer

. . .


Slide 189

Definition of “Operational”

• The system is operational if at least one computer is operational• A computer is operational if all the modules are operational

– A memory module is operational if at least 39 RAM chips and both interface chips are operational.

– A CPU unit is operational if all 6 CPU chips are operational– An I/O port is operational if all 6 I/O chips are operational– The error-handling unit is operational if both error-handling chips are

operational• Failure rate per chip is 100 failures per 1 billion hours


Slide 190

Coverage• This system could be modeled using combinatorial methods if we did not take

coverage into account. Coverage is the chance that the failure of a chip will not cause the larger system to fail even if sufficient redundancy exists. I.e., coverage is the probability that the fault is contained.

The coverage probabilities are given in the following table:

• For example, if a RAM chip fails, there is a 0.2% chance the memory module will fail even if sufficient redundancy exists. If the memory module fails, there is a 5% chance the computer will fail. If a computer fails, there is a 5% chance the system will fail.

Redundant Component Fault Coverage ProbabilityRAM Chip 0.998Memory Module 0.95CPU Unit 0.995I/O Port 0.99Computer 0.95


Slide 191

Outline of Solution: List of SANs

• The model is composed of four SANs:1. memory_module2. cpu_module3. errorhandlers4. io_port_module

• Each SAN models the behavior of the module in the event of a module component failure.


Slide 192

List of Places

• Seven places represent the state of the system:1. cpus – the number of operational CPU modules2. ioports – the number of operational I/O modules3. errorhandlers – whether the two error-handler chips are operational4. computer_failed – the number of failed computers5. memory_failed – the number of failed memory modules6. memory_chips – number of operational RAM chips7. interface_chips – number of operational interface chips


Slide 193

List of Activities

• Five activities represent failures in the system1. cpu_failure – the failure of any CPU chip2. ioport_failure – the failure of any I/O chip3. errorhandling_chip_failure – the failure of either error-handler chip4. memory_chip_failure – the failure of a RAM chip5. interface_chip_failure – the failure of a memory interface chip

Cases on these activities represent behavior based on coverage or non-coverage.


Slide 194

Tricks of the Trade

Since we intend to solve this model analytically, we want the fewest number of states possible.

• We don’t care which component failed or what particular failed state the model is in. Therefore, we lump all failure states into the same state.

• We don’t care which computer or which module is in what state. Therefore, we make use of replication to further reduce the number of states.

• We use marking-dependent rates to model RAM chip failure, making use of the fact that the minimum of independent exponentials is an exponential.

• We use cases to denote coverage probabilities, and adjust the probabilities depending on the state of the system.


Slide 195

Composed Model

Node Reps Common Placescomputer_failedRep1 3memory_failed

Rep2 2 computer_failed

Node Common PlacesSubtree 1 2 3 4

computer_failed memory_failed cpus errorhandlers

Join1

ioports


Slide 196

cpu_modules SAN

Place Markingcpus 3ioports 2errorhandlers 2memory_failed 0computer_failed 0


Slide 197

cpu_modules SAN, cont.

Activity Distributioncpu_failure expon(0.0052596 * MARK(cpus))

Gate Enabling Predicate FunctionIG1 (MARK(cpus) > 1) &&

(MARK(memory_failed) < 2) &&(MARK(computer_failed) < 2)

identity

cpu_modules input gate predicates and functions:

cpu_modules activity time distributions:


Slide 198


Case Probabilitymodule_cpu_failure

1 if (MARK(cpus) == 3)return(0.995);

elsereturn(0.0);

2 if (MARK(cpus) == 3)return(0.00475);

elsereturn(0.95);

3 if (MARK(cpus) == 3)return (0.00025);

elsereturn(0.05);

• case 1: chip failure covered

• case 2: chip failure causes computer failure

• case 3: chip failure causes system (catastrophic) failure

cpu_modules case probabilities for activities:


Slide 199


Gate FunctionOG1 if (MARK(cpus) == 3)

MARK(cpus) − −;OG2 MARK(cpus) = 0;

MARK(ioports) = 0;MARK(errorhandlers) = 0;MARK(memory_failed) = 2;MARK(computer_failed) ++;

OG3 MARK(cpus) = 0;MARK(ioports) = 0;MARK(errorhandlers) = 0;MARK(memory_failed) = 2;MARK(computer_failed) = 2;

cpu_modules output gate functions:


Slide 213

Model Solution

The modeled two-computer system with non-perfect coverage at all levels (i.e., the model as described), the state space contains 10,114 states. The 10 year mission reliability was computed to be .995579.


Slide 214

Impact of Coverage• Coverage can have a large impact on reliability and state-space size. Various

coverage schemes were evaluated with the following results.

Design description State-space sizeReliability(10-year

mission time)100% coverage at all levels 4278 0.999539Nonperfect coverage considered at all levels 10114 0.995579Nonperfect coverage considered at all levels,no spare memory module

1335 0.987646

Nonperfect coverage considered at all levels,no spare CPU module

3299 0.973325

Nonperfect coverage considered at all levels,no spare IO port

3299 0.985419

Nonperfect coverage considered at all levels,no spare memory module, CPU module, orIO port

511 0.935152

100% coverage at all levels, no sparememory module, CPU module, IO port, orRAM chips

6 0.702240


Slide 216

Solution by Simulation


Slide 218

Motivation

• High-level formalisms (like SANs) make it easy to specify realistic systems, but they also make it easy to specify systems that have unreasonably large state spaces.

• State-of-the-art tools (like Mobius) can handle state-level models with a few ten’s of million states, but not more.

• When state spaces become too large, discrete event simulation is often a viable alternative.

• Discrete-event simulation can be used to solve models with arbitrarily large state spaces, as long as the desired measure is not based on a “rare event.”

• When “rare events” are present, variance reduction techniques can sometimes be used.


Slide 219

Advantages of Simulation

• Simulation can be applied to any SAN model. The most prominent difference, compared with analytic solvers, is that generally distributed activities can be used.

• Simulation does not require the generation of a state space and therefore does not require a finite state space. Therefore, much more detailed models can be solved.


Slide 220

Disadvantages of Simulation• Simulation only provides an estimate of the desired measure. An

approximate confidence interval is constructed that contains the actual result with some user-specified probability.

• Higher desired accuracy dramatically increases the necessary simulation time. As a rule, to make the confidence interval n times narrower, the simulation has to be run n2 times as long.

• The “rare event problem” may arise. If simulation is used to estimate a small probability, such as the reliability of a highly-reliable system, extremely long simulations may have to be performed to encounter the particular event often enough.

• Complicated models can require long simulation times, even if the rare event problem is not an issue. The simulators in Möbius perform the necessary event scheduling very efficiently, but it should be realized that simulation is not a panacea.


Slide 221

Simulation as Model Experimentation• State-based methods (such as Markov chains) work by enumerating all possible

states a system can be in, and then invoking a numerical solution method on the generated state space.

• Simulation, on the other hand, generates one or more trajectories (possible behaviors from the high-level model), and collects statistics from these trajectories to estimate the desired performance/dependability measures.

• Just how this trajectory is generated depends on the:– nature of the notion of state (continuous or discrete)– type of stochastic process (e.g., ergodic, reducible)– nature of the measure desired (transient or steady-state)– types of delay distributions considered (exponential or general)

• We will consider each of these issues in this module, as well as the simulation of systems with rare events.


Slide 222

Types of SimulationContinuous-state simulation is applicable to systems where the notion of state is continuous and typically involves solving (numerically) systems of differential equations. Circuit-level simulators are an example of continuous-state simulation.

Discrete-event simulation is applicable to systems in which the state of the system changes at discrete instants of time, with a finite number of changes occurring in any finite interval of time.

Since we will focus on validating end-to-end systems, rather than circuits, we will focus on discrete-event simulation.

There are two types of discrete-event simulation execution algorithms:– Fixed-time-stamp advance– Variable-time-stamp advance


Slide 223

Fixed-Time-Stamp Advance Simulation• Simulation clock is incremented a fixed time ∆t at each step of the simulation.• After each time increment, each event type (e.g., activity in a SAN) is checked

to see if it should have completed during the time of the last increment.• All event types that should have completed are completed and a new state of the

model is generated.• Rules must be given to determine the ordering of events that occur in each

interval of time.• Example:

• Good for all models where most events happen at fixed increments of time (e.g., gate-level simulations).

• Has the advantage that no “future event list” needs to be maintained.• Can be inefficient if events occur in a bursty manner, relative to time-step used.

2∆t∆t 3∆t 4∆t 5∆t0e1 e2 e5

e4e3 e6


Slide 224

Variable-Time Step Advance Simulation

• Simulation clock advanced a variable amount of time each step of the simulation, to time of next event.

• If all event times are exponentially distributed, the next event to complete and time of next event can be determined using the equation for the minimum of nexponentials (since memoryless), and no “future event list” is needed.

• If event times are general (have memory) then “future event list” is needed.• Has the advantage (over fixed-time-stamp increment) that periods of inactivity

are skipped over, and models with a bursty occurrence of events are not inefficient.


Slide 225

Basic Variable-Time-Step Advance Simulation Loop for SANs

A) Set list_of_active_activities to null.B) Set current_marking to initial_marking.C) Generate potential_completion_time for each activity that may complete in the

current_marking and add to list_of_active_activities.D) While list_of_active_activities ≠ null:

1) Set current_activity to activity with earliest potential_completion_time.2) Remove current_activity from list_of_active_activities.3) Compute new_marking by selecting a case of current_activity, and executing

appropriate input and output gates.4) Remove all activities from list_of_active_activities that are not enabled in

new_marking.5) Remove all activities from list_of_active_activities for which new_marking is a

reactivation marking.6) Select a potential_completion_time for all activities that are enabled in

new_marking but not on list_of_active_activities and add them to list_of_active_activities.

E) End While.©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 226

Types of Discrete-Event Simulation• Basic simulation loop specifies how the trajectory is generated, but does not

specify how measures are collected, or how long the loop is executed.

• How measures are collected, and how long (and how many times) the loop is executed depends on type of measures to be estimated.

• Two types of discrete-event simulation exist, depending on what type of measures are to be estimated.

– Terminating - Measures to be estimated are measured at fixed instants of time or intervals of time with fixed finite point and length. This may also include random but finite (in some sense) times, such as a time to failure.

– Steady-state - Measures to be estimated depend on instants of time or intervals whose starting points are taken to be t → ∞.


Slide 227

Issues in Discrete-Event Simulation1) How to generate potential completion times for events

2) How to estimate dependability measures from generated trajectories– Transient measures– Steady-state measures

3) How to implement the basic simulation loop– Sequential or parallel


Slide 228

Generation of Potential Completion Times1) Generation of uniform [0,1] random variates

– Used as a basis for all random variate samples– Types

• Linear congruential generators• Tausworthe generators• Other types of generators

– Tests of uniform [0,1] generators

2) Generation of non-uniform random variates– Inverse transform technique– Convolution technique– Composition technique– Acceptance-rejection technique– Technique for discrete random variates

3) Recommendations/Issues


Slide 229

Generation of Uniform [0,1] Random Number SamplesGoal: Generate sequence of numbers that appears to have come from uniform [0,1]

random variable.

Importance: Can be used as a basis for all random variates.

Issues:1) Goal isn’t to be random (non-reproducible), but to appear to be random.2) Many methods to do this (historically), many of them bad (picking

numbers out of phone books, computing π to a million digits, counting gamma rays, etc.).

3) Generator should be fast, and not need much storage.4) Should be reproducible (hence the appearance of randomness, not the

reality).5) Should be able to generate multiple sequences or streams of random

numbers.


Slide 230

Linear Congruential Generators (LCGs)• Introduced by D. H. Lehmer (1951). He obtained

xn = an mod mxn = (axn - 1) mod m

• Today, LCGs take the following form:

xn = (axn - 1 + b) mod m, where

xn are integers between 0 and m - 1a, b, m non-negative integers

• If a, b, m chosen correctly, sequence of numbers can appear to be uniform and have large period (up to m).

• LCGs can be implemented efficiently, using only integer arithmetic.

• LCGs have been studied extensively; good choices of a, b, and m are known. See, e.g., Law and Kelton (1991), Jain (1991).


Slide 231

Tausworthe Generators

• Proposed by Tausworthe (1965), and are related to cryptographic methods.

• Operate on a sequence of binary digits (0,1). Numbers are formed by selecting bits from the generated sequence to form an integer or fraction.

• A Tausworthe generator has the following form:bn = cq - 1bn - 1 ⊕ cq - 2bn - 2 ⊕ . . . ⊕ c0bn - q

where bn is the nth bit, and ci (i = 0 to q - 1) are binary coefficients.

• As with LCGs, analysis has been done to determine good choices of the ci.

• Less popular than LCGs, but fairly well accepted.


Slide 232

Generation of Non-Uniform Random Variates• Suppose you have a uniform [0,1] random variable, and you wish to have a

random variable X with CDF FX. How do we do this?

• All other random variates can be generated from uniform [0,1] random variates.

• Methods to generate non-uniform random variates include:– Inverse Transform - Direct computation from single uniform [0,1] variable

based on observation about distribution.– Convolution - Used for random variables that can be expressed as sum of

other random variables.– Composition - Used when the distribution of the desired random variable

can be expressed as a weighted sum of the distributions of other random variables.

– Acceptance-Rejection - Uses multiple uniform [0,1] variables and a function that “majorizes” the density of the random variate to be generated.


Slide 233

Inverse Transform TechniqueSuppose we have a uniform [0,1] random variable U.

If we define X = F-1(U), then X is a random variable with CDF FX = F.

To see this,FX(a) = P[X ≤ a]

= P[F-1(U) ≤ a]= P[U ≤ F(a)]= F(a)

Thus, by starting with a uniform random variable, we can generate virtually any type of random variable.


Slide 234

Example of Inverse Transform

Let X be an exponentially distributed random variable with parameter λ. Let U be a uniform [0,1] random variable generated by a pseudo-random number generator.

( )

( ) ( )UUFX

eaF

X

aX

−−==

−=

−

−

1ln11

1

λ

λ


Slide 235

Convolution Technique• Technique can be used for all random variables X that can be expressed as the

sum of n random variablesX = Y1 + Y2 + Y3 + . . . + Yn

• In this case, one can generate a random variate X by generating n random variates, one from each of the Yi, and summing them.

• Examples of random variables:– Sum of n Bernoulli random variables is a binomial random variable.– Sum of n exponential random variables is an n-Erlang random variable.


Slide 236

Composition Technique• Technique can be used when the distribution of a desired random variable can be

expressed as a weighted sum of other distributions.

• In this case F(x) can be expressed as

• The composition technique is as follows:1) Generate random variate i such that P[I = i] = pi for i = 0, 1, . . .

(This can be done as discussed for discrete random variables.)2) Return x as random variate from distribution Fi(x), where i is as chosen

above.

• A variant of composition can also be used if the density function of the desired random variable can be expressed as weighted sum of other density functions.

( ) ( )

∑

∑∞

=

∞

=

=≥

=

0

0

.1 ,0 where

iii

iii

pp

xFpxF


Slide 237

Acceptance-Rejection Technique• Indirect method for generating random variates that should be used when other

methods fail or are inefficient.• Must find a function m(x) that “majorizes” the density function f(x) of the

desired distribution. m(x) majorizes f(x) if m(x) ≥ f(x) for all x.• Note:

• If random variates for m(x) can be easily computed, then random variates for f(x) can be found as follows:

1) Generate y with density m′(x)2) Generate u with uniform [0,1] distribution

3)

( ) ( ) ( )

( ) function.density a is )(but

function,density ay necessarilnot is so ,1

cxmxm

xmdxxfdxxmc

=′

=≥= ∫∫∞

∞−

∞

∞−

1. goto else ,return ,)()( If y

ymyfu ≤


Slide 238

Generating Discrete Random Variates

• Useful for generating any discrete distribution, e.g., case probabilities in a SAN.• More efficient algorithms exist for special cases; we will review most general

case.• Suppose random variable has probability distribution p(0), p(1), p(2), . . . on

non-negative integers. Then a random variate for this random variable can be generated using the inverse transform method:

1) Generate u with distribution uniform [0,1]2) Return j satisfying

( ) ( )∑∑=

−

=

<≤j

i

j

iipuip

0

1

0


Slide 239

Recommendations/Issues in Random Variate Generation

• Use standard/well-tested uniform [0,1] generators. Don’t assume that because a method is complicated, it produces good random variates.

• Make sure the uniform [0,1] generator that is used has a long enough period. Modern simulators can consume random variates very quickly (multiple per state change!).

• Use separate random number streams for different activities in a model system. Regular division of a single stream can cause unwanted correlation.

• Consider multiple random variate generation techniques when generating non-uniform random variates. Different techniques have very different efficiencies.


Slide 240

Estimating Dependability Measures: Estimators and Confidence Intervals

• An execution of the basic simulation loop produces a single trajectory (one possible behavior of the system).

• Common mistake is to run the basic simulation loop a single time, and presume observations generated are “the answer.”

• Many trajectories and/or observations are needed to understand a system’s behavior.

• Need concept of estimators and confidence intervals from statistics:– Estimators provide an estimate of some characteristic (e.g., mean or

variance) of the measure.– Confidence intervals provide an estimate of how “accurate” an estimator is.


Slide 241

Typical Estimators of a Simulation Measure• Can be:

– Instant-of-time, at a fixed t, or in steady-state– Interval-of-time, for fixed interval, or in steady-state– Time-averaged interval-of-time, for fixed interval, or in steady-state

• Estimators on these measures include:– Mean– Variance– Interval - Probability that the measure lies in some interval [x,y]

• Don’t confuse with an interval-of-time measure.• Can be used to estimate density and distribution function.

– Percentile - 100βth percentile is the smallest value of estimator x such that F(x) ≥ β.


Slide 242

Different Types of Processes and Measures Require Different Statistical Techniques

• Transient measures (terminating simulation):– Multiple trajectories are generated by running basic simulation loop multiple

times using different random number streams. Called Independent Replications.

– Each trajectory used to generate one observation of each measure.• Steady-State measures (steady-state simulation):

– Initial transient must be discarded before observations are collected.– If the system is ergodic (irreducible, recurrent non-null, aperiodic), a single

long trajectory can be used to generate multiple observations of each measure.

– For all other systems, multiple trajectories are needed.


Slide 243

Confidence Interval Generation: Terminating SimulationApproach:

– Generate multiple independent observations of each measure, one observation of each measure per trajectory of the simulation.

– Observations of each measure will be independent of one another if different random number streams are used for each trajectory.

– From a practical point of view, new stream is obtained by continuing to draw numbers from old stream (without resetting stream seed).

Notation (for subsequent slides):– Let F(x) = P[X ≤ x] be measure to be estimated.– Define µ = E[X], σ2 = E[(X - µ)2].– Define xi as the ith observation value of X (ith replication, for terminating

simulation).

Issue: How many trajectories are necessary to obtain a good estimate?


Slide 244

Terminating Simulation: Estimating the Mean of a Measure I

• Wish to estimate µ = E[X].• Standard point estimator of µ is the sample mean

• To compute confidence interval, we need to compute sample variance:

∑=

=µN

nnx

N 1

1ˆ

[ ] [ ] [ ]) where,ˆ and ,ˆ i.e., unbiased, is ˆ( 22

XVarN

VarE === σσµµµµ

( ) ( )2

1

2

1

22 ˆ11

1ˆ1

1µ

−−

−=µ−

−= ∑∑

== NNx

Nx

Ns

N

nn

N

nn


Slide 245

Terminating Simulation: Estimating the Mean of a Measure II

• Then, the (1 - α) confidence interval about x can be expressed as:

Where–

–– N is the number of observations.

• Equation assumes xn are distributed normally (good assumption for large number of xi).

• The interpretation of the equation is that with (1 - α) probability the real value (µ) lies within the given interval.

( ) ( )N

stN

st NN 2121 1ˆ

1ˆ

α−

α− −

+µ≤µ≤−

−µ

( ) ( ).in tables) found becan on distributi thisof (values freedom of degrees 1

on with distributi sstudent' theof percentileth 1100 theis 1 221

−

−−−

NttN

αα

deviation. standard sample theis 2ss =


Slide 246

Terminating Simulation: Estimating the Variance of a Measure I

• Computation of estimator and confidence interval for variance could be done like that done for mean, but result is sensitive to deviations from the normal assumption.

• So, use a technique called jackknifing developed by Miller (1974).

• Define

Where

( )22 ˆ21

21ˆ i

inni N

NxN

µ−−

−−

=σ ∑≠

∑≠−

=µin

ni xN 1

1ˆ


Slide 247

Terminating Simulation: Estimating the Variance of a Measure II

• Now define

(where s2 is the sample variance as defined for the mean)

• And

• Then

is a (1 - α) confidence interval about σ2.

( ) NiZN

ZNNsZN

iiii ,...,2,1for ,1 and ˆ 1

1

22 ==−−= ∑=

σ

( )∑=

−−

=N

iiZ ZZ

Ns

1

22

11

( ) ( )N

stZ

Nst

Z ZNZN 21221 11 α−

α− −

+≤σ≤−

−


Slide 248

Terminating Simulation: Estimating the Percentile of an Interval About an Estimator

• Computed in a manner similar to that for mean and variance.

• Formulation can be found in Lavenberg, ed., Computer Performance Modeling Handbook, Academic Press, 1983.

• Such estimators are very important, since mean and variance are not enough to plan from when simulating a single system.


Slide 249

Confidence Interval Generation: Steady-State Simulation

• Informally speaking, steady-state simulation is used to estimate measures that depend on the “long run” behavior of a system.

• Note that the notion of “steady-state” is with respect to a measure (which has some initial transient behavior), not a model.

• Different measures in a model will converge to steady state at different rates.• Simulation trajectory can be thought of as having two phases: the transient phase

and the steady-state phase (with respect to a measure).• Multiple approaches to collect observations and generate confidence intervals:

– Replication/Deletion– Batch Means– Regenerative Method– Spectral Method

• Which method to use depends on characteristics of the system being simulated.• Before discussing these methods, we need to discuss how the initial transient is

estimated.


Slide 250

Estimating the Length of the Transient Phase

Problem: Observations of measures are different during so-called “transient phase,” and should be discarded when computing an estimator for steady-state behavior.

Need: A method to estimate transient phase, to determine when we should begin to collect observations.

Approaches:– Let the user decide: not sophisticated, but a practical solution.– Look at long-term trends: take a moving average and measure differences.– Use more sophisticated statistical measures, e.g., standardized time series

(Schruben 1982).

Recommendation:– Let the user decide, since automated methods can fail.


Slide 251

Methods of Steady-State Measure Estimation: Replication/Deletion

• Statistics similar to those for terminating simulation, but observations collected only on steady-state portion of trajectory.

• One or more observations collected per trajectory:

• Compute

as ith observation, where Mi is the number of observations in trajectory i.

• xi are considered to be independent, and confidence intervals are generated.• Useful for a wide range of models/measures (the system need not be ergodic),

but slower than other methods, since transient phase must be repeated multiple times.

transientphase O11 O12

O21 O22

O31 O32 O33 O34

O23 O24

O13 O14 trajectory 1

trajectory 2

trajectory n

. . .

i

M

jij

i M

Ox

i

∑== 1


Slide 252

Methods of Steady-State Measure Estimation: Batch Means

• Similar to Replication/Deletion, but constructs observations from a single trajectory by breaking it into multiple batches.

• Example

• Observations from each batch are combined to construct a single observation; these observations are assumed to be independent and are used to construct the point estimator and confidence interval.

• Issues:– How to choose batch size?– Only applicable to ergodic systems (i.e., those for which a single trajectory

has the same statistics as multiple trajectories).– Initial transient only computed once.

• In summary, a good method, often used in practice.

initial transient O11 O12 O21 O22

O31 O32O23

O13 11nO

33nO22nO

...

.........


Slide 253

Other Steady-State Measure Estimation Methods I• Regenerative Method (Crane and Iglehart 1974, Fishman 1974)

– Uses “renewal points” in processes to divide “batches.”– Results in batches that are independent, so approach used earlier to generate

confidence intervals applies.– However, usually no guarantee that renewal points will occur at all, or that

they will occur often enough to efficiently obtain an estimator of the measure.

• Autoregressive Method (Fishman 1971, 1978)– Uses (as do the two following methods) the autocorrelation structure of

process to estimate variance of measure.– Assumes process is covariance stationary and can be represented by an

autoregressive model.– Above assumption often questionable.


Slide 254

Other Steady-State Measure Estimation Methods II• Spectral Method (Heidelberger and Welch 1981)

– Assumes process is covariance stationary, but does not make further assumptions (as previous method does).

– Efficient method, if certain parameters chosen correctly, but choice requires sophistication on part of user.

• Standardized Time Series (Schruben 1983)– Assumes process is strictly stationary and “phi-mixing.”– Phi-mixing means that Oi and Oi + j become uncorrelated if j is large.– As with spectral method, has parameters whose values must be chosen

carefully.


Slide 255

Summary: Measure Estimation and Confidence Interval Generation

1) Only use the mean as an estimator if it has meaning for the situation being studied. Often a percentile gives more information. This is a common mistake!

2) Use some confidence interval generation method! Even if the results rely on assumptions that may not always be completely valid, the methods give an indication of how long a simulation should be run.

3) Pick a confidence interval generation method that is suited to the system that you are studying. In particular, be aware of whether the system being studied is ergodic.

4) If batch means is used, be sure that batch size is large enough that batches are practically uncorrelated. Otherwise the simulation can terminate prematurely with an incorrect result.


Slide 256

Summary/Conclusions: Simulation-Based Validation Techniques

1) Know how random variates are generated in the simulator you use. Make sure:– A good uniform [0,1] generator is used– Independent streams are used when appropriate– Non-uniform random variates are generated in a proper way.

2) Compute and use confidence intervals to estimate the accuracy of your measures.– Choose correct confidence interval computation method based on the nature

of your measures and process


Slide 261

Simulator Editor

Maximum and MinimumNumber of Replications to Run

Number of Batches betweeneach calculation of the variance

Trace-Level for Debugging

File Name of Output File


Slide 262

Batch and Replication Outputs (Variable Output Option)Typical batch output:

Typical replication output:

Variable Name : utilizationBatch Number : 10Simulation Time : 1.100000e + 04Time (CPU seconds) : 41Batch Mean : 8.467695e − 01Mean : 8.447065e − 01 + / − 1.516121e − 03Variance : 4.417886e − 02 + / − 5.035103e − 04

Variable Name : utilizationReplication Number : 2400Simulation Time : 1.000000e + 02Time (CPU seconds) : 1498Current Value : 1.000000e + 00Mean : 8.466667e − 01 + / − 8.196275e − 03Sample Variance : 4.196934e − 02Variance : 4.196934e − 02 + / − 2.588252e − 03


Slide 263

Möbius Simulation Techniques

Simulation CharacteristicsSteady-stateor Transient

Instant-of-time orInterval-of-time

Mean,Variance, orDistribution

Variable ApplicableSimulator

Reward Variable tsim and itsimTransient Instant-of-timeand

Interval-of-time

Mean,Variance,

andDistribution

ActivityVariable

tsim

Steady-state Instant-of-time Mean,Variance,

andDistribution

Reward Variableand Activity

Variable

ssim


Slide 265

Symbolic State-space Exploration and Numerical Analysis of State-sharing Composed Models


Slide 266

Motivation

• State-space (SS) explosion or largeness problem in discrete-state systems– Costly generation and representation of SS (space and

time)– Costly representation of CTMC (space)– Costly representation of solution vector (space) and

costly iteration/solution time (time)• Typical solutions:

– Largeness avoidance, e.g., using lumping techniques• CTMC level• Model level

– Largeness tolerance using BDD, MDD, MTBDD, Kronecker, or Matrix Diagrams (MD)


Slide 267

What Is New?

• Our approach combines – Model-level lumping induced by structural symmetries

• Number of states ↓ ⇒ solution vector size ↓• Number of states ↓ ⇒ iteration time ↓

– MDD and Matrix Diagram (MD) data structures• Enables us to represent lumped CTMCs not possible using

sparse matrix • An order of magnitude faster than unlumped sparse

representation although it induces slowdown in solution time compared to lumped sparse representation

• State-sharing composed models as opposed to action-synchronization– Maintain almost the same generality


Slide 268

State-sharing Composed Models

• Join and Replicate operators

• Any atomic model formalism that can share state variables – E.g., SAN, PEPAk, and Buckets and Balls

• Replicate induces symmetry• Global and local actions

M1 M2

SV1

Join

M1 M2

Join

M1M1M1⇒

Rep (3)

M1

SV1


Slide 269

Introduction to MDD

• Represents function where

• Special case : n = 1, f representsa set of vectors

0 1 2

0 1 0 1

0 1 2 0 1 2

0 1

0 1

(0,0,1), (0,0,2), (0,1,1), (0,1,2),

(1,0,1), (1,0,2), (1,1,0), (1,1,1),

(2,0,0), (2,0,1), (2,1,1), (2,1,2)


Slide 270

Introduction to MDD



0 1 2

0 1 0 1

0 1 2 0 1 2

0 1

0 1

(0,0,1), (0,0,2), (0,1,1), (0,1,2),

(1,0,1), (1,0,2), (1,1,0), (1,1,1),

(2,0,0), (2,0,1), (2,1,1), (2,1,2)

• Representation of a set of statesof a discrete-state model– Partition set of SVs– Assign index to unique value

assignment of variables of each block

– Vector of indices representsa state


Slide 271

Introduction to MDD



0 1 2

0 1 0 1

0 1 2 0 1 2

0 1

0 1

(0,0,1), (0,0,2), (0,1,1), (0,1,2),

(1,0,1), (1,0,2), (1,1,0), (1,1,1),

(2,0,0), (2,0,1), (2,1,1), (2,1,2)

0 1 2 20 0 1 2

0 2 4 0 2 4 0 2 4

0 4 8 12

• Representation of a set of statesof a discrete-state model– Partition set of SVs– Assign index to unique value

assignment of variables of each block

– Vector of indices representsa state

• Augment by state offsets


Slide 272

MDD data structure by example • Partitioning SVs based on

composition structure– Maximizing efficiency of local

SS exploration– Simplifying global

SS exploration• Dependability model

for multicomputer system

Join

IO porterror handlercpuRep1 (M)

memory

Rep2 (N)

Rep2

Join

mem

mem

Rep1

outer replicate

MDD levelassignment

inner replicate

0

1

2

3

2+M


Slide 273

Algorithm Overview

1. Generate MDD representation of unlumped SS2. Build MD representation of CTMC3. Convert unlumped SS to lumped SS4. Solve CTMC by iterating through MD data structure


Slide 274

Symbolic Generation of Unlumped SS

• set of visited states• set of unexplored states• expands using

sequences of firings of localactions

• expands using single action firing of global actions

• Never generate potential or unreachable states• Creating necessary matrices and data structures to

construct MD of the CTMC at a later stage• No consideration of lumping properties


Slide 275

Symbolic SSG (Local Actions)

• Restriction: immediate actionsare local

• On-the-fly elimination ofvanishing states

• Local SS expansion in levelscorresponding to atomic models.No assumption of knowing thelocal state space in advance ⇒– Online computation of transitive closure based on Ibaraki

and Katoh’s algoritm• Avoids costly computation of tr. closure from scratch

i j

A B

⇓local

transition i to j

i j

A A B


Slide 276

Symbolic SSG (Global Actions)

• Global action a in component c affects more than one level • No “product-form”-like restriction ⇒

Effect of a on each level need not be determined locally• More difficult to handle than synchronizing actions• Expensive operation


Slide 277

Lumping

• Redundant states (paths)1 2

x

1 2

2

1

1

Rep

AM

AM

AM


Slide 278

Lumping

• Redundant states (paths)• Rep node c implies

equivalence relation Rc

1 2

x

1 2

2

1

1

Rep

AM

AM

AM

1

x

1

2


Slide 279

Lumping



• Overall equivalence relation• Canonical representative state in each class min(v)

1 2

x

1 2

2

1

1

Rep

AM

AM

AM

1

x

1

2


Slide 280

Lumping



• Overall equivalence relation• Canonical representative state in each class min(v)• may become exponentially large ⇒ break it up into

many extremely smaller MDDs ⇒ faster computation of

1 2

x

1 2

2

1

1

Rep

AM

AM

AM

1

x

1

2


Slide 281

Lumping

• where is the set of all states v where min(v) =v

• may become huge ⇒ break up into extremely smaller MDDs–

• is often less structured than and therefore larger in number of nodes


Slide 282

SSG and Lumping Performance

• Worst case example: No local behavior• Drastic decrease in number of states in the lumped SS (up to 6 orders of

magnitude)• Increase in number of nodes in the lumped state space but still small compared

to other entities• Very small unlumped and lumped SS representation


Slide 283

CTMC Generation and Enumeration

• Use Matrix Diagrams (MD) (Ciardo/Miner)– CTMC of largest example has <40000 nodes and takes <3MB of

memory


Slide 284



memory • Projection of the MD on the lumped SS? Problem: some needed

transitions are deleted

wrong

correct


Slide 285



memory and at most a few seconds to build• Projection of the MD on the lumped SS? Problem: some needed

transitions are deleted• Project rows on lumped SS

and columns on unlumped SS

• Redirect transitions on-the-fly• DFS-based enumeration of

MD using “sorting” MDD

wrong

correct


Slide 286

CTMC Enumeration Performance

• Fairly fast iteration: less than 6 times slower than lumped sparse matrix• Solving larger CTMCs


Slide 287

Integration into Möbius


Slide 288

Case Study: Survivability Evaluation


Slide 289

Defending Against a Wide Variety of Attacks

Civil disobedience Selling secrets

HarassmentCollecting trophies

Economic intelligence Military spying

Information terrorism

Stealing credit cards

Disciplined strategiccyber attack

Serious hackers

Script kiddies Curiosity

Thrill-seeking

Copy-cat attacks

Embarrassing organizations

HIGH

LOW

INNOVATION

PLANNING

STEALTH

COORDINATION

Nation-states,Terrorists, Multinationals


Slide 290

Intrusion Tolerance: A New Paradigm for Security

Prevent Intrusions(Access Controls, Cryptography,

Trusted Computing Base)

1st Generation: Protection

CryptographyTrusted Computing Base

Access Control & Physical Security

Detect Intrusions, Limit Damage(Firewalls, Intrusion Detection Systems,

Virtual Private Networks, PKI)

2nd Generation: Detection

But intrusions will occur

Firewalls

Intrusion Detection Systems

BoundaryControllers VPNs PKI

But some attacks will succeed

Tolerate Attacks(Redundancy, Diversity, Deception,

Wrappers, Proof-Carrying Code, Proactive Secret Sharing)

3rd Generation: Tolerance

Intrusion Tolerance

Big Board View of Attacks

Real-Time Situation Awareness& Response

Graceful Degradation

Hardened Operating System

Multiple Security Levels


Slide 291

Validation of Computer System/Network Survivability

• Security is no longer absolute• Trustworthy computer systems/networks must operated through

attacks, providing proper service in spite of possible partially successful attacks

• Intrusion tolerance claims to provide proper operation under such conditions

• Validation of security/survivability must be done:– During all phases of the design process, to make design

choices– During testing, deployment, operation, and maintenance,

to gain confidence that the “amount” of intrusion tolerance provided is as advertised.


Slide 292

Validating Computer System Security: Research Goal

CONTEXT: Create robust software and hardware that are fault-tolerant, attack resilient, and easily adaptable to changes in functionality and performance over time.

GOAL: Create an underlying scientific foundation, methodologies, and tools that will:

– Enable clear and concise specifications,

– Quantify the effectiveness of novel solutions,

– Test and evaluate systems in an objective manner, and

– Predict system assurance with confidence.


Slide 293

Existing Security/Survivability Validation Approaches

• Most traditional approaches to security validation have focus on avoiding intrusions (non-circumventability), or have not been quantitative, instead focusing on and specifying procedures that should be followed during the design of a system (e.g., the Security Evaluation Criteria [DOD85, ISO99]).

• When quantitative methods have been used, they have typically either been based on formal methods (e.g., [Lan81]), aiming to prove that certain security properties hold given a specified set of assumptions, or been quite informal, using a team of experts (often called a “red team,” e.g. [Low01]) to try to compromise a system.

• Both of these approaches have been valuable in identifying system vulnerabilities, but probabilistic techniques are also needed.


Slide 294

Example Probabilistic Validation Study

• Evaluation of DPASA-DV Project design– Designing Protection and Adaptation into a

Survivability Architecture: Demonstration and Validation

• Design of a Joint Battlespace Infosphere– Publish, Subscribe and Query features (PSQ)– Ability to fulfill its mission in the presence of

attacks, failures, or accidents• Uses Multiple, synergistic validation techniques


Slide 295

JBI Design Overview

JBI Management Staff

ExecutiveZone

CrumpleZone

OperationsZone

JBI Core

Quad 1 Quad 2 Quad 3 Quad 4

Network

Protection DomainsIsolation among selected functions on individual core hosts and on clients

Access Proxy (Isolated Process Domains in SE-Linux)Domain6

First Restart Domains Eventually Restart HostLocal Controller

RMISTCPTCP

PS Sensor Rpts

TCP UDPIIOP

PSQImplPSQImpl

IIOPTCP

DC

Eascii

Domain1 Domain2 Domain3 Domain4 Domain5Forward/Ratelimit

Proxy LogicInspect / Forward / Rate Limit


Slide 296

Survivability/Security Validation Goal

• Provide convincing evidence that the design, when implemented, will provide satisfactory mission support under real use scenarios and in the face of cyber-attacks.

• More specifically, determine whether the design, when implemented will meet the project goals:

• This assurance case is supported by:– Rigorous logical arguments – Experimental evaluation– A detailed executable model of the design


Slide 297

Goal: Design a Publish and Subscribe Mechanism that …

Provides 100% of critical functionality when under sustained attack by a “Class-A” red team with 3 months of planning.

Detects 95% of large scale attacks within 10 mins. of attack initiation and 99% of attacks within 4 hours with less than 1% false alarm rate.

Displays meaningful attack state alarms. Prevent 95% of attacks from achieving attacker objectives for 12 hours.

Reduces low-level alerts by a factor of 1000 and display meaningful attack state alarms.

Shows survivability versus cost/performance trade-offs.


Slide 298

Integrated Survivability Validation Procedure

R

PS Q


Model for Client

Model forAccess Proxy

Model forPSQ Server

…

AA1 AA2 AA3

RequirementDecomposition

Functional Modelof the System

(Probabilistic orLogical)

Assumptions

Supporting LogicalArguments and Experimentation

AP1 AP2

M1(Network Domains)

M2 M3 M4 M6M5

L1(ADF)

L2 L3


Slide 299

R

PS Q


Model for Client


Model forPSQ Server

…

AA1 AA2 AA3 AP1 AP2

M1(Network Domains)

M2 M3 M4 M6M5

L1(ADF)

L2 L3

1. A precise statement of the requirements

2. High-level functional model description:a) Data and alerts

flows for the processes related to the requirements,

b) Assumed attacks and attack effects [Threat/vulner-ability analysis; whiteboarding]

Steps



Slide 300

R

PS Q


Model for Client


Model forPSQ Server

…

AA1 AA2 AA3 AP1 AP2

M1(Network Domains)

M2 M3 M4 M6M5

L1(ADF)

L2 L3

3. Detailed descriptions of model component behaviors representing 2a and 2b, along with statements of underlying assumptions made for each component. [Probabilistic modeling or logical argumentation, depending on requirement]

Steps



Slide 301

R

PS Q


Model for Client


Model forPSQ Server

…

AA1 AA2 AA3 AP1 AP2

M1(Network Domains)

M2 M3 M4 M6M5

L1(ADF)

L2 L3

4. Construct executable functional model [Probabilistic modeling, if model constructed in 3 is probabilistic]

5. a) Verification of the modeling assumptions of Step 3 [Logical argumentation] and, b) where possible, justification of model parameter values chosen in Step 4. [Experimentation]

In Parallel

Steps



Slide 302

R

PS Q


Model for Client


Model forPSQ Server

…

AA1 AA2 AA3 AP1 AP2

M1(Network Domains)

M2 M3 M4 M6M5

L1(ADF)

L2 L3

6. Run the executable model for the measures that correspond to the requirements of Step 1. [Probabilistic modeling]

Steps



Slide 303

R

PS Q


Model for Client


Model forPSQ Server

…

AA1 AA2 AA3 AP1 AP2

M1(Network Domains)

M2 M3 M4 M6M5

L1(ADF)

L2 L3

7. Comparison of results obtained in Step 6, noting in particular the configurations and parameter values for which the requirements of Step 1 are satisfied.

?

Steps

Note that if the requirement being addressed is not quantitative, steps 4 and 6 are skipped.



Slide 304

Step 1: Requirement Specification

• Expressed in an argument graph:

JBI critical mission objectives

JBI critical functionality JBI mission

Detection / Correlation Requirements

Initialized JBI provides

essential services

JBI properly initialized

Authorized publish

processed successfully

Authorized subscribe processed

successfully

Authorized query

processed successfully

Authorized join/leave processed

successfully

Unauthorized activity properly rejected

Confidential info not exposed

IDS objectives


Slide 305

Argument Graph for the Design

PIP requirements 1 – 4

JBI survivability requirements

Initialized JBI provides essential services

Authorized publish is processed successfully

ConfidentialityDataflow

Timeliness Integrity

(from functional model execution)

Component Model Assumptions Hold

JBI intrusion detection requirements

PA1: Client-Core

Communication I & C

PA2: Alternate Path

Availability

QA1: QIS Incorruptibility

QA2: QIS Communication

Cutoff

QA3: QIS Input

Integrity

QA4: QIS Function

Correctness

AA1: AP Function

Correctness

AA2: AP Application-

layer Integrity

AA3: AP Application-layer Confidentiality

DA1: DC Communications

SA1: IO Integrity in

PSQ Server

SA2: Client Confidentiality in PSQ Server

SA3: IO Authenticity

SA4: Network-layer I & C

SeA1: Sensor False Alarm

Rate

SeA2: Sensor Detection Delay

SeA3: Sensor Detection Probability

CoA1: Corrleator

False Alarm Rate

MA1: SM Byzantine Agreement

PsA1: ADF Policy Server

Input Correctness


SynchronizationSystem Connectivity

Physical Topology

Network TopologyRestricted RoutingNo Tunneling Attacks

SELinux Solaris Windows

Type Enforcement Hardened Kernel IKENA StormWatch

Platform Mechanisms Process Domain Policies

Private Key Confidentiality

No Unauthorized Direct Access

Keys Protected from Theft

DoD Common Access Card (CAC)

PKCS #11 Tamperproof

Keys Not Guessable

Algorithmic Framework

Key Length Key Lifetime

No Unauthorized Indirect Access

Physical Protection of CAC device

Protection of CAC Authentication Data

No Compromise of Authorized Process

Accessing CAC

No Cryptography in Access Proxy

Not Preconfigured

Not Reconfigurable

ADF NIC services protected

ADF Correctness

ADF NIC Physical Security

ADF NIC Firmware Initialization

ADF Key Initialization

ADF Agent Initialization

ADF Protocol Correctness

ADF Host Independence

ADF Agent Correctness

VPG Integrity VPG Confidentiality

Policy Server Integrity

ADF Policy Correctness

Correctness of Registration

Protocol

Correctness of Reattachment

Protocol

Hard-wired Configuration

Electrically Isolated

Physically Protected

Connectivity

Physical Integrity

Electrical Integrity

Gate Configuration and

Truth Table

Proxy Protocol Configuration

Can Identify Malformed Traffic

Correctness of Rate Control Mechanisms

Correctness of Certificate Exchange

IDS Experimental Evaluation

Correctness of Modified ITUA Protocols

Functional model faithful to design

IDS / Correlation requirements

IO Confidentiality (end-to-end)

IConfidentiality of Network

Communications

Confidential info is not exposed

Unauthorized activity is properly rejected

Authorized join/leave is processed successfully

Authorized query is processed

successfully

Authorized subscribe is processed successfully

JBI is properly initialized

Design Team Review

Attack Model Assumptions Hold

Functional Model Assumptions Hold

Infrastructure Attack

Propagation

Data Attack Propagation

Attacks Originate

Outside the Platform

No Data Attacks


Initial Targets of

Infrastructure Attacks

Isolation of Intruded Process Domains

Targets for Loss of IO

Confidentiality

No Compromise or Failure of

QIS

DoS Causes Processing

Delays

DoS Does Not Corrupt

Other Components

DoS Attacks Do Not

Propagate from Clients to Core

Design Faithfully

Implemented

Absence of Insider Threat

Attack Model Parameter Selection

CERT Vulnerability DB Analysis

Variation over Anticipated

Ranges

Correctness of Managed Switch

IO Confidentiality in Transit

IO Confidentiality in Storage

Confidentiality of Application-layer

Messages

PIP requirements 1 – 4

JBI survivability requirements


Authorized publish is processed successfully

ConfidentialityDataflow

Timeliness Integrity


Component Model Assumptions Hold

JBI intrusion detection requirements

PA1: Client-Core

Communication I & C

PA2: Alternate Path

Availability



Cutoff

QA3: QIS Input

Integrity

QA4: QIS Function

Correctness

AA1: AP Function

Correctness


layer Integrity




PSQ Server





Rate



CoA1: Corrleator

False Alarm Rate

MA1: SM Byzantine Agreement


Input Correctness


SynchronizationSystem Connectivity

Physical Topology

Network TopologyRestricted RoutingNo Tunneling Attacks

SELinux Solaris Windows

Type Enforcement Hardened Kernel IKENA StormWatch

Platform Mechanisms Process Domain Policies






Keys Not Guessable







Accessing CAC


Not Preconfigured

Not Reconfigurable


ADF Correctness












Protocol


Protocol




Connectivity

Physical Integrity



Truth Table



Correctness of Rate Control Mechanisms








IConfidentiality of Network

Communications








successfully


successfully

Authorized subscribe is processed successfullyAuthorized subscribe is processed successfully



Design Team Review

Attack Model Assumptions Hold

Functional Model Assumptions Hold

Infrastructure Attack

Propagation

Data Attack Propagation

Attacks Originate


No Data Attacks


Initial Targets of

Infrastructure Attacks

Isolation of Intruded Process Domains

Targets for Loss of IO

Confidentiality

No Compromise or Failure of

QIS

DoS Causes Processing

Delays

DoS Does Not Corrupt

Other Components

DoS Attacks Do Not

Propagate from Clients to Core

Design Faithfully

Implemented

Absence of Insider Threat

Attack Model Parameter Selection

CERT Vulnerability DB Analysis

Variation over Anticipated

Ranges

Correctness of Managed Switch

IO Confidentiality in Transit




Messages


Messages

Requirements decompositionExecutable modelModel assumptionsSupporting arguments


Slide 306

Step 2: System and Attack Assumption Definition

ExampleHigh level description…Steps 4-5Access proxy verifies if the client is in valid session by sending the session key accompanying the IO to the Downstream Controller for verificationStep 6Access Proxy forwards the IO to the PSQ Server in its quadrant.....


Slide 307

Attack Model Description

• Definitions

– Intrusion, prevented intrusion, tolerated intrusion

– New vulnerabilities• Assumptions

– Outside attackers only

– Attacker(s) with unlimited resources

– Consider successful (and harmful) attacks only

– No patches applied for vulnerabilities found during the mission/scenario execution


Slide 308

Attack Model Description• Attack propagation

– MTTD: mean time to discovery of a vulnerability– MTTE: mean time to exploitation of a vulnerability

• 3 types of vulnerabilities:– Infrastructure-Level Vulnerabilities attacks in depth

• OS vulnerability• Non-JBI-specific application-level vulnerability• pcommon : common-mode failure

– Data-Level Vulnerabilities attacks in breadth• Using the application data of JBI software

– Across process domains

• flaw in protection domains©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author.

Slide 309

Attack Model Description

• Attack effects– Compromise

• Launching pad for further attacks• Malicious behavior

– Crash

• Attack propagation stopped– (DoS)– Distinction between OSes with and without protection domains


Slide 310

Attack Model Description• Intrusion Detection

– pdetect=0 if the sensors are compromised– pdetect > 0 otherwise.

• Attack Responses– Restart Processes– Secure Reboot– Permanent Isolation


Slide 311

Infrastructure Attacks Example

SM

SM, Quad 1, OS 2

AD

F NICSM

SM, Quad 1, OS 3

AD

F NICSM

SM, Quad 1, OS 4

AD

F NIC

Access Proxy, Quad 1, OS 1

PSQ Server, Quad 1, OS 1

Co

Correlator, Quad 1, OS 1

PSQ

Se

Ac

LC

AD

F NIC

AD

F NIC

AP IO

Se

Ac

LC

AP Hb

AP Alert

AD

F NIC

Guardian, Quad 1, OS 1

Gu

Se

Ac

LC

AD

F NIC

DC, Quad 1, OS 1

DC

Se

Ac

LC

AD

F NIC

SM

SM, Quad 1, OS 1

AD

F NIC


AP IO

Se

Ac

LC

AP Hb

AP Alert

AD

F NIC

Access Proxy, Quad 3, OS 3PSQ Server, Quad 2, OS 2

PSQ

Se

Ac

LC

AD

F NIC


PSQ

Se

Ac

LC

AD

F NIC


PSQ

Se

Ac

LC

AD

F NIC

Publishing Client, OS1

SD

Se

Ac

LC

AD

F NIC

AP IO

Se

Ac

LC

AP Hb

AP Alert

AD

F NIC


AP IO

Se

Ac

LC

AP Hb

AP Alert

AD

F NIC

Outside

Outside

Outside

Quadrant 1

T=85 min.:discovery of avulnerability onthe Main PD,

OS1

PS

Policy Server, Quad 1, OS 1

AD

F NICall quad

components

Crumple Zone Operations Zone Executive Zone


Slide 312

Step 3: Detailed descriptions of model component behaviors and Assumptions (Access Proxy)M

odel

of A

cces

s Pro

xyA

ssum

ptio

ns

4.4 Access Proxy4.4.1 Model DescriptionAM1: If a process domain in the DJM proxy is not corrupted, it forwards the traffic it is designated to handle from the Quadrant

isolation switch to core quadrant elements and vice versa. All traffic being forwarded is well-formed (if the proxy is correct).The following kinds of traffic are handled:1. IOs (together with tokens) sent from publishing clients to the core (we do not distinguish between IOs sent via different protocols such as RMI or SOAP/HTTP).

….AM2: Attacks on access proxy: attacks on an access proxy are enabled if either/both

1. Quadrant isolation switch is ON, and one or more clients are corrupted, leading to:a) Direct attacks: can cause the corruption of the process domain corresponding to the domain of the attacking process on

the compromised client.….AM3: If an attack occurs on the access proxy, it can have the following effects:

1. Direct attacks leading to process corruption:a) Enable corruption of other process domains on the host.

…..4.4.2 Facts and SimplificationsAF1: Each access proxy runs on a dedicated host machine.AF2: DoS attacks result in increased delays.….

4.4.3 AssumptionsAA1: Only well-formed traffic is forwarded by a correct access proxy.AA2: The access proxy cannot access cryptographic keys used to sign messages that pass through it.AA3: Access proxy cannot access the contents on an IO if application-level end-to-end encryption is being used.AA4: Attacks on an access proxy can only be launched from compromised clients, or from corrupted core elements that interact with the access proxy during the normal course of a mission. ….


Slide 313

Step 4: Construct Executable Functional Model


Slide 314

Step 5: Supporting Logical Arguments

JBI critical mission objectives

JBI critical functionality


Authorized publish processed successfully

ConfidentialityDataflow Timeliness

Integrity


Functional model assumptions hold

JBI mission awareness

CA1: Origin of Attacks on

Clients

CA2: Attack Propagation from Clients

CA3: Client Process

Corruption

PA1: Client-Core

Communication I & C

PA2: Alternate Path

Availability



Cutoff

QA3: QIS Input

Integrity

QA4: QIS Function

Correctness

AA1: AP Function

Correctness


layer Integrity


AA4: Origin of Attacks on

Access Proxy

AA5: Attacks from AP

AA6: DoS from Compromised

Core

AA7: AP Process

Corruption

AA8: DoSPrevention by Access Proxy


GA1: Process Corruption on

Guardian

DA3: Process

Corruption on DC

GA2: Attacks from Guardian

DA2: Origin of Attacks on DC

SA1: Origin of Attacks on

PSQ Server

SA2: Attacks from PSQ

Server


PSQ Server




SA7: Process Corruption in PSQ Server

SeA1: Attacks from IDS Sensor


Rate



SeA5: Process

Corruption in Sensor

AcA1: Process Corruption in

Actuator

AcA2: Attacks from Actuator

LA1: Process Corruption in

Local Controller

LA2: Attacks from Local Controller

CoA1: Corrleator

False Alarm Rate

CoA2: Origin of Attacks on

Correlator

CoA3: Attacks from Correlator

CoA4: Alert IntegrityMA1: SM Byzantine

AgreementMA2: Origin of Attacks on

SM

MA3: Attacks from SM


Input Correctness


Synchronization

ScA1: Process Corruption in Subscribed

Client

System Connectivity

Physical Topology

Network Topology Restricted Routing No Tunneling Attacks

Process Isolation

SELinux Trusted Solaris Windows 2000

Type Enforcement Hardened Kernel Hardened Kernel Kernel Loadable Wrappers

VMWare over SELinux

Platform Mechanisms Component-specific policy






Keys Not Guessable







Accessing CAC


Not Preconfigured

Not Reconfigurable


ADF Correctness












Protocol


Protocol




Connectivity Physical Integrity



Truth Table



Correctness of Flow Control Mechanisms

Bidirectional Flow Control





IDS objectives

Notification Confidentiality


Confidential info not exposed

Unauthorized activity properly rejected

Authorized join/leave processed

successfully

Authorized query processed

successfully

Authorized subscribe processed successfully

JBI properly initialized

Design Team Review


Slide 315

Logical Argument Sample


layer Integrity

AA3: AP Application-layer

Confidentiality


PSQ Server





DoD Common Access Card

(CAC)

PKCS #11 Compliance

Tamperproof

Keys Not Guessable


Key Length

Key Lifetime



Protection of CAC

Authentication Data

No Compromise of Authorized

Process Accessing CAC


Not Preconfigured

Not Reconfigurable


PSQ Server Model

Access Proxy Model

Functional Model

Model Assumptions

Supporting Arguments


Slide 316

Steps 6 and 7: Measures and Results• Assumptions: CPUB is the conjunction of

– C1PUB= the publishing client is successfully registered with the core

– C2PUB= the publishing client's mission application interacts with the client as intended

• Definition of a successful publish: EPUB is the conjunction of– E1PUB = the data flow for the IO is correct– E2PUB = the time required for the publish operation is less than

tmax– E3PUB = the content of the IO received by the subscriber has

the same essential content as that assembled by the publisher• Measure: P[EPUB|CPUB]

– Fraction of successful publishes in a 12 hour period– Between clients that cannot be compromised

• Objective– P[EPUB|CPUB] ≥ pPUB for a 12-hour mission


Slide 317

Vulnerability Discovery Rate Study

Fraction of successful publishesversus MTTD

Number of successful intrusionsversus MTTD


Slide 318

Varying the number of OS and OS w/ process domains

0.85 0.84 0.83

0.76

0.70

0.66

0.61

0.82

0.76

0.72

0.63

0.78

0.64

0.52

0.76

0.59

0.940.93

0.89

0.83

0.780.75

0.72

0.870.84

0.81

0.70

0.84

0.71

0.57

0.81

0.76

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1.1 6.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Experiment

Frac

tion

of s

ucce

ssfu

l Pub

lishe

With data attacksWithout data attacks

4 OS total

4 p.d 3 p.d 2 p.d 1 p.d

3 OS total

2 p.d 1 p.d

2 OS total

0 p.d 1 p.d

1 OS total

0 p.d0 p.d 3 p.d 2 p.d 1 p.d 0 p.d

4 OS, 4 pd,AP OS<>core

3 OS, 3 pd,AP OS<>core


Slide 319

Autonomic Distributed Firewall (ADF) NIC policies

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

100 1000

MTTD (min)

Frac

tion

of S

ucce

ssfu

l Pub

lishe

s

Per process domain

Per component

No restriction0

20

40

60

80

100

120

140

100 1000

MTTDTo

tal N

umbe

r of I

ntru

sions

Per process domain

Per component

No restriction

Fraction of successful publishes Total number of intrusions

• Per-pd policies considerably increase the performance (10% unavailability vs. 1.5% at MTTD=100 minutes)

• ADF NICs can handle per-port policies => should take advantage of this feature, implying to set the communication ports in advance


Slide 320www.iti.uiuc.edu

Design and Implementation Oriented Validation of Survivable Systems A. Agbaria, T. Courtney, M. Ihde, W. H. Sanders, M. Seri, and S. Singh

University of Illinois at Urbana-ChampaignI N F O R M A T I O N T R U S T I N S T I T U T E

Requirement

Decomposable?Logical

Decomposition

Yes

Sub-requirements

Quantitative?Logical

Argumentation

No

Yes

Build high-level description of System and its operational environment

Verify assumptions& parameter values

Notvalid

Probabilistic model of the system and its operational environment

Compare withrequirement

Probabilistic measures

System valid w.r.t.the requirement

System not valid

Step 5: Justify the modeling assumptions of Step 4

Step 7: Evaluation and comparing

Design Phase Validation

• Let PUB be the requirement of “successfully process a publish request”.

• Let C be the preconditions.• Let E be the desired event, i.e., the

successful of a request to publish.• E is a conjunction of:

• E1 = the data flow of the publish is correct

• E2 = timeliness• E3 = integrity • E4 = confidentiality

• The requirement: PUB: P[E|C] ≥ p

• A study of the design reveals that integrity and confidentiality can be regarded as probability-1 events.

• We obtain the following logical decomposition:

• PUB1: P[E1 ∧ E2| E3 ∧ E4 ∧ C] ≥ p• PUB2: P[E3| C] = 1• PUB3: P[E4| C] = 1

• It can be shown that: (PUB1 ∧ PUB2 ∧ PUB3) ⇒ PUB

Step 3: For every atomic requirement Ra

Data Flow

Infrastructure-level attacks


layer Integrity

AA3: AP Application-layer

Confidentiality




DoD (CAC)

PKCS #11 Compliance Tamperproof

Keys Not Guessable

Alg. Framew

ork

Key Leng

th

Key Lifetime

No Unauthorized

Indirect Access


Protection of CAC

Authentication Data


Accessing CAC

No Cryptography in

AP

Not Preconfig

ured

Not Reconfigu

rable


Access Proxy Model

Functional Model

Model Assumptions

Supporting Arguments

Step 1: Formulate a precise statement of R.Step 2: If R is

logically decomposable, decompose it iteratively.

Step 4:Detailed description of components

Step 6: Construct a simulation model

Survivable Publish Subscribe System

Client Zone

Management Staff

ExecutiveZone

CrumpleZone

OperationsZone

CoreQuad 1 Quad 2 Quad 3 Quad 4

Network

Access Proxy (Isolated Process Domains in SE-Linux)Domain6

First Restart Domains Eventually Restart HostLocal Controller

RMI

TCPTCP

PS Sensor Rpts

TCP UDP

IIOP

PSQImplPSQImpl

IIOP

TCP

DC

Eascii

Domain1 Domain2 Domain3 Domain4 Domain5

Forward/Rate limit Proxy Logic Inspect / Forward / Rate Limit

Implementation Phase Validation

Ev ent 1

Defeat the firewall access control

Ev ent 1


Ev ent 2

Compromise client

Ev ent 2

Compromise client

Ev ent 3

Escalate privileges

Ev ent 3

Escalate privileges

Ev ent 4

Read from data fi le

Ev ent 4

Read from data fi le

Ev ent 5

Read from memory

Ev ent 5

Read from memory

Gate 5

Read data

Gate 5

Read data

Gate 2

Read data on client

Gate 2

Read data on client

Ev ent 6

Defeat the firewall crypto

Ev ent 6


Ev ent 7

Steal key/certificate

Ev ent 7


Ev ent 8

Sniff packets

Ev ent 8

Sniff packets

Gate 6

Defeat the firewall and sniff off wire

Gate 6

Defeat the firewall and sniff off wire

Ev ent 1


Ev ent 1


Ev ent 6


Ev ent 6


Ev ent 9

Tear down current TCP connections

Ev ent 9

Tear down current TCP connections

Ev ent 10

Perform ARP spoofing

Ev ent 10

Perform ARP spoofing

Ev ent 11

Modify network routing

Ev ent 11

Modify network routing

Gate 8

Re-route traffic at both ends

Gate 8

Re-route traffic at both ends

Ev ent 7


Ev ent 7


Ev ent 12

Decrypt & read data

Ev ent 12

Decrypt & read data

Gate 9

Read data

Gate 9

Read data

Gate 7

Get in middle of client/core traffic

Gate 7

Get in middle of client/core traffic

Gate 3

Read data on transit

Gate 3

Read data on transit

Ev ent 1


Ev ent 1


Ev ent 7


Ev ent 7


Ev ent 13

Compromise AP

Ev ent 13

Compromise AP

Ev ent 14

Escalate privileges

Ev ent 14

Escalate privileges

Ev ent 15

Read IO as it passes through

Ev ent 15

Read IO as it passes through

Gate 12.1

Read from AP

Gate 12.1

Read from AP

Gate 4

Read data on core

Gate 4

Read data on core

Gate 1

Def eat conf identiality of IO data

Gate 1

Def eat conf identiality of IO data

Attack Tree

Attack GraphAutomatic

construction


Slide 321

The Art of Dependability Evaluation / Conclusions


Slide 322

Course Outline Revisited


• Stochastic Activity Network Concepts• Analytic/Numerical State-Based Modeling • Case Study: Embedded Fault-Tolerant Multiprocessor System• Solution by Simulation• The Art of System Dependability /Conclusions


Slide 323

Model Solution Issues

• In general:– Use “tricks” from probability theory to reduce complexity of model– Choose the right solution method

• Simulation:– Result is just an estimator based on a statistical experiment– Estimation of accuracy of estimate essential– Use confidence Intervals!

• Analytic/Numerical model solution:– Avoid state space explosion

• Limit model complexity• Use structure of model (symmetries) to reduce state space size

– Understand accuracy/limitations of chose numerical method• Transient Solution• (Iterative or Direct) Steady-state solution


Slide 324

The “Art” of Performance and Dependability Validation

• Performance and dependability validation is an art because:– There is no recipe for producing a good analysis,– The key is knowing how to abstract away unimportant details, while

retaining important components and relationships,– This intuition only comes from experience,– Experience comes from making mistakes.

• There are many ways to make mistakes.


Slide 325

Doing it Right: Model Construction• Understand the desired measure before you build the model.• The desired measure determines the type of model and the level of detail required. No

model is universal!• Steps in constructing a model:

1. Choose the desired measures:• Choice of measures form a basis for comparison.• It’s easy to choose wrong measure and see patterns where none exist.• Measures should be refined during the design and validation process.

2. Choose the appropriate level of detail/abstraction for model components.• Key is to represent model at the right level of detail for the chosen measures.• It is almost never possible or practical to include all system aspects.• Model the system at the highest level possible to obtain a good estimate of the

desired measures.3. Build the model.

• Decide how to break up the model into modules, and how the modules will interact with one another.

• Test the model as you build it, to ensure it executes as intended.


Slide 326

Doing it Right: Model Solution• Use the appropriate model solution technique:

– Just because you have a hammer doesn’t mean the world is a nail.– There is no universal model solution technique (not even simulation!)– The appropriate model solution technique depends on model characteristics.

• Use representative input values:– The results of a model solution are only as good as the inputs.– The inputs will never be perfect.– Understand how uncertainty in inputs affects measures.– Do sensitivity analysis.

• Include important points in the design/parameter space:– Parameterize choices when design or input values are not fixed.– A complete parametric study is usually not possible.– Some parameters will have to be fixed at “nominal” values.– Make sure you vary the important ones.


Slide 327

Doing it Right: Model Interpretation/Documentation• Make all your assumptions explicit:

– Results from models are only as good as the assumptions that were made in obtaining them.

– It’s easy to forget assumptions if they are not recorded explicitly.• Understand the meaning of the obtained measures:

– Numbers are not insights.– Understand the accuracy of the obtained measures, e.g., confidence intervals

for simulation.• Keep social aspects in mind:

– Performance and dependability analysts almost always bring bad news.– Bearers of bad news are rarely welcomed.– In presentations, concentrate on results, not the process.


Slide 328

Next Steps

• You have:– Learned theory related to reliability, availability, and

performance validation using SANs and Möbius– Learned about the advantages and disadvantages of various

(analytical/numerical and simulation-based) solution algorithms.

• There are many places to go for further information:– Möbius Software Web pages

(www.mobius.uiuc.edu)– Performability Engineering Research Group Web pages

(www.perform.csl.uiuc.edu)


2758769

Documents

system service

improper service

system restoration

system failure

time t

specified service

properness of service

validating computer