This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Validating Computer System and Network Trustworthiness
Prof. William H. SandersDepartment of Electrical and Computer Engineering and
Coordinated Science LaboratoryUniversity of Illinois at Urbana-Champaign
• Dependability is the ability of a system to deliver a specified service.• System service is classified as proper if it is delivered as specified; otherwise it
is improper.• System failure is a transition from proper to improper service.• System restoration is a transition from improper to proper service.
⇒ The “properness” of service depends on the user’s viewpoint!
Basic Validation Terms• Measures -- What you want to know about a system. Used to determine if a
realization meets a specification• Models -- Abstraction of the system at an appropriate level of abstraction and/or
details to determine the desired measures about a realization.• Dependability Model Solution Methods -- Method by which one determines
measures from a model. Models can be solved by a variety of techniques:– Combinatorial Methods -- Structure of the model is used to obtain a simple
arithmetic solution.– Analytical/Numerical Methods -- A system of linear differential equations or
linear equations is constructed, which is solved to obtain the desired measures
– Simulation -- The realization of the system is executed, and estimates of the measures are calculated based on the resulting executions (known also as sample paths or trajectories.)
⇒ Möbius supports performance/reliability/availability validation byanalytical/numerical and simulation-based methods.
• Reliability - a measure of the continuous delivery of service– R(t) is the probability that a system delivers proper service throughout [0,t].
• Safety - a measure of the time to catastrophic failure– S(t) is the probability that no catastrophic failures occur during [0,t].– Analogous to reliability, but concerned with catastrophic failures.
• Time to Failure - measure of the time to failure from last restoration. (Expected value of this measure is referred to as MTTF - Mean time to failure.)
• Maintainability - measure of the time to restoration from last experienced failure. (Expected value of this measure is referred to as MTTR - Mean time to repair.)
• Coverage - the probability that, given a fault, the system can tolerate the fault and continue to deliver proper service.
• The fact that the exponential random variable has the memoryless property indicates that the “rate” at which events occur is constant, i.e., it does not change over time.
• Often, the event associated with a random variable X is a failure, so the “event rate” is often called the failure rate or the hazard rate.
• The event rate of X is defined as the probability that the event associated with Xoccurs within the small interval [t, t + ∆t], given that the event has not occurred by time t, per the interval size ∆t:
• This can be thought of as looking at X at time t, observing that the event has not occurred, and measuring the number of events (probability of the event) that occur per unit of time at time t.
Important Fact 2: The exponential random variable has a constant failure rate!
Probability Review: Minimum of Two Independent ExponentialsAnother interesting property of exponential random variables is that the minimum of two independent exponential random variables is also an exponential random variable.Let A and B be independent exponential random variables with rates α and βrespectively. Let us define X = minA,B. What is FX(t)?
Probability Review: Competition of Two Independent ExponentialsIf A and B are independent and exponential with rate α and β respectively, and A and Bare competing, then we know that one will “win” with an exponentially distributed time (with rate α + β). But what is the probability that A wins?
[ ] [ ] [ ]dxxAPxABAPBAP 0
==<=< ∫∞
[ ] ( )
[ ]
[ ]
[ ]( )
[ ]( )
( )
β+αα
=α=
α=
α−−=
α≤−=
α<=
α=<=
=<=
∫
∫∫∫∫∫∫
∞ β+α−
∞ α−β−
α−∞ β−
α−∞
α−∞
α−∞
∞
0
0
0
0
0
0
0
11
1
dxe
dxee
dxee
dxexBP
dxeBxP
dxexABAP
dxxfxABAP
x
xx
xx
x
x
x
A
Important Fact 4: If A and B are independent, competing exponentials, with rates α and β respectively, the probability that A occurs before B is α/(α + β)!
• Combinatorial validation methods are the simplest kind of analytical/numerical techniques and can be used for reliability and availability modeling under certain assumptions.
• Assumptions are that component failures are independent, and for availability, repairs are independent.
• When these assumptions hold, simple formulas for reliability and availability exist.
• One key to building highly available systems is the use of reliable components and systems.
• Reliability: The reliability of a system at time t (R(t)) is the probability that the system operation is proper throughout the interval [0,t].
• Probability theory and combinatorics can be directly applied to reliability models.
• Let X be a random variable representing the time to failure of a component. The reliability of the component at time t is given byRX(t) = P[X > t] = 1 - P[X ≤ t] = 1 - FX(t).
• Similarly, we can define unreliability at time t byUX(t) = P[X ≤ t] = FX(t).
Failure RateWhat is the rate that a component fails at time t? This is the probability that a component that has not yet failed fails in the interval (t, t + ∆t), as ∆t → 0.
Note that we are not looking at P[X ∈ (t, t + ∆t)] = fX(t). Rather, we are seekingP[X ∈ (t, t + ∆t)| X > t].
System ReliabilityWhile FX can give the reliability of a component, how do you compute the reliability of a system?
System failure can occur when one, all, or some of the components fail. If one makes the independent failure assumption, system failure can be computed quite simply. The independent failure assumption states that all component failures of a system are independent, i.e., the failure of one component does not cause another component to be more or less likely to fail.
Given this assumption, one can determine:1) Minimum failure time of a set of components2) Maximum failure time of a set of components3) Probability that k of N components have failed at a particular time t.
Maximum of n Independent Failure TimesLet X1, . . . , Xn be independent component failure times. Suppose the system fails at time S if all the components fail.
Thus, S = maxX1, . . . , Xn
What is Fs(t)?
Fs(t) = P[S ≤ t]= P[X1 ≤ t AND X2 ≤ t AND . . . AND Xn ≤ t]= P[X1 ≤ t] P[X2 ≤ t] . . . P[Xn ≤ t] By independence= By definition
Let X1, . . . , Xn be independent component failure times. A system fails at time S if any of the components fail.Thus, S = minX1, . . . , Xn. What is FS(t)?
FS(t) = P[S ≤ t] = P[X1 ≤ t OR X2 ≤ t OR . . . OR Xn ≤ t]
This is an application of the law of total probability (LOTP).
Minimum of n Independent Component Failure Times
] AND . . . AND AND [1 ] OR . . . OR OR [
then, and such that complementset theis and event,an is If :Trick
k of NLet X1, . . . , Xn be component failure times that have identical distributions (i.e.,
= . . .). The system fails at time S if k of the N components fail.
FS(t) = P[at least k components failed by time t]= P[k failed OR k + 1 failed OR . . . OR N failed]= P[k failed] + P[k + 1 failed] + . . . + P[N failed]
What is P[exactly k failed]?= P[k failed and (N - k) have not]
=
where FX(t) is the failure distribution of each component.
A system comprises N components, where the component failure times are given by the random variables X1, . . . , XN. The system fails at time S with distribution FS if:
Reliability FormalismsThere are several popular graphical formalisms to express system reliability. The core of the solvers is the methods we have just examined. In particular, we will examine
There is nothing particularly special about these formalisms except their popularity. It is easy to implement these formalisms, or design your own, in a spreadsheet, for example.
ExampleA NASA satellite architecture under study is designed for high reliability. The major computer system components include the CPU system, the high-speed network for data collection and transmission, and the low-speed network for engineering and control. The satellite fails if any of the major systems fail.
There are 3 computers, and the computer system fails if 2 or more of the computers fail. Failure distribution of a computer is given by FC.
There is a redundant (2) high-speed network, and the high-speed network system fails if both networks fail. The distribution of a high-speed network failure is given by FH.
The low-speed network is arranged similarly, with a failure distribution of FL.
• Components are leaves in the tree• A component fails = logical value of true, otherwise false.• The nodes in the tree are boolean AND, OR, and k of N gates.• The system fails if the root is true.
AND gatestrue if all the components are true (fail).
OR gatestrue if any of the components are true (fail).
k of N gatestrue if at least k of the components are true (fail).
A system comprises N components, where the component failure times are given by the random variables X1, . . . , XN. The system fails at time S with distribution FS if:
Reliability FormalismsThere are several popular graphical formalisms to express system reliability. The core of the solvers is the methods we have just examined. In particular, we will examine
There is nothing particularly special about these formalisms except their popularity. It is easy to implement these formalisms, or design your own, in a spreadsheet, for example.
ExampleA NASA satellite architecture under study is designed for high reliability. The major computer system components include the CPU system, the high-speed network for data collection and transmission, and the low-speed network for engineering and control. The satellite fails if any of the major systems fail.
There are 3 computers, and the computer system fails if 2 or more of the computers fail. Failure distribution of a computer is given by FC.
There is a redundant (2) high-speed network, and the high-speed network system fails if both networks fail. The distribution of a high-speed network failure is given by FH.
The low-speed network is arranged similarly, with a failure distribution of FL.
• Components are leaves in the tree• A component fails = logical value of true, otherwise false.• The nodes in the tree are boolean AND, OR, and k of N gates.• The system fails if the root is true.
AND gatestrue if all the components are true (fail).
OR gatestrue if any of the components are true (fail).
k of N gatestrue if at least k of the components are true (fail).
Reliability Graph ExampleReliability graphs can implement more complex interactions.For example, a telephone network “fails” if there is no path from source to sink.
Conditioning Fault TreesIt is also possible to use conditioning to solve more complex fault trees. If the same component appears more than once in a fault tree, it violates the independent failure assumption. However, a conditioned fault tree can be solved.
Example: A component C appears multiple times in the fault tree.
( ) ( ) ( ) ( )( )
failed.not has given that system theis andfailed has given that system theis Where
• Frequently, the desired measure of a reliability model is the reliability at some time t. Thus, the distribution of the system reliability is superfluous; R(t) is the only thing of interest.
• This condition simplifies computation because all that is necessary for solution is the reliability of the components at time t. Solution then becomes a straightforward computation.
• If a system is described in terms of the availability of components at time t, then we may compute the system availability in the same way that reliability is computed. The restriction is that all component behaviors must be independent of one another.
Reliability/Availability TablesA system comprises N components. Reliability of component i at time t is given by RXi(t), and the availability of component i at time t is given by AXi(t).
Condition System Reliability System Availability
system fails if all components fail
system fails ifone component fails
system fails if at least k componentsfail, identical distribution
system fails if at leastk components fail,general case
• For hardware, MIL-HDBK-217 is widely used.– Not always current with modern components.– Lacks distributions; it only contains failure rates.– While not perfect, it seems to be the best source that exists. However,
numbers from MIL-HDBK-217 should be used with caution.
• Due to the nature of software, no accepted mechanism exists to predict software reliability before the software is built.– Best guess is the reliability of previously built similar software.
• In all cases, numbers should be used with caution and adjusted based on observation and experience.
• No substitute for empirical observation and experience!
– The amount of time a program takes to execute can be computed precisely if all factors are known, but this is nearly impossible and sometimes useless. At a more abstract level, we can approximate the running time by a random variable.
– Fault arrivals almost always must be modeled by a random process.
We begin by describing a subset of SANs: stochastic Petri nets.
Stochastic activity networks, or SANs, are a convenient, graphical, high-level language for describing system behavior. SANs are useful in capturing the stochastic (or random) behavior of a system.
Stochastic Petri Net ReviewOne of the simplest high-level modeling formalisms is called stochastic Petri nets. A stochastic Petri net is composed of the following components:
• Places: which contain tokens, and are like variables
• tokens: which are the “value” or “state” of a place
• transitions: which change the number of tokens in places
• input arcs: which connect places to transitions
• output arcs: which connect transitions to places
Firing Rules for SPNsA stochastic Petri net (SPN) executes according to the following rules:
• A transition is said to be enabled if for each place connected by input arcs, the number of tokens in the place is ≥ the number of input arcs connecting the place and the transition.
Firing Rules, cont.• A transition may fire if it is enabled. (More about this later.)• If a transition fires, for each input arc, a token is removed from the
corresponding place, and for each output arc, a token is added to the corresponding place.
Example:
Note: tokens are not necessarily conserved when a transition fires.
Specification of Stochastic Behavior of an SPN• A stochastic Petri net is made from a Petri net by
– Assigning an exponentially distributed time to all transitions.– Time represents the “delay” between enabling and firing of a transition.– Transitions “execute” in parallel with independent delay distributions.
• Since the minimum of multiple independent exponentials is itself exponential, time between transition firings is exponential.
• If a transition t becomes enabled, and before t fires, some other transition fires and changes the state of the SPN such that t is no longer enabled, then t aborts, that is, t will not fire.
• Since the exponential distribution is memoryless, one can say that transitions that remain enabled continue or restart, as is convenient, without changing the behavior of the network.
Notes on SPNs• SPNs are much easier to read, write, modify, and debug than Markov chains.
• SPN to Markov chain conversion can be automated to afford numerical solutions to Markov chains.
• Most SPN formalisms include a special type of arc called an inhibitor arc, which enables the SPN if there are zero tokens in the associated place, and the identity (do nothing) function. Example: modify SPN to give writes priority.
• Limited in their expressive power: may only perform +, -, >, and test-for-zero operations.
• These very limited operations make it very difficult to model complex interactions.
• Simplicity allows for certain analysis, e.g., a network protocol modeled by an SPN may detect deadlock (if inhibitor arcs are not used).
• More general and flexible formalisms are needed to represent real systems.
Stochastic Activity NetworksThe need for more expressive modeling languages has led to several extensions to stochastic Petri nets. One extension that we will examine is called stochastic activity networks. Because there are a number of subtle distinctions relative to SPNs, stochastic activity networks use different words to describe ideas similar to those of SPNs.
Stochastic activity networks have the following properties:
• A general way to specify that an activity (transition) is enabled• A general way to specify a completion (firing) rule• A way to represent zero-timed events• A way to represent probabilistic choices upon activity completion• State-dependent parameter values• General delay distributions on activities
A fault-tolerant computer system is made up of two redundant computers. Each computer is composed of three redundant CPU boards. A computer is operational if at least 1 CPU board is operational, and the system is operational if at least 1 computer is operational.
CPU boards fail at a rate of 1/106 hours, and there is a 0.5% chance that a board failure will cause a computer failure, and a 0.8% chance that a board will fail in a way that causes a catastrophic system failure.
Reward VariablesReward variables are a way of measuring performance- or dependability-related characteristics about a model.
Examples:– Expected time until service– System availability– Number of misrouted packets in an interval of time– Processor utilization– Length of downtime– Operational cost– Module or system reliability
Reward Structure ExampleA web server failure model is used to predict profits. When the web server is fully operational, profits accumulate at $N/hour. In a degraded mode, profits accumulate at Repairs cost $K./hour.6
1 N
By carefully integrating the reward structure from 0 to t, we get the profit at time t. This is an example of an “interval-of-time” variable.
( )
( )−
=
=
0
061
KaC
NN
mRm is a fully functioning markingm is a degraded-mode markingotherwise
RationaleThere are many good reasons for using composed models.
– Building highly reliable systems usually involves redundancy. The replicate operation models redundancy in a natural way.
– Systems are usually built in a modular way. Replicates and Joins are usually good for connecting together similar and different modules.
– Tools can take advantage of something called the Strong Lumping Theoremthat allows a tool to generate a Markov process with a smaller state space (to be described in Session 7).
Composed ModelHow does adding an additional computer affect reliability?
– In the composed model, change number of replications to 3 and change various reward variables - easy (Use a global variable if you think suspect you may want to do this.)
– In “flat” model, add another computer - hard
In composed model, the number of states in the underlying Markov chain is much smaller, especially for large numbers of replications. (Details will be given in Session 7.)
• Simulation relies on good pseudo-random number generation, sufficient observations, and good statistical techniques to produce an approximate solution
• Increasing accuracy by a factor of n requires on the order of n2 more work, which can be prohibitively expensive.
For example, a 5-Nines system reliability model will require approximately 100,000 observations to observe one failure. One digit of accuracy can easily require over 1,000,000 observations!
(For many models, 1,000,000 observations can be generated quickly, but as system failure becomes even rarer, standard simulation quickly becomes infeasible.)
If you can model using exponential delays and your model is sufficiently small, continuous time Markov chains (CTMCs) offer some advantages. These include:
– Typically faster solution time for systems with rare events– Typically takes less time to get more accurate answers– Typically more confidence in the solution
In order to understand when we get these advantages, we must better understand the methods of obtaining solutions to CTMCs.
Random Variable ReviewIt is often convenient to assign a (real) number to every element in Ω. This assignment, or rule, or function, is called a random variable.
Random Process ReviewRandom processes are useful for characterizing the behavior of real systems.
A random process is a collection of random variables indexed by time.
Example: X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = 1,2,3, . . ..
Describing a Random ProcessRecall that for a random variable X, we can use the cumulative distribution FX to describe the random variable.
In general, no such simple description exists for a random process.
However, a random process can often be described succinctly in various different ways. For example, if Y is a random variable representing the roll of a die, and X(t) is the sum after t rolls, then we can describe X(t) by
X(t) - X(t - 1) = Y,
P[X(t) = i|X(t - 1) = j] = P[Y = i - j],
or X(t) = Y1 + Y2 + . . . + Yt, where the Yi’s are independent.
Classifying Random Processes: Characteristics of TIf the number of time points defined for a random process, i.e., |T|, is finite or countable (e.g., integers), then the random process is said to be a discrete-timerandom process.
If |T| is uncountable (e.g., real numbers) then the random process is said to be a continuous-time random process.
Example: Let X(t) be the number of fault arrivals in a system up to time t. Since t ∈T is a real number, X(t) is a continuous-time random process.
Classifying Random Processes: State Space TypeLet X be a random process. The state space of a random process is the set of all possible values that the process can take on, i.e.,
S = y: X(t) = y, for some t ∈ T.
If X is a random process that models a system, then the state space of X can represent the set of all possible configurations that the system could be in.
State Occupancy Probability VectorLet π be a row vector. We denote πi to be the i-th element of the vector. If π is a state occupancy probability vector, then πi(k) is the probability that a DTMC has value i (or is in state i) at time-step k.
Assume that a DTMC X has a state-space size of n, i.e., S = 1, 2, . . . , n. We say formally
Notice that this resembles vector-matrix multiplication.In fact, if we arrange the matrix P = Pij, that is, if
P =
then pij = Pij, and π(1) = π(0)P, where π(0) and π(1) are row vectors, and π(0)P is a vector-matrix multiplication.The important consequence of this is that we can easily specify a DTMC in terms of an occupancy probability vector π and a transition probability matrix P.
Solution, cont.Alternatively, we could compute P3 since we found
π(3) = π(0)P3.
Working out solutions by hand can be tedious and error-prone, especially for “larger” models (i.e., models with many states). Software packages are used extensively for this sort of analysis.
Software packages compute π(k) by (. . . ((π(0)P)P)P. . .)P rather than computing Pk, since computing the latter results in a large “fill-in.”
Graphical RepresentationIt is frequently useful to represent the DTMC as a directed graph. Nodes represent states, and edges are labeled with probabilities. For example, our weather prediction model would look like this:
Limiting Behavior of DTMCsIt is sometimes useful to know the time-limiting behavior of a DTMC. This translates into the “long term,” where the system has settled into some steady-state behavior.
Formally, we are looking for
To compute this, what we want is
There are various ways to compute this. The simplest is to calculate π(n) for increasingly large n, and when π(n + 1) ≅ π(n), we can believe that π(n) is a good approximation to steady-state. This can be rather inefficient if n needs to be large.
ClassificationsIt is much easier to solve for the steady-state behavior of some DTMC’s than others. To determine if a DTMC is “easy” to solve, we need to introduce some definitions.
Definition: A state j is said to be accessible from state i if there exists an n ≥ 0 such that We write i → j.
Note: recall that
If one thinks of accessibility in terms of the graphical representation, a state j is accessible from state i if there exists a path of non-zero edges (arcs) from node i to node j.
Steady-State Solution of DTMCsThe steady-state behavior can be computed by solving the linear equation
π = πP, with the constraint that For irreducible DTMC’s, it can be
shown that this solution is unique. If the DTMC is periodic, then this solution yields π*.
One can understand the equation π = πP in two different ways.• In steady-state, the probability distribution π(n + 1) = π(n)P, and by
definition π(n + 1) = π(n) in steady-state.• “Flow” equations.
Flow equations require some visualization. Imagine a DTMC graph, where the nodes are assigned the occupancy probability, or the probability that the DTMC has the value of the node.
Let πiPij be the “probability mass” that moves from state j to state i in one time-step. Since probability must be conserved, the probability mass entering a state must equal the probability mass leaving a state.
Continuous Time Markov Chains (CTMCs)For most systems of interest, events may occur at any point in time. This leads us to consider continuous time Markov chains. A continuous time Markov chain(CTMC) has the following property:
A CTMC is completely described by the initial probability distribution π(0) and the transition probability matrix P(t) = [pij(t)]. Then we can compute π(t) = π(0)P(t).
The problem is that pij(t) is generally very difficult to compute.
CTMC PropertiesThis definition of a CTMC is not very useful until we understand some of the properties.
First, notice that pij(τ) is independent of how long the CTMC has previously been in state i, that is,
There is only one random variable that has this property: the exponential random variable. This indicates that CTMCs have something to do with exponential random variables. First, we examine the exponential r.v. in some detail.
Event RateThe fact that the exponential random variable has the memoryless property indicates that the “rate” at which events occur is constant, i.e., it does not change over time.
Often, the event associated with a random variable X is a failure, so the “event rate” is often called the failure rate or the hazard rate.
The event rate of X is defined as the probability that the event associated with Xoccurs within the small interval [t, t + ∆t], given that the event has not occurred by time t, per the interval size ∆t:
This can be thought of as looking at X at time t, observing that the event has not occurred, and measuring the number of events (probability of the event) that occur per unit of time at time t.
Minimum of Two Independent ExponentialsAnother interesting property of exponential random variables is that the minimum of two independent exponential random variables is also an exponential random variable.Let A and B be independent exponential random variables with rates α and βrespectively. Let us define X = minA,B. What is FX(t)?
FX(t) = P[X ≤ t]= P[minA,B ≤ t]= P[A ≤ t OR B ≤ t]= 1 - P[A > t AND B > t] - see comb. methods section= 1 - P[A > t] P[B > t]= 1 - (1 - P[A ≤ t])(1 - P[B ≤ t])= 1 - (1 - FA(t))(1 - FB(t))= 1 - (1 - [1 - e-αt])(1 - [1 - e-βt])= 1 - e-αte-βt
Competition of Two Independent ExponentialsIf A and B are independent and exponential with rate α and β respectively, and Aand B are competing, then we know that one will “win” with an exponentially distributed time (with rate α + β). But what is the probability that A wins?
Imagine a random process X with state space S = 1,2,3. X(0) = 1. X goes to state 2 (takes on a value of 2) with an exponentially distributed time with parameter α. Independently, X goes to state 3 with an exponentially distributed time with parameter β. These state transitions are like competing random variables.
We say that from state 1, X goes to state 2 with rate α and to state 3 with rate β.
X remains in state 1 for an exponentially distributed time with rate α + β. This is called the holding time in state 1. Thus, the expected holding time in state 1 is
The probability that X goes to state 2 is The probability X goes to state 3 is
Competing Exponentials vs. a Single Exponential With Choice
Consider the following two scenarios:
1. Event A will occur after an exponentially distributed time with rate α. Event B will occur after an independent exponential time with rate β.
2. After waiting an exponential time with rate α + β, event A occurs with probability and event B occurs with probability
These two scenarios are indistinguishable. In fact, we frequently interchange the two scenarios rather freely when analyzing a system modeled as a CTMC.
State-Transition-Rate MatrixA CTMC can be completely described by an initial distribution π(0) and a state-transition-rate matrix. A state-transition-rate matrix Q = [qij] is defined as follows:
qij =
Example: A computer is idle, working, or failed. When the computer is idle, jobs arrive with rate α, and they are completed with rate β. When the computer is working, it fails with rate λw, and with rate λi when it is idle.
Analysis of “Simple Computer” ModelSome questions that this model can be used to answer:
– What is the availability at time t?– What is the steady-state availability?– What is the expected time to failure?– What is the expected number of jobs lost due to failure in [0,t]?– What is the expected number of jobs served before failure?– What is the throughput of the system (jobs per unit time), taking into
State-Space Generation from SANsIf the activity delays are exponential, it is straightforward to convert a SAN to a CTMC. We first look at the simple case, where there is no composed model.
• “Reduced Base Model” construction techniques make use of composed model structure to reduce the number of states generated.
• A state in the reduced base model is composed of a state tree and an impulse reward.
• During reduced base model construction, the use of state trees permits an algorithm to automatically determine valid lumpings based on symmetries in the composed model.
• The reduced base model is constructed by finding all possible (state tree, impulse reward) combinations and computing the transition rates between states.
• Generation of the detailed base model is avoided.
CTMC Transient SolutionWe have seen that it is easy to specify a CTMC in terms of the initial probability distribution π(0) and the state-transition-rate matrix.
Earlier, we saw that the transient solution of a CTMC is given by π(t) = π(0)P(t), and we noted that P(t) was difficult to define.
Due to the complexity of the math, we omit the derivation and show the relationship
Solving this differential equation in some form is difficult but necessary to compute a transient solution.
,)()()( QtPtQPtPdtd
== where Q is the state transition rate matrix ofthe Markov chain.
Transient Solution TechniquesSolutions to can be done in many (dubious) ways*:– Direct: If the CTMC has N states, one can write N2 PDEs with N2 initial
conditions and solve N2 linear equations.– Laplace transforms: Unstable with multiple “poles”– Nth order differential equations: Uses determinants and hence is numerically
unstable– Matrix exponentiation: P(t) = eQt, where
Matrix exponentiation has some potential. Directly computing eQt by performing can be expensive and prone to instability.
If the CTMC is irreducible, it is possible to take advantage of the fact that Q = ADA-1, where D is a diagonal matrix. Computing eQt becomes AeDtA-1, where
.!)(
1∑
∞
=
+=n
nQt
nQtIe
∑∞
=
+1 !
)(n
n
nQtI
)()( tQPtPdtd =
( ). ,...,,diag 2211 tdtdtdDt nneeee =
* See C. Moler and C. Van Loan, “Nineteen Dubious Ways to Compute the Exponential of a Matrix,” SIAM Review, vol. 20, no. 4, pp. 801-836, October 1978.
This yields the elegant equation π*Q = 0, where the steady-state probability distribution. If the CTMC is irreducible, then π* can be computed with the constraint that
Definition: A CTMC is irreducible if every state in the CTMC is reachable from every other state.
If the CTMC is not irreducible, then more complex solution methods are required.
Notice that for irreducible CTMCs, the steady-state distribution is independent of the initial-state distribution.
Direct Steady-State SolutionOne steady-state solver in Möbius is the direct steady-state solver. This solver solves the augmented matrix using a form of Gaussian elimination.
Pros:
Cons:
Recommendation: Use for small CTMCs (tens of states) or medium-sized and stiff CTMCs (hundreds to a few thousands), or when high accuracy is required.
Reminder: High accuracy in solution does not mean high accuracy in prediction. Use accuracy to do relative comparisons.
TieQ~* =π
Can get a very accurate solution in a fixed amount of time; “stiffness” (described later) does not affect solution time.
Solution complexity is O(n3), so does not scale well to large models; memory requirements are high due to fill-in and are not known a priori.
Iterative Solution MethodsThe simplest iterative solution methods are called stationary iterative methods, and they can be expressed as
π(k + 1) = π(k)M,where M is a constant (stationary) matrix. Computing π(k + 1) from π(k) requires one vector-matrix multiplication, or one iteration, which on modern workstations is extremely fast.
The simplest stationary iterative method for CTMCs is called the power method. Recall π*Q = 0. Let M = Q + I.
Stationary iterative solution methods have the following characteristics:– Low memory usage (no fill-in); predictable memory usage– Low time per iteration, proportional to the number of non-zero entries– Fast solution time for non-stiff matrices (tens or hundreds of iterations)– Stop when sufficiently accurate– Slow solution time for stiff matrices– Difficult to quantify accuracy, especially for stiff matrices– Easy to implement
a if only rate rewards are used, the time-averaged interval-of-time steady-state measure is identical to the instant-of-time steady-state measure (if both exist).
b provided the instant-of-time steady-state distribution is well-defined. Otherwise, the time-averaged interval-of-time steady-state variable is computed and only results for rate rewards should be derived.
Problem Origin• This problem was originally posed in 1992 as a reliability model of a large,
embedded fault-tolerant computer, presumably for space-borne applications. It was posed as a hierarchical model with non-perfect coverage at each level, with the purpose of showing the inadequacy of existing techniques.– Combinatorial methods were incapable of including coverage at all levels of
the hierarchy, thus grossly overstating the reliability.– Markov- or SPN-based methods create far too many states to solve.– Monte-Carlo simulation works, but provides only an estimate (which is
often not good enough).– A specialized tool was developed to do numerical integration of a semi-
Markov process to solve this and similar problems.• In Möbius, we solve a smaller version of the same architecture “exactly” using
Markov models generated by SANs. This is made possible by automatic state lumping using composed models.
• System consists of 2 computers• Each computer consists of
– 3 memory modules (2 must be operational)– 3 CPU units (2 must be operational)– 2 I/O ports (1 must be operational)– 2 error-handling chips (non-redundant)
• Each memory module consists of– 41 RAM chips (39 must be operational)– 2 interface chips (non-redundant)
• A CPU consists of 6 non-redundant chips• An I/O port consists of 6 non-redundant chips• 10 to 20 year operational life
• The system is operational if at least one computer is operational• A computer is operational if all the modules are operational
– A memory module is operational if at least 39 RAM chips and both interface chips are operational.
– A CPU unit is operational if all 6 CPU chips are operational– An I/O port is operational if all 6 I/O chips are operational– The error-handling unit is operational if both error-handling chips are
operational• Failure rate per chip is 100 failures per 1 billion hours
Coverage• This system could be modeled using combinatorial methods if we did not take
coverage into account. Coverage is the chance that the failure of a chip will not cause the larger system to fail even if sufficient redundancy exists. I.e., coverage is the probability that the fault is contained.
The coverage probabilities are given in the following table:
• For example, if a RAM chip fails, there is a 0.2% chance the memory module will fail even if sufficient redundancy exists. If the memory module fails, there is a 5% chance the computer will fail. If a computer fails, there is a 5% chance the system will fail.
Redundant Component Fault Coverage ProbabilityRAM Chip 0.998Memory Module 0.95CPU Unit 0.995I/O Port 0.99Computer 0.95
• Seven places represent the state of the system:1. cpus – the number of operational CPU modules2. ioports – the number of operational I/O modules3. errorhandlers – whether the two error-handler chips are operational4. computer_failed – the number of failed computers5. memory_failed – the number of failed memory modules6. memory_chips – number of operational RAM chips7. interface_chips – number of operational interface chips
• Five activities represent failures in the system1. cpu_failure – the failure of any CPU chip2. ioport_failure – the failure of any I/O chip3. errorhandling_chip_failure – the failure of either error-handler chip4. memory_chip_failure – the failure of a RAM chip5. interface_chip_failure – the failure of a memory interface chip
Cases on these activities represent behavior based on coverage or non-coverage.
The modeled two-computer system with non-perfect coverage at all levels (i.e., the model as described), the state space contains 10,114 states. The 10 year mission reliability was computed to be .995579.
mission time)100% coverage at all levels 4278 0.999539Nonperfect coverage considered at all levels 10114 0.995579Nonperfect coverage considered at all levels,no spare memory module
1335 0.987646
Nonperfect coverage considered at all levels,no spare CPU module
3299 0.973325
Nonperfect coverage considered at all levels,no spare IO port
3299 0.985419
Nonperfect coverage considered at all levels,no spare memory module, CPU module, orIO port
511 0.935152
100% coverage at all levels, no sparememory module, CPU module, IO port, orRAM chips
• High-level formalisms (like SANs) make it easy to specify realistic systems, but they also make it easy to specify systems that have unreasonably large state spaces.
• State-of-the-art tools (like Mobius) can handle state-level models with a few ten’s of million states, but not more.
• When state spaces become too large, discrete event simulation is often a viable alternative.
• Discrete-event simulation can be used to solve models with arbitrarily large state spaces, as long as the desired measure is not based on a “rare event.”
• When “rare events” are present, variance reduction techniques can sometimes be used.
• Simulation can be applied to any SAN model. The most prominent difference, compared with analytic solvers, is that generally distributed activities can be used.
• Simulation does not require the generation of a state space and therefore does not require a finite state space. Therefore, much more detailed models can be solved.
Disadvantages of Simulation• Simulation only provides an estimate of the desired measure. An
approximate confidence interval is constructed that contains the actual result with some user-specified probability.
• Higher desired accuracy dramatically increases the necessary simulation time. As a rule, to make the confidence interval n times narrower, the simulation has to be run n2 times as long.
• The “rare event problem” may arise. If simulation is used to estimate a small probability, such as the reliability of a highly-reliable system, extremely long simulations may have to be performed to encounter the particular event often enough.
• Complicated models can require long simulation times, even if the rare event problem is not an issue. The simulators in Möbius perform the necessary event scheduling very efficiently, but it should be realized that simulation is not a panacea.
Simulation as Model Experimentation• State-based methods (such as Markov chains) work by enumerating all possible
states a system can be in, and then invoking a numerical solution method on the generated state space.
• Simulation, on the other hand, generates one or more trajectories (possible behaviors from the high-level model), and collects statistics from these trajectories to estimate the desired performance/dependability measures.
• Just how this trajectory is generated depends on the:– nature of the notion of state (continuous or discrete)– type of stochastic process (e.g., ergodic, reducible)– nature of the measure desired (transient or steady-state)– types of delay distributions considered (exponential or general)
• We will consider each of these issues in this module, as well as the simulation of systems with rare events.
Types of SimulationContinuous-state simulation is applicable to systems where the notion of state is continuous and typically involves solving (numerically) systems of differential equations. Circuit-level simulators are an example of continuous-state simulation.
Discrete-event simulation is applicable to systems in which the state of the system changes at discrete instants of time, with a finite number of changes occurring in any finite interval of time.
Since we will focus on validating end-to-end systems, rather than circuits, we will focus on discrete-event simulation.
There are two types of discrete-event simulation execution algorithms:– Fixed-time-stamp advance– Variable-time-stamp advance
Fixed-Time-Stamp Advance Simulation• Simulation clock is incremented a fixed time ∆t at each step of the simulation.• After each time increment, each event type (e.g., activity in a SAN) is checked
to see if it should have completed during the time of the last increment.• All event types that should have completed are completed and a new state of the
model is generated.• Rules must be given to determine the ordering of events that occur in each
interval of time.• Example:
• Good for all models where most events happen at fixed increments of time (e.g., gate-level simulations).
• Has the advantage that no “future event list” needs to be maintained.• Can be inefficient if events occur in a bursty manner, relative to time-step used.
• Simulation clock advanced a variable amount of time each step of the simulation, to time of next event.
• If all event times are exponentially distributed, the next event to complete and time of next event can be determined using the equation for the minimum of nexponentials (since memoryless), and no “future event list” is needed.
• If event times are general (have memory) then “future event list” is needed.• Has the advantage (over fixed-time-stamp increment) that periods of inactivity
are skipped over, and models with a bursty occurrence of events are not inefficient.
Basic Variable-Time-Step Advance Simulation Loop for SANs
A) Set list_of_active_activities to null.B) Set current_marking to initial_marking.C) Generate potential_completion_time for each activity that may complete in the
current_marking and add to list_of_active_activities.D) While list_of_active_activities ≠ null:
1) Set current_activity to activity with earliest potential_completion_time.2) Remove current_activity from list_of_active_activities.3) Compute new_marking by selecting a case of current_activity, and executing
appropriate input and output gates.4) Remove all activities from list_of_active_activities that are not enabled in
new_marking.5) Remove all activities from list_of_active_activities for which new_marking is a
reactivation marking.6) Select a potential_completion_time for all activities that are enabled in
new_marking but not on list_of_active_activities and add them to list_of_active_activities.
Types of Discrete-Event Simulation• Basic simulation loop specifies how the trajectory is generated, but does not
specify how measures are collected, or how long the loop is executed.
• How measures are collected, and how long (and how many times) the loop is executed depends on type of measures to be estimated.
• Two types of discrete-event simulation exist, depending on what type of measures are to be estimated.
– Terminating - Measures to be estimated are measured at fixed instants of time or intervals of time with fixed finite point and length. This may also include random but finite (in some sense) times, such as a time to failure.
– Steady-state - Measures to be estimated depend on instants of time or intervals whose starting points are taken to be t → ∞.
Generation of Potential Completion Times1) Generation of uniform [0,1] random variates
– Used as a basis for all random variate samples– Types
• Linear congruential generators• Tausworthe generators• Other types of generators
– Tests of uniform [0,1] generators
2) Generation of non-uniform random variates– Inverse transform technique– Convolution technique– Composition technique– Acceptance-rejection technique– Technique for discrete random variates
Let X be an exponentially distributed random variable with parameter λ. Let U be a uniform [0,1] random variable generated by a pseudo-random number generator.
Convolution Technique• Technique can be used for all random variables X that can be expressed as the
sum of n random variablesX = Y1 + Y2 + Y3 + . . . + Yn
• In this case, one can generate a random variate X by generating n random variates, one from each of the Yi, and summing them.
• Examples of random variables:– Sum of n Bernoulli random variables is a binomial random variable.– Sum of n exponential random variables is an n-Erlang random variable.
Composition Technique• Technique can be used when the distribution of a desired random variable can be
expressed as a weighted sum of other distributions.
• In this case F(x) can be expressed as
• The composition technique is as follows:1) Generate random variate i such that P[I = i] = pi for i = 0, 1, . . .
(This can be done as discussed for discrete random variables.)2) Return x as random variate from distribution Fi(x), where i is as chosen
above.
• A variant of composition can also be used if the density function of the desired random variable can be expressed as weighted sum of other density functions.
• Useful for generating any discrete distribution, e.g., case probabilities in a SAN.• More efficient algorithms exist for special cases; we will review most general
case.• Suppose random variable has probability distribution p(0), p(1), p(2), . . . on
non-negative integers. Then a random variate for this random variable can be generated using the inverse transform method:
1) Generate u with distribution uniform [0,1]2) Return j satisfying
Recommendations/Issues in Random Variate Generation
• Use standard/well-tested uniform [0,1] generators. Don’t assume that because a method is complicated, it produces good random variates.
• Make sure the uniform [0,1] generator that is used has a long enough period. Modern simulators can consume random variates very quickly (multiple per state change!).
• Use separate random number streams for different activities in a model system. Regular division of a single stream can cause unwanted correlation.
• Consider multiple random variate generation techniques when generating non-uniform random variates. Different techniques have very different efficiencies.
Typical Estimators of a Simulation Measure• Can be:
– Instant-of-time, at a fixed t, or in steady-state– Interval-of-time, for fixed interval, or in steady-state– Time-averaged interval-of-time, for fixed interval, or in steady-state
• Estimators on these measures include:– Mean– Variance– Interval - Probability that the measure lies in some interval [x,y]
• Don’t confuse with an interval-of-time measure.• Can be used to estimate density and distribution function.
– Percentile - 100βth percentile is the smallest value of estimator x such that F(x) ≥ β.
Different Types of Processes and Measures Require Different Statistical Techniques
• Transient measures (terminating simulation):– Multiple trajectories are generated by running basic simulation loop multiple
times using different random number streams. Called Independent Replications.
– Each trajectory used to generate one observation of each measure.• Steady-State measures (steady-state simulation):
– Initial transient must be discarded before observations are collected.– If the system is ergodic (irreducible, recurrent non-null, aperiodic), a single
long trajectory can be used to generate multiple observations of each measure.
– For all other systems, multiple trajectories are needed.
– Generate multiple independent observations of each measure, one observation of each measure per trajectory of the simulation.
– Observations of each measure will be independent of one another if different random number streams are used for each trajectory.
– From a practical point of view, new stream is obtained by continuing to draw numbers from old stream (without resetting stream seed).
Notation (for subsequent slides):– Let F(x) = P[X ≤ x] be measure to be estimated.– Define µ = E[X], σ2 = E[(X - µ)2].– Define xi as the ith observation value of X (ith replication, for terminating
simulation).
Issue: How many trajectories are necessary to obtain a good estimate?
Terminating Simulation: Estimating the Variance of a Measure I
• Computation of estimator and confidence interval for variance could be done like that done for mean, but result is sensitive to deviations from the normal assumption.
• So, use a technique called jackknifing developed by Miller (1974).
• Informally speaking, steady-state simulation is used to estimate measures that depend on the “long run” behavior of a system.
• Note that the notion of “steady-state” is with respect to a measure (which has some initial transient behavior), not a model.
• Different measures in a model will converge to steady state at different rates.• Simulation trajectory can be thought of as having two phases: the transient phase
and the steady-state phase (with respect to a measure).• Multiple approaches to collect observations and generate confidence intervals:
• Which method to use depends on characteristics of the system being simulated.• Before discussing these methods, we need to discuss how the initial transient is
Problem: Observations of measures are different during so-called “transient phase,” and should be discarded when computing an estimator for steady-state behavior.
Need: A method to estimate transient phase, to determine when we should begin to collect observations.
Approaches:– Let the user decide: not sophisticated, but a practical solution.– Look at long-term trends: take a moving average and measure differences.– Use more sophisticated statistical measures, e.g., standardized time series
(Schruben 1982).
Recommendation:– Let the user decide, since automated methods can fail.
Methods of Steady-State Measure Estimation: Replication/Deletion
• Statistics similar to those for terminating simulation, but observations collected only on steady-state portion of trajectory.
• One or more observations collected per trajectory:
• Compute
as ith observation, where Mi is the number of observations in trajectory i.
• xi are considered to be independent, and confidence intervals are generated.• Useful for a wide range of models/measures (the system need not be ergodic),
but slower than other methods, since transient phase must be repeated multiple times.
Methods of Steady-State Measure Estimation: Batch Means
• Similar to Replication/Deletion, but constructs observations from a single trajectory by breaking it into multiple batches.
• Example
• Observations from each batch are combined to construct a single observation; these observations are assumed to be independent and are used to construct the point estimator and confidence interval.
• Issues:– How to choose batch size?– Only applicable to ergodic systems (i.e., those for which a single trajectory
has the same statistics as multiple trajectories).– Initial transient only computed once.
• In summary, a good method, often used in practice.
Other Steady-State Measure Estimation Methods II• Spectral Method (Heidelberger and Welch 1981)
– Assumes process is covariance stationary, but does not make further assumptions (as previous method does).
– Efficient method, if certain parameters chosen correctly, but choice requires sophistication on part of user.
• Standardized Time Series (Schruben 1983)– Assumes process is strictly stationary and “phi-mixing.”– Phi-mixing means that Oi and Oi + j become uncorrelated if j is large.– As with spectral method, has parameters whose values must be chosen
Summary: Measure Estimation and Confidence Interval Generation
1) Only use the mean as an estimator if it has meaning for the situation being studied. Often a percentile gives more information. This is a common mistake!
2) Use some confidence interval generation method! Even if the results rely on assumptions that may not always be completely valid, the methods give an indication of how long a simulation should be run.
3) Pick a confidence interval generation method that is suited to the system that you are studying. In particular, be aware of whether the system being studied is ergodic.
4) If batch means is used, be sure that batch size is large enough that batches are practically uncorrelated. Otherwise the simulation can terminate prematurely with an incorrect result.
1) Know how random variates are generated in the simulator you use. Make sure:– A good uniform [0,1] generator is used– Independent streams are used when appropriate– Non-uniform random variates are generated in a proper way.
2) Compute and use confidence intervals to estimate the accuracy of your measures.– Choose correct confidence interval computation method based on the nature
1. Generate MDD representation of unlumped SS2. Build MD representation of CTMC3. Convert unlumped SS to lumped SS4. Solve CTMC by iterating through MD data structure
• Local SS expansion in levelscorresponding to atomic models.No assumption of knowing thelocal state space in advance ⇒– Online computation of transitive closure based on Ibaraki
and Katoh’s algoritm• Avoids costly computation of tr. closure from scratch
Validating Computer System Security: Research Goal
CONTEXT: Create robust software and hardware that are fault-tolerant, attack resilient, and easily adaptable to changes in functionality and performance over time.
GOAL: Create an underlying scientific foundation, methodologies, and tools that will:
– Enable clear and concise specifications,
– Quantify the effectiveness of novel solutions,
– Test and evaluate systems in an objective manner, and
• Most traditional approaches to security validation have focus on avoiding intrusions (non-circumventability), or have not been quantitative, instead focusing on and specifying procedures that should be followed during the design of a system (e.g., the Security Evaluation Criteria [DOD85, ISO99]).
• When quantitative methods have been used, they have typically either been based on formal methods (e.g., [Lan81]), aiming to prove that certain security properties hold given a specified set of assumptions, or been quite informal, using a team of experts (often called a “red team,” e.g. [Low01]) to try to compromise a system.
• Both of these approaches have been valuable in identifying system vulnerabilities, but probabilistic techniques are also needed.
• Provide convincing evidence that the design, when implemented, will provide satisfactory mission support under real use scenarios and in the face of cyber-attacks.
• More specifically, determine whether the design, when implemented will meet the project goals:
• This assurance case is supported by:– Rigorous logical arguments – Experimental evaluation– A detailed executable model of the design
Functional Model of the Relevant Subset of the System
Model for Client
Model forAccess Proxy
Model forPSQ Server
…
AA1 AA2 AA3 AP1 AP2
M1(Network Domains)
M2 M3 M4 M6M5
L1(ADF)
L2 L3
3. Detailed descriptions of model component behaviors representing 2a and 2b, along with statements of underlying assumptions made for each component. [Probabilistic modeling or logical argumentation, depending on requirement]
Functional Model of the Relevant Subset of the System
Model for Client
Model forAccess Proxy
Model forPSQ Server
…
AA1 AA2 AA3 AP1 AP2
M1(Network Domains)
M2 M3 M4 M6M5
L1(ADF)
L2 L3
4. Construct executable functional model [Probabilistic modeling, if model constructed in 3 is probabilistic]
5. a) Verification of the modeling assumptions of Step 3 [Logical argumentation] and, b) where possible, justification of model parameter values chosen in Step 4. [Experimentation]
Functional Model of the Relevant Subset of the System
Model for Client
Model forAccess Proxy
Model forPSQ Server
…
AA1 AA2 AA3 AP1 AP2
M1(Network Domains)
M2 M3 M4 M6M5
L1(ADF)
L2 L3
7. Comparison of results obtained in Step 6, noting in particular the configurations and parameter values for which the requirements of Step 1 are satisfied.
?
Steps
Note that if the requirement being addressed is not quantitative, steps 4 and 6 are skipped.
ExampleHigh level description…Steps 4-5Access proxy verifies if the client is in valid session by sending the session key accompanying the IO to the Downstream Controller for verificationStep 6Access Proxy forwards the IO to the PSQ Server in its quadrant.....
Step 3: Detailed descriptions of model component behaviors and Assumptions (Access Proxy)M
odel
of A
cces
s Pro
xyA
ssum
ptio
ns
4.4 Access Proxy4.4.1 Model DescriptionAM1: If a process domain in the DJM proxy is not corrupted, it forwards the traffic it is designated to handle from the Quadrant
isolation switch to core quadrant elements and vice versa. All traffic being forwarded is well-formed (if the proxy is correct).The following kinds of traffic are handled:1. IOs (together with tokens) sent from publishing clients to the core (we do not distinguish between IOs sent via different protocols such as RMI or SOAP/HTTP).
….AM2: Attacks on access proxy: attacks on an access proxy are enabled if either/both
1. Quadrant isolation switch is ON, and one or more clients are corrupted, leading to:a) Direct attacks: can cause the corruption of the process domain corresponding to the domain of the attacking process on
the compromised client.….AM3: If an attack occurs on the access proxy, it can have the following effects:
1. Direct attacks leading to process corruption:a) Enable corruption of other process domains on the host.
…..4.4.2 Facts and SimplificationsAF1: Each access proxy runs on a dedicated host machine.AF2: DoS attacks result in increased delays.….
4.4.3 AssumptionsAA1: Only well-formed traffic is forwarded by a correct access proxy.AA2: The access proxy cannot access cryptographic keys used to sign messages that pass through it.AA3: Access proxy cannot access the contents on an IO if application-level end-to-end encryption is being used.AA4: Attacks on an access proxy can only be launched from compromised clients, or from corrupted core elements that interact with the access proxy during the normal course of a mission. ….
Steps 6 and 7: Measures and Results• Assumptions: CPUB is the conjunction of
– C1PUB= the publishing client is successfully registered with the core
– C2PUB= the publishing client's mission application interacts with the client as intended
• Definition of a successful publish: EPUB is the conjunction of– E1PUB = the data flow for the IO is correct– E2PUB = the time required for the publish operation is less than
tmax– E3PUB = the content of the IO received by the subscriber has
the same essential content as that assembled by the publisher• Measure: P[EPUB|CPUB]
– Fraction of successful publishes in a 12 hour period– Between clients that cannot be compromised
• Objective– P[EPUB|CPUB] ≥ pPUB for a 12-hour mission
• Issues in Model-Based Validation of High-Availability Computer Systems/Networks
• Stochastic Activity Network Concepts• Analytic/Numerical State-Based Modeling • Case Study: Embedded Fault-Tolerant Multiprocessor System• Solution by Simulation• The Art of System Dependability /Conclusions
The “Art” of Performance and Dependability Validation
• Performance and dependability validation is an art because:– There is no recipe for producing a good analysis,– The key is knowing how to abstract away unimportant details, while
retaining important components and relationships,– This intuition only comes from experience,– Experience comes from making mistakes.
Doing it Right: Model Construction• Understand the desired measure before you build the model.• The desired measure determines the type of model and the level of detail required. No
model is universal!• Steps in constructing a model:
1. Choose the desired measures:• Choice of measures form a basis for comparison.• It’s easy to choose wrong measure and see patterns where none exist.• Measures should be refined during the design and validation process.
2. Choose the appropriate level of detail/abstraction for model components.• Key is to represent model at the right level of detail for the chosen measures.• It is almost never possible or practical to include all system aspects.• Model the system at the highest level possible to obtain a good estimate of the
desired measures.3. Build the model.
• Decide how to break up the model into modules, and how the modules will interact with one another.
• Test the model as you build it, to ensure it executes as intended.
Doing it Right: Model Solution• Use the appropriate model solution technique:
– Just because you have a hammer doesn’t mean the world is a nail.– There is no universal model solution technique (not even simulation!)– The appropriate model solution technique depends on model characteristics.
• Use representative input values:– The results of a model solution are only as good as the inputs.– The inputs will never be perfect.– Understand how uncertainty in inputs affects measures.– Do sensitivity analysis.
• Include important points in the design/parameter space:– Parameterize choices when design or input values are not fixed.– A complete parametric study is usually not possible.– Some parameters will have to be fixed at “nominal” values.– Make sure you vary the important ones.
Doing it Right: Model Interpretation/Documentation• Make all your assumptions explicit:
– Results from models are only as good as the assumptions that were made in obtaining them.
– It’s easy to forget assumptions if they are not recorded explicitly.• Understand the meaning of the obtained measures:
– Numbers are not insights.– Understand the accuracy of the obtained measures, e.g., confidence intervals
for simulation.• Keep social aspects in mind:
– Performance and dependability analysts almost always bring bad news.– Bearers of bad news are rarely welcomed.– In presentations, concentrate on results, not the process.