1 Some thoughts for the industry session Prof. Kishor S. Trivedi Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291.

1

Some thoughts for the industry session

Prof. Kishor S. Trivedi Department of Electrical and Computer EngineeringDuke UniversityDurham, NC 27708-0291Phone: (919)660-5269e-mail: [email protected] present: visiting Professor IIT Kanpur, CSE Dept.

Cochin ConferenceDec 18, 2002

mailto:[email protected]

2

What does industry want?

Well trained students Short term research problems solved Short courses on timely topics

3

What do faculty want?

Funding for `their’ research Place their students in good company labs Hope to get their research results

transferred to industry To get to know important and difficult

problems that can drive their research

4

Some lessons learned Student placement should be guided by the advisor Start early with summer internship Patience is needed in listening to problems from

industry Patience is needed in getting the IP problems resolved Expect to do at least 50% more work than the funding

provided Tech transfer is a double edged sword Practical problems can give rise to respectable research

papers Short courses are ideal entry points

5

Characteristics of the Systemsbeing Studied

Redundancy: Hardware (Static,Dynamic), Information, Time

Fault Types: Permanent, Intermittent, Transient, Design

Fault Detection, Automated Reconfiguration Imperfect Coverage Maintenance: scheduled, unscheduled

Dependability (Reliability, Availability, Safety):

6

Performance: Resource Contention, Concurrency and Synchronization Timeliness (Have to Meet Deadlines)

Composite Performance and Dependability: Degradable Levels of Performance

Need Techniques and Tools that can Evaluate: Systems with All the Characteristics Above

Explicitly Address Complexity

Characteristics of the Systemsbeing Studied

7

MEASURES TO BE EVALUATED Dependability

Reliability: R(t), System MTTF Availability: Steady-state, Transient, Interval Safety

“Does it work, and for how long?'' Performance

Throughput, Loss Probability, Response Time

“Given that it works, how well does it work?''

8

MEASURES TO BE EVALUATED Composite Performance and Dependability

“How much work will be done(lost) in a given interval including the effects of failure/repair/contention?'' Need Techniques and Tools That Can Evaluate

Performance, Dependability and Their Combinations

9

PURPOSE OF EVALUATION

Understanding a System Observation

Operational Environment

Controlled Environment Reasoning

A Model is a Convenient Abstraction

10

PURPOSE OF EVALUATION

Predicting Behavior of a System

Need a Model

Accuracy Based on Degree of Extrapolation All Models are Wrong; Some Models are Useful Prediction is fine as long as it is not about the

future

11

Methods of Quantitative EVALUATION

Measurement-Based

Most believable, most expensive

Not always possible or cost effective during

system design

12

Methods of Quantitative Evaluation(Continued)

Model-Based

Less believable, Less expensive

1. Discrete-Event Simulation vs. Analytic

2. State-Space Methods vs. Non-State-Space Methods

3. Hybrid: Simulation + Analytic (SPNP)

4. State Space + Non-State Space (SHARPE)

13

Why MODEL? Provides a framework for gathering, organizing,

understanding and evaluating information about a system e.g. Zitel, US&S,HP

A cost-effective means to evaluate a system e.g. Boeing, US&S, HP,IBM, Motorola, Cisco,SUN

14

Why MODEL? (continued)

Provides a means of evaluating a set of alternatives in a structured and quantitative manner e.g. Zitel, DEC,HP

Sometimes needed due to legal and contractual obligations e.g. FAA

Sometimes needed for business reasons: Motorola, SUN, Cisco

15

Compare two CLIENT-SERVER Architectures

Architecture 2

Architecture 1

16

0

)( dttRMTTF

Compare Connection Reliabilities

Connection reliability R(t) is the probability that throughout the interval [0,t) at least one path exists from the client to server on which all components are operational.

From R(t), system mean time to failure can be computed:

17

Compare Connection Reliabilities

18

)(lim tAAt

Compare Connection Availabilities

Connection (instantaneous, transient or point)

availability A(t) is the probability that at time t at

least one path exists from the client to server on

which all components are operational.

A(t)R(t) and limiting or steady-state Availability

19

Compare Connection Availabilities

20

MODELING THROUGHOUT SYSTEM LIFECYCLE

System Specification/Design Phase

Answer “What-if Questions'' Compare design alternatives

(Zitel,HP,Motorola) Performance-Dependability Trade-offs

(DEC) Design Optimization (wireless handoff)

21

MODELING THROUGHOUT SYSTEM LIFECYCLE Design Verification Phase

Use Measurements + Models

E.g. Fault/Injection + Reliability Model

Union Switch and Signals, Boeing, Draper

Configuration Selection Phase: DEC

System Operational Phase: Lucent

• It is fun! It is fun!

22

CASE STUDY: ZITEL

Comparison of two different fault-tolerant RAMdisks.

Stochastic Petri Net Package (SPNP) was used to model the two systems for their reliability.

23

CASE STUDY: ZITEL Trivedi worked with the designers directly:

Model Validation was done using face validation and sanity

checks.

Parameterization was easy due to the experience of the designers.

One difficult research problem originated from the study;

Subsequently solved and published in Microelectronics and

Reliability journal.

24

CASE STUDY: VAXCLUSTER

Developed three models of Processor Subsystem:

Two-Level Decomposition (IEEE-TR, Apr 89)

Inner Level: 9-state Markov

Outer level: n parallel diodes

A Detailed SPN Model (PNPM 89)

A Detailed SPN model for Heterogeneous Cluster (Averesky

book)

25

CASE STUDY: VAXCLUSTER

Storage Subsystem Model: A fixed-point iteration over a set of Markov submodels. (IEEE-TR, to appear)

Observed that availability is maximized with 2 processors (HCSS 90)

Many interesting reliability, availability, performability measures computed

26

Case Study: HP

Cluster Availability Modeling

Server Availability

Mass Storage Arrays Availability Modeling

Started with Markov chains via SHARPE

Progressed toward Stochastic Petri Nets

and Stochastic Reward nets via SPNP

27

CASE STUDY: LUCENT

A Validated Model of Hardware-Software Availability.

Worked with V. Mendiratta of Naperville. Model is semi-Markov; solved using SHARPE. Parameters collected form field data. Model results validated against actual

measurements.

28

CASE STUDY: LUCENT, IBM, Motorola, SUN

Software Rejuvenation: A technique to counter software “aging” and increase its

availability to clients. Evaluated optimum rejuvenation interval which

maximizes steady state availability (minimizes expected cost) for IBM cluster, Motorola CMTS cluster

Collected data from real systems to show aging and to determine proactive fault management strategies. Worked in our lab, with SUN Microsystems

29

CASE STUDY: MOTOROLA

Availability & Performability Modeling: Modeled several configurations of Communication

Enterprise Common Platform. Practical approaches for approximating steady state

measures in large, repairable, and highly dependable system: model decomposition, state space truncation, etc.

Both SHARPE and SPNP used.

30

CASE STUDY: MOTOROLA

Recovery strategies in wireless handoff:

proposed and modeled several strategies

a patent being filed by Motorola

SPNP was used

Hierarchy of two-level models used

Fixed-point iteration was used

31

CASE STUDY: BELLCORE

Architecture-based software reliability:

proposed a methodology

applied the methodology to SHARPE

used Bellcore’s test coverage tool, ATAC, to parameterize

the model

Bellcore is currently enhancing ATAC to incorporate our

methodology

32

CASE STUDY: DRAPER LAB

Overall aim was Verification of system with very

high reliability/availability specifications. Prototype

under consideration was FTPP cluster 3.

Hybrid approach proposed Fault injection based measurements.

Statistical analysis of measured data to enable

parameterization of analytical models.

33

CASE STUDY: DRAPER LAB

Reliability modeling of the prototype done: Parameterization done with the aid of existing reliability databases. Analytical solution provided exact closed form

expressions Markov model solved using SHARPE Petri net model solved using SPNP Reliability bottlenecks found

34

CASE STUDY: AT & T

GSHARPE: A Preprocessor to SHARPE developed at Bell Labs

by a Duke Student. User can specify Weibull Failure times and

lognormal and other repair time distributions. GSHARPE fits these to phase type distributions and

produces a Markov model that is generated for processing by SHARPE

35

CASE STUDY: BOEING

An Integrated Reliability Environment A working prototype Developed a high-level modeling language

(SDM) Designed and implemented an intelligent

interpreter

36

CASE STUDY: BOEING (Continued)

Interpreter determines which solution method is applicable

Five different modeling engines are integrated: CAFTA, SETS, EHARP, SHARPE and

SPNP.

37

QUANTITATIVE EVALUATION TAXONOMY

Closed-form solution

Numerical solution using a tool

38

MODELING TAXONOMY

39

STATE SPACE MODELING TAXONOMY

40

ANALYTIC MODELING TAXONOMY

NON-STATE SPACE MODELING TECHNIQUES

SP reliability block diagrams

Non-SP reliability block diagrams

Product form queuing models

41

State Space Modeling Taxonomy

State space methods

Markovian modeling

non-Markovian modeling

discrete-time Markov chains

continuous-time Markov chains

Markov reward models

Semi-Markov models

Markov regenerative models

Non-Homogeneous Markov

42

Transition label: Probability: (homogeneous) discrete-time Markov

chain (DTMC) Time-independent Rate: homogeneous continuous-

time Markov chain Time-dependent Rate: non-homogeneous

continuous-time Markov chain Distribution function: semi Markov process Two Dist. Functions: Markov Regenerative Process

State-Space Based Models

43

IN ORDER TO FULFILL OUR GOALS OF

Modeling Performance, Dependability and

Performability

Modeling Complex Systems

We Need

Automatic Generation and Solution of Large

Markov Reward Models

44

IN ORDER TO FULFILL OUR GOALS OF

Facility for State Truncation, Hierarchical composition

of Non-State-Space and State-Space Models, Fixed-

Point Iteration There are Two Tools that Potentially meet these Goals

Stochastic Petri Net Package (SPNP)

Symbolic Hierarchical Automated Rel. and Perf.

Evaluator (SHARPE)

45

MODELING SOFTWARE PACKAGES HARP - Hybrid Automated Reliability Predictor (Duke Univ, funded by NASA

Langley)

SAVE - System Availability Estimator (Duke Univ. funded by

IBM)

SHARPE - Symbolic Hierarchical Automated Reliability and Performance Evaluator;

installed at nearly 280 locations (GUI available)

SPNP - Stochastic Petri Net Package installed at nearly 120 locations (iSPN - GUI available)

D_RAMP for Union Switch and Signals by Duke, UVA and CMU

SDM - Boeing Integrated Reliability Modeling Environment (Jointly developed by Duke

Univ., Univ. of Wash. and Boeing)

SDDS - Developed by Sohar with the help from K. Trivedi

SREPT - Software Reliability Estimation and Prediction Tool

Challenges in Modeling

47

COMPLEXITIES OF MODELS

Large State Space

Model construction problem

Model solution problem

Model Stiffness.

Fast and slow rates acting together

Failure And Recovery/Repair

Performance and failure

48

COMPLEXITIES OF MODELS Modeling Non-Exponential Distributions

Combining performance and reliability

Believability/Understandability/Usability

Incorporation in the design process

Connection between measurements & models:

Parameterization

Validation

49

LARGENESS TOLERANCE Automated Model Construction

Stochastic Petri nets (GreatSPN, SPNP, SHARPE,

DSPNexpress, ULTRASAN)

High level languages (SAVE, QNAP, ASSIST, SDM)

Fault-Tree + Recovery Info (HARP)

Object-Oriented Approaches (TANGRAM)

Loops in the specification of CTMC (SHARPE)

50

LARGENESS TOLERANCE Efficient numerical solution techniques

Sparse Storage

Accurate and Efficient Solution Methods

We have Generated and Solved Models

with 1,000,000 states (has gone up

considerably recently)

Steady-State : NEAR-Optimal SOR

Transient: Modified Jensen's method

51

MODEL SPECIFICATION LANGUAGES

Different languages can be used to specify a

single model type:

SAVE,QNAP,SPNP all appear very different;

underlying model type is Markov

Same language can be used to specify

different model types:RESQ input language

used for PFQN or EQN

52

LARGENESS AVOIDANCE

Non-State-Space methods Reliability block diagrams

Fault-trees

Product-Form Queuing Networks

Approximate solutions State Truncation

SAVE, SPNP, ASSIST (Kantz and Trivedi: PNPM91)

53

LARGENESS AVOIDANCE Approximate solutions

Hierarchical Decomposition (Chapter 11)

and Fixed-Point Iteration among submodels:

Heidelberger and Trivedi; IEEE-TC,1983

(Queueing Models)

Ciardo and Trivedi; PNPM91 (SPN Models)

Tomek and Trivedi (Availability Models)

Singhal (IEEE-TPDS, 1992)

Chapter 11 of Sahner et al.

54

LARGENESS AVOIDANCE

Approximate solutions

Time-Scale Decomposition

Bobbio and Trivedi(IEEE-TC;1986); Section 11.2

Fluid Approximation:

Miltra; Kulkarni; Ciardo; Nicol, and Trivedi;

FSPN

Performability (Chapters 6 and 12)

55

Difficulties in Modeling Using MRMs

Stiffness

Causes numerical difficulties in solution

Stiffness Tolerance

Develop stiffness tolerant numerical

solution methods

Stiffness Avoidance

Avoid generating stiff models through

decomposition

56

STIFFNESS TOLERANCE

Automatic Detection of Stiffness (HARP)

Special Stable ODE Solver

Reibman and Trivedi (TR-BDF2)

Computers and Operations Research, 1988.

Malhotra and Trivedi (Pade, Implicit RK)

57

STIFFNESS TOLERANCE

Uniformization for Stiff Markov Chains

Muppala and Trivedi

We can solve models with rate ratios of 108 or

higher

Implemented in SHARPE & SPNP

58

STIFFNESS AVOIDANCE Model-level decomposition

Behavioral Decomposition (HARP, Bobbio &

Trivedi) Fault-Occurrence vs. Fault/Error Handling

Hierarchical Composition (SHARPE) Composition

of Submodel solutions without generating a single

one-level overall model

Fixed-Point Iteration (Ciardo and Trivedi; SPNP)

59

Non-Exponential Behavior

Non state space models: Fault Trees, Reliability

Graphs, RBDs; no problem

60

Non-Exponential Behaviorin State Space Models

61

NON-EXPONENTIAL DISTRIBUTIONS

Phase-Type Expansions

Malhotra and Reibman (GSHARPE)

See Figure 9.38 on p. 191(Red Book)

Non-Homogeneous Markov Chains

CARE III, HARP

Soft Reliability model with imperfect repairs

solved using SHARPE

62

NON-EXPONENTIAL DISTRIBUTIONS Semi-Markov Chains

Ciardo et al, IEEE-TC Oct. 90 Markov Regenerative Processes:

Choi, Logothetis, Kulkarni, Trivedi DSPN and MRSPN:

Choi, Kulkarni, Trivedi Discrete-Event Simulation

Now in SPNP (FSPN an Non-Markovian SPN

Simulation), RESQ, QNAP

63

BELIEVABILITYUNDERSTANDABILITY

Integration of Measurements and Models

Measurements Provide Parameters to Models

Models Provide Guidelines For Measurements

Models Validated Against Measurements

Integration of Different Modeling Tools

Boeing SDM project

IDEAS project at Duke

64

BELIEVABILITY/UNDERSTANDABILITY

Many Case-Studies of Validations Needed

Vaxcluster Availability Model: Wein & Sathaye

Hsueh, Iyer and Trivedi; IEEE-TC, Apr. 1988

AT & T Validation of ESS

Technology Transfer

Seminars and Workshops

Development and Dissemination of Tools

Application of the Techniques and Tools

65

MODELING AND MEASUREMENTS: INTERFACES

Measurements supply Input Parameters to Models

(Model Calibration or Parameterization)

Confidence Intervals should be obtained

Boeing, Draper, Union Switch projects

Model Sensitivity Analysis can suggest which

Parameters to Measure More Accurately: Blake,

Reibman and Trivedi: SIGMETRICS 1988.

66


Model Validation

1. Face Validation

2. Input-Output Validation

3. Validation of Model Assumptions

(Hypothesis Testing)

Rejection of a hypothesis regarding model assumption

based on measurement data leads to an improved model

67


Model Structure Based on Measurement Data

Hsueh, Iyer and Trivedi; IEEE TC, April 1988;

Gokhale et al, IPDS 98

1 Some thoughts for the industry session Prof. Kishor S. Trivedi Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291.

Documents

model accuracy

sunwhy model

system mttfavailability

nonstatespace methods3

difficult problems

ip problems resolvedexpect

research results

deadlinescomposite performance