1 Some thoughts for the industry session Prof. Kishor S. Trivedi Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291 Phone: (919)660-5269 e-mail: [email protected]At present: visiting Professor IIT Kanpur, CSE Dept. Cochin Conference Dec 18, 2002
67
Embed
1 Some thoughts for the industry session Prof. Kishor S. Trivedi Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Some thoughts for the industry session
Prof. Kishor S. Trivedi Department of Electrical and Computer EngineeringDuke UniversityDurham, NC 27708-0291Phone: (919)660-5269e-mail: [email protected] present: visiting Professor IIT Kanpur, CSE Dept.
Well trained students Short term research problems solved Short courses on timely topics
3
What do faculty want?
Funding for `their’ research Place their students in good company labs Hope to get their research results
transferred to industry To get to know important and difficult
problems that can drive their research
4
Some lessons learned Student placement should be guided by the advisor Start early with summer internship Patience is needed in listening to problems from
industry Patience is needed in getting the IP problems resolved Expect to do at least 50% more work than the funding
provided Tech transfer is a double edged sword Practical problems can give rise to respectable research
papers Short courses are ideal entry points
5
Characteristics of the Systemsbeing Studied
Redundancy: Hardware (Static,Dynamic), Information, Time
Performance: Resource Contention, Concurrency and Synchronization Timeliness (Have to Meet Deadlines)
Composite Performance and Dependability: Degradable Levels of Performance
Need Techniques and Tools that can Evaluate: Systems with All the Characteristics Above
Explicitly Address Complexity
Characteristics of the Systemsbeing Studied
7
MEASURES TO BE EVALUATED Dependability
Reliability: R(t), System MTTF Availability: Steady-state, Transient, Interval Safety
“Does it work, and for how long?'' Performance
Throughput, Loss Probability, Response Time
“Given that it works, how well does it work?''
8
MEASURES TO BE EVALUATED Composite Performance and Dependability
“How much work will be done(lost) in a given interval including the effects of failure/repair/contention?'' Need Techniques and Tools That Can Evaluate
Performance, Dependability and Their Combinations
9
PURPOSE OF EVALUATION
Understanding a System Observation
Operational Environment
Controlled Environment Reasoning
A Model is a Convenient Abstraction
10
PURPOSE OF EVALUATION
Predicting Behavior of a System
Need a Model
Accuracy Based on Degree of Extrapolation All Models are Wrong; Some Models are Useful Prediction is fine as long as it is not about the
future
11
Methods of Quantitative EVALUATION
Measurement-Based
Most believable, most expensive
Not always possible or cost effective during
system design
12
Methods of Quantitative Evaluation(Continued)
Model-Based
Less believable, Less expensive
1. Discrete-Event Simulation vs. Analytic
2. State-Space Methods vs. Non-State-Space Methods
3. Hybrid: Simulation + Analytic (SPNP)
4. State Space + Non-State Space (SHARPE)
13
Why MODEL? Provides a framework for gathering, organizing,
understanding and evaluating information about a system e.g. Zitel, US&S,HP
A cost-effective means to evaluate a system e.g. Boeing, US&S, HP,IBM, Motorola, Cisco,SUN
14
Why MODEL? (continued)
Provides a means of evaluating a set of alternatives in a structured and quantitative manner e.g. Zitel, DEC,HP
Sometimes needed due to legal and contractual obligations e.g. FAA
Sometimes needed for business reasons: Motorola, SUN, Cisco
15
Compare two CLIENT-SERVER Architectures
Architecture 2
Architecture 1
16
0
)( dttRMTTF
Compare Connection Reliabilities
Connection reliability R(t) is the probability that throughout the interval [0,t) at least one path exists from the client to server on which all components are operational.
From R(t), system mean time to failure can be computed:
17
Compare Connection Reliabilities
18
)(lim tAAt
Compare Connection Availabilities
Connection (instantaneous, transient or point)
availability A(t) is the probability that at time t at
least one path exists from the client to server on
which all components are operational.
A(t)R(t) and limiting or steady-state Availability
MODELING THROUGHOUT SYSTEM LIFECYCLE Design Verification Phase
Use Measurements + Models
E.g. Fault/Injection + Reliability Model
Union Switch and Signals, Boeing, Draper
Configuration Selection Phase: DEC
System Operational Phase: Lucent
• It is fun! It is fun!
22
CASE STUDY: ZITEL
Comparison of two different fault-tolerant RAMdisks.
Stochastic Petri Net Package (SPNP) was used to model the two systems for their reliability.
23
CASE STUDY: ZITEL Trivedi worked with the designers directly:
Model Validation was done using face validation and sanity
checks.
Parameterization was easy due to the experience of the designers.
One difficult research problem originated from the study;
Subsequently solved and published in Microelectronics and
Reliability journal.
24
CASE STUDY: VAXCLUSTER
Developed three models of Processor Subsystem:
Two-Level Decomposition (IEEE-TR, Apr 89)
Inner Level: 9-state Markov
Outer level: n parallel diodes
A Detailed SPN Model (PNPM 89)
A Detailed SPN model for Heterogeneous Cluster (Averesky
book)
25
CASE STUDY: VAXCLUSTER
Storage Subsystem Model: A fixed-point iteration over a set of Markov submodels. (IEEE-TR, to appear)
Observed that availability is maximized with 2 processors (HCSS 90)
Many interesting reliability, availability, performability measures computed
26
Case Study: HP
Cluster Availability Modeling
Server Availability
Mass Storage Arrays Availability Modeling
Started with Markov chains via SHARPE
Progressed toward Stochastic Petri Nets
and Stochastic Reward nets via SPNP
27
CASE STUDY: LUCENT
A Validated Model of Hardware-Software Availability.
Worked with V. Mendiratta of Naperville. Model is semi-Markov; solved using SHARPE. Parameters collected form field data. Model results validated against actual
measurements.
28
CASE STUDY: LUCENT, IBM, Motorola, SUN
Software Rejuvenation: A technique to counter software “aging” and increase its
availability to clients. Evaluated optimum rejuvenation interval which
maximizes steady state availability (minimizes expected cost) for IBM cluster, Motorola CMTS cluster
Collected data from real systems to show aging and to determine proactive fault management strategies. Worked in our lab, with SUN Microsystems
29
CASE STUDY: MOTOROLA
Availability & Performability Modeling: Modeled several configurations of Communication
Enterprise Common Platform. Practical approaches for approximating steady state
measures in large, repairable, and highly dependable system: model decomposition, state space truncation, etc.
Both SHARPE and SPNP used.
30
CASE STUDY: MOTOROLA
Recovery strategies in wireless handoff:
proposed and modeled several strategies
a patent being filed by Motorola
SPNP was used
Hierarchy of two-level models used
Fixed-point iteration was used
31
CASE STUDY: BELLCORE
Architecture-based software reliability:
proposed a methodology
applied the methodology to SHARPE
used Bellcore’s test coverage tool, ATAC, to parameterize
the model
Bellcore is currently enhancing ATAC to incorporate our
methodology
32
CASE STUDY: DRAPER LAB
Overall aim was Verification of system with very
high reliability/availability specifications. Prototype
under consideration was FTPP cluster 3.
Hybrid approach proposed Fault injection based measurements.
Statistical analysis of measured data to enable
parameterization of analytical models.
33
CASE STUDY: DRAPER LAB
Reliability modeling of the prototype done: Parameterization done with the aid of existing reliability databases. Analytical solution provided exact closed form
expressions Markov model solved using SHARPE Petri net model solved using SPNP Reliability bottlenecks found
34
CASE STUDY: AT & T
GSHARPE: A Preprocessor to SHARPE developed at Bell Labs
by a Duke Student. User can specify Weibull Failure times and
lognormal and other repair time distributions. GSHARPE fits these to phase type distributions and
produces a Markov model that is generated for processing by SHARPE
35
CASE STUDY: BOEING
An Integrated Reliability Environment A working prototype Developed a high-level modeling language
(SDM) Designed and implemented an intelligent
interpreter
36
CASE STUDY: BOEING (Continued)
Interpreter determines which solution method is applicable
Five different modeling engines are integrated: CAFTA, SETS, EHARP, SHARPE and