Sean carter dan_deans

11

Sean Carter, NASA JSCDaniel Deans, ManTech SRS Technologies

Constellation ReliabilityEngineering Process –Optimizing CxP Risk

Used with Permission

2

DFRAM Overview

Why does reliability engineering exist?

How does it fit within the life cycle?

Success space vs. failure space

Partnership on system engineering team

The value of “designing-out” failure modes

Where does it fit in the lifecycle?

What are some of the tools?

How are they applied?

Real examples

2

3

Failure is not an option…

A design engineer does not know what he does not know

An extra set of eyes and ears is always good

You have to spend money to make money

Mr. Murphy tends to rear his ugly head when you are not expecting it…

What all this means is: You have to work at it – nothing worth accomplishing comes easy

Reliability engineering is a discipline that adds value to the systems engineering process!

3

Reliability Engineering Value - Clichés

4

Typical System Engineering Lifecycle

5

Reliability Engineering Throughout Project Life

66

The Life Cycle Approach Reliability is best designed-in;

it is, for the most part, not: Analyzed in Tested in Operated in

Successful reliability performance begins with a diligent, intentional approach at the very beginning of a project

Pre-phase A: requirements Phase A: allocation; plan; resources Phase B: analysis, design input, preliminary design review Phase C: detailed design inputs; more analysis; trade studies;

design verification; critical design review Phase D: test planning, test readiness, manufacturing, final

validation; flight readiness review Phase E/F: ops, growth, disposal and lessons learned

System EngineeringSystem Engineering Test and AssessmentTest and Assessment

Element Integration & Test

System Integration Test

System Element Data Reduction and

AssessmentSystem Concept

Exploration

Preliminary Design

Design Synthesis

Component Fabrication, Assembly, Integrate, & Test

Requirements Compliance

Configuration Management

Project Direction, Control, & Planning

Risk Management

System Analysis

Project Direction

and Control

Project Direction

and Control

• System, Element, Subsystem Models

• System Performance Analyses

• Specifications• Verification

• Management Plan• Budget Development & Control• Project Plan Development• Schedule Development & Control

• Design Data Base• Problem/Failure

Reports (PFR)• Engineering Change

Orders

• Risk Planning• Risk Assessment• Risk Handling/Mitigation• Risk Monitoring

77

Success Space vs. Failure Space

A design engineer thinks in success space (typically) How will the widget work? When it is designed, what function will it perform? What are the performance requirements?

Reliability engineer paid to think in failure space How will the widget fail? What about the operating environment will cause issues? What materials, processes, and tools will accentuate failure modes? Is redundancy required Are there operational work-arounds? How will faults propagate through the system? What are the effects of a failure mode on the mission

Superimpose the two processes, you get success!

88

Credibility: Partnership on System Engineering Team

Safety and Mission Assurance organization provides discipline experts to support design teams

Our job is to serve; not to inhibit

We help the system engineering teams identify hazards and failure modes and design them out

Our sole reason for existing is to ensure project/program success and to reduce/eliminate operational risk

We are partners for success

The aim in partnership is to duplicate our knowledge in the collective heads of our design-team partners

9

The Value of “Designing-Out” Failure Modes

A failure mode is an obstacle to mission success

Not all may cause mission failure, but, any failure of a component has potential

In the commercial world, a failure in the field costs 10 times what it costs to mitigate in the design process

In the space business, a failure can and will cost the mission and quite possibly endanger people

Identifying and designing-out failure modes is important!

9Company Confidential

1010

How Do We Design Out Failure Modes?

Methodical process; starts in pre-phase A, follows the lifecycle. DMEDI – Define, Measure, Explore, Develop, Implement

(12 steps) Define requirements Allocate requirements Plan activities and analysis, including test and verification Collect data and develop data sources Use RAM simulation, FMEA, FTA, worst case analysis, derating,

proven design practices to drive the design Support design reviews and require improvement Verify and ensure that design will meet requirements Plan and implement thorough testing Finalize verification, ascertain flight readiness Identify reliability growth opportunities once design is complete Investigate and eliminate root causes to anomalies Develop lessons learned, provide feedback to future engineering teams

11

Pre-Phase A Concept Development

Very important part of process –DFRAM starts here

Develop requirements that will optimize RAM for program/project

Requirements include availability, mean time to failure, fault tolerance, mean time to repair, time to replace

Import lessons learned from similar programs/systems

Collect similar system failure history data

Begin development of system model

Begin development of RAM Plan

12

Phase A: Preliminary Analysis Refine requirements, negotiate

allocations with design elements Finalize RAM Plan and educate design

team on process; what role reliability engineering team will fill

Continue to develop preliminary model; begin FMEAs, FTAs, Probabilistic assessments

Allocate requirements to lowest design-to level

Negotiate failure definitions, failure budgets with design teams

Identify initial critical items, compare with lessons learned from previous systems

Continue to identify data sources Identify critical suppliers; begin to form

partnerships

13

Phase B – Preliminary Design Continue to build simulation (model) and

add more details Identify most effective analyses tools to use

to drive design Complete preliminary FMEA, FTA, PRA Continue to develop supplier partnerships Prepare for preliminary design review Perform maintenance task analysis Identify design improvement initiatives and

optimize using simulation Perform other sensitivity studies based on

fault tolerance requirements Begin developing and finalizing FRACAS,

test plans, reliability growth strategy Partner with designers to identify failure

modes, design them out Support concept of operations optimization

14

Phase C – Detailed Design

Perform detailed design analysis – PDR recovery Focus on pareto items identified from analyses (Top 10) Continue to develop and use RAM simulation, FMEA,

FTA, etc. to design out failure modes Use Con-Ops to develop operational work-arounds as

failure mode mitigation Finalize test plans –review for reliability success criteria Audit suppliers, provide support for reliability

improvement Mitigate schedule risks Finalize critical items, document for testing Begin life testing of components and subsystems as

feasible Perform specialized analysis (sneaks, fault propagation) Prepare for and support CDR

15

Phase D –Development Finalize design - CDR recovery, cut into

manufacturing Finalize FMEAs, FTAs, Simulations, CILs Support testing, root cause

investigations and corrective action Begin collection of failure and

operational history data (upon first application of power)

Finalize reliability growth strategy Develop and begin implementation of

reliability-centered maintenance approach

Make “last minute” improvements based on test results

Identify lessons learned and document Update Con-Ops with operational work-

arounds for critical items

16

Phase E/F – Ops and Disposal

Continue to gather data, monitor operations for anomalies

Support failure analyses, root cause investigations

Implement reliability growth process, identify areas for growth, design solutions

Document lessons learned Use simulation to validate reliability

growth strategy, sensitivities Update RAM Plan with lessons

learned Support system disposal via

identification of reliability challenges to shutdown

17

What are the Tools? Some of the tools that we use are:

Requirements allocation

RAM simulation/probabilistic risk assessment

FMEA/FMECA

Fault tree analysis (FTA)/event tree assessment

Parts stress analysis/derating

Detailed design analysis

Worst case analysis

Redundancy screens

Extensive testing and verification analysis

Reliability growth planning and implementation

Others….

18

Reliability and Maintainability Simulation A very powerful process Can help design out failure modes without cutting metal Provides for the Pareto Principle (20/80) Gives design team a tool for sensitivity analysis Allows for trying many different scenarios Helps to optimize the return on investment based on cost to

improve curve

$ Cost

Rel

iabi

lity

High rate of return

KITC

Area of diminishing return

KITC = Point on Curve where rise becomes less than run (reliability improvement = rise, cost to improve = run)

19

Simulation Basics

Simulations are built based on the system architecture Model provides for “RAM” characteristics of system Input data includes failure rates, repair times, sparing

information, logistics information, operational work-arounds

Simulation is run based on mission profiles “Monte Carlo” methodology is used Typically data is input using statistical distributions Outputs are system availability and cutsets (and other

failure “illuminators”) Cutsets lead to sensitivity analyses which in turn can

drive improvements (failure mode elimination)

20

RAM Simulation Example

Simulation is dynamic, not static analysis Can provide much information about overall availability

of system under many different sets of conditions Today’s tools can include operational concepts and

rules, optimization of spares (some automatic) Requires specific input data

21

How Results are Used Outputs of baseline simulations are verified and

validated using expert elicitation Once all agree that the simulation is in the “ballpark,” (do

not get wrapped around the axle on the numbers; it is the gap elimination that provides the most value) – begin the sensitivity analyses

Identify opportunities for improvement, plug those back into the sim, ascertain value of improvements

Continue this process until gaps are eliminated or at least reduced.

This can include block improvement of overall component failure rates – get the suppliers in on the act (supplier partnerships)

Ensure data from simulation is used in the design process

22

Success Stories: NASA Instrument Design Validation of proper installation of sample cup retaining springs

on Sample Manipulation System to preclude workmanship failures. (single ring failure would result in loss of solid sample science)

Use of physics of failure methods to identify and eliminate, where possible, failure modes of Pyrolysis Oven.

Implementation of HiPot test for Wide Range Pump motor to eliminate workmanship related failures.

Identification of Hall Effect Device on actuators as possible Radiation Sensitive device. Subsequent testing validated suitability of device.

Identification of thermal switch on Gas Trap as Reliability Issue. Redesign produced higher Reliability solution.

FMEA of Gas Processing System provided justification for addition of limited redundancy.

Improved reliability of instrument by approximately 25% based in initial predictions.

23

Complex Space Systems Application Predicated on effective

requirements implementation

Detailed RAM Plan developed and implemented at Program Level

RAM requirements, RAM Plan flowed down to systems, elements of systems

System owners responsible for DFRAM, but program will facilitate and audit

Program level analyses including simulation, FMEA, PRA being performed

Verification and validation will be program level functions

PRA will be part of flight readiness decision

Software included in DFRAM activities (no longer black box)

System Engineering organization partnering with S&MA organization for RAM implementation

23

24

SUMMARY Success of a system

predicated on intentional implementation of DFRAM

It will not happen spontaneously

Must be married with the system engineering process

Program management must be disciples – will not work otherwise

It is always easier and more cost effective to do it right the first time

Implementation requires people skills and a service mentality

24

Sean carter dan_deans

Technology

failure modes

design process

mission failure

design engineer

design teams

design support design

failure of acomponent

failure space partnership