Top Banner
Tutorial book on Asset Management - Maintenance and Replacement Strategies at the IEEE PES GM 2007 IR-EE-ETK 2007:004
108
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IR-EE-ETK_2007_004

Tutorial book on Asset Management - Maintenance and Replacement Strategies

at the IEEE PES GM 2007

IR-EE-ETK 2007:004

Page 2: IR-EE-ETK_2007_004

Tutorial book on Asset Management - Maintenance and Replacement Strategies

at the IEEE PES GM 2007

Authors:

Dr. George Anders Dr. Lina Bertling Dr. Gerard Cliteur Dr. John Endrenyi Dr. Andrew Jardine Dr. Wenyuan Li

Edited by:

Dr. Lina Bertling

Page 3: IR-EE-ETK_2007_004

Content

Contents Preface...............................................................................................................................................2 1 Introduction...............................................................................................................................3 2 Maintenance as a strategic tool for asset management .............................................................4

2.1 Introduction.......................................................................................................................4 2.2 Are Utility assets aging? ...................................................................................................7 2.3 Condition Assessments .....................................................................................................8 2.4 Driving today’s network into the future..........................................................................10 2.5 Biography........................................................................................................................13

3 Introduction to maintenance ...................................................................................................14 3.1 What is maintenance? .....................................................................................................14 3.2 Review of maintenance policies .....................................................................................16 3.3 Linking reliability and maintenance: a probabilistic approach.......................................20 3.4 Conclusions.....................................................................................................................23 3.5 References.......................................................................................................................24 3.6 Appendix: Deterministic or probabilistic models ...........................................................25 3.7 Biography........................................................................................................................26

4 RCM and its extension into a quantitative approach RCAM .................................................27 4.1 Introduction.....................................................................................................................27 4.2 Reliability-centred maintenance (RCM).........................................................................28 4.3 Reliability-centred asset management (RCAM).............................................................30 4.4 RCAM application study for an electrical distribution system [5] .................................37 4.5 Conclusions.....................................................................................................................45 4.6 References.......................................................................................................................46 4.7 Biography........................................................................................................................47

5 Optimizing condition monitoring decisions for maintenance planning..................................48 5.1 Introduction.....................................................................................................................48 5.2 Optimizing Condition Based Maintenance Decisions ....................................................49 5.3 Software for CBM Optimization ....................................................................................53 5.4 Recent Developments .....................................................................................................56 5.5 EXAKT Summary ..........................................................................................................57 5.6 Conclusion ......................................................................................................................58 5.7 References.......................................................................................................................59 5.8 Biography........................................................................................................................59

6 Computer program for decision support in the management of equipment maintenance ......61 6.1 Introduction.....................................................................................................................61 6.2 Asset Management Planer (AMP) Program ...................................................................62 6.3 Asset Reliability Model (ARM) Program.......................................................................66 6.4 Optimal refurbishment strategy ......................................................................................71 6.5 Program description ........................................................................................................76 6.6 Numerical example .........................................................................................................76 6.7 Conclusions.....................................................................................................................81 6.8 References.......................................................................................................................82 6.9 Biography........................................................................................................................82

7 Risk Based Asset Management – Applications at Transmission Companies.........................83 7.1 Introduction.....................................................................................................................83 7.2 Replacement Strategy of Aged HVDC Components .....................................................84 7.3 Determination of the Number and Timing of Spare Transformers ................................96 7.4 Further Discussions.......................................................................................................103 7.5 References.....................................................................................................................104 7.6 Biography......................................................................................................................105

Page 4: IR-EE-ETK_2007_004

Content

Page 5: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Preface 2

Preface It is a pleasure to present this book which has been prepared for the tutorial on Asset Management- Maintenance and Replacement Strategies, at the IEEE Power Engineering Society General Meeting during 24-28 June 2007, Tampa, Florida USA. The tutorial is sponsored by the; Reliability, Risk and Probability Applications (RRPA) Subcommittee group chaired by A. W. Schneider, Jr., and the Power System Planning & Implementation Committee (PSPI) chaired by Dr. M. L. Chan. Dr. Lina Bertling KTH (Royal Institute of Technology), Sweden, is the tutorial chair and editor of the book. The book shows on how maintenance is turned into a strategic tool for asset management. It gives a review of maintenance policies, and shows on the link to probabilistic approaches, and the reliability-centred maintenance methods. It shows on how condition based monitoring could be used for optimizing maintenance decisions. Furthermore, it introduces computer programs for decision support in the management of equipment maintenance. Finally, it shows on applications at transmission companies using risk based asset management. The material in the book has been prepared by five more authors that are; Dr. George Anders, Dr. Gerard Cliteur, Dr. John Endrenyi, Dr. Andrew Jardine, and Dr. Wenyuan Li. All these authors are well known experts within the field on maintenance and asset management. The idea for this tutorial came up at the 9th International Conference on Probabilistic Methods Applied to Power Systems (PMAPS2006), held at KTH Campus during 11-15 June 2006. The picture below shows on a memory from a workshop during PMAPS2006, which gathered several of the authors for this book. It has been a good and busy year since then, and maintenance keeps getting more useful when the time goes!

Lina Bertling, Editor

Stockholm, March 15, 2007 Contact for further information: Lina Bertling Assistant Professor KTH Electrical Engineering 100 44 Stockholm, Sweden Phone; +46 8 7906508 E-mail; [email protected] www; www.ee.kth.se/rcam or www.ee.kth.se/users/linab

Picture from left; Andrew Jardine, Ulf Sandberg, Gerard Cliteur, John Endrenyi and Lina Bertling

Page 6: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction 3

1 Introduction Maximal asset value and minimal life cycle cost are typical economic objectives of the electric utilities. However, attaining these objectives is constrained by the requirements of customers and regulators concerning the reliability of power supply. De-regulation of the electricity market has increased the incentives for cost effective and efficient use of available assets. Optimization of maintenance is one possible technique to reduce life cycle costs while improving reliability, and utilities need to implement new strategies for more effective maintenance techniques and asset management methods. The term asset management here implies making the right decisions on: what assets to perform maintenance on, what level of maintenance to perform, what specific maintenance steps to perform, and when to perform the selected maintenance. However, to make the right decisions the manager needs strategic tools, planning tools and data and different support systems. This book covers these different needs by: showing maintenance as a strategic tool for asset management, introducing maintenance planning methods such as reliability-centered maintenance (RCM), showing condition monitoring methods for collecting maintenance data and maintenance software, and finally showing an example of asset management methods in practical use in a transmission company.

Page 7: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

4

2 Maintenance as a strategic tool for asset management

Dr. Gerard Cliteur Power System Planning & Management

KEMA, Inc. Abstract - The importance of Equipment Maintenance and Replacement strategies addressing system reliability issues in North American power grids is growing. The reliability of these grids typically comprises lightning and weather induced outages, trees, animals and equipment deterioration. Vegetation management, automation (especially in distribution), insulation coordination and system hardening are common initiatives. However, neither of these address equipment deterioration directly. As the infrastructure is aging (average ages approach 40 years, some equipment categories have appreciable numbers exceeding 55 years) the question really is how long will failure rates stay constant? If they go up due to wear out, how fast will they increase? Can we do something about this right now? Can we for instance maintain more effectively and thereby extending its useful life? Can we apply life extension kits? The answer is; yes, but it depends on the actual business case and what the respective Utilities are already doing. What does it cost in terms of O&M labour and materials to do all of this and what does it buy in terms of deferred capital spending (replacement) and improved system reliability? Similar questions can be raised for equipment replacements going forward. Should we spend more capital to pro-actively replace certain equipment? If so, what equipment and at what rate? How does this affect O&M spending and system reliability? And, more challenging, in light of the other above-mentioned options to improve system reliability, what is the most cost-effective option? This chapter address these issues, the options and will provide practical examples of how utilities deal with project ranking, prioritization and optimization under certain objectives and constraints and uncertainties.

2.1 Introduction Asset Management is more than Condition Based Maintenance. It is less than corporate portfolio planning. It boils down to connecting execution and funding; connecting operations with asset ownership and corporate objectives. Asset Management is not operational excellence but instead focused on effectiveness, bringing out the most of every capital investment or expense from a planning perspective. It has a long-term view, strives for balanced investment-risk-performance levels and supports data driven decision-making required for all ‘discretionary spending’. Thus, and most importantly, Asset Management is for utilities with an aging asset base. Aging is not necessarily a bad thing. Equipment condition actually may improve for a certain period. However, it is clear that every piece of equipment eventually deteriorates due to wear, incidents and chemical processes, etc. This needs further elaboration on two issues that are at hand here. First, as aging and condition deterioration are time dependent, forecasting becomes of interest. Secondly, there is a quantification issue with uncertainties that put engineers at unease and managers either because of lacking data for unformed decision-making or having too much information to paper…Both are long standing topics in the Industry and apt with uncertainty, confusion and doubt. Omitting any Sarbanes-Oxley implications, let’s start with the forecasting issue.

Page 8: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

5

2.1.1 Condition forecasting In the medical profession, health is an individual’s physical body condition and is a momentary snapshot of that person’s well being and potential performance. “I am healthy (currently have no diseases) and am trained, willing and capable of running a marathon within 2 hours and 15 minutes”. This is an example of someone expressing his or her condition. A useful statement for the application screening committee. This claim can be tested and verified, if not by having the person run the marathon once. If, however, we change perspective and look at this claim from a sponsor’s point of view, we will want to know a couple of additional data points. Apart from the looks of the runner…we will want to know the age and, most important, how long this person can perform up to these specifications. Any physical body is subject to condition enhancements and deterioration, thus emphasizing the importance of forecasting this into time and the related certainty. Any professional responsible for budgets in combination with a certain expected but repeated or continued performance by assets that can deteriorate is in need of this information. Back to physical Utility assets, asset managers are similarly interested in such forecasted condition data. Adding assets like transformers is not much of a deliberate decision as it grows the (asset base of the) company and typically yields incremental revenues. As Utility asset bases tend to age, the successful Utility will become more and more defined by the one that can deploy fact-based decision-making related to asset replacements, often earmarked as ‘discretionary’ spending. If one is too late, the related performance goes down and other risk elements may become exposed. If one is too early, capital is wasted. Forecasted equipment condition4 and system performance feeds multi-dimensional cost-benefit analysis5 and improved decision-making.

2.1.2 Condition quantification Utilities use classifiers for equipment condition are ‘as new’, ‘good’ or ‘acceptable’, ‘critical and ‘urgent attention required’. Many of these are convoluted due to the inherent mix of condition and criticality of the unit in question. We will discuss this later in this paper. Other Utilities introduce a condition index; a parameter between 0-10 or even higher, if extended granularity is warranted or needed. The question really is what zero and the maximum mean. The maximum typically refers to the ‘as-new’ condition, even though ‘as-new’ has many teething diseases related to a potential new design, manufacturing or material defects. Focusing on aging infrastructure, the question really is what a condition index of (close to) zero means. Is this imminent failure? Likely failure? The CFO will still inquire the infamous ‘when?’. A hazard function expresses the annual likelihood of failure (i.e. not performing up to specified values) as a function of age. This is not to be confused with the failure rate of a certain population. The failure rate is a measure of number of failures divided by the number of equipment in any given year. This is a function of the more physically meaningful hazard rates when convolved with the actual age distribution of the total population. Based on a hazard function one can make a replace-retain decision because it is related to an actual physical unit. Failure rates can not be trended nor be used for single unit replace-retain decisions as they depend on the total population. Utilities reporting failure rates and their constant 4 Condition as a function of operations and life extension measures (typically periodic maintenance). 5 As opposed to analyzing and benchmarking performance separate from expenditures. Multi-dimensional benchmarking additionally takes regional differences, network differences and, most importantly, time into account by averaging several periods (i.e. years).

Page 9: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

6

trends need to be more self critical and for instance consider the failure rate in an imaginary family…having 4 family members over the age of 78 with no one perishing doesn’t mean that the rate of perishing will stay constant at that favourable figure over the next decade or so.

2.1.3 Why this matters Condition forecasting and quantification are important as:

1) It is expected that performance is being stretched to the limits with the current grid, increased loadings and deteriorating equipment conditions. Witnessed by the abundant recent summer loading related outages.

2) Uncertainty puts operations in a scrambling mode to obtain replacement dollars for unplanned replacements (typically from planned projects) and ultimately destroys credit ratings and customer perception.

We all have an intuitive level of risk. Examples that come to mind are typically related to automobiles and children. When safety is an obvious factor we all agree without being too critical, granular and quantified. Even though risk is defined as the probability of an event times its impact, we readily have acceptable and unacceptable classifications ready. However, when it comes to events that are unprecedented but with extreme high-impact (e.g. the flooding of New Orleans during hurricane Katharina) or events we know are going to happen but are hard to assess (e.g. when is this nice 100 or 120Hz-humming piece of steel going to give up the ghost?) – we tend to be under critical and reluctant with pro-active measures6. With most Utility assets there is a clear responsibility and benefit with being critical and open to assessments. The impact side of the equation for a power transformer failure for instance is related to the congestion costs, non-delivered energy, replacement/repair cost and safety related liabilities. A power transformer can fail violently with sharp porcelain debris that cuts through walls, oil fires and spills. Not to mention the indirect impact of negative headlines related to such a catastrophic failure. Planned replacements that are well-timed avoid all the negative energy, indirect dollars and effort related to emergency replacements. The biggest savings are, however, with improved supply chain management as procurement can now anticipate the need for units and negotiate discounts for multi-unit advanced orders with a strategic Vendor. Here is where the large volumes of distribution equipment kick in. Other benefits relate to improved transparency of reinvestment plans and may be used in a long-term regulatory strategy framework. Some Utilities are indeed deploying asset condition forecasts in relation to expected system performance under different scenarios of spending in an interactive discussion between planning, finance and the regulator. Forecasting and quantification are beneficial to support prudent or, better, optimal spending. Quantification needs to take uncertainty into account, especially when forecasted 5 to15 years out. It is important to understand the data and algorithms that underlie the quantified hazard functions in order to verify and improve the forecasts with each newly obtained data point (to be discussed in the section on condition assessment). The Utility with the best data and best forecasting algorithms, like the best performers at Wall-Street, will have the highest certainty and on aggregate loose the least money on an aging asset base. Note that this is not a general plea for 6 It will probably take just one major incident with an obviously rusted bridge collapsing that will trigger a nationwide aging bridge assessment and management program with corresponding capital and maintenance budget.

Page 10: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

7

pro-active replacement strategies. It is about getting your arms around cost, risks and performance over a certain period of time, evaluating several scenarios and making deliberate choices. The next two sections will address the questions related to aging asset bases and elaborate on what data to store and algorithms to use for condition assessment (quantification) and forecasting.

2.2 Are Utility assets aging? Yes. All asset bases are aging and this is a good thing. Every year, an asset base gets one year older when omitting system expansion, load growth related upgrades (upgrades comprise new equipment as opposed to uprates where only modifications to existing equipment are performed) and replacements (e.g. replaced poles triggered by road widening). It is the deterioration component of aging that should worry us. If an asset base does not deteriorate, or we have some kind of proof that it won’t occur within the next 20 years or so, we have peace of mind and can focus solely on other Utility issues (e.g. aging workforce). As long as we are pro-actively replacing equipment at rates less than 1%, we are inherently assuming that the equipment has a useful life exceeding 100 years. This implies we should be accruing the money for emergency replacement up to the assumed lifespan.

2.2.1 Do we accrue money for emergency replacements? No, because we do not assume an actual lifespan. At least not documented and acted on in terms of dedicated replacement budget. The general belief is that the variability is large7 and one would hope for the largest lifespan. As a matter of fact, ignoring indirect costs of failure and maintenance spending, the optimal replacement age of all assets is at failure. A big secret of operations is that Utility staff keeps fingers crossed and maintains & repairs based on experience and engineering judgment. Not to mention the water hosing of critical power transformers during hot summers…

2.2.2 If this is true, do we have a time bomb? No. Hot summers and other weather events will take out the weak units in a few isolated incidents. There will be a budget to do what is deemed necessary (…) in a one-time effort. These events however increase the awareness of an aging (and incapable) power grid and the magnitude of indirect costs. As the number of such events seem to increase, it may be more appropriate to speak about an aging asset mine field.

7 There is also the belief that newer equipment has shorter lifespans than older equipment.

Page 11: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

8

2.3 Condition Assessments Why it is done the way it is done now? Because it is difficult, your engineers will tell you. Because we have no data - our crews only want to repair equipment without logging the details. Because we have no time to sit and think - there is too much capital work (new construction) and too few resources. Most often all this is true. The major omission, however, is the creation of a ‘case with inherent proof’. We all know and have experienced that budgets become swiftly available to address issues that just became painfully apparent by actual failures and related outages. Only if these could have been predicted, articulated (on paper – different from the typical “I told you so” complaints for denied past budget applications) with likelihood and impact for verification when one actual occasion took place, then this makes a compelling case for non-discretionary spending in order to avoid adversary events or, at least, mitigate its related impacts. It is this single omission that jeopardizes the discussion between execution and funding. As long as there is no compelling case with actual proof but only strict engineering condition assessments in language unfamiliar to the best willing CFO, there will be little money dedicated to the case. To the CFO’s defense, it should not be hard to imagine a host of other initiatives to be financed with a better (better defined) ROI or any other measure for bang-for-the-buck. Again, the way to go is forecasted hazard functions (as the engineering side of the risk equation) in combination with impact of failure (as the financial side of the risk equation).

2.3.1 So, what is done? Many Utility plant is assessed on a regular basis. In fact all plant is in theory subject to preventive maintenance based on inspections as even distribution line equipment is eyeballed during walkdowns every 10-15 years. Having said that, substation equipment is typically assessed on a monthly basis and operational data is available through SCADA systems. The assessment includes cross examinations of inspection parameters, operational data, maintenance data and diagnostic measurement results. The cross examinations comprise of comparing the raw data to thresholds or applying these in algorithms published by professional organizations, etc. There is much attention to power transformers. Potentially because the important deterioration mechanisms are thermal and mechanical, better allowing for extrapolation and prediction than the sudden dielectric phenomena in circuit breakers for instance. Also, condition assessment of power transformers is well reported in the literature with commonly accepted standards and thresholds compared to other power system devices. The assessed conditions are typically reported in a so-called risk matrix, representing the condition or health index on one axis and the criticality (or ‘importance’) of each unit on the other axis. Then the area is divided into three or more arbitrary zones representing categories dubbed as ‘normal operation’, ‘suspected / increased monitoring’, ‘alarm 1’ (plan replacement or more detailed assessment) and ‘alarm 2’ (take out of service immediately). The problem with these risk matrices is twofold. Firstly, they lack time dependency at both axes. Condition deteriorates over time and criticality changes with availability of spare parts, topology changes, added customers or load and a host of other influences. The risk matrices indicate immediate problems but are not predictive. Secondly, the zones are arbitrary and granular (not quantified). It is equally arbitrary whether a red zone is actually red and deserves spending. Again, it is the forecasting and

Page 12: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

9

quantification that allow for proper allocation of dollars that, in turn, provide the real benefits and ensure a sustainable electric power supply.

2.3.2 And what is not done? One of the most elegant yet often omitted applications is trending the assessment outcome. Last year we measured this value and it was 80% off of the threshold (for failure or a certain alarm value), now it is only 70%. Correcting for potential differences in operation and maintenance regimes, this would yield an expected remaining lifetime (everything assumed equal) of 7 years. Of course, the threshold is not deterministic. If only there was one single indicator that was easy to determine yet 100% predictive... In reality, both the measurement result and threshold have inaccuracy (related to the repeatability of the measurement) and uncertainty (related to restricted knowledge, past data of comparable events, etc), respectively. However, the accuracy of the measurement should be known and the uncertainty related to the threshold can be diminished; incremental research may deliver more predictive results. This ‘incremental research’ is not a static, expensive, off-line R&D assignment but can be integrated into day-to-day operations. It requires the same data as used for the assessment itself augmented with failure data. Equipment failure is a unique moment to learn and improve; track age, operational data relevant to deterioration and at the log the actual failure mode as a minimal set of parameters to be evaluated post-mortem. The process is depicted in Figure 2.1.

Page 115

Improvement process

Asset list Failure mode Cause Indicator (equipment types, position (per equipment type) (per failure mode) (per cause) make, model, year) --- --- --- --- --- --- --- --- --- -other- --- --- Read checks Failure threshold Trigger level Mx.Orders (per indicator) (per indicator) (per failure threshold) (per trigger) --- --- --- --- --- --- --- --- --- --- --- ---

Maintenance plan

Capacity

additions

Reliability

replacements

Corrective replacements

Generated by equipment

physical knowledge (1)

Include for criticality (system impact, safety)

and backlog (2)Perform maintenance activities (can be another read check)

Record results

Theoretical (physics, design)

Historical (failure database, OMS)

(3)

(4)

(5)

(6)

Figure 2.1 Improvement process for integrated condition assessments

Page 13: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

10

There are two reasons why such data is not available and such analyses are not made. First of all, the assessment related IT tools (i.e. Computerized Maintenance Management Systems) are predominantly used for admin purposes; work tickets are generated, followed up and closed-out. The problem with the field crews not willing to fill out the relevant data can be solved by providing them concise pull down lists of data entries and training. The real problem is the lack of analytical engine power to load and run queries or any type of algorithms over historic data and selected assets in these tools. As such, there is simply no possibility for review and feedback. Secondly, there are few Utilities that have a consolidated database spanning asset registry, operations, maintenance and planning. The Utilities that want to review and improve spent a handful of resources in an uncoordinated one-time effort to collect the data. After this effort there are typically only a few process adaptations to facilitate a continued effort.

2.4 Driving today’s network into the future The most useful approach to take responsibility for an adequate future power infrastructure is a repeated and combined fleet assessment and bad-actor approach. Both will be discussed now, including their interrelation. A fleet assessment requires the regular asset registry data, inspection and maintenance data, and operational data. One can either automate to redo the condition scores and predictions when alarm values may be reached after each newly generated data point or manually do this after a certain period. This effort comprises of reviewing condition data against operational data to detect or refine correlations. Every time a failure happened or is detected before actual failure 8 the failure mode will be evaluated. If it is aging related then this data point will be included in a revised hazard function computation. If we know further details such as condition data before failure this may lead to revised alarm thresholds or inspection & maintenance intervals, etc. It also may provide clues with respect to indicators that are predictive but are not yet being considered up to date. The fleet assessment results in three sets of information: the actual bad actors (or suspected units), individual hazard functions for all units and a consolidated hazard functions for all comparable units together. The bad actors can be short-listed for replacement (with timings based on their hazard functions; one can apply a Life Cycle Cost analysis of certain alternatives with time series of costs, including direct and indirect cost of failure) or they can be put on a watch-list for increased attention (e.g. condition monitoring). Other measures for consideration are extending useful life by re-rating (uprating – by deploying latent margin without modification, downrating – by decreasing operational parameters), upgrading (increase the ratings by a physical modification) or refurbishment (replacement of deteriorated components), or improved effectiveness of maintenance. Each measure can be considered either for individual units up to entire asset categories. The actual measures and budget should depend on the criticality of each listed unit as risk is not only set forth by the condition of the unit (i.e. the hazard function) but also by the impact of failure. The impact of failure depends on the node of the network among others. 8 Failure is defined as not being able to perform the specified tasks. As such, a circuit breaker for instance has failed already when its contacts are stuck. The implications will be noticed upon a tripping signal. The failed condition needs to be detected before this trigger with a timely condition assessment. Note that with the suggested approach this does not necessarily imply a diagnostic measurement or inspection.

Page 14: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

11

The hazard function for the entire fleet can be used to forecast next year’s failures. In this case, we do not know for sure which units are going to fail and exactly when, but we do have a measure for the likely quantity of units failing. This concept is depicted and described in Figure 2.2.

Page 46

0

2

4

6

8

10

12

14

16

18

20

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Age

Num

ber o

f uni

ts

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

Haz

ard

rate

Aging

Units prone to failure, actual number of units failing = hazard rate times number of units.

Failed units will be inserted at age = 0 column, representing replacement with new equipment. This estimates the capital budget required for replacements as a baseline.

Failure rates, impact on system reliability, average population age and corresponding maintenance budget will be computed.

Aging Asset Base - computations

Figure 2.2 Concept of hazard rate and age distribution convolution

As discussed, this information supports supply chain management as procurement can now anticipate the need for units and negotiate rebates for multi-unit advanced orders with a strategic Vendor. Besides, there is no scramble to find money for replacement potentially disadvantaging planned projects. Most importantly, a Utility can establish the maintenance and replacement costs for all Utility plant going forward as a baseline, including effects on system performance. This baseline can then serve to compare pro-active measures such as replacements, uprates, changes in maintenance and inspection, etc. This quantification and forecasting will support the shift from engineering and standards driven planning to performance based planning for those Utilities that are willing to bridge the gap between execution and funding. Figure 2.3 represents such a baseline for one of these Utilities.

Page 15: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

12

Page 52

Baseline assessment – Equipment capital costs

$0.0

$5.0

$10.0

$15.0

$20.0

$25.0

$30.0

$35.0

EHV/HV O

H lines

EHV/HV tra

nsf. (

>10M

VA)

EHV/HV br

eake

rs (>6

9kV)

69kV

brea

kers

EHV/HV bu

swork

Protec

tion &

Con

trol

Distr. s

ubs.

trans

f. (<1

0MVA)

MV busw

ork

MV brea

kers

(<69k

V)

MV OH lin

es

MV UG lin

es

OH servi

ce tra

nsf.

Service

trans

f. (Pad

mount)

Fore

cast

ed a

nnua

l Cap

ital c

osts

Mill

ion

$

$0 .0

$5 .0

$10 .0

$15 .0

$20 .0

$25 .0

$30 .0

$35 .0

Mill

ions

M ax im um capital cos t

Current spending

Capital cos t at s us tainable point

Figure 2.3 Baseline capital cost assessment result for a selected Utility

For completeness, it must be mentioned that this is just the ground work for true Asset Management covering aging infrastructure with maintenance, replacement, monitoring and rerating as strategic options. However, there are more challenges to Utilities such as system hardening (being able to withstand Storms), lightning and animal induced outages and vegetation…all these need to be reviewed, potential projects and programs need to be defined, each with alternative capital and expense options for optimisation. It is the comprehensive approach of evaluating all T&D issues (quantified and forecasted, dealing with uncertainties), tying to system performance and investment levels that will define the successful Utility.

Page 16: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Maintenance as a strategic tool for asset management G. Cliteur

13

2.5 Biography Dr Gerard Cliteur. Gerard is a senior principal consultant with KEMA and specializes in helping utilities improve business performance through management and technical consulting. He has 14 years of experience in equipment condition assessment and valuation, equipment modelling and design, failure analyses and expert witness, maintenance strategy, and asset management. He is responsible for the initiation and management of large volume projects including consulting, R&D, process improvement, and technical audits. Dr. Cliteur is a recognized expert in the interpretation of inspection, maintenance and operational data in order to assess equipment health, O&M procedures, capital project planning, budgeting, and project prioritization in order to minimize cost, achieve performance targets, and proactively manage risk. He has published more than thirty technical papers in these areas, and is a regular instructor for international courses and seminars. Prior to joining KEMA, he worked for six years at Toshiba Corporation in Japan, developing Ultra High Voltage switchgear and he has worked for Endesa in Spain. With KEMA, he has performed consulting assignments for major utilities including Tennet (The Netherlands), El Paso (USA), CLP Power (Hong Kong), Public Power Corporation (Greece), Dhofar Power Company (PSE&G subsidiary in Oman), Tenaga National Berhad (Malaysia), National Hydro Power Company (India), Cinergy (USA), and many others.

Dr. Cliteur holds a M.Sc. in electrical engineering, Eindhoven University of Technology, (The Netherlands), and a Ph.D. from Kanazawa University (Japan), and has completed several executive training programs on business management and finance. He is an IEEE member and chairs the Asset Management Working Group.

Page 17: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

14

3 Introduction to maintenance

Dr. J. Endrenyi, Fellow IEEE Scientist Emeritus, Kinectrics Inc.

Toronto, Ontario, Canada

Abstract – One goal of power system operators and asset managers is, now more than ever, to minimize system operating costs and ensure that the system is running most economically. An important operating cost is the cost of maintenance. Those making decisions about equipment maintenance must have a clear understanding about what maintenance can achieve, what maintenance methods are available and what are the assumptions used in the various approaches. This presentation describes the difference between regular and as-needed maintenance, the effect of maintenance that does not achieve as-new conditions, and empirical and mathematical maintenance models. Probabilistic mathematical methods and Reliability Centered Maintenance are highlighted as two promising approaches in the future.

3.1 What is maintenance? Maintenance, according to definitions published in an IEEE Task Force Report [1], is a form of restoration of a device where restoration is “an activity which improves the condition of a device”. Specifically, maintenance is a “restoration wherein an unfailed device has, from time to time, its deterioration arrested, reduced or eliminated”. This contrasts with the activity of repair, which is a “restoration wherein a failed device is returned to operable condition.” The quoted definitions are reprinted in Reference 2.

The purpose of maintenance, as generally perceived, is to increase the lifetime of a device and extend its time between failures, by restoring it to a “younger” condition. This is a worthwhile goal, because it would help to increase component and system reliability. Electric utilities have always relied on maintenance programs to keep their equipment in good working condition. It must be pointed out, however, that maintenance is just one of the tools for increasing reliability. Others include adding more generation, increasing transmission redundancy and installing more reliable components. At a time, however, when these approaches are heavily constrained, electric utilities are forced to get the most out of the devices they already own, through more effective operating policies, including more effective maintenance programs.

An important relation can be observed in the above definition of maintenance: the concept is linked with the process of equipment deterioration. It is obvious that a sequence of ever-increasing deterioration would lead to failure. Maintenance is carried out in the hope that by slowing deterioration the (mean) time to failure can be made longer. Asset managers might be willing to pay for an increase in relatively inexpensive maintenance activities if thereby the number of costly repairs following failures can be reduced. But it is clear that the sum of the two expenses will reach a point of optimum where it is the lowest. It is the task of maintenance planners to identify this point and install maintenance policies where the minimal cost is at least approximated.

Not every failure is the consequence of deterioration. Devices can fail for many reasons. Some are caused by external events such as weather phenomena (lightning, ice, wind, heat), or damages inflicted by animals or humans. The device in question sees these as random phenomena and no

Page 18: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

15

oiling, adjusting, cleaning or tuning will make any difference in the frequency of such failures. These failures are called external failures, as opposed to failures intrinsic to the device itself, being the consequences of deterioration and ageing, which are internal failures. The times to internal failures can be controlled by maintenance performed on the device itself. Such maintenance is called internal maintenance, or simply maintenance, if this does not cause any confusion.

The rates of external failures can be reduced only by changes in design, such as the erection of barriers and fences, or improved shielding of transmission lines against lightning, or burying the circuits under ground. In some cases one can speak of external maintenance; for example, when trees in the vicinity of overhead lines are regularly trimmed to avoid failures due to contact with tree branches. Note that external maintenance is performed outside the device, not on the device. This presentation will not be concerned with external failures and maintenance.

Maintenance is an important part of asset management. As deterioration increases, the asset value (condition) of a device is reducing. The connection between asset value, time, maintenance and reliability is shown in Figure 3.1[3]. The curves in the figure are called life curves. Since they are derived from probabilistic information, the times shown represent means.

Figure 3.1 Life curves

Figure 3.1 illustrates conditions for three maintenance policies, including Policy 0 where no maintenance is performed at all. If failure is defined as the asset condition where asset value becomes zero, and lifetime, as the mean time it takes to reach this condition, the extensions of mean life T0 to T1 when Policy 1 is applied instead of Policy 0, and T1 to T2 when Policy 1 is replaced by Policy 2, can be clearly seen in the figure. So are the changes in the asset condition (value) at any time T. Note that both failure and lifetime can be defined differently; e.g., failure could be tied to any asset condition which is deemed unacceptable.

As far as reliability is concerned (measured in this case by the mean time to failure), Policy 2 is superior to Policy 1. Maintenance clearly affects component and system reliability. But maintenance has its own costs, and when comparing policies, this has to be taken into account. The increasing costs of carrying out maintenance more frequently must be balanced against the gains resulting from improved reliability. When costs are also considered, Policy 2 in Figure 3.1 may be very costly and, therefore, may not be superior to Policy 1.

Page 19: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

16

3.2 Review of maintenance policies Maintenance has been performed for a long time on a great variety of devices and machines, and over the decades many routines have been devised for the purpose. Originally, maintenance policies have been chosen on the basis of long-time experience and later, by following the recommendations of manuals issued by manufacturers. In most cases, maintenance has been carried out at regular, fixed intervals. This practice is also called scheduled maintenance9 and it is still the maintenance policy most often used.

3.2.1 Improvement vs. replacement The simplest representation of scheduled maintenance in terms of life curves is shown in Figure 3.2a. Maintenance is commenced at equally spaced times TM, 2TM, . . . (scheduled maintenance). The diagram is constructed on the assumption that maintenance would invariably result in as-new conditions, an assumption frequently made or tacitly implied. From Figure 3.2a it appears that the device would never fail, except for the fact that life processes are probabilistic and failure can occur, with low probability, at every point of a deterioration curve. Neither the curves in Figure 3.1, nor those in the various models in Figure 3.2 give account of this possibility – these representations are inherently deterministic. If maintenance would invariably result in as-new conditions, it would have the same effect as every time replacing the device with an identical new component. Only costs would decide which one to choose; and perhaps, nowadays, more and more often replacement would win. However, the assumption is not realistic. Maintenance is not carried out to regain 100% of the asset’s value but only a fraction of it; in most cases, this makes maintenance cheaper than replacement. If it is assumed that maintenance is done to 90% of the asset condition level reached at the previous maintenance, the resulting life curve will run as shown in Figure 3.2b. Maintenance is still triggered by reaching its due time but terminated at the predefined level (dotted line). Now failure would occur even in the deterministic process.

9 This presentation follows the terminology proposed in Reference 1. Other terms exist and are referred to in the terminology. The IEEE Task Force which approved the proposed terms saw no reason why any of the other terms should be preferred to those recommended.

Page 20: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

17

Figure 3.2 Life curves for various maintenance approaches: (a) “perfect” regular

maintenance, (b) imperfect maintenance, (c) as-needed maintenance - All ordinates are “Asset Conditions”

A large number of replacement policies are described in the literature; in fact, most of the literature concerns itself with replacement only, neglecting the possibility that maintenance may result in smaller improvements at smaller costs. Maintenance policies involving limited condition improvement are mostly based on experience, and such empirical approaches cannot predict and compare changes in reliability as a result of applying various maintenance policies.

3.2.2 Regular vs. as-needed maintenance In the last decade or so, a growing number of industrial operators saw merit in freeing up the regularity of maintenance intervals in favor of performing maintenance only when needed.10 This approach obviously offers savings, but it also requires new expenses for routines to identify times for maintenance. To find out when maintenance is needed, condition monitoring – periodic or continuous – and appropriate criteria for triggering action are required. Development of a life curve for this approach is shown in Figure 3.2c.

10 Actually, as-needed maintenance has been practiced for centuries. The bearings on the wheels of horse-drawn carriages were greased only when the driver noticed that they were running dry.

Page 21: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

18

The lower dotted line represents the outcome of condition monitoring; it “triggers” maintenance as soon as the component deterioration curves (the curved lines parallel to the appropriate sections of the M0 curve) reach it. When the resulting improvements touch the upper dotted line, maintenance is completed. It seems that maintenance frequency increases at old age and so does (assuming the 90% rule) the “depth” of maintenance: at the beginning, minor maintenance may suffice, but later on, major maintenance or even overhaul may be required. The lines for policies 1 and 2 in Figure 3.1 run between the two dotted lines, obtained by some arbitrary rule, and provide a smooth representation of the process.

3.2.3 Empirical vs. mathematical approaches Many empirical models are simple and the rules involved are easy to understand. But they are not very flexible and the benefits obtained from their application cannot be clearly identified. Also, cost and reliability optimization cannot be carried out.

Notwithstanding the above, some empirical approaches developed in the last 20 years are far from very simple, but their logic is very clear and they have the promise of being used more generally. Such approach is the Reliability Centered Maintenance (RCM), first proposed about 20 years ago [4,5]. It is based on condition monitoring and, therefore, does not follow rigid maintenance schedules. It includes failure cause analysis and an investigation of operating needs and priorities. From this information, it selects the critical components in a system (those that are dominant contributors to system failure or to the resulting financial loss) and indicates more stringent maintenance policies for these components; in fact, it assists in deciding where the next dollar budgeted for maintenance should go. An important advantage of the RCM approach is that it also considers external, non deterioration-originated failures (e.g., those caused by weather, animals, humans). Example

Consider the case of overhead lines in distribution systems. According to fault and interruption statistics in the UK, the percentages of failure causes of such lines are the following [6] (since only the dominant failure causes are shown, the percentages are rounded and do not add up to 100):

Weather 55%

Damage from animals 5%

Human damage 3%

Trees 11%

Ageing 14%

The conclusion appears to be that the maintenance budget for overhead lines should be divided almost equally between internal and external programs. The external budget would be spent mostly on tree trimming and some design changes, such as the erection of barriers and fences.

The RCM approach is discussed in more detail in Chapter 4 by Dr. L. Bertling.

Maintenance policies based on mathematical models are much more flexible than heuristic policies. Mathematical models can incorporate a wide variety of assumptions and constraints, but in the process they can become quite complex. A great advantage of the mathematical approach

Page 22: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

19

is that the outcomes can be optimized. Optimization with regard to changes in some basic model parameter can be carried out for maximal reliability or minimal costs.

Mathematical models can be deterministic or probabilistic. Since maintenance models are used for predicting the effects of maintenance in the future, probabilistic methods are more appropriate than deterministic ones, even if the price for their use is increased complexity and a consequent loss in transparency. For these reasons, the use of such methods is spreading only slowly.

The simpler mathematical models are still based on fixed maintenance intervals (scheduled maintenance), and optimization will be carried out, in most cases, through sensitivity analysis, by varying, say, the frequency of maintenance. More complex models [7,8,9] incorporate the idea of condition monitoring where decisions about the timing and amount of maintenance are dependent on the actual condition of the device (predictive maintenance). Such policies can be optimized with respect to any of the model parameters, such as the frequency of inspections.

3.2.4 A simple deterministic model This example is based on one in Reference [10]. Consider a device that breaks down from time to time. To reduce the number of breakdowns, inspections are made n times a year when minor modifications may be carried out. The optimal number of inspections that minimizes the total yearly outage time, consisting of the repair times after failures and the inspection durations, is to be determined.

Let the failure rate be λ(n) occurrences per year, where λ is independent of time but is a function of the inspection frequency. Therefore, the total downtime T(n) is also a function of n. Further, let it be assumed that

( )= ( 1)n k nλ + (3.1) where the numerical value of k indicates the failure frequency when no inspections are made.

If tr is the average duration of one repair and ti the average duration of one inspection, then

( ) ( ) r iT n n t ntλ= + (3.2) Substituting (3.1), taking the derivative of T(n) with respect to n, and equating it with zero,

2

- ( ) 0( 1)

ri

ktdT n tdn n

= + =+

(3.3)

From the second statement, the optimal value of n becomes

½ ( / ) - 1opt r in kt t= (3.4) With k = 5 per yr, tr = 6 h and ti = 0.6 h, one obtains that nopt = 6.07 per yr, or the optimal inspection frequency is about one in every two months. The total outage time is T(6) = 7.9 h/yr, whereas without inspections it would be T(0) = 30 h/yr.

Page 23: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

20

3.3 Linking reliability and maintenance: a probabilistic approach As already mentioned, one of the tasks of maintenance studies is cost optimization, where the costs include both the maintenance and repair costs. Repairs are assumed to be done, of course, after each failure. If it is decided to do maintenance more often or to more exacting standards, its costs will increase; as a result, however, lower failure frequency and associated repair costs can be expected. The goal is to balance these expenditures. To do so, a model is needed which can calculate the effect of changes in maintenance parameters on the various reliability parameters. In other words, a model which can provide a fast answer to questions like “what is the effect on the mean time to failure if the maintenance frequency is raised by 20%”.

As one can see from the “Simple deterministic model” above, optimization is easily included in mathematical models. On the other hand, modelling the relation between maintenance (inspection) and reliability (failure rate) is still a problem. In the example above, this relation is given by (3.1). It should be observed that this relation is assumed, and not a result of calculations. What is missing is a mathematical model where this relation is part of the model itself, and the effect of maintenance on reliability is part of the solution.

In the following, probabilistic models will be presented for a device without and with maintenance.

3.3.1 Basic models A simple failure-repair process for a deteriorating device is shown in Figure 3.3. The various states in the diagram are explained in the legend. The deterioration process is represented by a sequence of stages of increasing wear, finally leading to equipment failure. Deterioration is, of course, a continuous process in time, and only for easier modeling is it considered to occur in discrete steps.

Figure 3.3 State diagram including stages of deterioration (D1, D2, . . .). F: failure state.

The number of deterioration stages may vary, and so do their definitions. In most applications, the stages are defined through physical signs such as markers on wear or corrosion. This, of course, makes periodic inspections necessary to determine the stage of deterioration the device has reached. The mean times of the stages are usually uneven, and are selected from performance data or by judgment based on experience.

The process in Figure 3.3 can be readily represented by a probabilistic mathematical model. If the rates of transitions shown between the states can be assumed time-independent, the mathematical model describing such a process is known as a Markov model. Well-known techniques exist for the solution of these models [11,12,13]. It can be proven that in a Markov model the times of transitions between states are exponentially distributed. This property and the constant-rate property follow from each other.

Page 24: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

21

One way of incorporating maintenance into the model in Figure 3.3 is shown in Figure 3.4. It is immediately clear that in this arrangement there is no assumption made that maintenance would produce “new” conditions; in fact, the effect of maintenance can now be limited: it is assumed that it will improve the device’s condition to that which existed in the previous stage of deterioration [14]. This contrasts with many strategies described in the literature where maintenance is considered equivalent to replacement.

If a failure has external causes (e.g., inclement weather), there is a single step from the working to the failed state. Now, the constant failure-rate assumption leads to the result that maintenance cannot produce any improvement because the chances of failure in any future time interval are the same with or without maintenance (a property of the exponential distribution). That maintenance will not do any good in such cases agrees with experience as expressed by the oft-quoted piece of wisdom: “If it ain’t broke, don’t fix it!” The situation is quite different for deterioration processes where the times from new conditions to failure are not exponentially distributed even if the times between subsequent stages of deterioration are (this can be rigorously proven). In such a process, maintenance will bring about improvement, and one can conclude that if failures are the consequences of ageing, maintenance has an important role to play.

Figure 3.4 State diagram including three deterioration stages

and the corresponding maintenance states (F: failure state)

In Figure 3.4, the dotted-line transitions to and from state M1 indicate that maintenance while in state D1 should really not be performed because it would lead back to state D1 and, therefore, it would be meaningless. State M1 could be omitted if the maintainer knew that the deterioration process was still in its first stage and, therefore, no maintenance was necessary. Otherwise, maintenance must be carried out regularly from the beginning, and state M1 must be part of the diagram.

It should be observed that this and similar models solve the problem of linking maintenance and reliability. Upon changing any of the maintenance parameters, the effect on reliability (say, the mean time to failure) can be readily computed.

A further comparison of the model in Figure 3.4 and similar deterministic models is given in the Appendix.

3.3.2 The Asset Management Planner (AMP): a practical model A more sophisticated model [15] based on the scheme in Figure 3.4 and tested in practical applications is shown in Figure 3.5. A program, called Asset Management Planner (AMP), using this model, was developed by Kinectrics Inc. in Toronto, Canada. It computes the probabilities,

Page 25: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

22

frequencies and mean durations of the states of a component exposed to deterioration but undergoing regular inspections and receiving maintenance on an as-needed basis.

Without maintenance, the path from the onset (entering D1) would run through the stages of deterioration to the failure state F. With maintenance, this straight path to failure is regularly deflected by inspection and maintenance. According to the diagram, in all stages of deterioration regular inspections take place (I1, I2, I3), possibly several times, and at the end of each inspection a decision is made to continue with minor (M) or major (MM) maintenance, or forgo maintenance and return the device to the state of deterioration it was in before the inspection. Another point of decision is after minor maintenance when, if the results are considered unsatisfactory, major maintenance can be initiated.

Figure 3.5 The AMP model

The result of all maintenance activities is expected to be a single-step improvement in the deterioration chain, following the principle shown in Figure 3.4. However, allowances are made for instances when no improvement is achieved or even when some damage is done during maintenance, the latter resulting in the next stage of deterioration. The choice probabilities (at the points of decision making) and the probabilities associated with the various possible outcomes are based on user input and are estimated from historical records.

Another technique, developed for computing the so-called first passage times (FPT) between states [16], will provide the average times of first reaching any state from any other state. Although not shown, the technique is implemented in the AMP model. If the end-state is F, the FPT’s are the mean remaining lifetimes from any of the initiating states. This information is necessary for constructing life curves.

It can be observed that the AMP model can handle both scheduled (regular) and predictive (as needed) maintenance policies. Figure 3.4 shows an arrangement for scheduled maintenance: the rate of starting maintenances is always the same. (This rate is the reciprocal of the mean time to maintenance; the actual times constitute a random variable). The equivalent in Figure 3.5 would be the removal of the inspection states. The scheme in Figure 3.5, as shown, takes also care of as needed maintenance. Condition monitoring is done through regular inspections, and if it is found that no maintenance is needed, the device is returned to the “main line” without being sent for maintenance. Maintenance is carried out only when needed.

For further elaboration and detailed applications, see Chapter 6 by G. Anders.

Page 26: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

23

3.3.3 Generation of life curves Life curves have been discussed in Section 3.2, and the present process starts out from the diagram in Figure 3.2c. Now, however, the generation of a specific life curve that accommodates the conditions in Figure 3.4 and Figure 3.5 will be discussed. The process occurs in several steps, as explained below with the help of Figure 3.6.

• First, the borderlines between the deterioration stages D1, D2 and D3, expressed in terms of percentages of equipment condition, are marked on the vertical axis and entered into the program.

• Next, AMP/FPT calculations are carried out by the program, to determine the first passage times between states D1 and D2, D1 and D3, and D1 and F. These are entered on the time-axis of Figure 3.6. By using the AMP model, the effects of maintenance are already incorporated.

• If there was no maintenance, the FPT’s D1D2*, D1D3* and D1F* would be obtained and the corresponding life curve would run as shown. (This is identical to the curves M0 in Figure 3.2.) With maintenance, the life curve is no longer a smooth line but a rugged one indicating the deterioration between maintenances and the improvements caused by them. A crude realization of the process is shown in Figure 3.6. Note that the placement of the dotted lines ensures that maintenances out of the state D2 should take the device into D1, and those out of D3 into D2 – as prescribed in Figure 3.4. Some niceties in Figure 3.5 are not considered.

• The equivalent smooth life curve is drawn by observing the following simple rules. At time 0 it must be at 100%, at D1F it must be 0. At the remaining two ordinates, by arbitrary decision, it should be near the lower quarter of the respective domains. (In Figure 3.6, the midpoints are used, an earlier convention.)

Figure 3.6 Development of life curves without maintenance (a), and with maintenance (b)

3.4 Conclusions In this review, a survey is offered of the various maintenance methods available to operators. The methods range from the simplest, “follow the manual”-types to detailed probabilistic approaches. To get most out of maintenance, one would have to select a mathematical model where

Page 27: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

24

optimization is possible – optimization for highest reliability or lowest operating costs. There can be little doubt that such probabilistic models would be the best tools for identifying policies that provide the highest cost savings.

Another choice of which operators are becoming more and more aware is to apply a maintenance policy based on no rigid schedule but on the “as needed” principle. This can be implemented with or without mathematical models; example for the latter is the RCM approach. RCM, steadily gaining in popularity, is based on an analysis of failure causes and past performance, and helps to decide where to put the next dollar budgeted for maintenance. The method is good for comparing policies, but not for true optimization.

In today’s competitive environment, cost optimization is becoming even more important. This is particularly true for transmission and distribution equipment where the maintenance choices described in this review fully apply. While the maintenance times of generating units may be determined by different considerations, many of the basic principles discussed in this Chapter will still have relevance.

3.5 References [1] IEEE/PES Task Force, “The Present Status of Maintenance Strategies and the Impact of Maintenance on

Reliability”, IEEE Trans. Power Systems, 16, 4, pp. 638-646, November 2001. [2] IEEE Tutorial on Electric Delivery System Reliability Evaluation, 05TP175, Chapter 5, “Reliability and

Maintenance”, by J Endrenyi. IEEE/PES General Meeting, San Francisco, CA, 2005. [3] Anders, G.J. and Endrenyi, J., “Using Life Curves in the Management of Equipment Maintenance”,

Proceedings of the 7th PMAPS Conference, Naples, 2002. [4] Smith, A.M., Reliability-Centered Maintenance. McGraw-Hill, Inc., New York, 1993. [5] Moubray, J., Reliability-centered maintenance. Industrial Press Inc., New York, 1992. [6] Bertling, L., Reliability Centred Maintenance for Electric Power Distribution Systems, PhD thesis, Royal

Institute of Technology (KTH), Stockholm, 2002. [7] Canfield, R.V., "Cost Optimization of Periodic Peventive Maintenance", IEEE Trans. on Reliability, 35, 1,

pp. 78-81, April 1986. [8] Anders, G.J. et al. "Maintenance Planning Based on Probabilistic Modeling of Aging in Rotating Machines",

CIGRE Conference Paper No. 11-309, Paris, 1992. [9] Reichman, B. et al. "Application of a Maintenance Planning Model for Rotating Machines", CIGRE

Conference Paper No. 11-204, Paris, 1994. [10] Jardine, A.K.S., Maintenance, Replacement and Reliability. Pitman Publishing, London, 1973. [11] Endrenyi, J., Reliability Modeling in Electric Power Systems. J. Wiley & Sons, Chichester, 1978. [12] Anders, G.J., Probability Concepts in Electric Power Systems. J. Wiley & Sons, New York, 1990. [13] Billinton, R. and Allan, R.N., Reliability Evaluation of Engineering Systems, Second Edition. Plenum Press,

London, 1992. [14] Sim, S.H. and Endrenyi, J., "Optimal Preventive Maintenance with Repair", IEEE Trans. on Reliability, 37,

1, pp. 92-96, April 1988. [15] Endrenyi, J., Anders, G.J. and Leite da Silva, A.M., "Probabilistic Evaluation of the Effect of Maintenance

on Reliability - An Application", IEEE Trans. on Power Systems, 13, 2, pp.576-583, May 1998. [16] Anders, G.J. and Leite da Silva, A.M., “Cost Related Reliability Measures for Power System Equipment.”

IEEE Trans. On Power Systems, 15, 2, pp. 654-660, May 2000.

Page 28: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

25

3.6 Appendix: Deterministic or probabilistic models In this Appendix, a short comparison is made between a deterministic and a probabilistic approach describing the same situation, and a potential weakness of the deterministic approach is pointed out. Consider a deterioration-maintenance process similar to that shown in Figure 3.4. A deterministic equivalent is presented Figure 3.7(a). It is assumed that without maintenance the device would fail after (exactly) 10 years, the (rigid) maintenance interval is 3 years, and the effect of maintenance is a 1-year improvement in deterioration. Deterioration and maintenance are still linked through an algorithm based on the diagram; this algorithm constitutes a deterministic mathematical model. It can be seen that the time to failure now becomes 14 years as a result of the four maintenances carried out in the interval.

Figure 3.7: Maintenance every 3 years, resulting in

(a) 1-year improvement, (b) 3-year improvement

if total wear is 6 years or more, otherwise as in (a)

M – maintenance MM – overhaul F – failure

While it is conceivable that the improvement due to a maintenance activity is less than the deterioration between two consecutive maintenances, especially early in the life of a device when only minor maintenances are performed, later the effect of maintenance should equal or exceed the deterioration occurring between maintenances. This can be ensured by scheduling overhauls (major maintenances) beyond a given stage of deterioration. If, for instance, in the above example overhaul is required instead of maintenance after the deterioration stage of 6 years, and if the effect of overhaul is a 3-year improvement in deterioration, the diagram will change to that shown in Figure 3.7 (b). Note that now the expected time to failure is infinite.

The problem with this deterministic representation (and many others) becomes obvious in the last example. It is easy to visualize that if the improvement resulting from maintenance is less than the maintenance interval, the process will tend “to the right” and end in failure. However, this can be considered an unlikely case. Every time the improvement equals the maintenance interval, the process will oscillate within a given range, as in Figure 3.7 (b), and if it exceeds the maintenance interval, the process will move “to the left”. In both latter cases the implication is that failure will never occur. This is a false conclusion and is due to the assumptions that (a) failures cannot occur during the various stages of deterioration, and (b) all quantities involved have fixed values. If variability is allowed and the probability of failure is in no state is the probability of failure assumed to be zero, as in a probabilistic model, the failure state will, sooner or later, always be reached. This agrees with experience and can be rigorously proven.

Page 29: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Introduction to maintenance J. Endrenyi

26

3.7 Biography John Endrenyi (M’59, SM’76, F’87, LF’94) is Principal Scientist Emeritus at Kinectrics, Toronto (formerly Ontario Hydro Technologies), and retired Adjunct Professor at the University of Toronto. He received a Diploma of Electrical Engineering from the Technical University of Budapest, the MASc degree from the University of Waterloo (Ontario) and the Ph.D. from the University of Toronto. He joined Ontario Hydro’s Research Division in 1959 where he was first engaged in station and transmission line grounding studies and, later, in the development of probabilistic models for power system reliability. He has contributed to the methodology of power system reliability and maintenance through numerous papers, seminars, tutorials, a book, and participation in several IEEE, EPRI, CIGRE and IEC committees. In 2004, he received the biennial award of the PMAPS (Probability Methods Applied to Power Systems) International Society. Dr. Endrenyi is a registered Professional Engineer in the Province of Ontario. (e-mail: [email protected])

Page 30: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

27

4 RCM and its extension into a quantitative approach RCAM

Dr. Lina Bertling, Member IEEE KTH (Royal Institute of Technology),

Stockholm, Sweden Abstract -Reliability-centred maintenance (RCM) is a qualitative systematic approach to organizing maintenance. It originates from a need developing more efficient approaches for planning of preventive maintenance, not lowering the level of reliability. The main feature of RCM is its focus on preserving system function where critical components for system reliability are prioritized for PM measures. However, the method is generally not capable of showing the benefits of maintenance for system reliability and costs. For this purpose a quantitative approach for RCM has been developed, i.e. the reliability-centred asset management method.(RCAM). This chapter provides an overview of two different approaches for RCM i.e. RCM II and RCAM. The chapter also shows on application studies using the RCAM approach. Results from application studies show how the RCAM method can be used to compare different maintenance methods and PM strategies based on the total cost of maintenance, which includes the impact of the PM measure on the system reliability. Relating maintenance effort and reliability improvement is, however, a complex problem, and substantial input data is required to support the method. The RCAM, as well as the RCM, approach consequently provides a means for creating resources to provide input data.

4.1 Introduction Reliability overall can be improved by lowering either the frequency or the duration of interruptions. Preventive maintenance (PM) activities could impact on the frequency by preventing the actual cause of the failure. Consequently, PM is cost-effective when the reliability benefit outweighs the cost of implementing the PM measure. There is, therefore, a need for utilities to incorporate systematic methods which relate maintenance of system assets to the improvement in system reliability. This is part of the wider concept of asset management. Asset management involves making decisions to allow the network business to maximize long term profits, while delivering high service levels to the customers with acceptable and manageable risks. Reliability evaluation and maintenance planning techniques have separately been well developed, for example [1] and [2], with reliability assessment starting in the 1930s [3]. However, few techniques relate system reliability to component maintenance. Furthermore, the available techniques are not generally put into practice. Reasons for this, according with the author, are typically the lack of suitable input data, and a general reluctance to use theoretical tools to address the practical problem of maintenance planning. There is however an existing, and shown successful, approach for relating reliability to PM is known as reliability-centred maintenance (RCM). This chapter briefly describe two different RCM approaches. The first method described, RCM II is a well known approach and is proposed by John Moubray in his book "Reliability centred maintenance" [4]. The second method that is presented, has been developed within a research

Page 31: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

28

project at KTH (The Royal Institute of Technology) and involves a high degree of modelling [5][6]. The different steps in the two approaches are presented, and for the RCAM approach results from application studies are included. Finally a comparison of these methods is made, and future challenges are summarized.

4.2 Reliability-centred maintenance (RCM)

4.2.1 The background and concepts of RCM RCM is a qualitative systematic approach to organizing maintenance [4],[7] and [8]. It originated in the civil aircraft industry in the 1960s with the introduction of the Boeing 747 series, and the need to lower PM costs in attaining a certain level of reliability. The results were successful and the methodology was developed further. In 1975 the US Department of Commerce defined the concept RCM and declared that it should be used in all major military systems [4],. In the 1980s, the Electric Power Research Institute (EPRI) introduced RCM into the nuclear power industry. Today RCM is used or being considered by an increasing number of electrical utilities [9], [10]. The main feature of RCM is its focus on preserving system function where critical components for system reliability are prioritized for PM measures. However, the method is generally not capable of showing the benefits of maintenance for system reliability and costs. There are different versions of the RCM approach in use. In the 1980s, questions concerning the environment became important issues. This led to more focus being put into these issues according to Moubray [4]. Streamlined reliability centred maintenance (SRCM) are simplified versions of RCM. The streamlined versions are developed to lower the recourses needed to perform RCM. Maintenance and reliability are important because of the large costs associated with maintenance tasks and costs due to loss in production and breakdowns. Breakdowns can also lead to consequences that affect the environment or personal safety. These aspects could also be taken into consideration when performing a RCM analysis.

4.2.2 RCM according to Moubray The RCM II method has a strong focus on environmental and safety issues. A short summary of the method found in Moubray's book [4] is presented in this section. The RCM II process involves asking seven questions about the studied system:

1. What are the functions and associated performance standards of the asset in its present operating context?

2. In what ways does it fail to fulfil its functions? 3. What causes each functional failure? 4. What happens when each failure occurs? 5. In what way does each failure matter? 6. What can be done to predict or prevent each failure? 7. What should be done if a suitable preventive task cannot be found?

These steps are described in more detail below followed with some additional features of this method.

Page 32: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

29

4.2.2.1 What are the functions of the asset? To answer this question the asset's functions are divided into primary and secondary functions. The primary functions are the main purposes of the asset while secondary functions are additional properties that the asset is expected to meet. Functions should be described by a verb, an object and a standard of performance. 4.2.2.2 In what ways does it fail to fulfill its functions? The next step is to identify in what way the asset can fail to perform it's functions established in step one. There could be several ways the asset fails to fulfil its desired functions. 4.2.2.3 What causes each functional failure? Each functional failure may have several causes, failure modes. It is at this level that the maintenance of the system is to be done. It is stressed that the analysis must be applied at an appropriate detail level otherwise the work may become very extensive or in the other case, become meaningless. 4.2.2.4 What happens when each failure occurs? The effects of the failure should be recorded. This includes evidence that a failure has occurred, environmental or safety threats, effects on production, physical damage and how to restore the system after the failure. 4.2.2.5 In what way does each failure matter? This step analyses what consequences each failure leads to. First the failures are classified as apparent or hidden. If occurring on their own, hidden failures will not be noticed. Evident failures are failures which will become evident if occurring on their own. Evident functional failures are classified according to three groups that describe what the consequences of a failure are. The three groups listed below are ordered according to importance.

1. Safety and environmental consequences 2. Operational consequences 3. Non-operational consequences

Operational failures affect costs in connection with production and operation. Non-operational failures only effect the cost of repairing. 4.2.2.6 What can be done to predict or prevent each failure? Examine if there is any maintenance which can be done to prevent or predict the failure. These tasks are called preventive tasks. Predetermined tasks which may be used are scheduled restoration and scheduled discard. These are often appropriate when dealing with age-related failures. To use these strategies there must be a point in time when there is an increase in the probability of failure. Condition based tasks are used to identify potential failures. If condition based tasks are feasible the problem of how frequently to perform these tasks must be answered. This can be a difficult problem if reliable information about the failure probabilities and P-F intervals is hard to acquire. Condition based tasks are feasible if a potential failure condition is possible to identify, the P-F interval is reasonably constant and not too short and that monitoring the item at intervals shorter than the P-F interval is possible.

Page 33: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

30

Condition based maintenance and monitoring are discussed in more detail in Chapter 5 by Dr. A. Jardine. 4.2.2.7 What should be done if a suitable preventive task cannot be found? If no appropriate preventive task is feasible or worth doing there are three choices; redesign, no scheduled maintenance or to conduct failure finding tasks. Failure finding tasks are intended for hidden failures. When deciding which option to choose the consequences of failures must be considered. If the consequence is non-operational, economy can rule the choice but when there is safety or environmental consequences redesign might be the only option. When applying the RCM II process there is a decision diagram that should be followed. When a maintenance strategy is found that is feasible and worth doing it is chosen, and further analysis of other maintenance tasks is not required. Whenever possible, scheduled on-condition tasks should be chosen. Otherwise scheduled restoration tasks and then scheduled discard tasks are selected. The last choice when dealing with less severe consequences (operational and non-operational) is no scheduled maintenance or redesign. If the consequence involves environmental or safety hazards the problem must be addressed and no scheduled maintenance is not an option. 4.2.2.8 Characteristics of RCM II Moubray's method has a predetermined preference of maintenance strategies. The method steers towards performing preventive tasks rather than corrective tasks after a failure. Of the preventive strategies condition based maintenance is preferred to pre-determined maintenance. Environmental and safety consequences have a high priority in the analysis. Since the process stops when an acceptable maintenance strategy is found it is possible that another type of strategy would be more efficient if it also were evaluated. On the other hand some work can be saved and the process is made faster this way.

4.3 Reliability-centred asset management (RCAM) The RCAM method is developed from RCM principles attempting to relate more closely the impact of maintenance to the cost and reliability of the system. The method has been developed from comprehensive application studies for real power distribution systems [5],[6] and [11]. As a first step in the method, the critical components for the system reliability are identified from a sensitivity analysis. These components are further studied, focusing on the impact of maintenance measures. The relationship between reliability and maintenance has been established by relating the effect of PM to the causes of failures for the component being assessed. Two different approaches have been used. The first approach assumes a constant reduction ratio between failure rates and the effect of PM, whereas the second approach assumes this ratio to be dependent on time. In the first case λ(PM) depends only on the effect of PM (Approach I). In the second case, λ(t,PM) is also time-dependent (Approach II), and the failure rate reduction is a consequence of the PM actions considered for the specific component that is studied. Formulating the failure rate model for Approach II is a complicated task. Studies on this have been made for the underground cable component [5], [6] and [11] and for breaker components [11], [12],[13] and [14]. Results from these studies are presented in Section 4.4.

Page 34: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

31

The main stages of the RCAM approach are: Stage 1 System reliability analysis: defines the system and evaluates critical components affecting

system reliability. Stage 2 Component reliability modelling: analyzes the components in detail and, with the support

of appropriate input data, defines the quantitative relationship between reliability and PM measures.

Stage 3 System reliability and cost/benefit analysis: puts the results of Stage 2 into a system perspective, and evaluates the effect of component maintenance on system reliability and the impact on cost of different PM strategies.

These three stages emphasize a central feature of the method: that the analysis moves from the system level to the component level and back to the system level.

4.3.1 Economic evaluation The economic evaluation brings the RCAM analysis to its final step: to relate the benefits in costs due to the impact of maintenance on reliability. The motivation for any PM strategy is that the cost of applying the PM measure should be less than taking no action at all. If little or no PM is done, then more system failures are likely to occur resulting in more repair actions being required, i.e. in more corrective maintenance (CM) actions. Therefore, the important issue is to compare the costs associated with different maintenance methods, including both PM and CM with the objective of minimizing the total cost of maintenance. There are several costs that can be related to the effect of system failures. Two direct utility costs are: (a) cost of failure (CM), e.g. repair costs and losses in revenue due to non-delivered energy, and (b) cost of the PM actions, e.g. planned maintenance or replacement of a component in advance of failure. However, the cost of failure also depends on the customer cost [15]. A supply interruption affects the customer, who will suffer supply unavailability and may suffer direct costs and/or be compensated via a penalty payment. Consequently, the proposed cost analysis considers:

• the cost of failure fC

• the cost of preventive maintenance PMC

• the cost of interruption intC The optimal maintenance method and PM strategy is the solution that minimizes the sum of these three costs. However, in some cases it may not be necessary to include intC , for example for a simple or first order comparison of strategies. Section 4.4 presents an application study following the RCAM approach. The economic evaluations have been made using fundamental techniques. The costs are evaluated on an annual basis with an assumed increase due to inflation 1d . Furthermore, the investments in PM measures are spread over the remaining time of the assessment period T. Finally the present worth value of the total annualized costs is evaluated. The present worth value of one outlay (C) to be paid after n years with the discount rate 2d , is gained by multiplying by the present worth value factor

( ) ( ) nf ddnPV −+= 21, .

Page 35: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

32

4.3.2 The steps in the RCAM approach [6] Figure 8 illustrates the logic for the RCAM method. This figure includes the different stages and steps in the method, and the systematic process for analyzing the system components and their causes of failures. The ten steps needed to perform the RCAM approach, as identified in Figure 8, are presented in more detail in this section.

*

7. Define strategy for PMwhen, what, how

failure rate8. Estimate composite

4. Define a failure rate model

*

6. Deduce PM plans andevaluate resulting model

No

Are there more criticalcomponents ?

2. Identify critical componentsby reliability analysis

*

3. Identify failure causes

1. Define reliability modeland required input data

5. Model effect of PMon reliability

by failure mode analysis

RCAM plan

9. Compare reliability for

PM strategy

PM methods and strategies

10. Identify cost-effective

Stage 1

Stage 2

System reliability

Component reliabilitymodelling

analysis

Stage 3

cost/benefit analysisSystem reliability

Are there more causesof failures ?

No

Are there alternativePM methods ?

No

Yes

Yes

Yes

For each: critical component i,

PM m

ethod j, and

failure cause k.

Figure 8 Logic for the RCAM approach [6].

Page 36: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

33

4.3.2.1 Define reliability model and required input data.

Define input data including: network data, component reliability data and customer data, and a reliability model. 4.3.2.2 Identify critical voltage levels and components for the system reliability based on results from reliability analysis. The approach for the sensitivity analysis is as follows: categorize components according to their type, vary their input failure rates for one type at a time, and evaluate the resulting indices for the system and different load points. Perform this analysis for different voltage levels and load points. The results provide a prioritized list of components for PM measures. 4.3.2.3 Identify failure causes by failure modes analysis for each component identified as critical and affected by PM • Identify causes of failures from an understanding of: component functions, failure modes and

failure events. • Determine the percentage each cause contributes to the total number of failures from

interruption data and expertise. • Identify experience data for interruptions due to these causes of failures. • Identify possible effect of alternative PM methods. 4.3.2.4 Define a failure rate model For components nii ,1, = model the failure rate function iλ as follows: 4.3.2.5 Approach I : Assume that the failure rate equals the average failure interruption, i

aλ , from reliability input data (from Step 1):

ia

i λλ = (4.1) 4.3.2.6 Approach II: Assume that the component failure rate function can be obtained as a sum of contributions from the different causes of failures of type mkk ,1, = . Deduce a model for the failure rate as a function of time, using experience data from Step 2 for the failure rate modeling, as follows:

∑=

=m

kk

ii tt1

)()( λλ (4.2)

4.3.2.7 Model the effect of PM methods on reliability for each failure cause Assume that the PM method zjj ,1, = , preventing failure cause (k) is applied to component number i. For each PM method j define a failure rate model as follows: 4.3.2.8 Approach I • Assume that the effect of applying PM is a reduction of the actual failure cause k with %jkx

reduction, where [ ]ax jk ,0∈ and a, is the percentage contribution to the total failures of that failure cause, and given from Step 3.

• Assume that the failure rate for the analysed component is reduced by the same percentage. The resulting failure rate function can be evaluated from:

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−= ∑∑

= =

z

j

m

k

xiav

i jkPM1 1

1001λλ (4.3)

Page 37: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

34

4.3.2.9 Approach II • Deduce a model for functional relationship between reliability and PM activities as a function

of time. This model requires more knowledge about the component behaviour and the effect of applying PM with method j and the impact on specific failure causes.

• The resulting failure rate function can be evaluated from:

( ) ( )∑∑= =

=z

j

m

k

ijk

i PMtPMt1 1

,, λλ (4.4)

4.3.2.10 Deduce different plans for applying PM, and evaluate the resulting effect on the component failure rate Note that for Approach II this requires the effect of applying PM at different times on the resulting failure rate functions to be evaluated. 4.3.2.11 Define and implement different strategies for PM A PM strategy, S, for the system is defined by:

• applied PM methods j denoted by: Sj ⊇ , • proportion of the component type i that are affected by each PM method denoted by:

js , and also for Approach II, and within the period [ ]Ttt ,0∈ : • number of times PM is applied v , and • at what times PM is applied ( )PMvPMPM ttt ,,, 21

4.3.2.12 Estimate the resulting composite failure rate. This step implies developing the failure rate model for the component i applied with PM strategy S. The resulting failure rate function provides the input data for component type i to the system reliability model. • Define which failure causes are affected by each PM method j in the strategy. Let

jk ⊇ denote the affected causes, and jk ⊆ denote the non-affected causes. • The resulting failure rate function captures the average composite failure rate characteristic

for the component i. It is made up of several parts, depending on the PM strategy. 4.3.2.13 Approach I • Define the extent of the effect for each failure cause, affected by PM method j, that is

jkx . • Evaluate the resulting composite failure rate for component type i which is given as follows:

( )⎪⎪⎩

⎪⎪⎨

⋅+⋅+⋅−

+⎭⎬⎫

⎩⎨⎧

⇔+

=

∑∑∑

∑∑

⊆⊇⊇

=⊆

jk

ijkj

jk

ijkj

ij

Sjj

m

k

ijk

ij

Sj

ij

i

sPMssS

λλλ

λλλλ

)1()( 1

(4.5)

4.3.2.14 Approach II • The following equations define the resulting failure rate function:

Page 38: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

35

( )

( )( )

( )⎪⎪

⎪⎪

≤≤

≤≤

≤≤

=

Tttt

tttt

tttt

St

PMviv

PMPMi

PMi

i

λ

λ

λ

λ 211

100

, (4.6)

where

( ) ( )

( ) ( )( )

( ) ( )

( )

( )

( )( ) ( )

( ) ( ) ( )⎪⎪⎪

⎪⎪⎪

⎟⎟⎟

⎜⎜⎜

⋅+⋅+

+⋅+++−

+

=

⎪⎩

⎪⎨

⎟⎟⎟

⎜⎜⎜

⋅+

⋅+⋅−

+=

=

∑ ∑∑

∑ ∑ ∑∑

⊇⊆⊇

⊇⊇ ⊆

1

11

,

1

,

)1(

11

21

1

11

1

0

Sjjk

ijkvj

jk

ijkvjv

ijvjjj

Sj

ij

iv

Sjjk jk

ijkj

ijk

jijj

Sj

ij

i

ii

tsPMtst

tsss

t

t

tsPMt

ststt

tt

v

λλλ

λ

λ

λ

λλ

λλλ

λλ

4.3.2.15 Compare system reliability when applying different maintenance methods and PM strategies. • Perform system reliability analysis with result from Step 8 as input data for included

components. The output is the system and load-point reliability indices that show the different effects of the PM strategy (S) on the system.

• Compare the impact of PM strategy (S) on system and load-point reliability indices. • For Approach II, an alternative is to compare the average load-point indices during the

period, evaluated as follows:

( )∑Δ−

−Δ

=ttT

iiLpiLpiav St

tTt

0

,0

, λλ (4.7)

and similarly: LpiavLpiavLpiav ErU ,,, ,, for each load point,

piL , in the system model. • Analyse the effect of using different PM strategies on system reliability. 4.3.2.16 Identify cost effective PM strategy • Evaluate cost functions in [cost/yr], based on those that were introduced in Section 4.3.1:

• the cost of failure fCCM

• the cost of preventive maintenance PMCCM • the cost of interruption

intCCM with and without PM respectively as follows:

4.3.2.17 Approach I ( )∑∑

==

⋅=⋅=n

i

if

if

n

i

if

if cSSCPMcCCM

11

)(, λλ (4.8)

Page 39: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

36

where ifc is the cost of failure for component i [cost/int].

4.3.2.18 Approach II

( ) ( )

( ) ( )∑

=

=

+⋅⋅=

+⋅⋅=

n

i

tif

if

n

i

tif

if

dcStStCPM

dcttCCM

11

11

1,),(

1)(

λ

λ (4.9)

where 1d is the inflation rate.

4.3.2.19 Approach I

( ) ∑∑= ⊇

=n

i Sj

iPMjPM CSCPM

1

(4.10)

where iPMjC is the cost of applying PM method j for component i [cost/measure].

4.3.2.20 Approach II

( )( )

( )⎪⎪⎪⎪

⎪⎪⎪⎪

≤≤⎟⎟⎠

⎞⎜⎜⎝

⎛+−

≤≤+−

≤≤

=

∑ ∑∑

∑∑

= ⊇

= ⊇

TtttTC

ttttTC

ttt

StCPM

PMvv

PM

n

i Sj

iPMj

PMPMPM

n

i Sj

iPMj

PM

PM

v

1

1

0

,

1

1

1

1

211

10

(4.11)

where the cost of applying PM, at each PM occasion, is equally spread over the remaining time period.

4.3.2.21 Approach I ( ) ( )PMEcSCPMEcCCM Lpi

LpiLpi

Lpi ⋅=⋅= intintintint , (4.12)

where Lpicint is the customer interruption cost in [cost/kWh]. 4.3.2.22 Approach II

( ) ( )

( ) ( )∑

=

=

+⋅⋅=

+⋅⋅=

nlp

i

tLpiLpi

nlp

i

tLpiLpi

dcStEStCPM

dcCMtEtCCM

11intint

11intint

1,),(

1,)( (4.13)

• Evaluate the total annualized costs in [cost/yr]: 4.3.2.23 Approach I

( ) ( ) ( ) ( )SCPMSCPMSCPMSTCPM

CCMCCMTCCM

PMf

f

++=

+=

int

int (4.14)

4.3.2.24 Approach II ( ) ( ) ( )( ) ( ) ( ) ( )StCPMStCCMStCCMStTCPM

tCCMtCCMtTCCM

PMf

f

,,,, int

int

++=

+= (4.15)

• Evaluate present values in [cost]. 4.3.2.25 Approach I

The same value as given by:(4.14).

Page 40: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

37

4.3.2.26 Approach II ( ) ( ) ( )

( ) ( ) ( )2

2

,,,

,

0

0

dtPVStCPMTStTCPMPV

dtPVtCCMTtTCCMPV

f

T

tt

f

T

tt

⋅=

⋅=

=

= (4.16)

The cost-effective solution is the maintenance strategy that provides the lowest total cost when comparing the total costs for PM with different sets of S , and with no PM, that is CM.

4.4 RCAM application study for an electrical distribution system [6] This section provides results from application studies using the RCAM approach for assessment of an urban electrical distribution system Birka system. The application study includes failure rate modelling, for the underground cables and with the effect of PM on one failure cause (water-treeing). For each of the results presented the corresponding step in the RCAM method is noted.

4.4.1 Stage 1 - System reliability analysis for the Birka system (Step 1-3) The disturbance data for the Stockholm city power system (from 220, 110, 33, to 11kV level) and the period 1982-1999 was surveyed [17]. The statistics showed that the 11kV voltage level contributed most to the number of failures and customers affected. A system was selected to investigate this voltage level in more detail. This system includes the 220/110 kV Bredäng station and 33/11 kV Liljeholmen station, which are connected to each other via two parallel 110 kV cables. From the Liljeholmen station (LH11) there are 32 outgoing 11 kV feeders that supply the southern part of central Stockholm and 14,300 customers. Figure 9 shows the resulting model for thje Birka system. Customers are represented as one average 11kV load point. The following component types were included: bus bars, breakers, underground cables, and transformers. Furthermore, these were categorized into the different voltage levels between 220-11kV.

Page 41: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

38

Figure 9 Reliability model of the Birka system [5]

Figure 10 Identifying critical components for the Birka system with cases: (1) base case,

(2) bus bars, (3)breakers, (4) cables, and (5) transformers. (Step 2.)

c40

c41

c42

c14

SJ

c2

c3

c4

c5

c6

c7

c10

c11

c12

c13

c8

c9

c30 c31

c32

c33

c34c35

LH11

c27 c28c29

c23

c26

c25

c24

c19

c20

c21

c22

c15

c16

c17

c18

c37

c38

c39

c43

c44

c45

c36

HD

c46c47

c48

c49c50

c51

c52

c53

c54

c55c56

c57c58

Sp

c1

33 kV

110 kV

220 kV

11 kV

Page 42: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

39

The reliability of the Birka system was analysed using input reliability data from experience and statistics and RADPOW tool (a computer program developed at KTH [5]) [16]. Figure 10 shows results from Step 2 in the RCAM method defining the critical components. For each case, a specific component failure rate is assumed to be zero, and the resulting effect on the load point indices is evaluated. Case 1 refers to the base case with no PM. The most significant reduction occurs in Case 4, when cables are considered 100% reliable. This shows that these have the greatest impact on the failure rate and the unavailability for the average 11kV customer. The significant rise in average outage time is because the repair time for the dominant population of cables, that is 11kV, is much lower than the repair times for the other components. Therefore the average restoration time increases when the number of short interruptions is reduced. The conclusion is that the 11kV cables are critical components for this system.

4.4.2 Stage 2 – Component reliability modeling (Step 3-6) A comprehensive failure modes analysis was made (Step 3) using 18 years of data and 58 interruptions that were caused by the 11kV underground cables. The underlying causes of failures for each of these interruptions were investigated. The class of material or method made the most significant contribution with 59% of the total failures, including the underlying failure causes of material faults. Approach I The information from the failure modes analysis provides input data for the failure rate modelling (Step 4). Approach II Data from the statistics (Step 3) were complemented with practical experience. From discussions with maintenance personnel a list of underlying causes of cable faults was defined. One of these causes was water treeing. This is a tree-like phenomenon that involves water penetration through the insulation, occurring primarily in the early produced (mid-1970s) XLPE insulation cables. Data related to this failure were collected and selected. These include disturbance statistics [18], measurements and modelling of the cable condition [19], and PM of cables [20]. One effective method for preventing failures of water-treed cables is the rehabilitation method [20][21]. This involves injecting a silicon-based liquid between the individual wires of the conductor, which stops the growth of the current water trees. The water trees, on the other hand, impact on the breakdown strength of the cable, which can be measured with diagnostic methods. Based on the experience data and the logic shown in Figure 11, a failure rate model (Step 4) and a functional relationship between the failure rate and the effect of PM measures (Step 5) were defined [5].

Water-treegrowth

Decreasedbreakdown

voltage

Increasedfailure rate

Figure 11 Process to relate underlying failure cause to reliability .(Step 3-5.)

Three different maintenance activities were considered for these studies: no PM activities, PM by the rehabilitation method and PM by replacing cables systematically before they failed (the replacement method) with notations: org, si and rp respectively. Figure 12 shows the final result for modelling the failure rate, assuming one PM action on each cable. The initial value for the cable failure rate is relatively small but not zero, as the figure

Page 43: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

40

indicates. The failure rate characteristic with no PM is the resulting approximation of a function obtained from experience data [5]. The data is assessed from a complete population of cables over a 13-year aging period. It was assumed that the failure rate, after this time and due to this specific failure cause, is constant. Furthermore, it was assumed that replacement is made with a cable having the same characteristics as the current cable had when new. These assumptions were motivated by two aspects: that the water trees grow to a maximum length (that of the insulation thickness) and that this provides a worst-case scenario when showing the benefit of PM. However, it should be noted that for these XLPE insulated cables, a new cable would not have the same characteristics due to changes in the manufacturing techniques. Nevertheless, a changed characteristic can be included quite readily. In practice, PM procedures are likely to be performed several times during the lifetime of a particular component, in which case the characteristic shown in Figure 12 would have a series of decrements similar to that shown. The number of occasions and their timing should depend on the cost of performing the PM actions and the cost-benefit of doing so. The RCAM approach described in this paper allows this to be assessed objectively. The resulting cable failure rate model was used for the Birka system. The characteristics of the XLPE cables in this system are consequently assumed to follow those of the XLPE cables with insulation degradation due to water treeing. (It should be stressed that this assumption enabled complete demonstration of the RCAM method, rather than providing a true picture of the cables in the Birka system.) To obtain the composite failure rate for the cable it was assumed that the total failure causes were due to water trees and other causes. The resulting input data for the component then consisted of the developed failure rate model for failures due to water trees, and the average failure rate for the 11kV cable in the Birka system due to other causes. (Step 6.)

Figure 12 Resulting failure rate model for a water-treed cable affected

by PM measures after 11 years .(Step 4-5, Approach II.)

4.4.3 Stage 3 -System Reliability and Cost/ Benefit Analysis (Step 7-10) Approach I Results from the survey of statistics provided input data for modelling the relationship between PM and reliability using Approach I. Sensitivity studies were made to see the effect at the system

Page 44: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

41

level if each of these causes of failures were decreased individually or in combination. The different cases are as follows:

1. base case, 2. fabric or material faults =14%, 3. lack of maintenance =5%, 4. wrong method or instruction =15%, 5. total of (2-4) =34%, and 6. total for (material and method) =59%.

The difference in percentages between cases 5 and 6 (25%) relates to those causes that were reported as included in material and method, but with no further detailed level of classification. Figure 5 shows the benefit of these different cases on the system indices. It has been assumed for each case that the causes of failures can be eliminated by the PM activities. Thus the corresponding failures would be eliminated and the reliability indices influenced. The results show that PM measures to reduce individual causes of failures for a critical component in the system can significantly improve the system reliability. The cases represent different maintenance strategies for the RCAM method with Approach I (Step 7).

Figure 13 Effect on system reliability for different maintenance

strategies using Approach I for the Birka system. (Step 9.)

Approach II A system analysis is performed for the Birka system including two strategies for applying the PM with either rehabilitation ( siPM ) or replacement ( rpPM ). Both of these involve PM applied on three occasions (years 12,11,9=PMt ), and with the following proportions of cables subject to PM per occasion: 10% for 1S and 30% for 2S (Step 7). The results from the system reliability analysis, as shown in Table 3-1 (Step 9), show consistently that the best reliability is achieved with PM by replacement and with as much as possible of the component replaced, that is 2S . Figure 14 shows one result from the economic evaluation according to the RCAM method. Input data for the economic assessment was provided by the utility, and from the Swedish customer interruption costs included in [22]. It is seen that the cost of failures is decreased for the Birka system, when the 11kV cables are affected by PM measures. Furthermore, it is seen that the most significant decrease in cost of failures is achieved with the replacement method.

Page 45: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

42

Table 3-1 Reliability results applying different maintenance methods

Reliability Factor

Unit CM siPM

1S siPM

2S rpPM

1S rpPM

2S

Lpiav,λ [int/yr] 0.52 0.50 0.47 0.50 0.45

LpiavU , [h/yr] 0.70 0.70 0.65 0.68 0.63

Lpiavr , [h/int] 1.40 1.41 1.44 1.42 1.45

LpiavE , [MWh/yr] 16.14 15.75 14.97 15.61 14.57

Figure 14 The impact of maintenance methods and PM strategies on cost

of failure for the Birka system. (Step 10, Approach II.)

Page 46: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

43

Figure 15 The impact of different maintenance methods on the total annual

costs of applying a PM strategy for the Birka system. Results are shown for

the case with the interest rate %21 =d . (Step 10, Approach II.)

The final step in the RCAM analysis is to evaluate the present worth values of the annualised total costs of maintenance. Figure 15 presents annual costs for the different maintenance methods using PM strategy S1. It can be seen directly from the annual costs that PM is a dominating cost. Furthermore, it is clearly more cost-effective to rehabilitate the cable than to replace it, since the greater benefit in reliability by the replacement method is offset by the higher investment cost. Consequently, the cost-effective solution is not to carry out PM in this case, but if PM is carried out, rehabilitation is better than replacement. This is, however, a constructed example considering only one type of component and does not provide the complete result for the Birka system. It is also important to note that cables compared with other components in a power system involve extremely high PM costs with relatively few possible PM actions. It is, however, of significant importance for efficient maintenance planning to evaluate the relative values of implementing different maintenance strategies, as shown in this application example.

4.4.4 Further developments into maintenance prioritization The question of prioritization of maintenance resources is fundamental for all types of systematic and cost-efficient maintenance planning approaches. In the RCAM approach the first stage, and Step 1, includes to identify the most critical components, i.e. those that have the greatest impact on the system reliability. This section briefly introduce a proposed approach for component reliability importance indices, which have been developed for a first stage in maintenance optimization [23][24][25]. The proposed indices focus on customer interruption cost as a measure of system performance and reliability. The customer interruption costs have been calculated based on customer specific

Page 47: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

44

initial costs for every interruption plus a cost linearly dependant on the duration of the interruption. The interruption cost based index is defined as follows:

i

sHi

CI

λ∂∂

= [€/f] (4.17)

where Cs [€/yr] is total yearly customer interruption cost and λi [f/yr] component i’s failure rate. The index identifies components that are critical for the system with respect to their individual impact on total interruption cost with changes in component failure rate [26]. One interpretation of IH is that it corresponds to the total expected interruption cost (for all load points) that would occur if component i failed. Hence, if there were one maintenance action available, which would result in the same absolute change in failure rate for any component in the network. IH would then be the adequate index to use for a prioritization of what component the action should be performed on. The proposed index, IH, is not affected by the studied component’s failure rate but “only” by component repair time and the position of the component and all other components in the system. This is analogous with Birnbaum’s importance index [27]. Hence, the concept of maintenance potential [26] is introduced. Maintenance potential corresponds to the expected system cost reduction that would occur in the case of a perfect component, i.e. no failures for the studied component (hence maintenance potential). Another way to express this measure is the expected total interruption cost that the studied components failures will result in (alone and/or together with other components) during one year. Maintenance potential is defined as:

(1 , ) ( )MPi S i SI C Cλ λ= − [€/yr] (4.18)

where CS is the total system (interruption) cost and λ [f/yr] failure rate for the studied components. Results from applying these indices, for the Birka system is presented below. First the different component reliability indices have been calculated. Then one of three maintenance strategies are implemented for each component. The three component strategies are:

1. Keep current preventive maintenance level, average failure rate is assumed to remain unchanged, no change in cost.

2. Improve the preventive maintenance, the failure rate of the component is assumed to become reduced, increased cost of preventive maintenance.

3. Decrease the preventive maintenance, the failure rate is assumed to increase for the studied component, cost savings on preventive maintenance.

The selection process of the component strategies has to be performed in an optimization process that recalculates the indices several times; this in order to assure that an optimal point is reached.

Page 48: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

45

Figure 16 An optimal maintenance strategy for the Birka system [23].

Figure 16 illustrates the results for the Birka system using the component importance indices for searching an optimal maintenance strategy. The results shows for each component if the maintenance level should be; kept, increased or decreased. From the figure it is seen that the maintenance could be decreased for several components at the 33 kV level. This result is reasonable since these components are located on a high redundant part of the system.

4.5 Conclusions This chapter has provided an overview of two different approaches for RCM i.e. RCM II and RCAM. RCAM differs mainly since its approach is purely mathematical. Therefore RCAM needs more data input and requires a higher level of research than RCM II. The RCAM was developed for a structurally complex system. For analysis of individual components the structure would be very simple, since nearly all failures lead to a functional failure and little would be gained from the system analysis which plays a major part in the RCAM method. The RCAM method has the advantage of being able to view the system as a unit when deciding maintenance strategies, while RCM II view the system at component and failure mode level when maintenance strategies are determined. The chapter has also shown on application studies for the RCAM approach. Results from application studies show how the RCAM method can be used to compare different maintenance methods and PM strategies based on the total cost of maintenance, which includes the impact of

c40

c41

c42

c14

SJ

c2

c3

c4

c5

c6

c7

c10

c11

c12

c13

c8

c9

c30 c31

c32

c33

c34c35

LH11

c27 c28c29

c23

c26

c25

c24

c19

c20

c21

c22

c15

c16

c17

c18

c37

c38

c39

c43

c44

c45

c36

HD

c46c47

c48

c49c50

c51

c52

c53

c54

c55c56

c57c58

Sp

c1

33 kV

110 kV

220 kV

11 kV

SW

Increase maintenance

Decrease maintenance

Keep maintenance lev.

Page 49: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

46

the PM measure on the system reliability. Furthermore, the application study shows that the RCAM method can be performed and supported by real input data. Relating maintenance effort and reliability improvement is, however, a complex problem, and substantial input data is required to support the method. The RCAM, as well as the RCM, approach consequently provides a means for creating resources to provide input data.

4.6 References [1] Billinton, R., Fotuhi-Firuzabad, M. and Bertling, L. ``Bibliography on the application of probability methods in

power system reliability evaluation 1996-1999'', IEEE Transactions on Power Systems, Vol. 16, No. 4, November 2001.

[2] Endrenyi, J., et al, “The Present Status of Maintenance Strategies and the Impact of Maintenance on Reliability”, IEEE Transactions on Power Systems, vol. 16, no. 4, November, 2001.

[3] R. Billinton, “Bibliography on the application of probability methods in power system reliability evaluation,” IEEE Trans. Power App. Syst., vol. PAS-91, Mar./Apr. 1972.

[4] J. Moubray, Reliability-centred Maintenance, Butterworth-Heinemann, Oxford, 1995. [5] Bertling, L. "Reliability Centred Maintenance for Electric Power Distribution Systems", ISBN 91-7283-345-9,

TRITA-ETS-2002-01, ISSN 1650-674X, KTH Electrical Engineering, August 2002. [6] Bertling L., Allan R.N., Eriksson, R.,“A reliability-centred asset maintenance method for assessing the impact

of maintenance in power distribution systems”, TPWRS-00271-2003.R3, IEEE Transactions on Power Systems, Vol. 20, No. 1, Feb. 2005.

[7] Nowlan, F. S. and Heap, H. F., Reliability-Centered Maintenance, National Technical Information Service, U.S. Department of Commerce., Springfield, Virginia, US, 1978.

[8] Smith, A. M., Reliability-Centered Maintenance, McGraw-Hill, U.S, 1993. [9] Swedenergy AB, ”RCM for Electrical Distribution Systems - A Simplified Decision Model for Maintenance

Planning Part I” (RCM För Elnät En förenklad beslutsmetod för underhållsplanering - Del 1 Användningsområden och arbetssätt, (ISBN 91-7622-167-9, In Swedish), 2001.

[10] Cigré Working Group 13.08, “Life Management of Circuit-Breakers, International Council on Large Electric Systems”, Cigré, Paris, France, Working Group 13.08 Report 165, 2000.

[11] Eriksson R., Lindquist T., Bertling L. “Reliability modelling of aged XLPE cables”, Nordic Insulation Symposium Tampere, June 11-13, 2003.

[12] Lindquist T., Bertling L, Eriksson R., “A Method for Age Modelling of Power System Components based on Experiences from the Design Process with the purpose of Maintenance Optimization”, Presented at the Reliability and Maintainability Annual Symposium (RAMS), January 2005.

[13] Lindquist T., Bertling L, Eriksson R., “A Feasibility Study for Probabilistic Modeling of Aging in Circuit Breakers for Maintenance Optimization”, Proceedings of PMAPS, Ames, Iowa, September 2004.

[14] Lindquist T., Bertling, L., Eriksson,``Estimation of disconnector contact condition for modeling the effect of maintenance and ageing”, IEEE PowerTech'05 St. Petersburg, June 2005.

[15] Kariuki, K.K. and Allan, R.N., Application of customer outage costs in system planning, design and operation, IEE. Gener. Transm. Distrib., vol 143, no 2, March 1996.

[16] Bertling, L., Eriksson, R. and Allan, R.N., “Relation between preventive maintenance and reliability for a cost- effective distribution systems”, Proceedings of IEEE PowerTech'01, vol 4, no 208, September 2001.

[17] Bertling, L., Eriksson, R., Allan, R.N., Gustafsson, L.Å. and Åhlén M. ``Survey of Causes of Failures Based on Statistics and Practice for Improvements of Preventive Maintenance Plans", 14th PSCC in Seville, June 2002.

[18] Swedenergy AB, The Lifetime and Usefulness of XLPE Cables (PEX-kablar livslängd och användbarhet), (In Swedish), 1990.

[19] Werelius, P., Thärning, P., Eriksson, R., Holmgren, B. and Gäfvert, U., ”Dielectric Spectroscopy for Diagnostics of Water Tree Deteriorated XLPE Cables”, IEEE Transactions on Dielectrics and Electrical Insulation, vol 8,no 1, February 2001.

[20] SINTEF, Faremo, H., Report: Rehabilitation of XLPE Cables with long Water-trees, (Energiforsyningens Forskningsinstitutt (EFI), EFI TR A 4512, In Norwegian),Trondheim, Norway, 1997.

[21] Pilling, J. and Bertini, G., “Incorporating Cablecure injection into a Cost-Effective Reliability Program”, IEEE Industry Applications Magazine, Vol. 3333, No 208333, September/October 2000.

[22] Cigré Task Force 38-06-01, Methods to Consider Customer Interruption Costs in Power System Analysis, Paris, 2001.

Page 50: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

RCM and its extension into a quantitative approach RCAM L. Bertling

47

[23] Hilber, P., “Component reliability importance indices for maintenance optimization of electrical networks”, Licentiate thesis, ETS, KTH. Usab ab. ISBN 91-7178-055-6, 2005.

[24] Hilber, P., and Bertling, L., ``A method for extracting reliability importance indices from reliability simulations of electrical networks”, Proceedings of the 15th PSCC in Liege, Belgium, Aug. 2005.

[25] Hilber, P., Hällgren, B. and Bertling, L. “Optimizing the replacement of overhead lines in rural distribution systems with respect to reliability and customer value”, Accepted to be presented at the 18th International Conference on Electricity Distribution (CIRED) in Turin, June 2005.

[26] Hilber, P., Bertling, L., and Hällgren, B., ”Effects of correlation between failures and power consumption on costumer interruption cost”, Proceedings of the 9th international conference on Probabilistic Methods Applied to Power Systems, PMAPS, Stockholm, Sweden, June 2006.

[27] Rausand, R. and Høyland, A. “System reliability theory”,.2nd ed, Hoboken, New Jersey: John Wiley & Sons. ISBN 0-471-47133-X, 2004.

4.7 Biography Lina Bertling (S’98-M’02) was born in Stockholm in 1973. She received her Ph.D in Electric Power Systems in 2002 and M.Sc. in Systems Engineering in 1997, from KTH - the Royal Institute of Technology, Stockholm, Sweden. She is currently employed at KTH School of Electrical Engineering as Assistant Professor, and is the leader for a research program at the Swedish Centre of Excellence in Electric Power Systems (EKC2) on maintenance management. Since 2003 she has been working as a lecturer and research leader at KTH developing a research group on reliability-centered asset management (RCAM). Her research interests are in power system maintenance planning and optimization including reliability-centered maintenance (RCM) methods, reliability modeling and assessment for complex systems, and lifetime and reliability modeling for electrical components. Dr. Bertling is a member of the IEEE Power Engineering Society (PES) Subcommittee on Risk, Reliability, and Probability Applications (RRPA), and the IEEE PES Committee on Power System Planning and Implementation. She was the general chair of the 9th international conference on probabilistic methods applied to power systems (PMAPS) in Stockholm in 2006. ([email protected],www.ee.kth.se/rcam .)

Page 51: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

48

5 Optimizing condition monitoring decisions for maintenance planning

Dr. Andrew K.S. Jardine Department of Mechanical & Industrial Engineering

University of Toronto Toronto, Ontario Canada, M5S 3G8

Abstract - Condition monitoring is an activity that is widely used, mainly for expensive and complex equipment/systems consisting of a large number of simpler components, and subject to different failure modes. Due to the rapid development of IT, much more data from condition monitoring and all types of maintenance and corrective activity is collected and stored in maintenance data bases. The Chapter focuses on current industry-driven research that employs proportional hazards modeling to identify the key risk factors that should be used to identify the health of equipment from amongst those signals that are obtained during equipment health monitoring. Economic considerations are then blended with the risk estimate to establish optimal condition-based maintenance (CBM) decisions. Recent results of the research program are included in the Chapter including development of the EXAKT software, and its successful application to the condition monitoring techniques of vibration monitoring and oil analysis. The remaining useful life (RUL) of a system and its associated conditional reliability function are also considered as a tool that may be used in optimizing CBM decisions.

5.1 Introduction Condition Monitoring (CM) has become a recognized tool for assessment of the health of equipment, such as the use of oil analysis for power transformers. Planning and scheduling of maintenance decisions can be made based on the analysis of CM information. Examples of CM information that can be utilized include, but are not limited to: Vibration monitoring, Infrared Thermography, Oil Analysis, Ultrasonics, Motor Current Analysis, etc. [Dunn, 2005] Control charts are one of the most commonly applied techniques for interpretation of CM data. At each inspection, levels of some measurements are compared with the corresponding predefined “warning limits” and a judgment is made based on the outcome. The method has been applied for several decades and proved to be a helpful and simple to understand technique. However, control charts leave several important questions unanswered. Among the variety of measurements related to the items condition that one can collect, which ones should be paid attention to? What if there is no single variable that can provide information on true condition of the equipment? What are the optimal warning limits and should these limits change with operating age of the item? [Jardine et al, 2006]. In this Chapter we present a procedure that takes into account both the age of the item and it’s history it significantly expands the space of available maintenance strategies and is termed Condition-Based Maintenance (CBM).

Page 52: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

49

The CBM Consortium research laboratory was established in 1995 at the Department of Mechanical and Industrial Engineering in the University of Toronto. The lab has developed theory that combines age and condition monitoring data with economic and/or performance data that may include the cost of failure, the cost of planned maintenance, the corresponding down times, and produces a long-run optimal maintenance decision policy. Among current activities of the project is development of software that can assist maintenance and reliability specialists to optimize decisions in CBM environment. The current state of development of the software, called EXAKT™, is presented in section 8.5. Details of the CBM Lab can be obtained at www.mie.utoronto.ca/cbm.

5.2 Optimizing Condition Based Maintenance Decisions

5.2.1 Introduction Possibly the most common approach to understanding the health of equipment is through plotting various measurements and comparing them to specified standards. This procedure is illustrated in Figure 5.1 where measurements of iron deposits in an oil sample are plotted on the Y-axis and compared to warning and alarm limits. The maintenance professional then takes remedial action if deemed appropriate. Many software vendors addressing the needs of maintenance have packages available to assist in interpreting CM measurements, with the goal of predicting failures.

WorkingAge

Normal < 200ppm

Warning > 200ppm

Alarm > 300ppm

Figure 5.1 Classical Approach to Condition Monitoring.

Clearly there is a need to focus attention on the optimization of condition monitoring procedures. In the following section we will present an approach for estimating the hazard (conditional probability of failure) that combines the age of equipment and condition monitoring data using a PHM. We will then examine the optimization of the CM decision by blending in with the hazard calculation, the economic consequences of both preventive maintenance, including complete replacement, and equipment failure. [Jardine & Tsang 2006]

Page 53: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

50

5.2.2 The Proportional Hazards Model (PHM). A valuable statistical procedure for estimating the risk of equipment failing when it is subject to condition monitoring is the proportional hazards model [Cox, 1972]. There are various forms that can be taken by a PHM, all of which combine a baseline hazard function along with a component that takes into account covariates that are used to improve the prediction of failure. The particular form used in this section is known as a Weibull baseline PHM which is:

⎭⎬⎫

⎩⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑

=

)(exp)(,(1

1

tzttZthm

iiiγηη

ββ

(5.1)

where h(t, Z(t)) is the (instantaneous) conditional probability of failure at time t, given the values of )(),...,(),( 21 tztztz m . Each zi (t) in equation (5.1) represents a monitored condition data item at the time of inspection, t, such as the parts per million of iron or the vibration amplitude at the second harmonic of shaft rotation. These condition data are called covariates. The γ’s are the covariate parameters indicating the degree of influence each covariate has on the hazard function. The model consists of two parts, the first part is a baseline hazard function that

takes into account the age of the equipment at time of inspection, 1−

⎟⎟⎠

⎞⎜⎜⎝

⎛β

ηηβ t , and the second part,

)()()( 2211 tztztz mme γγγ +++ , takes into account the variables (may be thought of as the key risk factors used to monitor the health of equipment) and their associated weights.

In the study by Anderson et al [1982] the form of the hazard model for the aircraft engines was:

( )21

47.3

98.041.0exp2410024100

47.4)( zztth +⎟⎠⎞

⎜⎝⎛= (5.2)

where z1 is Fe concentration and z2 is Cr concentration in parts per million and t is the age of the aircraft engine in flying hours at the time of inspection. Since ß = 4.47 we know that the age of the aircraft engine is an influencing factor in estimating the hazard rate of the engine. η = 24,100 hours is a parameter of the Weibull distribution. The values 0.41 and 0.98 are the weights to give the iron and chrome measurements when calculating the hazard rate. They are estimated from the data that is analyzed and will be different for different engines, and will depend on their operating environment. The procedure to estimate the values of ß, η and the weights, along with determining the condition monitoring variables to be included in the model is discussed in a number of books and papers, including Kalbfleisch and Prentice [2002]. Standard statistical software such as SAS and S-Plus have routines to fit a PHM.

Page 54: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

51

5.2.3 Blending Hazard and Economics: Optimizing the CBM Decision Makis and Jardine [1992] presented an approach to identify the optimal interpretation of condition monitoring signals. The approach is illustrated graphically in Figure 5.2 and Figure 5.3.

DATA PLOT

RISK PLOTAge

Data

Age

Risk

Figure 5.2 Calculating Hazard from Condition Monitoring Measurements.

47

Optimal risk level

Age

Risk

Risk

Cost/unit time

RISK PLOT

COST PLOT

Ignore risk

Replace at failure only

minimal cost

optimal risk Figure 5.3 Establishing the Optimal Hazard Level for Preventive Replacement.

Figure 5.2 illustrates that given a set of condition monitoring measurements (the data plot) it is possible to convert the measurements to the equivalent hazard estimate (the risk plot). This conversion is achieved through using a PHM. Once we have a method of monitoring an equipment’s hazard value, the next question is: What should we do about it to make an optimal maintenance decision? The answer is illustrated in Figure 5.3. There it can be seen that one possibility is to ignore risk (Risk Plot). If risk

Page 55: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

52

information is ignored, then the equipment will be used until it fails, and only then will it be maintained (for the time being, assume that the maintenance action is equivalent to a replacement, as is the case of some complex equipment, such as aircraft engines where after maintenance the engines are re-lifed and have the same guarantees as a new engine). The cost associated with this decision (ignoring risk) is the cost of a failure replacement divided by the mean time to failure of the equipment). Thus we obtain the cost of replacing only on failure as identified on the Cost Plot. As the risk is reduced, then there will be more preventive replacement actions, and less failure replacements. Assuming that the cost of a failure replacement is greater than the cost of a preventive replacement then a cost function as illustrated on the Cost Plot will be obtained. Thus it is possible to identify the optimal hazard level at which the equipment should be replaced: if the hazard rate is greater than a certain threshold value, preventive replacement should take place; otherwise, operations can continue as normal. In the Makis and Jardine [1992] paper it is shown that the expected average cost per unit time, Φ(d), is a function of the threshold risk level, d, and is given by:

(1 ( )) ( ) ( )( )( )

C Q d C K Q ddW d

⋅ − + + ⋅Φ = (5.3)

where C is the preventive replacement cost and C+K the failure replacement cost. Q(d) represents the probability that failure replacement will occur, at hazard level d. W(d) is the expected time until replacement, either preventive or failure. The optimal risk, d*, is that value that minimizes the right hand side of equation (5.3), and the optimal decision is then to replace the item whenever the estimated hazard, h(t, Z(t)), calculated on completion of the condition monitoring inspection, equals or exceeds d*.

5.2.4 Applications The topic of optimizing CBM decisions has been an active research thrust at the University of Toronto that has been conducted for some years in partnership with a number of companies, many of them having global operations (www.mie.utoronto.ca/cbm). As a consequence, pilot studies have been undertaken and published in the open literature. Brief summaries of three of them, each utilizing a different form of condition monitoring are: 5.2.4.1 Use of vibration monitoring A company undertook regular vibration monitoring of critical shear pump bearings. At each inspection 21 measurements were provided by an accelerometer. Using the theory described in the previous section, and its embedding in software called EXAKT, see Section 3.5.6, it was established that of the 21 measurements there were 3 key vibration measurements: Velocity in the axial direction in both the first band width and the second band width, and velocity in the vertical direction in the first band width. In the plant the economic consequence of a bearing failure was 9.5 times greater than when the bearing was replaced on a preventive basis. Taking account of risk as obtained from the PHM and the costs it was clear that through following the optimization approach total cost could be reduced by 35%. Fuller details are available in Jardine et al. [1999]

Page 56: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

53

5.2.4.2 Use of oil analysis Electric wheel motors on a fleet of haul trucks in an open-pit mining operation were subject to oil sampling on a regular basis. Twelve measurements resulted from each inspection. These were compared to warning and action limits in order to decide whether or not the wheel motor should be removed preventatively. These measurements were: Al, Cr, Ca, Fe, Ni, Ti, Pb, Si, Sn, Visc 40, Visc 100, and Sediment. After applying a PHM to the data set, it was identified that there were only two key risk factors, that is, oil analysis measurements that were highly correlated to the risk of the wheel motor failing; these measurements were of iron (Fe) and sediment. The economic advantage of following the optimal replacement strategy was a cost reduction of 22 %. The cost consequence of a wheel motor failure was estimated as being three times the cost of replacing it preventatively. Fuller details are available in Jardine et al. [2001]. 5.2.4.3 Use of visual inspection: Transportation Traction motor ball bearings on trains were inspected at regular intervals to determine the color of the grease; it could be in one of four states; light grey, grey, light black, black. Depending on the color of the grease and knowing the next inspection time a decision was made to either replace or leave the ball bearings in service. As a result of building a PHM relating the hazard of a bearing failing before the next planned inspection a decision was made to dramatically reduce the interval between checks from 3.5 years to 1 year. Before the study was undertaken the transportation organization was suffering, on average, 9 train stoppages per year. The expected number with a reduced inspection interval was estimated to be one per year. The year following the study the transportation system identified two system failures due to a ball bearing defect. The overall economic benefit was identified as a reduction in total cost of 55%. It should be mentioned that this included the cost of additional inspectors and took into account the reduction in passenger disruption. A “notional” cost was identified with passenger delays.

5.2.5 Further Comments Case studies dealing with the optimization of CBM decisions in the utilities sector include: Nuclear plant refueling, Jardine et al [2003] and Turbines in a nuclear plant, Chevalier et al [2004].

5.3 Software for CBM Optimization To ease the application of the theory described in Section 8.2, a software package named EXAKT (www.omdec.com) has been developed. As explained by Wiseman [2004], “EXAKT takes processed signals, correlates them with past failure and potential failure events. Using modeling, it subsequently provides failure risk and residual life estimates tuned to the economic considerations and the availability requirements for that asset in its current operating context” Table 5-1 shows the form of condition monitoring data that EXAKT requires if the CM tool is vibration monitoring. In addition, “event data” is required. This is information about when equipment went into service and when it came out of service. It is also information about any maintenance interventions that took place between installation and removal of the equipment, such as the events defined in Table 5-2, which may affect interpretation of the CM data. A sample

Page 57: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

54

of the vibration analysis event data for the example being illustrated in this section is provided in Table 5-3 where the working age of the bearing being monitored was days.

Table 5-1 Vibration Monitoring Data.

69

g

Table 5-2: Different Forms of Event Data

88

Definition of an event:1. A beginning event. This indicates the start of a history ( A “history” is

the time from installation to removal of a item). Designated by “B”.

2. A failure event. Designated by “EF”. (Ending with failure)

3. A preventive replacement. Designated by ES (Ending by suspension).

An event is also an occurrence during a history which effects the condition data. Here are some examples:1. An oil change

2. A rotor balance

3. A shaft/coupling alignment

4. A soft foot correction

5. Tightening, calibration, minor adjustments that affect the condition data

6. A filter replacement

7. and so on

Page 58: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

55

Table 5-3 Vibration Analysis Event Data.

Data from Table 5-1 and Table 5-3 are used to obtain the PHM. The same data is used to obtain the transition probabilities which are then used in combination with cost data to obtain the optimal decision figure; see Banjevic et al [2001]. Table 5-4 is an example of the transition probability matrix for the vibration measurement “velocity in the axial direction, first band width” and when the interval for the transition is specified as 30 days. Thus, if today the velocity is in the range 0.15 – 0.22 there is a probability of 0.37788 that the equipment will be in the same state 30 days from today. Similarly the table can be use to estimate the probability of the equipment being in a failure state in 30 days time as 0.199714. Transition probabilities are provided for all possible combinations of states.

Table 5-4 Transition Probability Matrix.

Very Smooth

Smooth

Rough

Very Rough

Failure

Inspection Interval = 30 days

Finally using the PHM, transition matrices and the costs associated with preventive and failure replacement, the figure used for decision-making is obtained – Figure 5.4.

Page 59: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

56

Vibration Monitoring Decision

Figure 5.4 Optimizing the CBM Decision.

Thus whenever an inspection is made the values of the key risk factors are obtained. In this case the key risk factors are: velocity in the axial direction, first band width; velocity in the axial direction second band width and velocity in the vertical direction, first band width. These measurements are then multiplied by their weighting factors, 5.8312, 36.552 and 24.053 respectively, then added together to give a Z-value which is marked on the Y-axis. The X-axis defines the age of the item (a bearing in this example) at the time of inspection. The intersection of a horizontal line from the Z-value and a vertical line from the age indicates the optimal decision. If the intersection is in the light shaded area (green) the recommendation is to continue operating – with reference to the lower figure in Figure 5.3 the cost curve is still declining. If the intersection is in the dark shaded area (red) the recommendation is to replace – with reference to the lower figure in Figure 5.3 the cost curve is now in the increasing range. If the intersection lies in the clear area it indicates that the optimal change-out time is between two inspections. On the site www.omdec.com there is a detailed explanation of EXAKT along with the answers to many frequently asked questions and a number of tutorial problems. The Chapter Interpretation of inspection data emanating from equipment condition monitoring tools: Method and software in Mathematical and Statistical Methods in Reliability, [Armijo, Y.M. (Editor), (2005)] provides an overview of the theory and application of the CBM optimization approach presented in this section.

5.4 Recent Developments

5.4.1 Conditional distribution of time to failure [Banjevic and Jardine, 2005] Within the framework of statistical models introduced in sections 8.2, the conditional reliability function of the item, given the current state of the covariate process can be expressed as follows:

Page 60: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

57

( ) ( ( ) ) ( )ij

j

R t x i P T t T x Z x i L x t| , = > | > , = = ,∑ (5.4)

Once the conditional reliability function is calculated we can obtain the conditional density from its derivative. We can also find the conditional expectation of T t− , termed the remaining useful life (RUL), as

( ( )) ( ( ))t

E T t T t Z t R x t Z t dx∞

− | > , = | ,∫ (5.5)

In addition, the conditional probability of failure in a short period of time [ ]t t t, + Δ can be found as

(Survive during [ ] ( )) ( ( )) ( ( ))P t t t t Z t R t t Z t R t t t Z t, + Δ | , = | , − + Δ | , (5.6)

For a maintenance engineer, predictive information based on current CM data, such as RUL and probability of failure in a certain period of time, can be a valuable tool for assessment of risks and planning appropriate maintenance actions.

5.5 EXAKT Summary The current state of development of the software, named EXAKT™, allows the user to:

• Create a convenient database by extracting the event and condition (inspection) data from external databases; • Detect logical errors in the databases; • Perform data analysis and preprocessing, using graphical and statistical analysis; • Estimate parameters of the PHM and Markov process model. The model can be evaluated based on such statistical tests as Wald test, Log-likelihood test, Kolmogorov-Smirnov test, 2χ test for independence of covariates and for homogeneity of the Markov process; • Calculate and graphically present the conditional probability distribution for a given item and provide such characteristics as RUL and probability of failure in a short time period; • Compute and save the optimal replacement policy. Alternate policies are also available based on Age and Block replacement strategies; • Perform separate analysis for different failure modes or components of the system and create an integrated decision module; • Make and save decisions for current records whenever it is required, using the developed decision model.

Figure 5.5 illustrates the principle of the software and the way it can be used in decision-making. As outlined above, the program utilizes the age data and the condition-monitoring data in order to produce a statistical model, which in turn can be used to derive useful justified predictions and/or to optimize economic considerations. It is our belief that when supplied with the results of these analyses, an engineer can make better maintenance decisions.

Page 61: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

58

Figure 5.5 Principle of EXAKT™.

5.5.1 Marginal Analysis in EXAKT™ For a multi-component system, or a system with multiple failure modes, the software has an option called Marginal Analysis. Under this option, for a single set of data, separate models can be built for different components (or failure modes) and then integrated to produce one general decision model. Separate analyses of different components (or failure modes) can help for better planning and scheduling of preventive maintenance activities, more targeted work orders, possibilities for opportunistic preventive maintenance, etc. However, marginal analysis requires additional information on lifetime history of equipment, such as classification of events of failure, which might not always be accessible. One of the case studies undertaken by the CBM lab was intended to analyze performance of Diesel Engines employed on ships. As many as ten different failure modes have been defined, five of which have been found related to the available condition monitoring data (oil analysis data) collected by the user over the years. If ignored, interactions between different causes of failure could have led to a conclusion that time was not a significant risk factor for the engine. At the same time, when separated, analyses of different failure modes showed that at the component level it was possible to build time-dependent statistical models and, thus, derive more targeted policies for component replacements. In terms of the system, it translated into a component replacement strategy which yielded 20%-50% of improvement (depending on the ratio of costs of planned and failure replacements) in the long-run cost per unit time as compared with the Run-to-Failure strategy. Challenge remains to develop theory revealing relations between different components (or failure modes) within a system. This problem, among others, is one of the current research interests of the CBM lab. An approach to analysis and modeling of complex systems as well as review of literature can be found for example in Lugtigheid et al [2004].

5.6 Conclusion The growing competitiveness in the industrial world is driving the interest in improvement of asset effectiveness. Application of condition monitoring techniques is growing and produces a challenge to develop appropriate decision making strategies. Statistical modeling of acquired data and economic considerations of maintenance activities have proven to be useful for making

Page 62: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

59

evidence-based decisions and building justified predictions for the future behavior of the equipment. Development of theoretical optimization models should be followed by the development of software for analysis of condition-monitoring and equipment lifetime data in order to ensure successful implementation of new techniques in industry.

5.7 References [1] Anderson, M., Jardine, A.K.S., and Higgins, R.T., (1982), The use of concomitant variables in reliability

estimation, Modeling and Simulation, Vol 13, pp 73-81 [2] Armijo, Y.M. (Editor), (2005), Interpretation of inspection data emanating from equipment condition

monitoring tools: Method and software in Mathematical and Statistical Methods in Reliability, World Scientific Publishing Company

[3] Banjevic D., Jardine A.K.S., Calculation of reliability function and remaining useful life for a Markov failure time process, IMA Journal of Management Mathematics, [Online] doi:10.1093/imaman/dpi029, 2005.

[4] Banjevic, D, Jardine, A.K.S., Makis, V and Ennis M., (2001), A control –limit policy and software for condition-based maintenance optimization, INFOR, Vol 39, pp 32 - 50

[5] Barlow R.E., Hunter L.C., Optimum preventive maintenance policies, Operations Research, Vol. 8, pp. 90–100, 1960.

[6] Chevalier, R., Benas, J-C, Garnero, M.A., Montgomery, N, Banjevic, D. and Jardine, A.K.S.(2004) “Optimizing CM Data from EDF Main Rotating Equipment Using Proportional Hazard Model”, Surveillance5 Conference, France, October 11- 13, 2004.

[7] Cox, D.R., (1972), Regression models and life tables (with discussion), J.Roy. Stat. Soc. B, 34, 187-220 [8] Dunn, S., Condition monitoring in the 21st century, [Online] [9] http://www.plant-maintenance.com/articles/ConMon21stCentury.shtml, 2005. [10] Jardine, A.K.S., Banjevic, D., Montgomery, N., and Pak A, Repairable system reliability: recent

developments in CBM optimization, International Journal of Performability Engineering, (in press) [11] Jardine, A.,K.S., Banjevic, D., Wiseman, M., Buck, S, (2001), Optimizing a mine haul truck wheel motors’

condition monitoring program", Journal of Quality in Maintenance Engineering, No 1, pp. 286-301. [12] Jardine, A.K.S., Joseph, T and Banjevic, D, (1999), Optimizing condition-based maintenance decisions for

equipment subject to vibration monitoring, Journal of Quality in Maintenance Engineering, Vol. 5. No. 3, pp 192-202

[13] Jardine, A.K.S., Kahn, K., Banjevic, D., Wiseman, M. and Lin, D. (2003), An Optimized Policy for the Interpretation of Inspection Data from a CBM Program at a Nuclear Reactor Station”, COMADEM, Sweden, August 27-29

[14] Jardine, A.K.S., and Tsang, A. H. C., Maintenance, Replacement, and Reliability: Theory and Applications, CRC Press, Taylor and Frances, 2006

[15] Kalbfleisch, J.D., and Prentice, R.L., (1980) The statistical analysis of failure times, Wiley [16] Lugtigheid D., Banjevic D., Jardine A.K.S., Modelling repairable system reliability with explanatory

variables and repair and maintenance actions, IMA Journal of Management Mathematics, Vol. 15, pp. 89–110, 2004.

[17] Makis, V., Jardine, A.K.S., (1992), Optimal Replacement in the Proportional Hazards Model, INFOR, Vol. 20, pp 172-183

[18] Wiseman, M. (2004) , Private communication

5.8 Biography Andrew K.S. Jardine, Ph.D., C.Eng., M.I.Mech.E., M.I.E.E., P.Eng. is Professor and Principal Investigator at the Condition-Based Maintenance (CBM) Laboratory at the University of Toronto where the EXAKT software for CBM optimization and the SMS software for the optimization for emergency spares have been developed . The CBM Laboratory is funded by the following 10 organizations. From Canada: ABB, Department of National Defence, Diavik Diamond Mines, Dofasco Steel, Hydro One, INCO, Irving Pulp and Paper, Syncrude Canada, Teck Cominco and internationally: the Ministry of Defence (U.K.). CBM lab details can be found at

Page 63: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, 24-28 June 2007, Tampa, USA

Optimizing condition monitoring decisions for maintenance planning A. K. S. Jardine

60

www.mie.utoronto.ca/cbm. Dr. Jardine also serves as an advisor to IBM’s Asset Management Centre of Excellence. Dr. Jardine is the author of the economic life software AGE/CON and PERDEC that is licensed to organizations including transportation, mining, electrical utilities, and process industries and is author of the OREST software used for optimizing component preventive replacement decisions and forecasting demand for spare parts. Professor Jardine wrote the book, “Maintenance, Replacement and Reliability”, first published in 1973 and now in its 6th printing. He is the co-editor with J.D. Campbell of the 2001 published book Maintenance Excellence: Optimizing Equipment Life Cycle Decisions. His new book “Maintenance, Replacement & Reliability: Theory and Applications”, co-authored with Dr. A.H.C. Tsang, was published by CRC Press, 2006. Professor Jardine was the 1993 Eminent Speaker to the Maintenance Engineering Society of Australia and in 1998 was the first recipient of the Sergio Guy Memorial Award from the Plant Engineering and Maintenance Association of Canada in recognition of his outstanding contribution to the Maintenance profession. He is listed in Who’s Who in Canada. ([email protected] )

Page 64: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

61

6 Computer program for decision support in the management of equipment maintenance

Dr. G.J. Anders, Fellow IEEE

Kinectrics Inc., Toronto, Canada

Abstract – In the new business environment of competition and re-regulation, the determination of asset values and methods of reaching the best investment decisions are of increasing interest. Traditional approaches to establishing maintenance and replacement expenditures can no longer satisfy regulators or bottom-line-driven decision-makers. Quantitative methods are needed which combine technical factors with financial and business risk factors. This document describes three closely related computer programs for selecting optimal maintenance policies. A combination of probabilistic, financial and engineering information is used to compute the effects of several maintenance programs on equipment reliability and the incurred costs. In two programs the approach is based on the evaluation of asset life curves which describe equipment condition as a function of time. The computation of costs involves an analysis of the present value, over a given time horizon, of future capital investment, and the costs of maintenance and possible failures. Description of the features of the computer software implementing the above concepts is presented, together with a numerical example. The third program looks at a more general picture of the optimal timing for major interventions requiring large capital investments for power equipment refurbishments. Numerical example is also included.

6.1 Introduction In the emerging operating environment of deregulation and market-based competition, every management decision involves a certain amount of risk. These risks need to be evaluated and courses of actions selected so that they are minimized. For quantitative risk evaluations analytical tools are necessary. This paper describes two closely linked computer programs for decision support in the area of equipment maintenance. For the maintenance (or asset sustainment) function at an electric utility, the following question is of particular interest: Faced with multiple options for re-investment in equipment maintenance, what is the best course of action to take in order to maximize reliability at minimum cost? Typical options could be, (1) to continue present maintenance policy; (2) to do nothing; i.e., to run the equipment in the future without any maintenance; (3) to perform major overhaul, followed by the original or a modified maintenance policy; (4) to replace aging or failed equipment with a new one and apply the original or a modified maintenance policy for the replacement. The decision-maker can use several criteria for selecting the best re-investment policy. In the past, engineers operating an electric power system were mainly concerned about equipment reliability, with the financial aspect playing a secondary role. However, in the new economic environment the reliability and financial aspects of system operation will be equally important. Hence, both reliability and cost should be considered in the selection of maintenance alternatives. With this in mind, a substantial effort has been put in developing suitable decision support tools to address the question of option selection. The following sections build on earlier studies [1, 2] and describe three programs AMP (Asset Management Planner), ARM (Asset Reliability Modeling) and LcmPlus and their application in the re-investment decision process.

Page 65: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

62

6.2 Asset Management Planer (AMP) Program The Asset Management Planner is a computer program designed for equipment exposed to deterioration but undergoing maintenance at prescribed times [1]. It computes the probabilities, frequencies and mean durations of the states of such equipment. The basic ideas in the AMP model are the probabilistic representation of the deterioration process through discrete stages, and the provision of a link between deterioration and maintenance (Figure 6.1). Clearly, maintenance is expected to slow the rate of deterioration. In most applications, it is sufficient to represent deterioration by three stages, an initial (D1), a minor (D2), and a major (D3) stage. This last is followed, in due time, by equipment failure (F) which requires extensive repair or replacement. A detailed description of the principles governing construction of an AMP model are given in the Chapter 3 by Dr John Endrenyi and present chapter concentrates on the computer implementation of this model.

Figure 6.1 Representation of deterioration and maintenance in the AMP model.

6.2.1 Studies of the effect of changes in maintenance policies 6.2.1.1 Input data requirements The input information needed to apply the model in this program includes the mean durations of the various stages of deterioration and of the inspection and maintenance activities, and the probabilities associated with the various choice and outcome possibilities. Estimates of these quantities are based on historical experience with similar units operated in similar conditions: usually they are obtained by analyzing the records of all abnormal operating conditions, maintenance activities and their results, and general observations from maintenance personnel. 6.2.1.2 Computed values In order to assess the effect of maintenance policies on the remaining life and the associated cost, a number of cases are normally examined. The first is a base case study which models the present maintenance policy. The other cases may range from consideration of no maintenance at all to a full replacement of the equipment in question. The results represent the remaining of the equipment, the probabilities of residing in each deterioration state and the expected life time cost of the equipment. Sample output from the program is shown in Figure 6.2a. In addition to display of the remaining life of the equipment, a sensitivity study is usually carried out to explore the effects of changing the inspection frequency. Figure 6.2b shows how the remaining life from entering the initial stage of deterioration varies with the time between inspections. The curve indicates that, not surprisingly, the higher is the frequency of inspections,

Page 66: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

63

the longer is the life of the equipment. There is, however, a price to be paid for the extension of the equipment life. To explore the cost aspect of the maintenance policy, the sensitivity study can be repeated for costs, with sample results shown in Figure 6.2c. The cost change for longer times between inspections is composed of two components: (1) a decrease in the maintenance cost caused by the reduction of maintenance activities, and (2) an increase in the total costs caused by an increase in the number of replacements of failed equipment. In this study, the later part of the cost is always smaller than the former. Therefore, increasing intervals between inspections will result in a decrease of the total maintenance and replacement costs.

(a)

(b)

Page 67: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

64

(c) Figure 6.2 Sample results from the AMP program: (a) remaining life and expected cost,

(b) sensitivity analysis remaining life as a function of the time between inspections and

(c) expected cost as a function of the time between inspections.

As routinely used, these approaches will yield only the mean values of the durations involved. Often, however, questions of the type, "what is the probability that the remaining time (to failure) is shorter (or longer) than a given value?", need to be answered. For this, the probability distributions of these times are required in addition to their mean values. These distributions can be obtained by an extension of the standard Markov and FPT techniques through Monte Carlo simulation [1].

6.2.2 Generation of life curves Equipment at substations, generating plants or transmission lines age with time and in-service duty and the probability of failure generally increases with time and usage as well. A convenient way to represent the aging process is by the life curve of the equipment. Such a curve shows the relationship between asset condition, expressed in either engineering or financial terms and time. Since there are many uncertainties related to the prediction of equipment life, probabilistic analysis must be applied to construct and evaluate life curves. This analysis directly integrates into established classical decision and financial analysis methods. The objective is to determine the type of optimal asset sustainment action, and the year this action is to take place, so that NPV is maximized while not violating financial and reliability constraints. The subject is treated in more detail in the Chapter 3 by Dr John Endrenyi and only a brief introduction on how the life curves are generated in the AMP computer program is given below. The generation of a life curve requires several steps. They are described in the following.

1. In the first step, decisions are made about where the borderlines lie between deterioration stages D1, D2 and D3, in terms of the (percentage) equipment condition. The results are entered into the program as shown on the screen in Figure 6.3, and marked on the vertical axis of the life curve diagram.

Page 68: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

65

Figure 6.3 Selection of asset condition ranges for deterioration stages.

2. Next, input data are collected for the AMP model and FPT calculations are carried out by the program, to determine the first passage times between states D1 and D2, D1 and D3, and D1 and F. These are entered on the time axis of the life curve diagram.

3. The life curve must lie somewhere between the two borders. At time 0 it should be at 100%; at D1F it should be 0. At the remaining two ordinates, by arbitrary decision, it will be at the midpoint of the respective domains, as shown in Figure 3.6.

6.2.3 Finding an optimal inspection interval Comparing the curves in Figure 6.2(b) and Figure 6.2(c) it is not obvious what is the best inspection policy in this case. On the one hand, an increase in the interval between inspections results in the reduction of the equipment remaining life and, on the other hand, the life cycle costs are decreased. Recently, a new mathematical model for the selection of an optimal maintenance policy using the AMP model has been proposed [5]. The original model proposed in [1] presented a method of calculation of the remaining life of equipment. This paper defines several possible optimization procedures to find out the best maintenance policy and demonstrates an implementation of the simulating annealing algorithm for this purpose. The whole procedure is illustrated on a practical numerical example involving high voltage circuit breakers. The objective function is composed of three components: (1) the Remaining Life of Equipment represented in the model as the First Passage Time (FPT) from the current deterioration state to the failure state, (2) The Life Cycle Costs represented as the cost of maintenance and failure, and (3) equipment Unavailability. The goal is thus to define an optimization model that would minimize a function of these three parameters, that is:

( ) min ( _ cos , , )F f total t FPT unavailability= −r (6.1)

Vector r symbolizes parameters of the model that can be varied and are all related to the amount of money that the utility is willing to spend on the maintenance activities for a particular piece of equipment. Thus, the model assumes that putting more money into maintenance activities can result either in faster repairs or more thorough work or both. More thorough work is translated

Page 69: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

66

into an increased probability of the equipment ending up in a better deterioration state after the repair. Function f is a special function that transforms three parameters to be expressed in the same units of measurement. In addition, the model allows representation of various degrees of risk aversion of the person performing the analysis. The results of the analysis are new parameters of the semi-Markov model and the revised (optimized) values of the first passage times, component unavailability and the lifetime cost.

6.3 Asset Reliability Model (ARM) Program The aim of this program is to used the information provided by the life curves generated by AMP to help in selection of the maintenance policy option.

6.3.1 Input required A typical study involves the establishment of a semi-Markov model for the equipment in question, including maintenance states and appropriate transitions to and from them. The study can be directed to analyze life curves, cost curves, or probabilities of failure. These are carried out by inputting data and other information with the help of a series of screens. The first screen asks for instructions as to the analyses to be performed and the results to be displayed. On the next three screens input data can be entered. This data can be automatically copied from the AMP program. The results are shown in the form of graphs as illustrated in the example that follows. On the screen, these graphs appear in color.

6.3.2 Cost computations In many financial evaluations, the costs are expressed as present value quantities. The present value approach is also used in this study because maintenance decisions on aging equipment include timing, and the time value of money is an important consideration in any decision analysis. In selecting the best course of action, the proposed alternatives are compared with some reference action. The corresponding cost difference is often referred to as the Net Present Value (NPV). In the case of maintenance, the NPV can be obtained for several re-investment options and are compared with the present maintenance policy. Cost computations involve calculation of the following cost components:

1. cost of maintenance activities,

2. cost of the action selected (overhaul or replacement),

3. costs associated with failures (cost of repairs, system cost, penalties).

The costs are given as Present Value (PV). To compute the PV, inflation and discount rates are required for a specified time horizon. The time horizon is a period of time, starting at the present and ending after a chosen number of years, for which the costs of the various operating and maintenance options are calculated and compared. The costs associated with equipment failure over the time horizon are computed as the sum of two components: one for failures that occur before the action is taken (during the delay period) and one for failures that occur after. These costs are multiplied by the probabilities of failures before and after the action, respectively, and the two products are added.

Page 70: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

67

6.3.3 Sample application: maintenance of high voltage air blast breakers 6.3.3.1 General This study involves the analysis of several breakers with a total operating history of about 100 breaker-years [1]. According to the current policy, three types of maintenance are routinely performed on each breaker. About every eight months, minor maintenance is performed involving timing adjustments and lubrication at a cost of about $700. Its average duration is 0.25 day. Medium maintenance involving replacement of some parts, taking on the average 2 days at a cost of about $6000, is performed approximately every ten years. Major maintenance involving breaker overhaul takes place every twelve years with an average duration of 22 days and a cost of about $75,000. It follows that in this application a simplified form of the AMP model is used: instead of having regular inspections, and maintenance performed only as needed, the various types of maintenance are performed at regular (but still stochastically determined) intervals. Note that if at a given point in time it is decided that the optimal maintenance policy is, say, to perform overhaul as soon as possible and then continue by resuming the original maintenance routine, this overhaul is out of step with the original policy and incurs extra costs. As mentioned before, other alternatives include making no changes in the maintenance policy, stopping all maintenance altogether, or installing a new breaker. 6.3.3.2 Financial information The financial assumptions used below are usually well established in the approved financial procedures of a corporation, and are available to the engineer. These assumptions must be included because they have significant effect on the impact of re-investment action timing. Generally, two sets of financial assumptions must be considered. The first set concerns the time value of the dollar. It includes the projected inflation rate to account for the eroding value of money with time, as well as the corporate discount rate used to set a required return on investment. The second set has to do with the composite income tax and the property tax rates. In the example presented here, only the first set of financial assumptions is considered with the following numerical values.

Time horizon 10 years Inflation rate 3% Discount rate 5%

The system and penalty costs associated with equipment failure are assumed to be $10,000 each.

In order to calculate the effect of the proposed action, we need to specify the asset condition, or asset value, at “present time” (the beginning of the time horizon). In this example, it is at 80% which, for the given equipment, corresponds to 20 years of service. This information determines where the equipment is located on the life curve.

6.3.3.3 Engineering information The engineering information required is a simple description of the current maintenance practices. In the breaker example, the three types of maintenance routines mentioned above are modeled. In order to analyze re-investment alternatives, possible options need to be defined. Such options were discussed before. More can be added or some deleted. In case of failure, the user has a

Page 71: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

68

choice of either repairing or replacing the equipment. In the former case, the condition of the asset after repair has to be specified. In case of replacement, a new equipment type can be entered, if desired. The option “Continue As Before” represents the current maintenance policy and cannot be deleted. This option does not require any additional parameters. The “Do Nothing” option (named “Stop All Maintenance” in this example) requires only one additional parameter, the delay period after which this “action” is implemented. The “Overhaul” and “Replacement” options require three parameters: delay, cost of action, and the state of the equipment after the action has been taken (in the replacement case, it is assumed that the equipment returns to the 100% condition level). Note that in these cases, the overhaul and replacement actions are carried out just once, and it is assumed that after the action regular maintenance continues either in the original or in a changed form. In the present example, the policy will slightly change: the minor maintenance after overhaul or replacement will be performed once every 15 months rather than every eight months. With the financial and engineering data specified, the calculation of reliability and costs can now proceed.

6.3.3.4 Life curves Figure 6.4 shows three typical life curves for the selected breaker. Curve (a) describes the existing maintenance policy as calculated by the program (action: “Continue As Before”). Curve (b) is valid for the “reduced” maintenance policy where minor maintenance is performed less frequently, as specified above. Curve (c) describes conditions where no maintenance is performed at all. Note that the life curves always start at 100% asset condition and the policies shown end when a failure occurs.

Figure 6.4 Life curves computed by the program: (a) present maintenance policy,

(b) reduced maintenance policy, (c) no maintenance.

The curves can be edited manually by inserting or deleting points. In the present study it is assumed that the mid-point for the as-new stage, in terms of asset conditions, is 68%, for the minor deterioration it is 25%, and for major deterioration 8% (not shown in the figure). Failure is at 0%. Figure 6.5 shows two life curves. Curve (a) represents the option where replacement is carried out after a 3-year delay from the present time and following that, regular maintenance is continued, but in the “reduced” form. The time horizon is indicated by a heavy line on the time axis and it begins, as explained before, at the “present time”. In the replacement action, the

Page 72: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

69

equipment is assumed to return to the “as new” conditions. Since the new policy prescribes maintenance less frequently, the 40-year life expectancy of a new breaker (Figure 6.4) shrinks to 26 years, thus the replacement after 23 years of a fairly good breaker results in an only slightly improved expected life of 49 years.

Figure 6.5 Life curves with (a) replacement, (b) no maintenance, after a 3-year delay.

Curve (b) in Figure 6.5 represents the interesting situation where a decision is made to abandon all maintenance activities altogether after a 3-year delay period. Upon equipment failure repair is performed that brings the equipment to an assumed 90% of its original condition but, again, no further maintenance is performed afterwards. Since the entire life curve for a breaker without any maintenance is about 7 years (see Figure 6.4), the repair after failure adds only about six years to the life of the equipment.

6.3.3.5 Cost diagrams Cost computations involve the calculation of the expected numbers of failures and the various types of maintenance activities during the specified time horizon. These expected numbers are computed separately for the periods before and after the action. The cost of each maintenance activity is then expressed by its present value. The probabilities of failures before and after the action are either computed by the program or entered by the user if the life curves are specified by him/her. The cost curves are then presented as functions of the delay.

Figure 6.6 illustrates the present costs for all options with a three-year delay for each. This diagrams shows that in the case of a 3-year delay in starting a new policy, the best action (of those considered) is to continue with the original maintenance policy. The expected cost of this is $100,000 for the 10-year time horizon. The costs are the highest for the “Stop All Maintenance” option because the probability of failure after 3 years is much higher than for the other options. The maintenance cost is high for the “Continue As Before” policy because minor maintenance is performed quite often and, during the time horizon, a major maintenance can also be expected to occur.

Page 73: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

70

. Figure 6.6 Cost diagram for various actions performed after a three-year delay.

6.3.3.6 Probability of failure In order to compute the expected costs during the specified time horizon, the probability of failure within this time-period is required. The probabilities for each option are computed by ARM and can be displayed as functions of the delay. The before-action and after-action values can be obtained separately, or in a composite form [2].

6.3.3.7 Sensitivity studies Looking at the results in Section 6.3.3.5 the question arises how “robust” the findings are if some of the input values are subject to uncertainty. To find an answer, several of the inputs were varied to see how these changes affect the costs of options, and the selection of the preferred option.

Some of the results are shown in Figure 6.7. The diagrams indicate the present values of the costs associated with each option for two time horizons, 10 years and 20 years, and for a range of delays in time before the actions are implemented. One can observe that the option “Continue As Before” is the least expensive, approximated by the “Do Overhaul” option in certain ranges. Thus, in this example, “Continue as Before” appears to be a “robust” choice.

The sudden jump at 4.5 years occurs because during the delay period the original maintenance policy is continued and in the course of this a major maintenance is expected at 4.5 years. If the delay is less, this major maintenance will not happen because the maintenance schedule is restarted at the time of action.

Page 74: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

71

Figure 6.7 Costs of options for two time horizons in terms of the delay in action.

It is interesting to find that costs are not at minimum if action is taken without delay. This is, partly, because the present value of the cost of action becomes less if the action is delayed, and partly, because during the delay period the comparatively cheap original maintenance policy is applied.

If the inflation rate is varied between 1 and 10% (and the corresponding discount rate between 1.5 and 14%), all curves show a maximum near 2% inflation; and the “Continue” option is still the most desirable in every case. The latter also holds if the rate of minor maintenance after action is varied between 0.5 and 2 occurrences per year (this range includes the rate of 1.5 per year which represents the case of no change in maintenance policies after action). The costs over this range vary hardly at all.

6.4 Optimal refurbishment strategy Both programs described above deal with analysis of various maintenance scenarios. Maintenance is aimed at slowing down the deterioration process of the equipment. The problem of ageing equipment is a universal engineering concern. Every system, structure or component (SSC) is designed to function for some specified period of time; the actual degree of deterioration during this specified time, however, will strongly vary by equipment and application. Carefully planned use of equipment, including appropriately selected maintenance policies, can reduce the number of failures and, thus, result in considerable savings. Moreover, it can prolong useful equipment life, thereby increasing equipment reliability. Obviously, the savings obtained must be balanced against the costs incurred by employing a possibly more costly maintenance plan.

Page 75: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

72

The following developments present a software tool, which enables Life Cycle Management (LCM) analysis. The first version of the software [6-7] allows for the comparison of up to four alternative plans. This is done by computing, through simulation, the present values of the costs expected for each alternative, and also the Benefit to Investment Ratios (BIR) for alternatives B, C and D, assuming that plan A forms the base case (usually the present plan). In comparisons of this type, results are relative, not absolute numbers. Relative results minimize the effect of erroneous statistical inputs as all alternatives are affected the same way. The above approach was implemented in the Electric Power Research Institute (EPRI) LCM studies using a program called LcmVALUE [8-11], completed by several nuclear power plants in the US. Deterministic calculations compared the total NPV of up to four alternatives. In another approach, most of the parameters were treated as random variables with triangular distributions and Monte Carlo simulation were performed to obtain the mean values of these costs [11]. The industry has developed LCM evaluation tools, including Westinghouse’s Proactive Asset Management (PAM), and the EPRI/STP/ABS Risk-Informed Asset Management (RIAM) method. The methods are distinctly different tools with unique features, yet each assists in LCM planning for important SSCs. The LcmPlus program is an extension of the EPRI’s approach whereby an optimization is performed to find the best timing of the possible investments. A genetic algorithm belonging to a class of evolutionary optimization methods is employed to minimize the total life cycle cost of the system, structure or components (SSCs). The life cycle cost includes operation, proactive maintenance, cost of failure (corrective maintenance and lost revenue) plus the major investment costs planned for the future. Thus, instead of defining various scenarios for timing of future major refurbishments, the software finds the optimal timing of all investments planned for the SSC. Since most of the parameters entering the analysis are not known with certainty, they are treated as random variables with prescribed probability distributions and the whole process is treated as a stochastic optimization problem.

6.4.1 Optimization of the timing of investments for LCM of SSCs The problem that the software models can be briefly described as follows. In order to keep a particular SSC in good operating conditions, the company monitors its operation and performs routine predictive maintenance. In spite of the best company efforts, the equipment occasionally fails. We are interested only in those failures whose occurrence is caused by equipment deterioration because through improved maintenance activities we hope to reduce the rate at which the SSC fails. The most promising way of achieving this goal is through system refurbishments. Such refurbishments may have various beneficial effects for the operation of the SSC. For example, a replacement of the major parts of the SSC or the installation of new monitoring equipment may reduce the failure rate or can reduce the time the equipment will be out of service following a forced outage. The usual practice in the LCM of the SSCs is to postulate several possible investment alternatives. Each alternative will result in a predefined outcome, such as a reduction of the equipment failure rate or outage duration or both. Such investments are usually very costly and because of the usual financial constraints, they are staggered in time. We will assume that each investment can occur in a predefined time interval, which in the most general case may span the period from the present moment to the end date of the study. Our objective is to minimize the

Page 76: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

73

total life cycle costs (or maximize BIR) by optimally timing the investments. The constraints will describe the intervals during which the refurbishments can take place. This is a fairly complex optimization problem because each investment has a different effect on the outage costs and some of those effects may be cumulative. LcmPlus employs a genetic algorithm for this purpose. The evolutionary algorithms seldom find the absolute minimum of the objective function but usually give a value in a close vicinity of it. Therefore, after the neighbourhood of the absolute minimum is established, additional classic non-linear optimization is performed to home-in on the best timing of the investments. The approach yields a set of dates at which various refurbishments will be undertaken and associated costs. It should be pointed out that it is quite possible that some investments will not be selected at all. As already mentioned, some quantities are treated as random variables. This further complicates the search of the best timing of the investments. The following procedure is used to perform the stochastic optimization.

1. Enter the most probable values of all the input variables and define their probability distributions. Any distribution can be used.

2. Select time intervals during which major investments can take place. Costs of such investments can also be treated as random variables.

3. Select via Monte Carlo simulation the values of all random variables. 4. Perform optimization to find the best timing of the investments. 5. For the optimal set of investment dates obtained in step 4, perform Monte Carlo

simulation selected number of times and compute the mean value of the NPV of the total cost and of the benefit-to-investment ratios.

6. Repeat steps 3 to 5 a specified number of times. 7. From the results obtained select one set that gives the optimal value of NPV cost or BIR.

The above process is summarized graphically in Figure 6.8. If the number of the MC simulations in the first and the second runs is equal to N, then, in the worst case, the program will need to perform 2N MC runs and N optimizations. Through numerous tests it has been determined that 10000N = gives satisfactory results. Thus, up to 100,000,000 MC simulations may be required. This number is much reduced in practice since many of the date sequences are not acceptable because of the order of investments on the same SCC may be predetermined. Block four in Figure 6.8 selects allowable sequences. The details of the calculations are described in the following sections.

Page 77: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

74

Figure 6.8 Flowchart of the stochastic optimization process.

6.4.2 Objective function 6.4.2.1 Cost computations The total cost of a plan is composed of many components, including fixed and variable planned costs as well as the unplanned costs of failures. In a probabilistic analysis, the last component is computed by multiplying the equipment failure rate by the repair expenses and consequential costs of failures. One of the greatest challenges of LCM is to build a probabilistic model of the SSC’s failure rates that would be as close as possible to reality. In comparing alternative plans, the most obvious way is to look at the total cost of each. The total cost of an alternative is composed of the following two components:

1. Cost of maintenance (MC) 2. Cost of failure (FC)

In addition to regular maintenance costs, each alternative may have one-time costs associated with the selected maintenance action. For example, in the case examined later, such costs would include the purchase of a spare rotor, or the rewinding of a stator. Similarly, the cost of failure can be composed of several components. The total cost, TC, of an alternative is equal to

TC MC FC= + (6.3)

The MC expenses including ongoing yearly costs (YC), planned refurbishment costs (RFC) and special one-time costs (SC). Components of YC are the engineering expenses, operating expenses, costs of craftsmen, all man-hours times rates. Rates may change yearly. To be added

Page 78: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

75

are the costs of subcontractors, materials, and other expenses not mentioned above. The RFC costs are computed similarly, except that now the costs pertain to refurbishment rather than to the ongoing yearly costs. The second component in equation (1) represents the costs associated with the SSC failures. This component includes the costs of repairs (RC), lost production (LPC) and consequential effects (CC). Further, one needs to estimate the expected number of failures in each year, that is, the failure rate. This last will be denoted by ( )i kλ for failure mode i in year k. The total cost of failures, FC, is then expressed as

[ ]1 1

( ) ( ) ( ) ( ) ( )K n

i i i ik i

FC k RC k D k LE k PL CC kλ= =

= + +∑∑ (6.4)

where K = number of study years n = number of failure modes

( )iD k = average outage duration of failure mode i in year k (h/occ) LE(k) = PV of the lost production cost per each MWh energy loss in year k ($/MWh) PL = power lost during each outage (MW) The middle term of the right-hand side represents LPC, the cost of lost production due to failure mode i in year k. All cost components are converted to the present values taking into account the inflation and discount rates. 6.4.2.2 Benefit to Investment Ratio Another criterion which is widely used in decision making is the Benefit to Investment Ratio (B/I Ratio or BIR). In general, BIR is used in comparisons of alternatives, and is defined as the change in failure costs (benefit), divided by the change in maintenance costs (investment) as one alternative plan is substituted by another. All values used are present values. Thus,

ABFCBIR

MC−Δ

(6.5)

The negative sign in the numerator accounts for the fact that if the failure costs decrease when moving from plan A to plan B, the benefits increase. Equation 4 can be rewritten as

B AAB

B A

FC FCBIR

MC MC−

= −−

(6.6)

and substituting (2) in the numerator,

( ) ( )1 1B B A A B A

ABB A B A

TC MC TC MC TC TC TCBIRMC MC MC MC MC

− − − − Δ= − = − = −

− − Δ (6.7)

Values of BIR greater than one indicate that alternative B is better than the reference plan A. On the other hand, if BIR is less than one, the investment required by plan B would be ineffective. The BIR can also attain negative values; this occurs when the total value of the alternative is higher than the total value of the base case.

Page 79: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

76

6.4.3 Optimization problem The optimization problem we are solving involves minimization of the objective function given by equations (6.3) and (6.4) subject to the constraints representing permissible investment periods. In our example, the investment period covers the entire study time horizon. Let R be a set of all possible investments. Some parameters are dependent on a particular combination of the investments; hence, the objective function takes the form

[ ]1 1

min ( ) ( , ) ( , ) ( , ) ( ) ( , )N n

r i i i ir k i

MC k k RC k D k LE k PL CC kλ∈

∈ = =

+ ⋅ + +∑ ∑ ∑r R rr r r r (6.8)

where r denotes a sequence of investments in R and ( )rMC k represents the rth investment that took place in year k. The algorithm considers all feasible investment combinations. In this application, we assumed that only the failure rate, the repair and consequential costs and the duration of the outage are affected by the refurbishment scenarios. As an alternative, the objective function could maximize the BIR value given by (6.7). 6.4.3.1 Evolutionary algorithm Since the proposed optimization problem involves a class of functions that cannot be defined a priori (the number of variables varies and is a function of the selected investments), the program employs evolutionary algorithms that are well suited to handle such situations [12-14]. When designing the optimization problem one has to remember that the analyzed solutions (dates of refurbishments) are dependent on additional constraints that eliminate some investments. For example, if we have two possible investments one involving replacement of the equipment and the other only the replacement of some parts, the second investment cannot follow the first one for obvious reasons, whereas the reverse order is permitted. Evolutionary algorithms use techniques inspired by evolutionary biology such as inheritance, mutation, natural selection, and recombination (or crossover). Discussion of the implementation of an evolutionary algorithm for the LCM optimization problem is given in [6].

6.5 Program description The web application was developed using the J2EE technology. The project is based on Model-View-Contoroller (MVC) design pattern and is discussed in [6]. The input data consists of economic information (inflation and discount rates, cost of energy, etc.) and the routine maintenance and refurbishment costs. The possible investments and their mutual relationships are also defined and the effects of the investments are specified (e.g., the change of the failure rate or the outage duration or costs). The required data is illustrated in the numerical example in the next section.

6.6 Numerical example The example concerns the LCM plan for a main generator in an electric power station. The licensed period of the station is 40 years and it is assumed that the license will not be renewed. The study period starts at year 20 of the plant’s operation; therefore, the pay-off time must be shorter than the remaining 20 years if the plan is to be successful. The data are based on real-life industry experience.

Page 80: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

77

6.6.1 Economic data

Table 6-1 Summary of the economic parameters and their bounds for this study.

Parameter Lower bound Nominal value Upper bound Replacement Energy Cost

($/MWh) 12 24 72

Discount Rate (%) 6.75 9 12 Inflation Rate (%) 1.5 3 4.5

Other costs are as follows: labor cost is 60 $/h and engineering cost equals 70 $/h.

6.6.2 Base case equipment parameters 6.6.2.1 Failure rates For the analysis of the LCM plans for a large generator, four failure modes were considered as follows.

1. Stator winding and core. 2. Rotor winding, forging and RR. 3. Exciter and voltage regulator. 4. Other.

Traditionally, the equipment failure rate is computed by dividing the number of outages by the equipment equivalent operating years considered in the studies. The failure and maintenance rates computed in such a way are usually assumed to be constant throughout equipment life. However, many characteristics could influence equipment failure or maintenance rates causing variation in equipment failure rates with time and usage. These could include, for example, equipment age, manufacturer and the maintenance depth and frequency. The LCM Solutions software can accommodate other mathematical models of failure rate histories such as linear and Weibull. In this example, all the failure rates will be treated as linearly changing with time; that is, they will take the form

( )i i ik a b kλ = + ⋅ (6.9) where k represents a year and i is the failure mode. In particular, when 0ib = the failure rate is independent of age but still can be a random variable. Four failure types are considered:

1. A stator failure 2. A rotor failure 3. Excitation system failure 4. Other equipment failure

6.6.2.2 Random variables Most of the parameters in the LCM studies are uncertain, including in particular: failure rates, outage costs and outage durations. For the purpose of this example, only the failure rates will be changed following each investment. It was assumed that each parameter follows a triangular distribution with a given mode and the minimal and maximum values. The assumed values of the parameters are summarized in Table 6-2 where failure rate λ, outage cost C, and outage duration D, are given for each failure mode of the unit.

Page 81: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

78

Table 6-2 Parameters of random variables - base case.

6.6.3 The alternative plans The following investment alternatives have been defined.

A. The base case. The assumption is that the current maintenance program is being continued. Maintenance is carried out regularly, and after failures, repairs are performed; the associated lost production costs are the dominant expenses in this case. Failures are expected to occur with some regularity – this frequency is estimated from past performance.

B. In this plan, the rotor is rewound in the future. The estimated cost of this investment is $300,000.

C. This plan invests even more in maintenance. It includes the purchase of a spare rotor at the cost of $4,000,000.

D. This investment postulates a purchase of a digital voltage regulator at the price of $1,000,000.

E. In this plan, a new exciter is purchased for $4,000,000. The investment plans are selected in the hope of reducing the expenditures necessitated by failures, including the lost production costs caused by curtailed energy. The parameters may have different limits following each investment. The effect of each investment is summarized in Table 6-3.

Value of a Value of b Failure type Parameter Lower

bound Nominal Upper bound

1λ (1/y) 0.02 0.038 0.08 0 1C (k$) 200 800 10,000 0 1

1D (days) 5 30 90 0 2λ (1/y) 0.01 0.03 0.1 0.0005 2C (k$) 200 500 6,000 0 2

2D (h) 15 20 50 0 3λ (1/y) 0.05 0.076 0.15 0.0005 3C (k$) 5 20 100 0 3

3D (days) 0.5 2 10 0 4λ (1/y) 0.05 0.076 0.15 0 4C (k$) 10 10 100 0 4

4D (days) 0.5 1 10 0

Page 82: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

79

Table 6-3 Bounds for failure rates following each investment.

. Value of parameter a Value of parameter b Investment Failure type lower nominal upper nominal

B 2 0.005 0.02 0.02 0.00025 C 2 0.01 0.02 0.05 0.00020 D 3 0.025 0.035 0.04 0.00050 E 3 0.01 0.057 0.10 0.00025

Since investments B and C both concern the rotor, an additional constraint is added that once a new rotor is purchased, no rewound is required. Similarly for the excitation system, we will assume that if a new exciter is purchased (investment E), there is no need for a new voltage regulator (investment D). We will assume that each of the above investments can take place at any time during the entire study period. We will also assume that the plant is 20 years old and the study period extends from the present moment to the end of the planned life, which is assumed to be 40 years; that is, 20 years from now.

6.6.4 Study results Several cases will be presented in this section with progressive complexity. 6.6.4.1 Deterministic parameters In this study it is assumed that the parameters take their most probable (nominal) values. The cost of the base case alternative is 319,315.97 k$. This cost includes operating expenses and the cost of forced outages. A sequence of the optimal dates for investments B, C, D and E is [35, 119, 14, 131] with the total cost of 222,184.90 k$ which is equal to 69.6% of the base case cost. The investment dates are represented in months from the beginning of the study period. Normally, the installation of the new equipment can take place only during a planned outage. If we were to select the sequence of the optimal investment scenario given above, we would install a new voltage regulator during the first outage (outages take place every 18 months) and we would rewind the rotor during the second outage. On the other hand, the more expensive investments would take place further in the future. A new rotor would be installed in 10 years from now and a new excitation system would be purchased 1 year later. In practice, both investments could take place during the same outage. The results might be quite different if a different allowable period was selected for each investment. Additional studies were performed in which only investments B and C were considered. The optimal solution is given by a vector [35, 120] with the total cost of 232,145.17 k$. We can observe that the optimal investment dates are the same as before but the total cost is about 10,000 k$ larger. Figure 6.9 shows these results in a graphical form.

Page 83: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

80

Figure 6.9 Optimal investment dates for scenarios B and C.

If, on the other hand, only excitation system is considered (investments D and E), the optimal sequence is [15, 132] with the total cost of 309,355.84 k$. We can observe that, in this case, installing a new excitation system or purchasing a new voltage regulator has very small influence on the total life cycle cost of the generator. This can be explained by the fact that the cost of failure associated with the excitation system is much smaller than the cost resulting from a rotor or a stator failure. 6.6.4.2 Probabilistic analysis In order to assess the effect of the uncertainty in the input parameters on the cost and the B/I ratios, a Monte Carlo study was performed. The purpose of the Monte Carlo analysis is to establish the probability distributions of the Total Cost and the Benefit to Investment ratios for the optimal sequence of investments. This will allow us to answer the following questions:

• What is the chance that the Total Cost of the selected sequence will be grater/smaller than a specified value?

• What is the probability that the selected sequence will be better than an alternative one? Through experimentation, 30,000 simulations were selected for the Monte Carlo runs. From the first round of simulations, 30,000 optimal investment dates were obtained. For each of these dates, the second set of Monte Carlo simulations was performed to find the sequence with the lowest expected NPV. The base case scenario costs 691,650.2 k$ and the optimal sequence of investments is given by the vector [32, 122, 0, 122] with the cost of 499,497.03 k$, which is 72.2% of the original cost. We can observe that the costs in this study are more than double the values of the deterministic case. This can be explained by the fact that triangular probability distributions of the input costs are skewed to the right with the upper limit much further away from the most probable value than the lower limit. In the stochastic optimization study a sequence [131, 26, 0, 115] gave the cost of 502,156.3k$, which is only slightly higher than the optimal one. However, in this sequence, we would purchase a new rotor during the first outage and would not do rewind at all since the rewind falls after the new rotor installation which is not allowed sequence of investments. The utility may prefer this scenario to the optimal one. In order to analyze further these two alternatives, the probability density functions of the BIR for both sequences were plotted in Figure 6.10.

Page 84: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

81

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.0 20.0 40.0 60.0 80.0BIR

Prob

abili

Optimal solutionAlternative solution

Figure 6.10 The frequency chart of the BIR values of two alternative sequences of optimal

investment dates. The graph has been truncated for negative values of the BIR.

Information in Figure 6.10 gives additional valuable insight into the decision making process. The higher mean value is confirmed since for the original optimal investments dates the density of BIR is somewhat skewed to the left and the alternative scenario has fairly high probability of large values of BIR.

6.7 Conclusions This chapter presents advanced computer programs to help the decision-makers in choosing the best maintenance strategy from a selection of options. The intention was to develop tools that can be easily used by asset management planners and field engineers. Their development was guided by the needs of users at Hydro One Networks Inc. of Toronto. The tools are used to complement other methods in the area employed in Hydro One and other utilities. The successful application of the method employed in AMP and ARM hinges on the proper representation the equipment deterioration process under various maintenance policies. These processes are graphically represented by life curves. The creation and application of such curves was described in this report. The most important features are the following.

• Probabilistic modeling of all variables entering the analysis. • Intensive application of semi-Markov models and Monte Carlo simulation, allowing

application of several types of standard probability distributions. • Calculation of the First Passage Times for analysis of the remaining life of the power

equipment undergoing maintenance. • Calculation of the benefit-to-investment ratios for alternatives of investment/asset-

sustainment plans extended over period of time spanning the time horizon. • Implementation of simulated annealing and evolutionary algorithms for the selection of

the optimal investment intervals. The method encoded in the programs use sophisticated probability techniques. In the real world, many parameters are really random variables; that is, their values are uncertain and can be described only by probability distributions. These distributions can take on many shapes and, once chosen in an application, can be best evaluated through either semi-Markov models or Monte Carlo simulation techniques, as implemented in these programs.

Page 85: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Computer program for decision support in the management of equipment maintenance G.J. Anders

82

6.8 References [1] Endrenyi, J.,. Anders G.J. and Leite da Silva A.M., "Probabilistic Evaluation of the Effect of Maintenance

on Reliability - An Application", IEEE Trans. on Power Systems, Vol. 13, No.2, May 1998, pp. 575-583. [2] Anders, G., Endrenyi, J. and Yung, C., “Risk-Based Planner for Asset Management”, IEEE Computer

Applications in Power, Vol. 14, No. 4, pp. 20-26, October 2001. [3] Ross S, Stochastic processes, John Wiley & Sons, N.Y., 1995. [4] Anders G.J., Leite da Silva, A.M., “Cost Related Reliability Measure For Power System Equipment”, IEEE

Trans. On Power Systems, Vol. 15, No.2, May, 2000, pp. 654-660. [5] Stopczyk, M., Sakowicz B., Anders G.J., “Application of a semi-Markov model and a simulated annealing

algorithm for the selection of an optimal maintenance policy for power equipment”, submitted to IEEE Trans. on Power Systems.

[6] Sakowicz, B., Stopczyk M., Anders G.J., “Scheduling of Major Investments for a Steam Generating Unit Using a Stochastic Model”, submitted to IEEE Trans. on Energy Conversion.

[7] Anders G.J. and Sakowicz B., “Life Cycle Management – distributed Web-based software development with evolutionary programming and stochastic optimization”, PMAPS’2006 Int. Conference, Stockholm, June 2006.

[8] EPRI Life Cycle Management Planning Tool, LcmVALUE, Beta Version 0.2, June 2002. [9] EPRI Technical Report 1000806, “Demonstration of Life Cycle Management Planning for Systems,

Structures and Components” With Pilot Applications at Oconee and Prairie Island Nuclear Stations, January 2001.

[10] EPRI Technical Report 1003058, “Life Cycle Management Planning Sourcebooks-Overview Report”, December 2001.

[11] Electric Power Research Institute, Inc. (EPRI), “Demonstration of Life Cycle Management Planning for Systems, Structures and Components – LcmVALUE User Manual and Tutorial Final Version 1.0”, Project no.6118, July 2002.

[12] Goldberg, David E, Genetic Algorithms in Search, Optimization and Machine Learning, Kluwer Academic Publishers, Boston, MA, 1989.

[13] Goldberg, David E, The Design of Innovation: Lessons from and for Competent Genetic Algorithms, Addison-Wesley, Reading, MA, 2002.

[14] Schmitt, Lothar M, Theory of Genetic Algorithms II: models for genetic operators over the string-tensor representation of populations and convergence to global optima for arbitrary fitness function under scaling, Theoretical Computer Science (310), pp. 181-231, 2004.

6.9 Biography George Anders received a Masters Degree in Electrical Engineering from Technical University of Lodz in Poland in 1973, an M.Sc. Degree in Mathematics and Ph.D. Degree in Power System Reliability from the University of Toronto in 1977 and 1980, respectively. He also received a Doctor of Science degree from the Technical University of Lodz in Poland in 2000. Since 1975 he has been employed by Ontario Hydro, first as a System Design Engineer in Transmission System Design Department and currently as a Principal Engineer/Scientist in the Electrical Systems Technologies Department of Kinectrics Inc. which is a successor company of Ontario Hydro Technologies. For several years, Dr. Anders has been teaching at the University of Toronto and he is now an Adjunct Professor in the Department of Electrical and Computer Engineering. He is author of over 160 technical papers and several books. Dr. Anders is a registered Professional Engineer in the Province of Ontario and a Fellow of the IEEE.

Page 86: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

83

7 Risk Based Asset Management – Applications at Transmission Companies

Wenyuan Li, Fellow IEEE British Columbia Transmission Corporation,

Vancouver, Canada

Abstract - This chapter of the tutorial discusses two actual applications of risk based asset management approaches at British Columbia Transmission Corporation, Vancouver Canada. The concepts and methods presented are general and can be applied in any utility. The first application is the risk evaluation based approach to the replacement strategy of aged HVDC components. It includes estimation of unavailability of individual HVDC components due to repairable and aging failures, calculations of capacity state probabilities of the HVDC system, quantified risk evaluation of the power system containing the HVDC link and benefit/cost analysis for different replacement strategies. The approach can be also used for other system components. The replacement strategy for an aged submarine cable of the HVDC link in a power supply system at BCTC is analyzed as an example to demonstrate the actual aspects. The procedure of the analysis is explained in detail in the example. The second application is a probabilistic approach to determining the number of spare transformers for a group of transformers and the timing requirement for each spare transformer to meet the specific reliability criterion. The historical reliability performance metric designated as System Average Interruption Duration Index (SAIDI) is used to establish the specified reliability criterion. The proposed method considers both repairable and aging failures of transformers. The 138/25 kV 25 MVA transformer group in the BCTC system, consisting of both fixed turn ratio and on-load tap changing transformers, is used for an illustration. The detailed analysis in determining the number of spare transformers and their timing requirements during a 10 year planning period is presented.

7.1 Introduction Asset management is associated with a variety of topics, including maintenance, replacement, aging and retirement, life cycle assessment, equipment spare planning, risk management and reliability evaluations, etc. Both traditional and risk based asset management methods have been addressed in the past [1 - 16]. This chapter of the tutorial discusses two actual applications of risk based asset management approaches at British Columbia Transmission Corporation (BCTC). The concepts and methods presented are general and can be applied in any utility. The first application is the risk evaluation based approach to the replacement strategy of aged HVDC components. It includes estimation of unavailability of individual HVDC components due to repairable and aging failures, calculations of capacity state probabilities of the HVDC system, quantified risk evaluation of the power system containing the HVDC link and benefit/cost analysis for different replacement strategies. The approach can be also applied for other system components. The replacement strategy for an aged submarine cable of the HVDC link in a power supply system at BCTC is analyzed as an example to demonstrate the actual aspects. The procedure of the analysis is explained in detail in the example.

Page 87: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

84

The second application is a probabilistic approach to determining the number of spare transformers for a group of transformers and the timing requirement for each spare transformer to meet the specified reliability criterion. The historical reliability performance metric designated as System Average Interruption Duration Index (SAIDI) is used to establish the specified reliability criterion. The proposed method considers both repairable and aging failures of transformers. The 138/25 kV 25 MVA transformer group in the BCTC system, consisting of both fixed turn ratio and on-load tap changing transformers, is used for an illustration. The detailed analysis in determining the number of spare transformers and their timing requirements during a 10 year planning period is presented.

7.2 Replacement Strategy of Aged HVDC Components [17]

7.2.1 Problem description HVDC links have been widely used in electric power systems across the world for many years. A HVDC link is much more complex than a simple AC circuit since it not only consists of overhead lines or underground/submarine cables but also a variety of converter station equipment including valves, converter transformers, smoothing reactors, filters and auxiliary protection and control devices. It is actually a sub-system of multiple components. Many HVDC systems in the world have been operated for 25 to 35 years or even longer and some components have reached their end-of-life stage [18]. An important issue is the replacement strategy for an aged HVDC component. Utilities have different practices for replacement, including:

• The aged component is continuously used until it dies. The problem with this policy is that for major transmission system components (e.g. cables, transformers, reactors, etc.), it will take more than one year to complete the whole replacement process including purchase, transportation, installation and commissioning of a new component. The power system may be exposed to severe risks of being unable to meet security criteria during the replacement period.

• The aged component is continuously used with close field monitoring. The process of purchasing a new component for replacement starts when phenomena associated with fatal failure are observed. Unfortunately, some component cannot be monitored in such a way. For example, it is extremely difficult to monitor a cable since sampling a section of cable cannot represent the status of the whole cable. For a power transformer, although oil sampling can be performed to partially monitor the status of its wear-out, the decision on replacement is still difficult.

• The replacement is set at a given retirement age which is normally around the estimated mean life of component. Once a component reaches this age, the replacement is imposed. The problem with this policy is the fact that any aged component may die before or after the specified retirement age. If it dies before, it will result in a high system risk that is caused by its absence from the system. If it can survive longer, its early retirement will result in a waste of capital because of unnecessary earlier investment for replacement.

The questions utilities are facing for replacement strategy are:

• Should a piece of equipment be replaced? • If yes, when should it be replaced: before or after it fails?

Page 88: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

85

This section presents a risk evaluation based approach to answer these two questions for aged components. The basic idea is to quantify the expected system risks and risk costs due to three replacement options: replacing the aged component before it fails, replacing it after it fails and not replacing it at all. The difference in the expected risk cost between the options can be compared with the difference in the capital cost between them. Although the approach can be applied to any equipment in power systems, the descriptions and example given here focus on an aged HVDC component since evaluating impacts of a HVDC component on the system risk requires more efforts than an AC component.

7.2.2 Methodology 7.2.2.1 Procedure of the approach Conceptually, the value of a component in a power system depends on the variation of system risk caused by its absence from the system. If the absence of a component creates very marginal degradation in system reliability, the benefit of replacing it becomes minor. This situation may not occur often since the majority of power system components are installed for a specific purpose that contributes to the reliable delivery of power. However, when system configuration is changed or system enhancement is performed, some equipment may become less important to the system. Generally speaking, the impact of any equipment on the system risk is an extremely complex function of system configuration, effects of other new equipment, load levels and failure probabilities of all system components that vary from year to year. In other words, the decision to replace and the choice of replacing before or after component failure will have different impacts on the total system risk in a period. Therefore, quantified system risk evaluation is the key for selecting a replacement strategy. Calculating failure probabilities due to aging failures is one of crucial steps in the risk assessment. Considerable efforts have been devoted to risk evaluation of power systems in the past [19 - 22]. However, relatively little literature has discussed risk evaluation of power systems containing HVDC links. It is difficult to directly evaluate the system risk of a power system containing HVDC sub-systems using traditional methods. A HVDC system consists of multiple components and can be operated at different capacity levels. The proposed method is to calculate a capacity probability distribution of the HVDC system and incorporates it into the risk evaluation of the whole power system as an equivalent component with multiple states. The presented approach includes the following steps with a focus on the equivalent modeling of the HVDC system:

1. Estimating average unavailability of individual HVDC components including both repairable and end-of-life failure modes

2. Calculating capacity levels and capacity probability distributions of the HVDC system for three cases: with all existing components, with the replacement of a component whose replacement strategy is investigated, and with the component out-of-service without replacement

3. Evaluating the risks of the power system containing the HVDC system for the three cases in Step 2

4. Performing the analysis for the replacement strategy of the component under consideration It can be seen that Steps 1 and 2 are to obtain an equivalent component of the HVDC system under different replacement strategies. For a system without HVDC link, the procedure is simpler. An AC component generally can be represented using a two-state model (up and down) and only unavailable probabilities of AC components are prepared. The three cases to be evaluated are the same for an AC component under investigation for replacement strategy.

Page 89: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

86

7.2.2.2 Estimating unavailability of system components The unavailability of a system component due to repairable failure is defined as [1]:

8760rf MTTRU ⋅

= (7.1)

where f is the average failure frequency (failures/year) and MTTR is the mean time to repair (hours/repair). For an aged system component, particularly the component under investigation for replacement, its aging failure mode should be considered. The unavailability due to aging failures depends on the age of a component and a subsequent period to consider. By denoting its age and the subsequent period by T and t respectively and dividing the t into N equal intervals with an interval length D, the unavailability due to its aging failure can be calculated by [1, 14-15].

]/D)i(t[Pt

UN

iia 2121

1∑ −−⋅==

(7.2)

where

∫∫ −

=∞

−++

T

DiT

T

iDT

Ti

dxxf

dxxfdxxfP

)(

)()()1(

(i=1, 2,…, N) (7.3)

The f(x) is a failure density probability function. The Weibull distribution is often used and in this case, equation (7.3) becomes:

β

ββ

α

αα

⎥⎦⎤

⎢⎣⎡−

⎥⎦⎤

⎢⎣⎡ +−−⎥⎦

⎤⎢⎣⎡ −+−

=T

iDTDiT

Pi

exp

exp)1(exp (i=1, 2,… N) (7.4)

where α and β are the scale and shape parameters for the Weibull distribution, which can be estimated using historical data [23].

The total unavailability of the two failure modes is obtained using a union concept:

arart UUUUU −+= (7.5) The above equations are general and apply to both AC and HVDC components in the power system. 7.2.2.3 7.2.2.4 Calculating state capacity probability of HVDC system The unavailability alone is sufficient to model a two-state model for AC components whereas an equivalent multiple capacity state model is needed for a HVDC pole with multiple components. The HVDC pole has its full capacity when all HVDC components are available. The probability at the full capacity is calculated as follows:

Page 90: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

87

∏=

−=K

iiUfullP

1)1( (7.6)

where Ui is the unavailability of Component i and K the number of components in the HVDC pole. A failure of some components leads to the derated state, which can be called the half-pole operation mode. The probability of the derated capacity level is calculated as follows:

full

n

N

n

n

N

nM

jdr P

U

UP

j

j

)1(1

1

1−∏

∏∑=

=

=

=

(7.7)

where M is the number of the failure events that lead to the derated capacity level and Nj the number of failed components in the jth failure event. Normally, Nj contains only one critical component in most cases. The probability of the full HVDC pole being down (at the zero capacity) is:

drfulldw PPP −−= 1 (7.8)

Multiple derated states can be modeled in a similar way if necessary [24].

7.2.2.5 Evaluating risk of power system

The purpose is to evaluate impacts of different replacement strategies on the risk exposure to the power system. Generally, it is necessary to evaluate the risk of the composite generation and transmission system that contains the component for replacement. The procedure and details of composite system risk evaluation can be found in Reference [1, 3]. However, in some cases, a simplified risk evaluation model can be applied. For the replacement of a HVDC component, the subsystem impacted by the replacement is the region that the HVDC supplies power to. In this case, a power source-demand system risk model is sufficient for comparison between different replacement strategies. In the risk evaluation model, all power sources including the HVDC poles and transmission lines supplying to the region as well as location generators in the region can be treated as power sources while the total load with an annual load curve is the demand. The risk evaluation method for such a model is summarized as follows:

1. A multiple level load model is created using chronological hourly load records during one year. All the load levels are considered successively and the resulting indices for each load level are weighted by their probability to obtain annual indices.

2. System states at each load level are selected using Monte Carlo simulation techniques. This includes:

• The HVDC pole states are modeled using a multiple-state random variable (full up, down and derated states)

Page 91: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

88

• Generating unit states are modeled using multiple-state random variables or two-state random variables (up and down states) depending on the generators.

• AC transmission equipment states are modeled using two-state random variables (up and down states).

Take a three-state random variable for a HVDC component as an example. A uniformly distributed random number Rj is drawn between [0, 1] for each power source component. The state of the jth power source component is determined by

⎪⎩

⎪⎨

≤≤+≤<

+>=

jdrj

jdwjdrjjdr

jdwjdrj

jPRif

PPRPifPPRif

derateddown

ups

)(0)()()(

)()(

)()(

)(

210

(7.9)

where, Pdw and Pdr are the probabilities in down and derated states. In the case of two-state random variable for an AC component, the sampling concept is similar without considering the derated state.

3. The capacity of each power source component is determined according to its state so that the total system power capacity can be obtained. For a given load level, the demand not supplied in the kth sampling is calculated by

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

∑=

−= )(1

,0max jsjkGm

jiLkDNS (7.10)

where Li is the load at the ith level, Gjk the available capacity of the jth power source in the kth sampling and m the number of power sources supplying the subsystem considered.

If uncertainty of the load is considered, the load level Li is used as the mean with the uncertainty represented by a standard deviation σi. A standard normal distribution random number Xk is created using an approximate inverse transformation method [1, 3]. The sampled value of the load in the kth sampling is given by

iiki LXL += σσ (7.11) The Lσi is used to replace Li in Equation (7.10) in order to capture the uncertainty of the load.

4. The EENS (Expected Energy Not Supplied) that reflect the system supply risk is calculated by

∑= ⎟

⎜⎜

⎛∑=

=L iN

i

S

kkDNS

iNiT

LOEE1 1

(7.12)

where, NL is the number of the load levels in the multiple step model of an annual load curve, Ti the time length of the ith load level and Si the number of samples at the ith load level.

7.2.2.6 Benefit/cost Analysis in Comparison between Replacement Strategies

Different replacement strategies – replacing before the component fails, or replacing after it fails, or not replacing at all – have different system risks and costs. Therefore they can be compared using a benefit/cost analysis approach. The analysis may vary slightly depending on the case. The detail of benefit/cost analysis is illustrated using an actual example in the following subsection.

Page 92: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

89

7.2.3 Actual Example 7.2.3.1 Case description The Vancouver Island region in the BCTC system is supplied through two 500 kV lines, a bipolar HVDC link and several local generators. The schematic diagram of the island supply system is shown in Figure 7.1. The HVDC link is an aged system with Pole 1 in service for 37 years and Pole 2 for 30 years. The schematic diagram of the HVDC system is shown in Figure 7.2. According to system planning studies, a new 230 kV AC line will be added to the power supply system in 2008 to replace the aged HVDC system. On the other hand, the existing HVDC system must be available at least until the new 230 kV AC line is in-service. A recent field inspection (in 2005) found that the cable 1 of HVDC Pole 1 has some armor damage with three broken wire strands [25]. Cable experts estimated that the damaged section (5 km) of the cable 1 has a very high possibility of fatal failure within a couple of years. The questions the utility faces are: Should the damaged section be replaced? If yes, should it be replaced before or after it fails? 7.2.3.2 Study Conditions

The main study conditions include:

• A new 230 kV AC line is expected to be in service in 2008. The HVDC system has a much smaller effect on the reliability of the island supply system after the 230 kV line in service than before.

• The HVDC system is an old system. Once the 230 kV line is in service, the HVDC system will be kept for a transition period and possibly retired around 2010 when the cost of maintenance and repairs exceeds the benefit. The time frame in the study is the 5 years from 2006 to 2010.

• The replacement of the damaged cable section will take about one year because marine work can only be performed under fair weather. Preparation for replacement also takes long time to complete.

Page 93: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

90

Figure 7.1 Schematic diagram of Vancouver Island supply system.

• The peak loads in the island region from 2006 to 2010 are based on the recent load forecast. It has been assumed the annual load curves for all the 5 years follow the same shape that is based on the hourly load records in 2005.

• Both Poles 1 and 2 of the HVDC system were modeled using three capacity states (full up, derated to half and full down). If the cable 1 of Pole 1 has the end-of-life failure with no replacement, the maximum capacity of HVDC Pole 1 will be derated to 156 MW from 312 MW whereas the maximum capacity of Pole 2 will be derated to 336 MW from 476 MW according to the HVDC configuration.

• Both repairable and aging failure modes of all the components in the HVDC system are modeled whereas only repairable failure modes for AC transmission components and local generators are considered. The repairable failure data are obtained from historical records.

7.2.3.3 Capacity state probabilities of HVDC system The capacity state probabilities of the existing HVDC system (Poles 1 and 2) and the HVDC system with the replacement of damaged cable section or without replacement are evaluated using the methods given in Section 7.2.2.2 and 7.2.2.3. The results are shown in Table 7-1 to Table 7-6 respectively. The following observations can be made:

1200 MW

500 kV line

1200 MW

500 kV line

312MW/156 MW

Vancouver Island load

HVDC Pole 1

476 MW/238 MW

HVDC Pole 2

304 MW in total

ASH

JHT 1 - 6

PUN

LDR1 -2

SCA1 - 2 UCO/Zeballos

27 MW

21 or 26 MWx6

24 MW

24 MWx2

32 MW x2 15 MW

170 MW

Steam 7 0MW

ICG

170 MW

JOR

600 MW

230 kV AC line (future)

Page 94: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

91

Figure 7.2 Schematic diagram of HVDC system.

• Pole 1 has extremely high failure probability since its age has greatly exceeded its mean life. The failure probability of Pole 2 is also high because the age of major components is close to the mean life.

• Replacing the damaged cable can slightly increase the probabilities of both poles serving the maximum capacity levels. However, the increase is very small because only one cable section of 5 km is replaced and the rest portion (27.5 km) is still an aged cable and the impact of the cable 1 on the capacity probability distribution of the whole HVDC is minimal.

• The probabilities of HVDC Poles 1 and 2 performing at the maximum capacity without cable 1 are slightly higher than those with the cable 1, which results in slightly lower probabilities at the zero and/or derated capacity levels for the case without the cable 1. This is because all the cables are required to reach the maximum capacity, or say, all the cables are logically in series in the reliability model. One basic concept in reliability evaluation is that removing one more component from a series logical model leads to a higher success (at the maximum capacity) probability or a lower failure probability. The impact of the cable 1 out-of-service is mainly the reduced capacities for both Poles 1 and 2 but not capacity state probabilities in this example.

Table 7-1 Capacity state probabilities of Pole 1 for the existing HVDC system.

at 312 MW at 156 MW at zero MW 2006 0.106243735 0.152434503 0.741321762 2007 0.075725132 0.124754433 0.799520435 2008 0.051009050 0.097306577 0.851684374 2009 0.032753449 0.072326656 0.894919895 2010 0.019887959 0.050931581 0.929180460

Pole 1

Pole 2

Submarine return cable

Submarine cable 2

Submarine cable 1

Submarine cable 3

Submarine cable 4

Valves

Valves

Valves

Valves

Reactor Reactor

Reactor Reactor

Transformers Transformers

Filter

Filter

Filter

Filter

Page 95: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

92

Table 7-2 Capacity state probabilities of Pole 2 for the existing HVDC system.

at 476 MW at 238 MW at zero MW 2006 0.554333069 0.216997424 0.2286695072007 0.512838492 0.217244321 0.2699171872008 0.463541606 0.218515517 0.3179428762009 0.413689862 0.216221708 0.3700884312010 0.362198344 0.211159543 0.426642113

Table 7-3Capacity state probabilities of Pole 1 for the cable 1 replaced.

at 312 MW at 156 MW at zero MW 2006 0.106944494 0.152709123 0.7403463832007 0.076228654 0.125058682 0.7987126642008 0.051351387 0.097602359 0.8510462542009 0.032975628 0.072585300 0.8944390722010 0.020024502 0.051138621 0.928836877

Table 7-4Capacity state probabilities of Pole 2 for the cable 1 replaced.

at 476 MW at 238 MW at zero MW 2006 0.557989321 0.214615684 0.2273949952007 0.516248523 0.215131435 0.2686200422008 0.466652574 0.216735362 0.3166120642009 0.416496079 0.214758473 0.3687454472010 0.364685055 0.210011597 0.425303347

Table 7-5Capacity state probabilities of Pole 1for the cable 1 out-of-service.

at 156 MW at 78 MW at zero MW 2006 0.122508347 0.147066434 0.7304252192007 0.087353876 0.123346221 0.7892999022008 0.059378386 0.098648047 0.8419735672009 0.038348715 0.074954605 0.8866966792010 0.023438967 0.053882509 0.922678524

Table 7-6Capacity state probabilities of Pole 2 for the cable 1 out-of-service.

at 336 MW at 168 MW at zero MW 2006 0.578098707 0.201516114 0.22038518 2007 0.535003695 0.203510560 0.2614857452008 0.483762895 0.206944509 0.3092925962009 0.431930277 0.206710684 0.3613590392010 0.378361967 0.203697894 0.417940139

7.2.3.4

Page 96: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

93

7.2.3.5 Risk evaluation of the power supply system

The risk of the power system supplying the Vancouver Island region was evaluated for three cases with the existing cable 1, with the cable 1 replaced and with the cable 1 out-of-service. The EENS index (Expected Energy Not Supplied) is used as the indicator of system risk. The EENS indices for the three cases from 2006 to 2010 are shown in Table 7-7. It can be seen that the EENS indices for using the existing damaged cable 1 and replacing the damaged section of the cable 1 are almost the same due to the fact that there are same state capacities but very minor differences in capacity probability distributions for the two cases. The EENS indices for the case with the cable 1 out-of-service are higher than the other two cases. Note that the ENNS indices have a drop starting 2008 because the 230 kV AC line is expected to be in service from that year.

Table 7-7 EENS for VI supply system (MWh/year).

With existing Cable 1

With replacedCable 1

Without Cable 1

2006 4850 4843 6097 2007 5655 5642 6881 2008 1140 1138 1406 2009 1271 1268 1504 2010 1542 1541 1755

7.2.3.6 Replacement strategy analysis

Using the results in Table 7-7, a replacement strategy analysis for the cable 1 can be performed. The following three options are considered for comparison:

1. Replacing the damaged section of the cable 1 in 2006 before it fails.

2. Replacing the damaged section of the cable 1 after it fails.

3. Not replacing the damaged section of the cable 1 (using it until it fails and operating the HVDC system without it after its failure).

As mentioned earlier, the replacement duration is assumed to be one year and the period of the five years from 2006 to 2010 is considered in the analysis.

1. If the cable 1 is replaced in 2006 before it fails, the HVDC system will be operated without the cable 1 for replacement in 2006 and with it (after replacement) from 2007 to 2010. The total EENS for the period of the 5 years is: 6097+5642+1138+1268+1541 = 15,686 MWh.

2. If the cable 1 is replaced after it fails, there will be different possibilities since it can fail in any year from 2006 to 2010. If it fails in 2006 and is replaced right away, the total EENS for the 5 year’s period is the same as that for Option (1). If it fails in some year later and starts replacement right after its failure, the HVDC will be operated without the cable 1 for that year, with the existing cable 1 for years before that year and with the replaced cable 1 for other years after that year. For example, if it fails in 2007, the total EENS for the period of the 5

Page 97: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

94

years is: 4850 + 6881 + 1138 + 1268 + 1541 = 15,678 MWh. The total EENS indices for replacement after the failure in the period of the 5 years for the different failure years are summarized in Table 7-8.

3. If the cable 1 is never replaced, the Vancouver Island supply risk also depends on the year in which it fails. The later it fails, the lower the risk. For example, if it fails in 2008, the total EENS for the period of the 5 years is: 4850+5655+1406+1504+1755=15,170 MWh. The total EENS indices for not replacing the cable 1 after its failure in the period of the 5 years for the different failure years are also summarized in Table 7-8. Note that if the cable 1 fails in early 2010, the total EENS without replacement in the 5 year’s period is the same as that with replacement because the replacement is assumed to take one year and therefore the HVDC will be still operated without the cable 1 during replacement. Performing the replacement in 2010 will only have a benefit on the island reliability after 2010, which will be minimal. As mentioned earlier, according to the previous planning studies, once the 230 kV line is in service, the HVDC system will be kept just for a few years before its complete retirement.

It can be seen by comparing the EENS indices between Options 1 and 2 that replacing the cable 1 after its failure results in a lower risk. The later its failure occurs, the lower risk. Between Options 2 and 3, we should compare the reduced risk due to replacing the cable 1 against the cost required to replace it. The reductions of EENS and risk cost due to replacing the cable 1 for different failure years are given Table 7-9. The reduced risk cost is the product of the reduced EENS and unit interruption cost. The unit interruption cost is obtained by the Provincial Gross Domestic Product divided by electricity energy consumption in the province where the utility is located and is $CAN3.07/kWh.

The cost of replacing the damaged section (5 km) of the cable 1 is estimated to be $8 million. The reduction of risk cost due to replacement is the benefit and the benefit/cost ratios for different failure years are listed in Table 7-10. It can be seen that the benefit/cost ratio for any year in which the cable 1 may fail is less than 1.0. This indicates that not replacing the cable 1 is more cost effective than replacing it.

Table 7-8 Total EENS (MWh) in the 5 year’s period for Options 2 and 3.

Failure year of Cable 1 Option 2 Option 3

In 2006 15,686 17,643 In 2007 15,678 16,396 In 2008 14,720 15,170 In 2009 14,690 14,904 In 2010 14,671 14,671

Page 98: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

95

Table 7-9 Reduction of EENS (MWh) and risk cost (M$) due to replacing the cable 1.

Failure year of Cable 1

Reduction of EENS (MWh)

Reduction of risk cost

(M$) In 2006 1,957 6.008 In 2007 718 2.204 In 2008 450 1.382 In 2009 214 0.657 In 2010 0 0.000

Table 7-10 Benefit/cost ratios for replacement of the cable 1.

Failure year of Cable 1 Benefit/cost ratio In 2006 0.751 In 2007 0.276 In 2008 0.173 In 2009 0.082 In 2010 0

7.2.4 Summary

In this application, a risk evaluation based approach to replacement strategy of aged HVDC components has been presented. The approach includes the following four steps:

1. Estimating average unavailability of individual HVDC components

2. Calculating capacity probability distributions of the HVDC subsystem for different replacement strategies

3. Assessing risks of the power system containing the HVDC subsystem

4. Performing a probabilistic benefit/cost analysis for different replacement strategies

Conceptually, the approach is not limited to the replacement strategy of aged HVDC components but can be applied to a replacement of any other system components.

The replacement strategy for an aged submarine cable of the HVDC link in a power supply system at British Columbia Transmission Corporation has been analyzed as an example to demonstrate the actual application of the presented approach. The procedure of the analysis has been explained in detail through the example. The results show that not replacing the damaged cable is the most cost effective option in this particular case.

Page 99: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

96

7.3 Determination of the Number and Timing of Spare Transformers [1, 26]

7.3.1 Problem description A sophisticated spare analysis is a challenge in asset management. The practice of most utilities in this area so far is to use a deterministic method, which is basically based on an engineering judgment. There are several drivers for the need of spares. First of all, a repairable failure of power equipment such as a transformer, reactor, capacitor, generator, etc. may often require a relatively long repair time. If adequacy of equipment in a system is not enough due to lack of spares, the system may experience an extensive loss of energy supply and a financial loss of revenue. Secondly, equipment aging has been a major concern in utilities for years. Aged equipment implies higher failure probability and thus more needs for spares. Besides, the policy of common spares shared by an equipment group is becoming popular under the competitive environment in the power industry. Traditionally, for example, the N-1 security principle has been widely used for substation transformers. Each substation is often designed to have two or more transformers in parallel so that the peak load can be still carried when one of the transformers fails. This is a secure but very expensive criterion. Compared to the N-1 security principle in each substation, the common spare transformer strategy can avoid considerable capital expenditure and still assure a sufficient reliability level. The following are two basic questions in the spare analysis:

1. How many spares are needed and when should each of them be in place in order to maintain the system reliability?

2. How can spares be financially justified?

Generally, there are two risk-evaluation based methods for the spare analysis. The first one is based on reliability criteria and the second one is based on probabilistic risk cost models. The more details of the two methods can be found in Reference 3. In this section, only the reliability criterion method is discussed and an example of a transformer group is used to demonstrate the application of the method.

7.3.2 Methodology 7.3.2.1 Procedure of the method Spares are considered for an equipment group. Each component in the group has its failure probability or unavailability and when it fails, a spare must be put in service to assure normal operation of the system. Therefore how many spares are needed depends on the requirement for group reliability. With the unavailability of individual components, a Monte Carlo simulation or state enumeration technique can be used to conduct evaluations of group failure probability with and without spares. The spare analysis for an equipment group includes the following steps:

1. Calculating unavailability of components in the group 2. Evaluating individual failure event probabilities and the total group failure probability 3. Performing spare analysis based on a specified reliability criterion 4. Repeating Steps 1 to 3 for all years in consideration

Page 100: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

97

7.3.2.2 Unavailability of components

There are two failure modes for power system equipment: repairable and aging failures. In many risk evaluations of power system, only unavailability of repairable failures is considered. However, a model for unavailability due to aging failures must be taken into consideration in the spare analysis as in the replacement strategy analysis give in Section 7.2 since the aging failure is one of the reasons why spares are needed, particularly for an aged equipment group.

The unavailability values of components due to both repairable and aging failures are calculated using the same equations as given in Section 7.2.2.2, i.e., Equations (7.1) – (7.5). It should be noted that the input data (scale and shape parameters of the Weibull distribution) for unavailability estimation of different components (cable or transformers) are different and based on respective historical statistics.

7.3.2.3 Group reliability and spare analysis As mentioned above, the evaluation of group reliability can be conducted using a Monte Carlo or state enumeration technique. The procedure using the state enumeration method is given to explain the concept. Consider a three-component group. It is assumed that the unavailability values of the three components have been calculated and they are U1, U2 and U3. An event probability table is built as shown in Table 7-11.

Table 7-11 Event probability.

Comp. No Event Event probability 1 1 down, 2 up &3 up U1·(1-U2)·(1-U3) 2 2 down, 1 & 3 up U2·(1-U1)·(1-U3) 3 3 down, 1 & 2 up U3·(1-U1)·(1-U2) 4 1 & 2 down, 3 up U1·U2·(1-U3) 5 1 & 3 down, 2 up U1·U3·(1-U2) 6 2 & 3 down, 1 up U2·U3·(1-U1) 7 all 1, 2 & 3 down U1·U2·U3 8 all 1, 2 & 3 up (1-U1)·(1-U2)·(1-U3)

Cumulative failure probabilities for each failure level can be calculated from the table. Probability for any one failure: P(a) = U1·(1-U2)·(1-U3) + U2·(1-U1)·(1-U3) + U3·(1-U1)·(1-U2) Probability for any two failures: P(b) = U1·U2·(1-U3) + U1·U3·(1-U2) + U2·U3·(1-U1) Probability for all the three component failures P(c) = U1·U2·U3

Page 101: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

98

Given a system failure criterion, the spare analysis can be conducted. For instance, if the system failure criterion for this example is that any failure of one or more components results in a group failure, the spare analysis is shown in Table 7-12. Note that the reliability values in the column of “Example value” are arbitrarily given here just for the purpose of explanation. If an acceptable group reliability level is specified, the number of spares can be determined. For instance, if the acceptable group reliability level is 0.9, the first spare is needed. If the acceptable level is selected as 0.98, the second one is also needed.

Table 7-12 Spare analysis based on a group reliability criterion.

Spare Group reliability Example value

Spare contribution

Zero 1.0-[P(a)+P(b)+P(c)] 0.85 First 1.0-[P(b)+P(c)] 0.95 0.10

Second 1.0-P(c) 0.99 0.04 Third 1.0 1.00 0.01

7.3.2.4 Reliability criterion

The historical reliability performance metric designated as System Average Interruption Duration Index (SAIDI) has been utilized in BCTC for setting the company performance target [27]. The SAIDI of 2.1 hours/year/delivery point is used as a specified reliability criterion in the actual example given in the next section. Conceptually, the SAIDI can be converted to an unavailability target for a group of substations as shown in the following example:

Assume that 35 substations (delivery points) are considered as a group in the study. Therefore,

Total average interruption duration target for the group is:

SAIDI × (number of delivery points) = 2.1×35 = 73.5 hrs/year

Therefore, the unavailability target = 73.5/8760 = 0.0084

The availability target = 1 – 0.0084 = 0.9916 (or 99.16%)

The above example indicates that the availability of 0.9916 is required as a specified reliability criterion for this substation group in order to maintain the company performance target in SAIDI of 2.1 hours/year/delivery point.

It should be noted that converting the SAIDI target into availability is not a unique approach to set the reliability criterion. Other approaches can be used depending on different cases or utility’s requirements [1].

Page 102: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

99

7.3.3 Actual example [28] 7.3.3.1 Case description The 138/25 kV transformers, which have capacities of 10-30 MVA, are considered as a transformer group that is backed up by 138/25 kV 25 MVA spare transformers. Three study scenarios are presented in this example. The first one focuses on the fixed turn ratio transformer group, which consists of 34 transformers located in 29 substations. The second one focuses on the on-load tap changing (LTC) transformer group, which consists of 16 transformers located in 12 substations. The third one combines both fixed turn ratio and LTC transformers altogether, which consists of 50 transformers located in 35 substations. The planning period for the transformer group is 10 years from 2006 to 2015. The Weibull distribution model for the aging failure of transformers has an estimated mean life of 57.1 years with a standard deviation of 14.5 years. These two parameters were obtained from historical records for the same type of transformers at BCTC. The reliability criterion for each scenario and the results are presented in the following.

7.3.3.2 Fixed turn ratio transformer group Total average interruption duration target is: SAIDI×( number of delivery points) = 2.1×29 = 60.9 hours/year The unavailability target = 60.9/8760 = 0.007 The availability target = 1 – 0.007 = 0.993 (or 99.3%) The availability of 0.993 is used as the specified reliability criterion for the 34 fixed turn ratio transformers located in 29 substations. The transformer group reliability must be at least equal to or above this specified reliability level all the time during the planning period (2006 – 2015). The SPARE program that has been designed for the spare analysis was used. The results obtained are shown in Table 7-13 and graphically presented in Figure 7.3. Table 7-13 shows the annual availability of the 138/25 kV fixed turn ratio transformer group associated with/without the number of spare transformers (up to 3 spares). It is worthy to note that the annual availability is decreased with years since the aging failure probability of transformers increase with years. Figure 7.3 shows that two fixed turn ratio spare transformers are needed in year 2006, and these two spare transformers are able to meet the specified reliability level (0.993 availability) until the end of the planning period (2015).

Table 7-13 Availability of the 138/25 kV fixed turn ratio transformer

group (34 units) for different numbers of spare transformers.

Number of Spare Transformers Year 0 1 2 3 2006 0.8757 0.9922 0.9997 1.0000*2007 0.8651 0.9908 0.9996 1.0000*2008 0.8537 0.9891 0.9995 1.0000*2009 0.8417 0.9872 0.9993 1.0000*2010 0.8289 0.9849 0.9991 1.0000*2011 0.8154 0.9824 0.9989 0.9999 2012 0.8011 0.9794 0.9986 0.9999 2013 0.7862 0.9761 0.9982 0.9999 2014 0.7706 0.9723 0.9978 0.9999 2015 0.7542 0.9680 0.9972 0.9998

* The values of 1.0000 were obtained by rounding in order to present only 4 digits after decimal.

Page 103: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

100

0.9920

0.9930

0.9940

0.9950

0.9960

0.9970

0.9980

0.9990

1.0000

2006 2008 2010 2012 2014 2016Year

Ava

ilabi

lity

(/yea

r)

Specified reliability criterion

2 spares

Figure 7.3 The number of fixed turn ratio spare transformers required to meet the specified reliability level.

7.3.3.3 On-load tap changing (LTC) transformer group Total average interruption duration target is: SAIDI×(number of delivery points) = 2.1×12 = 25.2 hrs/year The unavailability target = 25.2/8760 = 0.0029 The availability target = 1 – 0.0029 = 0.9971 (or 99.71%) The availability of 0.9971 is used as the specified reliability criterion for the 16 on-load tap- changing transformers located in 12 substations. The transformer group reliability must be at least equal to or above this specified reliability level all the time during the planning period (2006 – 2015). The results obtained using the SPARE program for the 138/25 kV LTC transformers are shown in Table 7-14 and graphically presented in Figure 7.4. It can be seen from Figure 7.4 that one LTC spare transformer is required in year 2006 in order to maintain the specified reliability level (0.9971 availability) for the 138/25 kV LTC transformer group. In year 2012, the first spare transformer will no longer meet the specified reliability criterion and the second spare LTC transformer will be required in this year.

Page 104: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

101

Table 7-14 Availability of the 138/25 kV LTC transformer group

(16 units) for different numbers of spare transformers.

Number of Spare TransformersYear 0 1 2 2006 0.9514 0.9989 1.0000* 2007 0.9470 0.9987 1.0000* 2008 0.9422 0.9984 1.0000* 2009 0.9371 0.9981 1.0000* 2010 0.9316 0.9978 1.0000* 2011 0.9257 0.9974 0.9999 2012 0.9194 0.9969 0.9999 2013 0.9127 0.9963 0.9999 2014 0.9055 0.9957 0.9999 2015 0.8979 0.9950 0.9998

* The values of 1.0000 were obtained by rounding in order to present only 4 digits after decimal.

0.9960

0.9965

0.9970

0.9975

0.9980

0.9985

0.9990

0.9995

1.0000

2006 2008 2010 2012 2014 2016Year

Ava

ilabi

lity

(/yea

r)

Specified reliability criterion

1 spare 2 spares

Figure 7.4 The number of LTC spare transformers required to meet the specified reliability level.

7.3.3.4 Combined fixed turn ratio and LTC transformer group The advantage of an LTC spare transformer is that it can replace either a fixed turn ratio or a LTC transformer. The number of LTC spare transformers needed to back up all the 138/25 kV fixed turn ratio and LTC transformers at BCTC can be determined using the same method. Total average interruption duration target is: SAIDI×(number of delivery points) = 2.1×35 = 73.5 hrs/year The unavailability target = 73.5/8760 = 0.0084 The availability target = 1 – 0.0084 = 0.9916 (or 99.16%) The availability of 0.9916 is used as the specified reliability criterion for the 50 transformers (fixed turn ratio and LTC) located in 35 delivery points (substations). The transformer group reliability must be at least equal to or above this specified reliability level all the time during the planning period (2006 – 2015). The results obtained using the SPARE program for this group are shown in

Page 105: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

102

Table 7-15 and graphically presented in Figure 7.5. It can be seen from Figure 7.5that two LTC spare transformers are needed in year 2006 to backup both the fixed turn ratio and LTC transformers, and these two spare transformers are able to maintain the specified reliability level until the end of a planning period (2015). The results in the three sub-sections above indicate that if the fixed turn ratio spare and LTC spare transformers are considered separately, the system would need 4 spare transformers (2 fixed turn ratio spare transformers and 2 LTC spare transformers) by the end of 2015. However, if the LTC spare transformers are considered to backup both the fixed turn ratio and LTC transformers, the system would need only two LTC spare transformers. This strategy leads to a considerable saving in the capital investment while still maintaining the specified reliability criterion for the 138/25 kV 25 MVA transformer group.

Table 7-15 Availability of 138/25 kV fixed turn ratio and LTC transformer

group (50 units) for different numbers of spare transformers.

Number of Spare Transformers Year 0 1 2 3 2006 0.8331 0.9856 0.9992 1.0000*2007 0.8192 0.9829 0.9989 1.0000*2008 0.8044 0.9799 0.9986 0.9999 2009 0.7887 0.9764 0.9983 0.9999 2010 0.7722 0.9724 0.9978 0.9999 2011 0.7548 0.9678 0.9972 0.9998 2012 0.7366 0.9626 0.9964 0.9997 2013 0.7175 0.9566 0.9955 0.9997 2014 0.6978 0.9499 0.9944 0.9995 2015 0.6772 0.9423 0.9931 0.9994

* The values of 1.0000 were obtained by rounding in order to present only 4 digits after decimal.

0.9900

0.9920

0.9940

0.9960

0.9980

1.0000

2006 2008 2010 2012 2014 2016Year

Ava

ilabi

lity

(/yea

r) 2 spares

Specified reliability criterion

Figure 7.5 The number of LTC spare transformers required to meet the specified reliability

level for the transformer group composed of both fixed turn ratio and LTC transformers.

Page 106: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

103

7.3.4 Summary In this second application, a reliability based method for spare equipment planning is presented. It can be applied to any power system equipment. The method is very useful in practical planning and decision making processes of utilities in order to minimize the capital investment cost without sacrificing the reliability requirements. The method includes the following main aspects:

• Estimating average unavailability of individual equipment due to both repairable and aging failures

• Evaluating reliability of the equipment group with different numbers of spares • Selecting a reliability level that the equipment group should meet in the planning period • Performing the spare analysis to determine the numbers and timing in order to meet the

specified reliability criterion.

The 138/25 kV 25 MVA transformer group in the BCTC system is used to illustrate the application procedure of the presented spare equipment analysis method. The reliability criterion in this example is based on the corporative reliability performance target on the SAIDI index at BCTC. The results indicate that two spare LTC transformers are required to meet the specified reliability criterion for 50 138/25 kV transformers in the 10 year period from 2006 to 2015.

7.4 Further Discussions This chapter discussed two applications of risk based asset management. The first application is the risk evaluation based approach to the replacement strategy of aged equipment in power systems. The decision on the replacement of an aged HVDC cable in the Vancouver Island supply system at BCTC was used as an example to demonstrate the application procedure. The second application is a risk evaluation based method to determine the number and timing of spare equipment. A 138/25 kV transformer group was used as an example to illustrate the application procedure. The risk evaluation based techniques can be applied to other aspects of asset management such as preventive maintenance planning, maintenance scheduling, workforce planning in maintenance, equipment retirement strategy, life cycle management, etc. More materials can be found in the references. A lot of risk/reliability evaluation studies have been conducted in probabilistic planning and asset management at BCTC. The 17 technical reports in this area are available at the BCTC website [29]. The traditional asset management focuses on individual equipment, including investigation into physical condition of equipment, operation performance and field environment. A basic fact, which has been more or less ignored in the traditional asset management, is that importance of individual equipment in a system does not depend on itself but on impacts due to its absence from the system on overall system reliability. If the absence of a piece of equipment from system due to maintenance, retirement or failure creates a little or very marginal impact on system operation risk, it should be in a much less important position in the asset management process. On the contrary, if the absence of a piece of equipment from the system has very large effects on system reliability, any issue associated with its maintenance, replacement or retirement should be emphasized. Quantified probabilistic assessment of equipment unavailability on system reliability is the key idea of the risk evaluation based asset management method presented in this chapter. It should be emphasized that there is no conflict between traditional considerations in asset management and risk evaluation based asset management methods. Both can be performed to enhance the asset management process.

Page 107: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

104

Another important point associated with asset management is aging failure modeling of equipment. In the traditional risk evaluation, only repairable failures are considered but aging failures are ignored or improperly modeled. Equipment aging is a basic fact in majority of power systems. One of objectives in asset management is how to deal with aged system components. Both aging and repairable failure models have been incorporated in the presented risk based asset management method. The input data for repairable and aging failure models is crucial for risk evaluation based asset management. Collection, processing, storage, reporting and utilization of historical failure records is one of keys in asset management. A computerized reliability database management system becomes increasingly important. More information on the data management system can be found in Reference 30.

7.5 References [1] W. Li, Risk Assessment of Power Systems: Models, Methods, and Applications, IEEE Press and Wiley & Sons,

2005 [2] R. Billinton and R. N. Allan, Reliability Evaluation of Power Systems, Plenum Press, New York, 1996 [3] R. Billinton and W. Li, Reliability Assessment of Electric Power Systems Using Monte Carlo Methods, Plenum

Press, New York, 1994 [4] J. Endreyi, Reliability Modeling in Electric Power Systems, Wiley & Sons, Chichester, 1978 [5] G. J. Anders, Probability Concepts in Electric Power Systems, Wiley & Sons, New York, 1990 [6] A. K. S. Jardine, Maintenance, Replacement and Reliability, Pitman Publishing, London, 1973 [7] N. B. Bloom, Reliability Centered Maintenance, McGraw-Hill, Inc., New York, 2006 [8] IEEE Tutorial Course Text, Electric Delivery System Reliability Evaluation, 05TP175, 2005 [9] IEEE Task Force, “The Present Status of Maintenance Strategies and the Impacts of Maintenance on

Reliability”, IEEE Trans. on Power Systems, Vol. 16, No. 4, November 2001, pp638-646 [10] J. Endrenyi, G. J. Anders and A. M. Leite da Silva, “Probabilistic Evaluation of the Effect of Maintenance on

Reliability – An Application”, ITTT Trans. on Power Systems, Vol. 13, No. 2, May 1998, pp576-583, [11] W. Li, E. Vaahedi and P. Choudhury, “Power System Equipment Aging – Assessment, Maintenance and

Retirement”, IEEE Power& Energy, Vol. 4, No. 3, May/June, 2006, pp52-58 [12] W. Li, J. Zhou, J. Lu and W. Yan, “A Probabilistic Analysis Approach to Making Decision on Retirement of

Aged Equipment in Transmission Systems”, accepted for publication in IEEE Trans. on Power Delivery [13] W. Li and J. K. Korczynski, "A Reliability Based Approach to Transmission Maintenance Planning and Its

Application in BCTC System", IEEE Trans. on Power Delivery, Vol. 19, No. 1, January 2004, pp303-308 [14] W. Li, “Incorporating Aging Failures in Power System Reliability Evaluation”, IEEE Transactions on Power

Systems, Vol. 17, No. 3 August 2002, pp. 918 – 923 [15] W. Li and S. Pai, “Evaluating Unavailability of Equipment Aging failures”, IEEE Power Engineering Review,

February, 2002, pp52-54 [16] L. Bertling, Reliability Centered Maintenance for Electric Power Distribution Systems, Ph.D. thesis, Royal

Institute of Technology (KTH), Stockholm, 2002 [17] W. Li, P. Choudhury, D. Gillespie and J. Jue, “A Risk Evaluation Based Approach to Replacement Strategy of

Aged HVDC Components and Its Application at BCTC”, accepted for publication in IEEE Transaction on Power Delivery

[18] EPRI, High-Voltage Direct Current Handbook, EPRI TR-104166, prepared by GE Industrial and Power Systems, 1994

[19] R.N. Allan, R. Billinton, A.M. Breipohl, C.H. Grigg, “Bibliography on the Application of Probability Methods in Power System Reliability Evaluation: 1992-1996”, IEEE Transactions on Power Systems, Vol. 14, No. 1, 1999, pp. 51-57

[20] R. Billinton, M. Fotuhi-Firuzabad and L. Bertling, “Bibliography on the Application of Probability Methods in Power System Reliability Evaluation 1996-1999”, IEEE Transactions on Power Systems, Vol. 16, No. 4, Nov. 2001, pp595 – 602

[21] EPRI Report, Framework for Stochastic Reliability of Bulk Power System, TR-110048, Palo Alto, California 1998

Page 108: IR-EE-ETK_2007_004

IEEE Tutorial on Asset Management – Maintenance and Replacement Strategies, June 24-28, 2007, Tampa, USA

Risk Based Asset Management – Applications at Transmission Companies W. Li

105

[22] CIGRE Task Force 38-03-10, Composite Power System Reliability Analysis, CIGRE Symposium on Electric Power System Reliability, September, 16-18, 1991

[23] W. Li, "Evaluating Mean Life of Power System Equipment with Limited End-of-Life Failure Data", IEEE Trans. on Power Systems, Vol. 19, No.1, February 2004, pp236-242

[24] W. Li, “Probability Distribution of HVDC Capacity Considering Repairable and Aging Failures”, IEEE Trans. on Power Delivery, Vol. 21, No. 1, January 2006, pp523-525

[25] BC Hydro Report, Pole 1 and Pole 2 DC Cable - 2005 ROV Inspection (Summary of Results), January 2006 [26] W. Li, E. Vaahedi and Y. Mansour, “Determining Number and Timing of Substation Spare Transformers Using

a Probabilistic Cost Analysis Approach”, IEEE Transactions on Power Delivery, Vol. 14, No. 3, July 1999, pp. 934 – 939

[27] British Columbia Transmission Corporation, “Service Plan: For Fiscal Year 2005/06 to 007/08”, February 2005, available at: http://www.bcbudget.gov.bc.ca/2005/sp/crownagency/bctc.pdf

[28] W. Wangdee, W. Li, W. Shum and P. Choudhury, “Applying Probabilistic Methods in Determining the Number of Spare Transformers and their Timing Requirements”, IEEE CCECE 2007 conference, Vancouver, April 2007

[29] British Columbia Transmission Corporation, 17 technical reports on reliability assessment, available at: http://www.bctc.com/the_transmission_system/reliability_assessment/

[30] W. Li, H. C. Jonas, S. Yan, B. Corns, P. Choudhury and E. Vaahedi, “Reliability Decision Management System: Experience at BCTC”, IEEE CCECE 2007 conference, Vancouver, April 2007

7.6 Biography Dr. Wenyuan Li (SM86, F02) is currently a Principal Engineer at BCTC in Canada and an advisory professor of Chongqing University in China. He is an IEEE Fellow. Dr. Li is the author/coauthor of a considerable number of papers in power system planning, operation, optimization, reliability and asset management. He published four books in power system operation and risk assessment, including the book of “Risk Assessment of Power Systems: Models, Methods, and Applications”, IEEE Press and Wiley & Sons, 2005, and completed more than sixty technical reports for industry applications. He also delivered many tutorials and seminars at different international conferences (IEEE, PMAPS and CEA) and industrial workshops (EPRI, WECC and NWPP). Dr. Li was the winner of the 1996 “Outstanding Engineer Award” by the IEEE Canada and the recipient of the “Significant Reviewer Award” by IEEE PES in 2006. He can be reached at [email protected].