Computer Learning Systems in Corrosion

Paper No.

657CORROSION 96The NACE International Annual Conference and Exposition

COMPUTER LEARNING SYSTEMS IN CORROSION

C.P. SturrockNational Institute of Standards and Technology

Gaithersburg, MD 20899 USA

W.F. BogaertsKatholieke Universiteit Leuven - Dept. MTM

B-300 1 Leuven Belgium

ABSTRACT

A collection of data documenting the stress corrosion cracking (SCC) behavior of austenitic stainlesssteels provides the basis for an automated learning system. Computer learning systems based onclassical and non-parametric statistics, connectionist models, machine learning methods, and fuzzy logicare described. An original method for inducing fhzzy rules from input-output data is presented. All ofthese computer learning systems are used to solve a typical problem of corrosion engineering:determine the likelihood of SCC of austenitic stainless steels given varying conditions of temperature,chloride level, oxygen content, and metallurgical condition in simulated boiling water reactor (BWR)environments. Empirical performance comparisons of the various approaches are summarized, alongwith the relative intelligibility of the outputs. In both areas the decision tree approach was found toperform very well on the problem investigated.

Keywords: computer learning, decision tree, expert system, fuzzy logic, linear discriminant,nearest neighbor, polynomial network, stainless steel, stress corrosion cracking

INTRODUCTION

Computer learning systems are programs that generate a decision based on the accumulated experiencecontained within a collection of relevant examples. We contrast computer learning systems with expertsystems, which attempt to encode expert knowledge explicitly with rules or some other symbolic formof knowledge representation. The knowledge structures in expert systems are typically drawn from the

Copyright01996 by NACE International. Requests for permission to publish this manuscript in any form, in part or in whole must be made in writing to NACEInternational, Conferences Division, P.O. Box 218340, Houston, Texas 77218-8340. The material presented and the views expressed in thispaper are solely those of the author(s) and are not necessarily endorsed by the Association. Printed in the U.S.A.

experienceandcontributionofexpertsinaparticulardomain.Learningsystems,bycontrast,extractdecisioncriteriadirectlyfromsamplesofsolvedcasesstoredinacomputer.

Many corrosionprocessesarepoorlyoratbestincompletelyunderstood.Themechanismofstresscorrosioncracking(“SCC”)isanexcellentexample.OneauthorhascitednofewerthansevenmodelsofSCC basedonvariou~metallurgicalandelectrochemicalconsiderations.ThismakesSCC anexcellentcandidateforcomputerlearninginvestigations,asveryoftenthemostreliableandusefilinformationaboutSCC hasbeenobtainedfromempiricalexperiments.Theseexperimentsprovidethecollectionofcasedatafromwhichcomputerlearningsystemscanextractusefuldecisioncriteriaandformulatesolutions.

Computer learning systems span several disciplines, including statistics, cognitive science, artificialintelligence, machine learning, psychology, systems engineering, etc. At one point or another, all ofthese disciplines have addressed the classlycation problem, which is the assignment of a particularobservation to one of several prespecified categories. The statistics community has been studying theclassification problem since the 1930s, when the linear discriminant was first formulat ed2. In the 1960sand 1970s, statisticians and artificial intelligence researchers developed nonparametric methods ofstatistical pattern recognition. Inspired by neurophysiology, cognitive scientists have developedconnectionist models such as artificial neural networks and applied them to a broad range ofclassification problems4. A wide variety of methods for inducing easily-understood decision trees orrules from data has been put forth primarily by the machine learning community5. Control systemsengineers have formulatl:d continuous or~hzzy logic, and developed adaptive fizzy systems forclassification problems primarily in the area of process contro16. While in most cases any of thesemethods can be applied to the same problem, they differ radically in approach, computationalrequirements, and the final format of their solution.

Despite the abundance of methods for solving the classification problem and the widespread popularity

of some of these methods, there have been relatively few comparisons of different methods applied tothe same set of data, and none at all applied to corrosion data. In this paper, we report the results of anextensive comparison of computer-learned classification systems built from a database documenting theoccurrence of SCC of austenitic stainless steels under widely varying laboratory conditions.

DATA INVESTIGATED

The data investigated consist of 112 records documenting the SCC behavior of austenitic stainless steelspecimens in simulated boiling water reactor (“BWR”) high temperature aqueous environments, and arereported filly in Table 1. These data were drawn from a compendium of data compiled from 20sources documenting related laboratory experiments. Recorded parameters include: the temperature,chloride and oxygen level, metallurgical condition, and the presence/absence of SCC. The statisticalsummary of the dataset is provided in Table 2.

METHODS

To a computer, the data in Table 1 are essentially a collection of pattern of features and their correctclassification. In keeping with the variation of terminology found in the literature, we shall use the

657/2

terms case and pattern interchangeably. The process of developing classifiers consists of two phases:training and testing. In the training phase, numerous patterns are used to build a classifier that canpredict an output given an acceptable set of inputs. In the testing phase, additional patterns are used totest the classifier’s performance by comparing the output reported in the data against that predicted bythe classifier. The challenge is to develop classifiers that can classifj new patterns correctly, rather thanto simply discriminate between the patterns used to build the classifiers.

The most commonly used evaluation metric is the error rate. Thetrueerrorrateofmostclassifiersusuallycannotbedeterminedwithabsolutecertaintyasitrequirespreciseforeknowledgeoftheentirepopulationdistribution.Lackingsuchknowledge,thesimplesterrorratetocalculateistheresubstitutionerror,E,.,Ub,whichisobtainedbytestingtheclassifierwiththesamecasesusedtobuildtheclassifier.Unfortunately,theresubstitutionerrorisoptimisticallybiased,meaningthatitislowerthanthetrueerrorrate.Furthermore,theresubstitutionerrorprovidesnoinformationwhatsoeveronaclassifier’spredictiveability.

Estimating error rates

For most problems the best estimate of the true error rate is obtained empirically. Empirical error rateestimation involves testing a classifier on previously unseen sample data. The empirical error rate is thefraction of test cases that are misclassified:

Error rate =Number of unseen test cases misclassified

Number of unseen test cases(1)

Holdout method. A straightforward method for determining the error rate is to partition theentire sample into two mutually exclusive sets. One set is used to design the classifier and the other oneis used to test it. This approach is known as the holdo?it method. To assure integrity of the calculatederror rate, the two sets should be selected randomly. Furthermore, the cases within these two setsshould be independent, meaning that there is no relationship among them other than that they have beendrawn from the same sample. Examining the patterns and distributing them in the training and test setsaccording to any non-random process is insupportable, for the two sets would no longer beindependent. A classifier built from anything but a randomly drawn training set could not possiblygeneralize to the entire population, except in the extremely rare instance in which the distribution of theentire population was known a priori, and the training set determined accordingly. A classifier builtfrom such a training set would then be of dubious worth, as a simple and direct reference to the knownpopulation distribution would suffice.

The holdout method is computation ally inexpensive and straightforward to implement: randomly dividethe sample, build a single classifier, and then evaluate the classifier using the previously unseen test data.Unfortunately, this approach results in unreliable estimates for the true error rate except for very largesamples, i.e., more than 1000 cases in the test set8. The holdout method usually results in eitherinsufficient training or test cases, and generally results in a pessimistic estimate of the true error rate.

657/3

Not all of the cases are used to design the classifier and the method is subject to the idiosyncrasies of asingle random train-and-test partition.

Instead of relying on a single train-and-test experiment, multiple random train and test experiments canbe performed. For each random train and test partition, a new classifier is constructed. The estimatederror rate is the average of the error rates for classifiers resulting from the independently and randomlygenerated partitions. Random resampling in this manner can produce better error rate estimates than asingle train and test partition.

Leave-one-out. A special case of resampling is known as the leave-one-out method. Leave-one-out is an elegant and straightforward technique for estimating classifier error rates. For a givenmethod and sample size n, a classifier is generated using n-1 cases and tested on the remaining case.This is repeated n times, for each case in the sample. Each case is used as a test case, and, each timenearly all the cases are used to design a classifier. The error rate is the number of errors on the singletest cases divided by n. Evidence for the superiority of the leave-one-out approach is welldocumented. The leave-one-out method is computationally expensive, but provides the best estimateof error rate. Since a principal goal of this research was the reliable performance comparison of avariety of classification methods, we applied the leave-one-out method of error estimation, despite itsgreat computational cost.

Classification methods

Linear discriminant. Consider a collection of cases representing two broad classes Cl and Cz.These classes might correspond, for example, to the presence/absence of SCC, such as in the casesshown in Table 1. Each case is characterized by a vector of observable features and their recordedvalues: (xl, xz, . . . x.). If we consider the vector space defined by these features, then cases pertainingto the different classes will fall into different regions of this featzire space. Regions where the sameclass decision prevails are known as decision regions. Separating surfaces, called decision s~irfaces, canformally be defined as linear surfaces inn dimensions which are used to separate the known cases intotheir respective classes and are used to classifi unknown cases. Such decision surfaces are calledhyperplanes, and are (n-l) -dimensional. When n =2, the decision surface is a line,

W. +W, x, +W2X2 = o (2)

where wo, wl, and WZare constant coefficients. If we represent instances of class Cl as square% andinstances of Class Cz as circles, then this line is shown bisecting the feature space into separate decisionregions in Figure 1.

When n = 3, the decision surface is a plane. When n = 4 or higher, the decision surface is a hyperplanerepresented by:

W. +W)xl +W2X2+W3X3+ . ..+wnxn =wo+~wzxi =0 (3),=1

657/4

Usually the classes actually overlap and cannot be completely separated by a decision surface describedby Equation 3. Cases that are misclassified in this manner are the source of nonzero error rates for thelinear discriminant.

The key to using the linear discriminant lies in finding the coefficients w, in Equation 3. There arepotentially unlimited numbers of coefficients that could be tried; the search for the most reasonablepossibilities usually involves some assumptions about the data. The most widely used lineardiscriminant guarantees a reasonable separation between classes by assuming that the data are clusteredaround a mean, with the density decreasing with increasing variation from the mean. These assumptionsmay not in fact be valid as the true distribution is usually unknown; nevertheless, the results for a lineardiscriminant on training cases often continue to hold on unseen cases. Furthermore, the lineardiscriminant is quick and readily available in standard statistical packages. These attractive qualitiesmake the results of the linear discriminant a useful baseline from which to compare the performance ofmore complex classification methods.

Nearest nei~hbor, Nearest neighbor methods attempt to discriminate among cases byminimizing some kind of distance metric calculated using only the feature values as input. The classassignment of an unknown case is determined by comparison with the most similar cases, that is, themost proximate cases in the feature space. Unlike the linear discriminant, nearest neighbor methodsassume no a priori knowledge of the underlying distribution of the sample cases and theirclassifications. Nearest neighbor methods can yield any shape decision surface to distinguish among theclasses based only on the known cases and their topological relations within the feature space.

Nearest neighbor methods were first described in a U. S. Air Force technical report in the early 1950s10.The k-nearest neighbor rule was introduced and used to assign to an unclassified case the class mostprominent among its k nearest neighbors. The single nearest neighbor rule assigns to the unclassifiedcase the class value of its single nearest neighbor, and is the method most frequently used. If largervalues ofk are considered, a majority vote is taken to establish class assignment. For the binary 2-classproblem, such as the presence/absence of SCC, odd values of k are generally chosen so as to precludeties.

It has been established that at the limit of an infinite set of sample cases, the error rate for the singlenearest neighbor is no worse than twice the optimal error ratell. In this sense, it may be said that half ofthe available information in an infinite collection of classified cases is contained in the nearest neighbor.For this reason, the single nearest neighbor method has been found to perform very well whendiscriminating among large numbers of cases.

Theuseofnearestneighbormethodsrequiresselectingasuitablemetric,i.e.,ameansformeasuringdistancesbetweencases.ThedistancebetweentwocasesA andB istypicallycalculatedusingthefeaturevaluesA,andB,oftherespectivecases,andsomevariantofthefollowinggeneralizedmetric:

(4)

65715

The metric is said to be Euclidean when p = 2. The value ofp generally has little impact on the results;as a result, the familiar Euclidean metric is by far the most commonly used, and was the method appliedin this research, Because Equation 4 involves a summation across features that may vary greatly inscale, the weights ~i are often used to normalize the feature values in order to neutralize thecontribution to distance attributable to the choice of units.

In addition to their widespread use by statisticians, nearest neighbor methods have also been applied bypsychologists investigating multidimeilsional scaling12, and more recently by artificial intelligenceresearchers specializing in case-based reasoning (CBR), Nearest neighbor methods in fact lie at theheart of most CBR systems. A recent comprehensive survey of CBR systems and research, presented ina text written by a leading CBR researcher13, indicated that the nearest neighbor approach was usedalmost exclusively to assess the similarity of a pair of cases.

Pohmomial networks. Polynomial networks are a subclass of the general class of connectionistmodels, which also includes artificial neural networks. Elements of the architecture and dynamics ofpolynomial networks and artificial neural networks are very similar, but their origins and synthesis differmarkedly. The origins of polynomial networks lie in function approximation, whereas artificial neuralnetworks arose out of efforts to simulate neural dynamics. The architecture of an artificial neuralnetwork is generally specified in advance by the investigator, whereas that of a polynomial network isdetermined adaptively as part of the learning process.

The generation of polynomial networks fi-om a set of input-output data vectors (xl, x2, . . . x~,y)proceeds as follows. First the regression equations

y= A+ Bxz+cxl +Dx:+&; +FxixJ+ . . . (5)

are computed for each combination of input variables x, and xj and the output y. These regressionequations can be of any degree desired, For illustrative purposes we restrict ourselves in the followingdiscussion to quadratic regression equations, i.e., degree 2. Such equations contain only the six termsshown on the right-hand side of Equation 5. The result of the regression computation is d(d-1)/2higher-order variables for predicting the output y in place of the original d variables xl, x.2,. . . x~ (seeFigure 2).

We select a subset of these equations, say d], of these higher-order variables that best predict the outputy. We now use each of the quadratic equations just computed and generate new independentobservations (which will replace the original observations of the variables x1, x2, . . . xd). From thesenew independent variables we will combine them exactly as before, i.e., we compute all of theregression equations ofy versus these new variables (see Figure 3).

This will in turn give us a new collection of di(dl-1)/2 regression equations for predicting y from thenew variables, which in turn are estimates ofy from the previous equations. We have in essence acollection of fourth degree polynomials in four variables. We repeat the process until the regressionequations begin to have a poorer predictability power than did those of a previous generation. In other

657/6

words, the model will start to become overspecialized. The best quadratic polynomial of the highestgeneration attained is selected. This optimal polynomial is an estimate of the original fi-mction y as aquadratic of two variables, which are themselves quadratics of two more variables, which arethemselves quadratics of two more variables, . which are quadratics in the original input variables xl,X2,. . . Xd. If we were to make the necessary algebraic substitutions, we would arrive at a very complexpolynomial of the form:

d dd ddd dddd

(6)i=] j=l ]=1 ,=1 ,=1 k=l ,=1 ,>1 /(=] 1=]

Having selected subsets of the equations d at each level that best predict the output y, many of thecoefficients in Equation 6 are in fact zero. This greatly eases the computational burden of synthesizingpolynomial networks from all possible combinations of network inputs. The literature refers toEquation 6 as the Kolmogorov-Gabor polynomia114, or, more recently, as the Ivakhnenko polynomia115,in honor of A. G. Ivakhnenko, who developed the first algorithms for polynomial network synthesisbased on this equation.

An example polynomial network from this research is illustrated in Figure 4. Note that Figure 4incorporates linear and third-degree regression elements represented by the following shorthandnotation:

Linear Q WO+WlXl +W2X2+w~xq

Single 0 wO+ W,X,+w2x~ + w~x~

(7)

(8)

(9)

(lo)

Decision trees. Decision trees are built by recursively partitioning the feature space intorectangular regions. Hence the decision tree is essentially a series of the simplest of linear discriminants,in which all of the partitions are parallel to the axes of the feature space. Figure 5 is an example of abinary decision tree, indicative of those investigated in this research.

A decision tree consists of nodes and branches. Each node represents a single test or decision.Depending on the outcome of a test at a given node, the tree will branch to another node. Classassignment is determined at terminal nodes, which are the right-most nodes in Figure 5. The impurityof a terminal node is the degree to which the class assignment is mixed; for example, the last terminalnode in Figure 5, labeled”35 SCC”, has zero impurity.

65717

Decision trees are grown by testing the discriminative ability of every possible feature value in thedatabase. At any given point in decision tree development, the partition selected is the one that bestdiscriminates between the sample cases, i.e., the one that reduces the impurity the most. Nodes becometerminal and are not split firther when one of two conditions are met:

(1)Allcasesin the remaining sample belong to one class, For example, the node in Figure 5 labeled“35 SCC” represents 35 cases in which SCC occurred, so it is not split fh-ther.

(2) A stopping criterion has been met. Examples of stopping criteria include: minimum number ofcases, or a maximum depth of the decision tree. The node is then assigned to the class having thegreatest frequency at that node. For example, the first terminal node in Figure 5 would be assignedthe class of “No SCC” as 35 of the 41 cases satis@ing the conditions of the prior nodes were foundto be examples where SCC did not occur.

Without a stopping criterion, the partitioning process continues until all nodes consist of samplesbelonging to a single class, This process inevitably results in trees that are overspecialized to thetraining data and do not generalize well. The optimal depth of a decision tree is found by varying thestopping criterion until the tree that performs best on previously unseen data is found. Alternatively, thetree may be developed fully and then “pruned back” by removing branches that do not significantlyimprove the discriminative ability of the tree. For example, if we pruned the tree in Figure 5 byeliminating the partition where the Oxygen = 1.7 mg/L, then the resulting terminal node would consistof 19 cases where SCC occurred and 8 cases in which it did not. The slightly higher impurity of thisnode over the subsequent nodes as shown in Figure 5 might be offset by the improvement ingeneralization ability of the simpler tree on unseen data. As with the stopping criterion, the only way todetermine the optimal degree of pruning is by repeated testing of candidate trees using cases not used tobuild the tree.

Fuzzy logic. Fuzzy logic is not a computer learning method per se, but rather a method forreasoning with uncertain and/or noisy data, such as that given in Table 1. The principle use of theexpression j~zzy logic refers to reasoning with fizzy sets or with sets of fuzzy ruleslb. A fuzzy set haselements that belong to it to different degrees whose values are context-dependent. The set of “Highchlorides” would be an example. Fuzzy set operations can be embodied in the production rules of anexpert system. Fuzzy reasoning occurs when logical inferences are made based on fuzzy sets, rules andoperators. Fuzzy rules closely resemble the production rules of an expert system. However, theprocess of evaluating fuzzy rules differs considerably from that of conventional rules. In brie~ the rulesin a traditional expert system are evaluated in series, while those of a fuzzy system are evaluated inparallel.

When a fuzzy rule applies, or “fires”, it fires to a degree determined by the belief level in each

antecedent condition (statements following “IF”) in the premise of the rule. The antecedents areevaluated using membership functions to produce belief levels, which are then combined using fizzyoperators to produce the final output activation level in a process known as defuzzljlcation. There aremany ways to effect inference and defizzification using fuzzy rules. Probably the most common methodof inference is to scale the output through a process known as A4ax-Dot ii)jereize; the most common

657/8

method of defkzzification is center of gravity, or centroid, method. These methods are best illustratedby example.

Consider the two-input/one-output system illustrated in Figure 6, and suppose that the fuzzy sets, ormembership il.mctions, shown in Figures 7 have been defined for each of the variables in this system.Suppose the current values of the inputs are Alpha= A* and Beta= B*, and these values intersect themembership finctions as shown in Figure 8. We see that A* is considerably Low (membership value:O.8), but also partly A4ediurn (membership value 0.2). Similarly, B* is mostly High (membership value0.6), but also somewhat A4ediurn (membership value 0.4). Suppose that the only rules available that areapplicable to this particular combination of A/pha and Beta are:

Rule 1: IF (Alpha IS Low) AND (Beta IS High) THEN (Gamma IS A4edium)Rule 2: IF (Alpha IS A4edil{m) AND (Beta IS High) THEN (Gamma IS High)

The premise of Rule 1 is a fuzzy set operation governed by the “AND” operator. This operatorspecifies fizzy set intersection, which means taking the minimum value of the operands. The quantityA* has a membership of 0.8 in the A/pha fuzzy set Low, and the quantity B* has a value of 0,6 in theBeta fizzy set High. The minimum of these two values is 0.6. This value is used to scale the height ofthe output fuzzy set A4edium for the output variable Gamma.

In a similar way we determine the contribution of Rule 2. This rule specifies finding the minimum of 0,2(membership value of A* in the AIpha fizzy set AIedinm) and 0.6 (membership value of B* in the Betafizzy set High), whichis0.2,andusing this value to scale the height of the output fuzzy set High forthe output variable Gamma.

Assuming that Rule 1 and Rule 2 are the only known rules that apply, the final value of Gamma isdetermined by summing the contributions of each rule toward the output and calculating the centroid.This entire process is illustrated in Figure 9, which indicates that the final value assigned to Gamma ismostly A4ediurn but also somewhat High.

The mathematical expression for the inference/defuzzification method illustrated herein is:

‘&ikiwQ“=l: “

Zww 1,=1

(11)

where Q* is the variable value at the centroid of the fuzzy set, P, is the degree of membership computedfor the premise of rule i, wi is the weight assigned to rule i, A4, is the moment of the membership

function assigned to Cl in rule i around zero, and A, is the area of the membership function assigned tof2 in rule i.

657/9

The ability of a fizzy system to accurately model a process depends entirely on its rules. The FuzzyApproximation Theorem17 states that in theory we can always find fuzzy rules to simulate any processor approximate any function to any desired degree of accuracy. In practice, however, it is often verydiflicult to find these rules. The following approach for finding a set of fizzy rules from input-outputdata was developed as part of this research.

We start with a collection of input-output data, expressed in vector notation (~J, 12, . . . In, 0) Foreach continuous variable we first determine its full range across the entire dataset. Extremely broadranges are typically scaled by transforming the values with a logarithmic or some other non-decreasingfimction. We then divide this range into 2M segments of equal length, where M is any odd integer. Wethen assign M tizzy membership fimctions across the range as follows. The middle membershipfimction is triangular, centered about the midpoint and spans four segments of the range, two on eachside of the midpoint. The remaining membership functions are drawn to the lefl and right of center with50% overlap, which leaves a “shoulder” in the membership functions governing the extreme values ofthe range.

The construction of membership functions in this manner for the continuous variables in Table 1 is

illustrated in Figure 10, where we have chosen M = 5.

Because fhzzy systems allow for multiple rules firing simultaneously, we can assign one rule for eachcase represented in the database. We determine the rules as follows. For each case, we consider eachvariable, then we determine which membership function applies to the greatest degree, and the precisevalue of that degree, Call this value degree,, We then multiply the degrees of each of the winningmembership functions for the continuous variables to find the composite degree of the rule. Thiscomposite degree becomes the weight w, assigned to the rule introduced in Equation 11. The rule takesthe form:

IF 11IS max (1I membership function)AND 12IS max (IZ membership function)AIVD .AND 1,, IS max (In membership fimction)THEN 0 IS max (0 membership function)

n+ 1

weight = ~ degree, (12),=1

For the discrete variables, the number of membership functions assigned is determined by the number ofdiscrete values of the variable in question. We choose membership function labels that correspond tothese specific values, such as Annealed and Sensitized for the variable Metallurgical Condition, andTrue and False for the output SCC. Without additional information, discrete variables by their verynature imply crisp (non-fuzzy) membership functions, so they do not influence the weight assigned tothe rule, which is a measure of the degree to which the specific combination of variable values applies.

Applying this scheme to the data in Table 1, a sample rule derived from the first case is as follows:

657/1 O

Rule 1: IF Temperature IS Very High (degree 1.0)AND Log(Chlorides) IS AAedium (degree 0.55)AND Log (Oxygen) Is LOW (degree 0.86)AND A4etalhmgical Condition 1S AnnealedTHEN SCC Is False

weight = 0.47 (13)

The weight is determined by multiplying the degrees of the continuous variables. Each case results in asingle rule; hence our fuzzy n.debase for this problem contains 112 rules.

RESULTS

The performance results for all five methods investigated are summarized in Table 3. For each methodthe resubstitution error E,e,.b, and the leave-one-out error, EIOO,were calculated. Given that there were112 cases in the database, this meant that well over 500 classifiers were built and tested to arrive at the

results shown in Table 3.

Note that the resubstitution error for the nearest neighbor method is always identically zero, as thenearest neighbor to a given case is itself. The optimal polynomial network and decision tree using allavailable data are shown in Figures 4 and 5, respectively. The default error rates shown in Table 3 arederived from selecting the most frequently occurring output (“No SCC”, in this case) without regard toinputs.

DISCUSSION

The materials performance data examined in this research address a prototypical problem in real-worldcorrosion. The problem defined has only two classes and is characterized by uncertainty ofclassification. On the whole, the predictive ability of the features is fairly strong: with only oneexception, the default leave-one-out error rat e is more than twice that of all of the classifiers examined,which made considerable use of the input data.

The result for the linear discriminant provide the baseline from which to compare the performance of themore complex classifiers. The leave-one-out error rate of 0.205 indicates that 23 of the 112 cases aremisclassified when tested against classifiers built without benefit of the test case. Slightly betterperformance was observed for the nearest neighbor and polynomial network methods, which arecapable of implementing decision surfaces of any shape. Of the methods examined, by the measure ofthe leave-one-out error rate, the decision tree performed best. The highest leave-one-out error rate wasobserved for the fuzzy logic system. This result was not really surprising, because fuzzy systems arebetter suited to providing approximate answers to problems in which the inputs are vaguely specified.Neither of these conditions applied for this problem.

Thereareothercriteriabesideserrorrateforjudgingthequalityofacomputerlearningsystem.Intelligibilityisoftenimportant,especiallyifthelearningsystemistobeincorporatedwithinanexpertsystem,whichisexpectedtoexplainitsanswersoratleastprovidesomeinsightintotherelationshipsbetweetiamongtheoutputandinputs.Thefirstthreeclassifiersexaminedperformpoorlyinthis

657/1 1

respect.Theoutputsofthelineardiscriminantandthepolynomialnetworksaremultivariateequations,ofpotentiallyveryhighorderinthelattercase.To mostpeople,multivariateequationsare“blackboxes”:theonlywaytoinvestigatethecontributionsoftheindividualinputsandassessrelationshipsbetweenthemistosolvetheequationsrepeatedlybyholdingsomeinputsconstantatsomearbitraryvalueandvaryingtheothers,ascenariothatmayhavenobasisinreality.Theoutputofthenearestneighborapproachisacaseorgroupofcasesthatissimilartothecaseinquestion,Whilethisismoreconcretethanamultivariateequation,asimilarcaseissimplythebestavailableexamplefrompriorexperience,andlendsverylittleinsightintotheproblem.

The decision tree and fuzzy logic approaches, by contrast, do provide insights that are accessibleimmediately upon inspection. For example, Figure 5 indicates that of the four inputs, oxygen is themost important in terms of its ability to predict the likelihood of SCC. For example, if we were to prunethe tree all the way back to the first partition, made at Oxygen = 0.3 mg/L, so that we only consideredoxygen as an input, the resulting decision tree would have a resubstitution error of only 0.170, which isbetter than that of the linear discriminant based on all available inputs. This result also reveals thatadditional information does not always improve performance, since a one-level decision tree is identicalto a linear discriminant in one variable.

Figure 5 also shows that next to oxygen content, chloride concentration is most important in terms as apredictor of SCC, and that neither the temperature nor metallurgical condition are important factors interms of their ability to predict the likelihood of SCC, at least for these data. The screening out ofinputs in this way is known asfeature selection, a process whose importance is directly proportional tothe dimensionality of the data, and becomes paramount in highly-dimensional problems. For example,suppose the cases were characterized by twelve inputs instead of four. Some of the features wouldundoubtedly be noisy (poorly predictive), or at the least redundant. Without feature selection, muchmore computing resources would be needed to develop 12-dimensional classifiers, and the performanceresults would almost certainly be worse than with the simpler classifiers built by reducing thedimensionality of the problem using feature selection. While feature selection is also normally abyproduct of the polynomial network development process, for this particular problem all availablefeatures were selected and integrated within the optimal network.

Just like the fizzy rules that were derived from the data according to the method described above, a setof crisp rules can be induced directly from a decision tree, and used as input to an expert system. Forexample the decision tree in Figure 5 would yield the following five rules:

Rule 1: IF Oxygen <0.3 mg/L Rule 2: IF Oxygen <0.3 mg/L

AND Chlorides <800 mg/L AND Chlorides >800 mg/L

THEN SCC IS unlikely THEN SCC IS likely

Rule 3: IF Oxygen >0.3 mgk Rule 4: IF Oxygen >0.3 mg/L

AND Chlorides <1.5 mg/L AND Chloridess 1.5 m.g/L

AND Oxygen <1.7 mg/L AND Oxygen >1.7 mg/L

THEN SCC IS unlikely THEN SCC IS likely

657fl 2

Rule 5: IF OxYgen >0.3 mg/LAND Chlorides >1.5 mg/LTHEN SCC IS likely

As with the fhzzy rules, we could add weight factors to these rules if we wished to reflect ourconfidence in their conclusions. For example, we would almost certainly be more confident in Rule 5,derived from a terminal node of zero impurity and 35 supporting cases than we would with Rule 2,derived from a terminal node in which the impurity was very high and represented only nine cases. Andlike all rules, these rules would be valid only in a particular context, in this case for the risk assessmentof SCC of austenitic stainless steels in simulated BWR water environments in the temperature range of241-350 “C.

These results cannot necessarily be extrapolated to all problems. However, similar findings have beenattained by researchers investigating the performance of these methods or related methods. Decisiontrees were found to perform better overall than statistical methods and artificial neural networks whenapplied to several real-world databases on a wide range of problems in botanical classification andmedical diagnosisls. Similar findings were obtained in solving a complex two-class classificationproblem of predicting the distribution of tsetse flies in Africa, based on the environmentalcharacterizations of the regions in which they occur *9. And while no known comparison studies haveincluded fizzy logic among the methods investigated, a recent challenge to the fuzzy logic communityto identi@ cases where fhzzy logic was used in a real-world expert system yielded only one example,and the system cited had not actually been completed as of 199420. The nonexistence of successfulexpert systems that incorporate fizzy logic suggests that problems of (crisp) heuristic classification,which are the sine qua non of expert systems, evidently do not facilitate the application of fizzy logic.This suggestion is supported by the mediocre performance of fuzzy logic on the two-class classificationproblem described herein.

CONCLUSION

We have investigated five approaches to solving a two-class classification problem described by adatabase on the stress corrosion cracking of austenitic stainless steels in simulated BWR waterenvironments. The approaches examined differ greatly in origin, assumptions about the data,computational requirements, and solution format. For the database examined, the decision tree methodwas best able to discriminate between the cases, followed by the nearest neighbor and polynomialnetwork approaches, which tied for second in terms of error rate. By this measure an approach utilizingfuzzy logic performed relatively poorly, and was outperformed as a predictor of the likelihood of SCCby all of the other methods, including the computationally much simpler linear discriminant. However,the fhzzy rules induced from the data are probably more readily interpreted by humans than the outputof all of the other approaches, with the possible exception of the decision tree. The output of thedecision tree approach readily provides insights into the relationships among the data, and can be usedto generate (crisp) rules for a rule-based expert system.

657/1 3

REFERENCES

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

Harbiye, A.D.A., Passivity and SCC of AISI 316 SS in Chloride-Containing Solutions, Ph.11Dissertation, Terhnische Universiteit Delft, 1991.

Fisher, R. A., “The Use of Muhiple Measurements in Taxonomic Problems,” Annals ofEugenics, Vol. 7, pp. 179-188, 1936.

Duda, R. O., and Hart, P. E., Pattern Classl~cation and Scene Ana/ysis. New York: Wiley,1973.

McClelland, J., and Rumelhart, D., Parallel Distributed Processing: Explorations in theMicrostructure of Cognition, Cambridge, MA: The MIT Press, 1986.

Breiman, L., Friedman, J.H., Olshen, R. A., and Stone, C.J., Classljlcation and RegressionTrees. New York: Chapman & Hall, Inc., 1984.

McNeil], D., and Freiburger, P., Fuzzy Logic, New York: Simon and Shuster, 1993.

Gordon, B. M., “The Effect of Chloride and Oxygen on the Stress Corrosion Cracking ofStainless Steels: Review of Literature,” Materials Performance, Vol. 19, No. 4, pp. 29-38,April 1980.

Highleyman, W.H., “The Design and Analysis of Pattern Recognition Experiments,” Bell System

Technical Journal, Vol. 41, pp. 723-744, 1962.

Lachenbruch, P., and Mickey, M,, “Estimation of Error Rates in Discriminant Analysis,”Technornetrics, Vol. 10, pp. 1-11, 1968.

Fix, E., and Hodges, J.L., Jr., “Discriminatory analysis, non-parametric discrimination,” USAFSchool of Aviation Medicine, Randolph Field, Texas, Project 21-49-004, Rept. 4, ContractAF41(128)-31, February 1951.

Cover, T.M., and Hart, P. E., “Nearest Neighbor Pattern Classification,” IEEE Transactions onInformation Xheory, Vol. IT-13, No. 1, pp. 21-27, January 1967.

Shepard, R.N., “Multidimensional Scaling, Tree-Fitting, and Clustering,” Science, Vol. 210, No.

24, pp. 390-398, October 1980.

Kolodner, J., Case-Based Reasoning. San Mateo, CA: Morgan Kaufmann Publishers, Inc.,1993.

Ivakhnenko, A.G, “Polynomial Theory of Complex Systems,” IEEE Transactions on Systems,Man, and Cybernetics, Vol. SMC-1, No. 4, pp. 364-378, October 1971.

657/1 4

15. Hecht-Nielsen, R., Neurocomputing. Reading, MA: Addison-Wesley Publishing Company, Inc.,1990.

16. Zadeh, L. A., “Fuzzy Sets,” Information and Control, Vol. 8, pp. 338-353, 1965.

17. Kosko, B., “Fuzzy Systems as Universal Approximators,” Proceedings of the 1992 IEEEConference on Fuzzy Systems, pp. 1153-62, March 1992.

18. Weiss, S.M., Kapouleas, I., “An Empirical Comparison of Pattern Recognition, Neural Nets,and Machine Learning Classification Methods,” in Proceedings of the Eleventh InternationalJoint Conference on Artificial Intelligence, Vol. 1, pp. 781-787. San Mateo, CA: MorganKaufmann Publishers, Inc., 1989.

19. Ripley, B. D., “Statistical Aspect of Neural Networks, “ in Barndorff-Nielsen, O.E., Jensen, J.L.,

and Kendall, W. S. (eds.), Networks and Chaos - Statistical and Probabilistic Aspects, pp. 40-123. London: Chapman and Hall, 1993.

20. Elkan, C., “The Paradoxical Success of Fuzzy Logic,” IEEE Expert, Vol. 9, Number 4, pp. 3-8,

Los Alamitos, CA: IEEE Computer Society, August 1994.

657/15

Table 1- Data investigated

Temperature Chlorides Oxygen Metallurgical Scc Temperature Chlorides Oxygen Metallurgical SccCase (deg. C) (mg/L) (mg/L) Condition (Y/N) Case (deg. C) (mg/L) (mg/L) Condition (Y/N)

1 350 1 0.1 Annealed N 57 260 100 0.1 Annealed N

17 350 750 0.6 Annealed Y 73

18 350 750 0.6 Sensitized Y 7419 350 750 0.1 Annealed N 7520 350 750 0,1 SensitizedN 76 i EM

.N

N

N

I , 1 INI I

21 260 6 8 Annealed ] Y I 77 286 1 0.3 AnnealedI N22 260 30 8 Annealed Y 78 286 1 0.3 Sensitized N23 300 100 1 Annealed Y 79 286 10 0.3 Annealed N24 260 83 30 Annealed Y 80 286 10 0.3 Sensitized N25 330 0.1 40 Annealed Y 81 286 100 0.3 Annealed N26 330 0.1 3 Annealed Y 82 286 100 0.3 Sensitized N27 330 0.1 0.4 Annealed N 83 286 0.001 38 Annealed N28 330 0.1 0,15 Annealed N 84 286 1 38 Annealed N29 330 10 1 Annealed Y 85 286 10 38 Annealed Y

30 330 10 0.4 Annealed Y 86 286 100 38 Annealed Y

31 330 1000 0,15 Annealed N 87 286 0.001 38 Sensitized N

32 300 4.5 0,4 Annealed Y 88 286 0.1 38 Sensitized Y

33 300 1 35 Annealed Y 89 286 0.2 38 Sensitized Y

34 288 0.5 0.2 Annealed N 9035 282 0.5 0.2 SensitizedN 9136 288 0.5 1,5 Annealed N 92 ‘: ‘}:-

‘m 282 0.5 1.7 SensitizedI N I 93 286 0,001 38 SensitizedI Y

38 282 0.01 1.7 Sensitized N 94 286 0.1 38 Sensitized Y

39 260 600 8 Annealed Y 95 286 0.5 38 Sensitized Y-

40 260 600 158 Annealed Y 96 286 1 38 Sensitized Y

41 260 30 158 Annealed Y 97 286 10 38 Sensitized Y

42 280 1.5 1,2 Annealed N 98 278 20000 0.001 Annealed N

43 288 0.02 0.2 Sensitized Y 99 252 500 0.5 Annealed Y

44 288 0,1 100 Sensitized Y 100 300 20 200 Sensitized Y

45 288 0,1 36 Sensitized Y 101 300 200 100 Annealed Y

46 288 0.1 8 Sensitized Y 102 300 10 0.01 Annealed N

47 260 553 220 Annealed Y 103 300 100 0.01 Annealed N

48 260 553 42 Annealed Y 104 300 1000 0.01 Annealed N

49 260 553 28 Annealed Y 105 300 10 0.01 Sensitized

50 241 553 0.01 Annealed N 106 300 100 0.01 Sensitized:!

51 260 553 0.1 Annealed N 107 300 1000 0,01 Sensitized N

52 241 553 0.09 Annealed N 108 274 0.02 7 Sensitized Y

53 241 553 0.01 Annealed N 109 274 0,02 0.4 Sensitized Y

54 241 553 0,02 Annealed N 110 274 0.02 0.08 Sensitized Y

55 260 350 42 Annealed Y 111 274 0.02 0.04 Sensitized N

56 260 100 42 Annealed Y 112 288 0.01 6 Sensitized Y

657/1 6

Table 2- Statistical summary

Temperature Chlorides Oxygen Annealed Scc(“C) (mg/L) (mg/L) (T/F) (T/F)

Maximum 350 20000 1200Minimum 241 0.001 0.001Average 292 479 41

Median 286 10 0.6Std.Dev. 27 2118 162Mode 300 100 38

# True 74 65

# False 38 47

Table 3- Results

Method E,e,.b EIOO

Linear discriminant 0.188 0.205

Fuzzy logic 0.143 0.232

Default 0.420 0.420

first xl X,2 x3 . xd.1 xd

generation

Figure 2- Polynomial network synthesis

X2w~+ W]x, + WZX2= o

\\

o 0 0 C2

o•1

00xl

❑ 0o 0

❑ [1 ❑o

c,❑ o ❑

I

Figure1-Lineardiscriminantin2 dimensions

x, ‘J

Figure 3- Propagation of variables

657/1 7

Temperature

DiEEclOxygen

Chlorides

,l!2!2!K

ChloridesE*

Triple I

ITemperature ~ --

Figure 4- Polynomial network

ElRoot j oxygen s 0.3 nlg/L }

,—— .

1 ( L—

~Chlmidms800n@ I

I

~ Chlorides>800mall-}

35 No SCC6 SCC

5 Scc4 No SCC

EI5 No SCCOxygen >0.3 n@L Chloridess1.5nlg/L OXygen<1.7m@ 1 Scc

I ‘-Oxygen>1.7mgfl.18 SCC

3 No SCC

i 1<

Figure 5- Binary decision tree

Figure 6- Simple fizzy system

1m~= Law mA= Medium mA=I{igh

o C?(KIA

1mB= Low mB= Medium mB= IIigh

o KB

1mr=VL mr=L m==M mr=I1 m=VH

or

Figure7- MembershipfunctionsforAlpha,Beta, and Gamma

A*

r

r*Figure 9- Scaled membership functions,

summed result, centroid defuzzification

657/1 9

1

0

m,= Very Low m*= Low m~=Medium m,=High mj=VeryHigh1

0

241 262 274 296 317 339 350

Temperature (“C)

m,= VL mz= L m,= M m~= H m~=VH ml= VL m2= L m~= M m~= H ms= WI

1

0-3 -2.27 -0.81 0.65 2.11 3.57 4.3 -3.1 -2.39 -1.17 0.05 1.27

Log Chlorides (m#L) Log Oxygen (mg/L)

Figure 10- Membership functions determined by data in Table 1

2.49 3.1

657/20

Computer Learning Systems in Corrosion

Documents