Top Banner
Advanced Review Algorithm quasi-optimal (AQ) learning Guido Cervone, 1Pasquale Franzese 2 and Allen P. K. Keesee 3 The algorithm quasi-optimal (AQ) is a powerful machine learning methodology aimed at learning symbolic decision rules from a set of examples and counterexamples. It was first proposed in the late 1960s to solve the Boolean function satisfiability problem and further refined over the following decade to solve the general covering problem. In its newest implementations, it is a powerful but yet little explored methodology for symbolic machine learning classification. It has been applied to solve several problems from different domains, including the generation of individuals within an evolutionary computation framework. The current article introduces the main concepts of the AQ methodology and describes AQ for source detection(AQ4SD), a tailored implementation of the AQ methodology to solve the problem of finding the sources of atmospheric releases using distributed sensor measurements. The AQ4SD program is tested to find the sources of all the releases of the prairie grass field experiment . 2010 John Wiley & Sons, Inc. WIREs Comp Stat 2010 2 218–236 DOI: 10.1002/wics.78 Keywords: AQ learning; machine learning classification; evolutionary computa- tion; source detection; atmospheric emissions T he algorithm quasi-optimal (AQ) learning methodology traces its origin to the A q algo- rithm for solving general covering problems of high complexity. 1,2 An implementation of the AQ algorithm in combination with the variable-valued logic representation produced the first AQ learning program, AQVAL/1, which pioneered research on general-purpose inductive learning systems. 3 An early application of AQ to soybean disease diagnosis was considered one of the first significant achievements of machine learning. 4 Subsequent implementations, developed over the span of several decades, added many new features to the original system, and produced a highly versatile learning methodology, able to tackle complex and diverse learning problems. In the current implementation, AQ is a multipurpose machine learning methodology that generates rulesets in attri- butional calculus. Because of a wide range of features and a highly expressive representation language, Correspondence to: [email protected] 1 Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA 22030, USA 2 Center for Earth Observing and Space Research, George Mason University, Fairfax, VA 22030, USA 3 Department of Statistics, George Mason University, Fairfax, VA 22030, USA DOI: 10.1002/wics.78 recent members of the AQ family of programs are among the most advanced symbolic learning systems. The rapid development of computer technology and high-level programming languages throughout the 1980s and 1990s prompted researchers to port the original Lisp version of the AQ methodology to new programming environments. These developments, however, were performed in an academic environment and primarily for educational purposes, and they often lacked stability, reliability, and ease of use associated with other commercial or more popular classification programs, such as classification and regression trees (CART) 5,6 and C4.5. 7 As a result, despite such continuous develop- ment, use of AQ programs has been limited, especially outside the main developing group. In addition, lim- ited AQ usage may be the result to some extent of the complexity associated with running various dif- ferent AQ implementations (i.e., variations based on different parameter settings available within the basic AQ framework). That is, changing modes, tolerance levels, thresholds, and other parameters that can be adjusted in AQ to address different kinds of data or set up AQ to generate different types of output is an exercise in fine tuning, and although correct parame- ter setting is not overly difficult to learn, it does take some effort and even a modicum of trial and error with specific datasets. 218 2010 John Wiley & Sons, Inc. Volume 2, March/April 2010
19

Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Jul 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review

Algorithm quasi-optimal (AQ)learningGuido Cervone,1∗ Pasquale Franzese2 and Allen P. K. Keesee3

The algorithm quasi-optimal (AQ) is a powerful machine learning methodologyaimed at learning symbolic decision rules from a set of examples andcounterexamples. It was first proposed in the late 1960s to solve the Booleanfunction satisfiability problem and further refined over the following decade tosolve the general covering problem. In its newest implementations, it is a powerfulbut yet little explored methodology for symbolic machine learning classification.It has been applied to solve several problems from different domains, includingthe generation of individuals within an evolutionary computation framework.The current article introduces the main concepts of the AQ methodology anddescribes AQ for source detection(AQ4SD), a tailored implementation of the AQmethodology to solve the problem of finding the sources of atmospheric releasesusing distributed sensor measurements. The AQ4SD program is tested to find thesources of all the releases of the prairie grass field experiment . 2010 John Wiley &Sons, Inc. WIREs Comp Stat 2010 2 218–236 DOI: 10.1002/wics.78

Keywords: AQ learning; machine learning classification; evolutionary computa-tion; source detection; atmospheric emissions

The algorithm quasi-optimal (AQ) learningmethodology traces its origin to the Aq algo-

rithm for solving general covering problems ofhigh complexity.1,2 An implementation of the AQalgorithm in combination with the variable-valuedlogic representation produced the first AQ learningprogram, AQVAL/1, which pioneered research ongeneral-purpose inductive learning systems.3 An earlyapplication of AQ to soybean disease diagnosis wasconsidered one of the first significant achievements ofmachine learning.4

Subsequent implementations, developed overthe span of several decades, added many newfeatures to the original system, and produced ahighly versatile learning methodology, able to tacklecomplex and diverse learning problems. In the currentimplementation, AQ is a multipurpose machinelearning methodology that generates rulesets in attri-butional calculus. Because of a wide range of featuresand a highly expressive representation language,

∗Correspondence to: [email protected] of Geography and Geoinformation Science, GeorgeMason University, Fairfax, VA 22030, USA2Center for Earth Observing and Space Research, George MasonUniversity, Fairfax, VA 22030, USA3Department of Statistics, George Mason University, Fairfax, VA22030, USA

DOI: 10.1002/wics.78

recent members of the AQ family of programs areamong the most advanced symbolic learning systems.

The rapid development of computer technologyand high-level programming languages throughoutthe 1980s and 1990s prompted researchers to port theoriginal Lisp version of the AQ methodology to newprogramming environments. These developments,however, were performed in an academic environmentand primarily for educational purposes, and they oftenlacked stability, reliability, and ease of use associatedwith other commercial or more popular classificationprograms, such as classification and regression trees(CART)5,6 and C4.5.7

As a result, despite such continuous develop-ment, use of AQ programs has been limited, especiallyoutside the main developing group. In addition, lim-ited AQ usage may be the result to some extent ofthe complexity associated with running various dif-ferent AQ implementations (i.e., variations based ondifferent parameter settings available within the basicAQ framework). That is, changing modes, tolerancelevels, thresholds, and other parameters that can beadjusted in AQ to address different kinds of data orset up AQ to generate different types of output is anexercise in fine tuning, and although correct parame-ter setting is not overly difficult to learn, it does takesome effort and even a modicum of trial and errorwith specific datasets.

218 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 2: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

This article describes a complete rewrite of theAQ algorithm, specifically tailored for the problemof detecting the source of an atmospheric pollutantrelease from limited ground sensor measurements.Before undertaking the task of developing an entirelynew AQ program, we analyzed relevant availableexisting AQ implementations, such as AQ158 andAQ18.9 Our analysis indicated that the aboveimplementations were optimized for speed, ratherthan for the extensibility or comprehensibility of thecode. Additionally, many of the features includedin the existing implementations were not needed inthe context of atmospheric source detection, whereassome crucial features were missing. Therefore,the specialized AQ for source detection (AQ4SD)represents for the AQ approach a fresh new start.Although AQ4SD can be used as a general machinelearning classifier, it was specifically designed towork with large noisy datasets containing a limitednumber of primarily real-valued attributes but up tohundreds of thousands of cases. The main applicationof AQ4SD is to generate new candidate solutions ina non-Darwinian evolutionary computation process.It was optimized to run iteratively and included anew mechanism for incremental learning to refinepreviously learned patterns or rules using only newevents, rather than starting the learning process fromscratch each time. Using knowledge acquired from theanalysis of the previous codes, the new design aimsat making the new implementation reliable, easy touse, and easy to modify and extend, while retainingrelevant features previously implemented in AQ rulelearning systems.

The article is structured as follows: the firstsection gives an introduction to the AQ method-ology; it discusses the implementation of AQ4SD,and discusses the strengths and disadvantages; SectionAdvantages and Disadvantages of the AQ Methodol-ogy presents advantages and disadvantages associatedwith the AQ methodology compared with other meth-ods; Section Evolutionary Computation Guided byAQ discusses the use of AQ as main engine foran evolutionary computation process; Section SourceDetection of Atmospheric Releases discusses the prob-lem of source detection of atmospheric releases andpresents a technique based on AQ learning to identifythe source of unknown releases; and Section Resultspresents the results from the application of AQ toidentify the sources of the real-world Prairie Grassexperiment.49 Finally, Section Discussion summarizesthe main contributions of the articles and the resultsof the experiments.

AQ METHODOLOGY

OverviewThis section reviews the main features of AQ-typelearning. A detailed description of various aspects ofthe methodology can be found in Refs 1–3, and 9–12.AQ pioneered the sequential covering (a.k.a. ‘separateand conquer’) approach to concept learning. It isbased on an algorithm for determining quasi-optimal(optimal or suboptimal) solutions to general coveringproblems of high complexity.

AQ is a machine learning classifier thatgeneralizes sets of examples with respect to one ormore sets of counter-examples. The input data for AQis therefore made of labeled data, or in other wordsdata which is already assigned to a particular classor group. Unlike clustering, a form of unsupervisedlearning, whose goal is dividing unlabeled data intodistinct classes, AQ is a form of supervised learning,wherein classified data are generalized to identify thecharacteristics of the entire class.

In its simplest form, given two sets of multivari-ate descriptions, or events, P1, . . . , Pn and N1, . . . , Nm,AQ finds rules that cover all P examples (a.k.a. positiveevents) and do not cover any of the N examples (a.k.a.negative events). More generally, each multivariatedescription is a classified event of type x1, . . . , xk, andc, where each x is an attribute value, and c is theclass it belongs to. For each class c, AQ considers aspositive all the events that belong to class c, and asnegative all the events belonging to the other classes.

The algorithm learns from examples (positives)and counterexamples (negatives) patterns (a.k.a. rules)of attribute values that discriminate the characteristicsof the positive events with respect to the negativeevents. Such patterns are generalizations of theindividual positive events and depending on AQ’smode of operation may vary from being totallycomplete (covering all positives) and consistent (notcovering any of the negatives) to accepting a tradeoffof coverage to gain simplicity of patterns.

The AQ learning process can proceed in one oftwo modes: (1) the theory formation (TF) mode and(2) the pattern discovery (PD) mode. The PD modewas introduced in AQ18 and was not part of theoriginal methodology. In the TF mode, AQ learnsrules that are complete and consistent with regard tothe data. In other words, the learned rules cover all thepositive examples and do not cover any of the negativeexamples. This mode is mainly used when the trainingdata can be assumed to contain no errors. The PDmode is used for determining strong patterns in thedata. Such patterns may be partially inconsistent orincomplete with respect to the training data. The PD

Volume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 219

Page 3: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

mode is particularly useful for mining very large andnoisy datasets.

The core of the AQ algorithm is the so-called stargeneration, the process of which can be done in twodifferent ways, depending on the mode of operation(TF or PD). In TF mode, the star generation proceedsby selecting a random positive example (called a seed)and then generalizing it in various ways to createa set of consistent generalizations (that cover thepositive example and do not cover any of the negativeexamples). In PD mode, rules are generated similarly,but the program seeks strong patterns (that maybe partially inconsistent) rather than fully consistentrules. This star generation process is repeated until allthe positive events are covered. Additionally, whenrun in PD mode, the generated rules go throughan optimization process which aims at generalizingand/or specializing the learned descriptions to simplifythe patterns.

AQ4SD Features and ImplementationAQ4SD is, as noted above, a total rewrite of the AQalgorithm, specifically optimized to solve the problemof source detection of atmospheric releases (see SectionSource Detection of Atmospheric Releases). It sharesmany parts with the earlier version of AQ2011 butincludes new features and optimization algorithmsfor the source detection problem. The development ofAQ20 was led by the first author in close collaborationwith many faculty and student members of theComputer Science Department and Machine LearningLaboratory at George Mason University.

AQ4SD is written in C++ making extensiveuse of the standard templated library (STL)13 andgeneric design patterns.14 The entire code comprisesabout 250,000 lines. The goal of the AQ4SDalgorithm was to be suitable as a main engine ofevolution in a non-Darwinian evolutionary process(see Section Evolutionary Computation Guided byAQ) to find the sources of atmospheric releases, usingsensor concentration measurements and forwardatmospheric transport and dispersion numericalmodels.

AQ4SD was thus optimized to be used iterativelybecause evolutionary computation is based oniterative processes. It is tailored primarily towardreal-valued (continuous) attributes, and it uses a novelmethod that does not discretize real-valued attributesinto ordinal attributes during preprocessing. It isalso optimized to work with noisy data, as sensorconcentration measurements often contain errors andmissing values. Finally, as sensors are usually verylimited in number but record very long time series,

AQ4SD is optimized to run with a very largenumber of events with a small number of attributes.Experiments, for example, were performed with upto 1,000,000 training events, each comprised of 20real-valued attributes.

Although a formal analysis of the complexity ofthe AQ algorithm is beyond the scope of this article,experimental runs showed that AQ complexity ispolynomial. In particular, it is a low-order polynomialin the number of events and a higher order polynomialin the number of negative events. The lower complex-ity increases associated with an increase in positiveevents are due to the fact that in optimization duringlearning, only uncovered positive events are used toevaluate rules (whereas all negatives, or a sample ofall the negatives, are used to evaluate the rules).

The following sections describe the algorithmsand data types used and implemented in AQ4SD.Because AQ4SD is an implementation of a generalmethodology, when AQ4SD is specified in the text, itrefers to specific features or implementation details ofAQ4SD itself, whereas when AQ is specified, it refersto concepts and theories that apply to the general AQmethodology.

AQ EventsThe AQ input data consists of a sequence of events.An event is a vector of values, where each value corre-sponds to a measurement associated with a particularattribute. An event can be seen as a row in a database,with each value an observation of a particularattribute where columns are the different attributes.

AQ events are a form of labeled data, meaningthat they are or can be classified into one of two ormore classes. Therefore, each event contains a specialattribute class, which identifies which class it belongsto. A sequence of events belonging to the same classis called an eventset.

Additionally, two different types of events canbe used by AQ: training and testing. Training eventsare used by AQ to learn rules. Testing events are usedto compute the statistical correctness of the learnedrules on events not used during learning.

AQ RulesAQ uses a highly descriptive representation languageto represent the learned knowledge. In particular,it uses rules to describe patterns in the data. Aprototypical AQ rule is defined in logical Eq. (1).

Consequent ←− Premise � Exception (1)

220 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 4: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

where consequent, premise and exception are conjunc-tions of conditions. While premise and consequentare mandatory, the exception is optional and usedonly in very special circumstances. Although excep-tion has been implemented in AQ4SD, it is not beingused, because it often leads to over fitting in thepresence of very noisy data. A condition is simply arelation between an attribute and a set of values it cantake.

[Attribute. Relation. Value(s)] (2)

Depending on the attribute type, differentrelations may exist. For example, for unorderedcategorical attributes, the relations < or > cannotbe used as they are undefined. A complete set ofthe relations allowed with each attribute type isgiven in Section Attribute Types). Typically, theconsequent consists of a single condition, whereas thepremise consists of a conjunction of several conditions.Equation (3) shows a sample rule relating a particularcluster to a set of input parameters. The annotationsp and n indicate the number of positive and negativeevents covered by this rule.

[Cluster = 1] ←− [WindDir = N . . . E]

[WindSpeed > 10 m/s] (3)

[Temp > 22◦C] : p = 11, n = 3

This type of rule is usually called attributional tobe distinguished from more traditional rules thatuse a simpler representation language. The maindifference from traditional rules is that referee(attribute), relation, and reference may includeinternal disjunctions of attribute values, ranges ofvalues, internal conjunctions of attributes, and otherconstructs. Such a rich representation language meansthat very complex concepts can be represented usinga compact description. However, attributional ruleshave the disadvantage of being more prone to overfitting with noisy data.

Multiple rules are learned for each cluster, andare called a ruleset. A ruleset for a specific consequentis also called a cover. A ruleset is a disjunction ofrules, meaning that even if only one rule is satisfied,then the consequent is true. Multiple rules can besatisfied at one time because the learned rules couldbe intersecting each other. Equation (4) shows a

sample ruleset:

[Cluster = 1] ←− [WindDir = N . . . E]

[WindSpeed > 10 m/s]

[Temp > 22◦C] : p = 11, n = 3

←− [WindDir = E]

[Date = July] : p = 5, n = 0

←− [Pressure > 1010]

[Date = Sep] : p = 1, n = 0 (4)

Each rule has a different statistical value. Assuming13 positive events associated with cluster 1, the firstrule in Eq. (4) covers not only most positive events inthe cluster (11 of the 13 events) but also three negativeevents. This means that AQ was run in PD mode, toallow inconsistencies to gain simpler rules. The secondrule covers less than 50% of the events and the thirdcovers only 1, but both without covering any elementsin other clusters. Therefore, there is a tradeoff betweencompleteness, namely the number of events coveredout of all the clouds in the cluster and consistency,namely the coverage of events from other clusters.

Attribute TypesAQ4SD allows for four different types of attributes,nominal, linear, integer, and continuous. Eachattribute type is associated with specific relations thatcan be used in rule conditions.

Nominal: Unordered categorical attribute for whicha distance metric cannot be defined. Nominalattributes do not naturally or necessarily fall intoany particular order or rank, like the colors orblood types or city names. The domain of nominalattributes is thus that of unordered sets. Thefollowing relations are allowed in rule conditions:equal (=) and not equal ( �=).

Linear: Ordinal categorical attribute that is rankable,but not capable of being arithmetically operatedupon. Examples of linear attributes are small,medium, large, or good, better, best. Suchattributes can be sorted and ranked but cannotbe multiplied or subtracted from one other. Thefollowing relations are allowed in rule conditions:equal (=), not equal ( �=), lesser (<), greater (>),lesser or equal (≤), and greater or equal (≥).

Integer: Ordinal integer-valued attribute without aprefixed discretization and without decimalvalues. Integer attributes allow only wholenumbers, such as 20, or −77. The followingrelations are allowed in rule conditions: equal

Volume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 221

Page 5: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

(=), not equal ( �=), lesser (<), greater (>), lesseror equal (≤), and greater or equal (≥).

Continuous: Ordinal real-valued attribute withouta prefixed discretization but which contains adecimal point and a fractional portion. Thefollowing relations are allowed in rule conditions:equal (=), not equal ( �=), lesser (<), greater (>),lesser or equal (≤), and greater or equal (≥).Previous versions of AQ dealt with continuousvariables by discretizing them into a numberof discrete units and then treating them aslinear attributes. AQ4SD does not require suchdiscretization as it automatically determinesranges of continuous values for each variableoccurrence in a rule during the star generationprocess.

AQ AlgorithmThe AQ learning process can be divided intofour different parts: data preparation, rule learning,postprocessing, and optional testing. The followingsections address each part individually.

The input data is made of a definition of theattributes (variables), AQ control parameters for eachof the four parts mentioned above, and the raw events.The output of AQ consists of the learned rules, whichcan be displayed in textual or graphical form. Differentversions of AQ used different ways to define the formatof the input and output. Because the different methodsdo not affect learning, they are not discussed in thisarticle. AQ4SD uses the input/output format describedin Ref 12.

Data PreparationThe AQ learning process starts with data beingread from a file (when used as a stand aloneclassifier) or from memory (when embedded in alarger system). The data is processed by the datapreparation mechanism, which checks the data formatfor correctness, corrects or removes ambiguities,selects the relevant attributes (a.k.a. feature selection),and applies rules for incremental learning.

Some versions of AQ can also automaticallyor through user input generate new attributes tochange the data representation. This feature, calledconstructive induction, was first implemented ina specialized version of AQ1715,16 and is notimplemented in AQ4SD.

Resolving AmbiguitiesAn ambiguity is an event that belongs to two or moreclasses. For the purpose of learning, each event must

be unique and belong to a single class. AQ has fourdifferent strategies to resolve ambiguities:

Positives: The ambiguous event is kept in the positiveclass (the class rules are being learned from) andeliminated from all the other classes.

Negatives: The ambiguous event is eliminated fromthe positive class.

Eliminate: The ambiguous event is eliminated andnot used for learning.

Majority: The ambiguous event is associated to theclass where it most appears.

Attribute SelectionIn general, AQ learns rules to discriminate betweenclasses using only the smallest number of attributes.a

Therefore, AQ performs an automatic attributeselection during the learning phase, selecting the mostrelevant attributes, and disregarding those apparentlyirrelevant. Unfortunately, especially for large noisyproblems, irrelevant attributes can lead to generationof incorrect rules. To avoid this problem, AQ canbe set to create statistics for each of the attributevalues, namely a measure of how many positivesand negative examples, respectively, each attributevalue covers. AQ can try to keep only those attributesthat seem to have more discriminatory informationbetween classes.

This is only a rough approximation, asindividual attributes might have little discriminatoryinformation when considered singularly but can helpthe generation of excellent rules in combination withothers. Creating such statistics is a quick linearoperation that requires a one-time analysis of theentire data or of a statistical sample of the data.Because such statistics are also used by the learningand optimization algorithms, there is not a significantcomputational overhead introduced by this attributeselection method.

Rules for Incremental LearningOne of the main advantages of AQ (see SectionAdvantages and Disadvantages of the AQ Method-ology) is the ability to refine previously learned rulesas new input events become available. The input datacan specify rules that describe either a previouslylearned concept or constraints between attributes.For example, they can specify that a particularcombination of attributes cannot appear together ina rule or that the boundaries of the search space arereduced under particular attribute values.

As described in detail in Section Rule Learning,AQ starts generating rules by comparing positive

222 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 6: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

and negative events and keeps specializing previouslylearned rules with new conditions when they covernegative events. In incremental learning mode, theset of rules being specialized does not start with anempty set but with those specified in the input data.No other aspects of the learning are affected exceptin the case of extreme ambiguities, when the suppliedrules do not include any positive examples. In suchsituations, AQ cannot use the input rules as it is notable to evaluate their positive and negative coverage.

Rule LearningThis is the core of the AQ methodology, where rulesare generated from examples and counterexamples.AQ generates rules by an iterative process aimed atidentifying generalizations of the positive exampleswith respect to the negative examples. Recall thatpositive examples are those labeled for the targetclass, and negatives are those belonging to all theother classes.

The main algorithm for AQ is illustrated inAlgorithm 1. Although several variants and optimiza-tion mechanisms have been developed, the main coreshown is true for the main AQ methodology. AQrequires two nonempty eventsets, one of positive andone of negative events, where at least one positiveevent and one negative event are not ambiguous. Thealgorithm starts by making a new list of positive eventsyet to be to cover P′. The algorithm loops until allpositives have been covered.

The algorithm starts by selecting a randompositive event, called the seed from among P′, andthen creates a star (Section Star Generation) forthat example.b The result of the star is a rule thatgeneralizes the seed and does not cover any of thenegatives (TF mode) or can allow an inconsistentcoverage for simpler rules (PD mode).

A lexicographical evaluation functions (LEF) isused to evaluate the rules during the star genera-tion (Section Lexicographical Evaluation Functions).Next, all the events covered by rule r are removedfrom the list P′. Rule r is guaranteed to cover at least

the seed but might cover many if not all the events inP′. Rule r is then added to the list of rules R to beadded to the final answer.

Star GenerationThe central concept of the algorithm is a star, definedas a set of alternative general descriptions (rules) of aparticular event (a ‘seed’) that satisfy given constraints,for example, do not cover negative examples, do notcontradict prior knowledge, etc.

The star generation is an iterative process(Algorithm 2). First, the seed is extended againsteach negative example (line 3). The extension-againstoperator (�) is a pair-wise generalization operationbetween the seed and a negative event aimed atfinding the largest possible set of descriptions (rules)that cover the seed but not the negative. Thus,the result of the extension-against operation is adisjunction of single condition rules, namely one rulefor each nonidentical attribute. An identical attributeis simply an attribute that has the same value forboth the seed and the negative event, and for whicha generalization that covers the seed but not thenegative cannot be made. For each dimension, thelargest possible description that covers the seed, butnot the negative, is the negation of the negative.This definition of the extension-against operator onlyworks for nominal attributes and is implemented byencoding the attribute domain into a binary vector,where each bit represented a particular value, and bynegating this vector.

Assuming a nominal variable ||x|| = {red, green,blue}, x = blue is encoded as {0,0,1}. The resultof the extension-against operation between a seedwith x = blue and a negative event with x = redis the rule [positive] ←− [x = green or blue]. Its

Volume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 223

Page 7: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

binary representation is [{0,1,1}], which is exactly thenegation of the negative event.

The extension-against operation for linearattributes is slightly different, and it involves flippingthe bits only up to the value of the negative event,and not any values beyond. Assuming a linearvariable ||y|| = {XS, S, M, L, XL, XXL}, the resultof the extension-against operation between a seedwith y = S and a negative event with y = L isthe rule [positive] ←− [y = XS . . . M]. Its binaryrepresentation is [{1,1,1,0,0,0}].

For integer and continuous attributes, theextension-against operator finds a value between theseed and negative. The degree of generalization canbe controlled, and by default set to choose the middlepoint between the two. The ε parameter, definedbetween [0 and 1], controls the degree of generaliza-tion during the extension-against operation, with 0being most restrictive to the seed, and 1 generalizingup to the negative. Assuming a continuous variable||z|| = {0.100}, the result of the extension-againstoperation between a seed with z = 10 and a negativeevent with y = 30, and ε = 0.5 is the rule [positive]←− [z ≤ 20]. The result of the same extension-againstoperation with ε = 1 is [positive] ←− [z < 30] (notethat 30 is not included).

The rules from the extension-against operationare then logically multiplied out with all the rules r toform a star (Algorithm 2, line 4), and the best rule (orrules) according to a given multicriterion functionalLEF (Section Lexicographical Evaluation Functions)are selected (line 5). The parameter maxstar is centralto the star generation process and defines how manyrules are kept for each star.

If AQ is run in TF mode, the result from theintersection of the previously learned rules and thenew rule is kept. In PD mode, the function Q [Eq.(5)] is used to compute the tradeoff between thecompleteness and the consistency of the rules

Q =(

pP

)w [(P + N

N

)(p

p + n− P

P + N

)]1−w

(5)

where is p and n are the number of positive andnegative events, and P and N are the total number ofpositive and negative events in the data. The parameterw is defined between 0 and 1 and controls the tradeoffbetween completeness and consistency.9

Lexicographical Evaluation FunctionsLEF is an evaluation function composed fromelementary criteria and their tolerances and are usedto determine which rules best reflect the needs ofthe problem at hand. In other words, LEF is used

to determine which rules, among those generated,are best suited to be included in the answer. AQhas been described as performing a beam search inspace.8 LEF is the parameter that controls the width ofthe beam.

LEF works as following:

1. Sort the rules in the star according to LEF, fromthe best to the worst.

2. Select the first rule and compute the numberof examples it covers. Select the next rule andcompute the number of new examples it covers.

3. If the number of new examples covered exceeds anew example threshold, then the rule is selected,otherwise it is discarded. Continue the processuntil all rules are inspected.

The result of this procedure is a set of rulesselected from a star. The list of positive events tocover is updated by deleting all those events that arecovered by these rules.

PostprocessingPostprocessing operations consist in: (1) improvementof the learned rules through generalization andspecialization operators, (2) optional generation ofalternative covers, and (3) formatting of the outputfor textual and graphical visualization.

Optimization of RulesWhen AQ is run in PD mode, rules can be furtheroptimized during post processing. Rules can begeneralized by dropping values in the reference of theconditions or by enlarging the ranges for continuousand integer attributes. Rules can be further generalizedby dropping conditions altogether. Finally, entirerules could be dropped. The opposite operation ofspecialization is performed only at the condition level,by adding values in discrete attributes, and shrinkingdomains for integer and continuous attributes.

The optimization operation follows heuristics,and at each step computes what is called in AQ the Qvalue for the new rule Eq. (5). If the Q value increases,then the modified rule is added to the final answer,otherwise is disregarded.

Alternative CoversSome of the rules learned during the star generationprocess, especially with large maxstar values, mightnot be required in the final output. The final step ofthe learning process consists in selecting from the poolof learned rules, only the minimum set required to

224 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 8: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

Rule 2(27, 0)

Rule 3(25, 0)

Rule 2(2, 0)

Rule 1(11, 0)

Group = 2(30, 19)

Group = 6(11, 39)

HumidityPressureBlackSSTWind0Z500

=NE..S(23,3)

=N..SE(27,8)

>=7.79(28,14)

>=9.954e+04(30,16)

>=1.002e+05(16,9)

=5.45..8.95(13,12)

=Feb..Nov(25,12)

<=0.8399(27,17)

<=294.7(23,19)

>=2(28,19)

<=38(9,31)

=Feb..Mar(2,4)

<=Dec(11,39)

=3.75..5.45(3,8)

>=5.35(9,33)<=1.011e+05

(11,39)=7.035..19.32

(10,22)

=S..W(10,8)

Date SLHF Air temperature

DiffSST

Rule 1(30, 0)

FIGURE 1 | A sample association graph from an atmospheric pollution problem.

cover the positive examples. Thus, some of the rulesmight not be included in the final answer and can beused to generate alternative solutions. Depending onthe presence of multiple strong patterns in the data,alternative covers might be very useful to discriminatebetween classes.

Association GraphsAssociation graphs are used to visualize attributionalrules that characterize multivariate dependenciesbetween target classes and input attributes. A pro-gram called concept association grapth (CAG) wasdeveloped by the first author to automatically displaysuch graphs. Figure 1 is a graphical illustration ofthe rules discovered from an atmospheric releaseproblem.17 Representing relationships with nodes andlinks is not new nor unique to AQ and has been usedin many applications in statistics and mathematics.Each target class is associated only with unique pat-terns of input parameters. The thickness of the linksindicates the weight of a particular parameter-valuecombination in the definition of the cluster.

ADVANTAGES AND DISADVANTAGESOF THE AQ METHODOLOGY

The AQ methodology has intrinsic advantages anddisadvantages with respect to other machine learningclassifiers, such as neural networks, decision trees, ordecision rules. Some of the original disadvantages havebeen solved or improved with additional componentsor optimization processes, often at the expense

of a much slower or complex program. Otherissues remain unresolved and open to investigation.The following discussion summarizes those that arebelieved to be the main issues to consider whenchoosing the AQ methodology over other methods,in particular C4.5 which is the closest widely usedmachine learning symbolic classifier.

Rich Representation LanguageOne of the main advantages of AQ consists in the abil-ity to generate compact descriptions which are easy toread and understand. Unlike neural networks, whichare black boxes and use a representation languagethat cannot be easily visualized, AQ rules can beinspected and validated by human experts. Althoughdecision tree classifiers, such as C4.5, can convert thelearned trees into rules, the resulting descriptions areexpressed in a much simpler representation languagein AQ. For example, C4.5 rules only allow for atomicrelationships between attributes and possible valuesand do not allow for internal disjunctions or multipleranges. Figure 2 shows the respective covers generatedby AQ (left) and C4.5 (right). In this example, internaldisjunction allows for a simpler and more compactrepresentation due to intersecting patterns.

The cover generated by AQ [Eq. (6)] is composedof two rules with a single condition, each covering 20positives and no negatives.

[Positives = 1] ←− [X ≥ 5] : p = 20, n = 0

←− [Y ≥ 5] : p = 20, n = 0 (6)

Volume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 225

Page 9: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

1086420

X

108642

X

Y

++

++

++

++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

−−

− −

− −

AQ Cover

++

++

++

++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

−−

− −

− −

C4.5 Cover

10

8

6

4

2

0Y

10

8

6

4

2

0

0 FIGURE 2 | Different coversgenerated by AQ (left) and C4.5(right) using the same dataset.

In contrast, the tree [Eq. (7)] and the correspondingrules [Eq. (8)] generated by C4.5 cannot representthe intersecting concept because of the simplerrepresentation language.

Root

X < 5

Y ≥ 5

Positives

Y < 5

Negatives

X ≥ 5

Positives

(7)

[Positives = 1] ←− [X ≥ 5] : p = 20, n = 0

←− [X < 5][Y ≥ 5] : p = 10, n = 0

(8)

The C4.5 cover is composed of two rules, onewith a single condition, and one with two conditions.The first, identical to the rule learned by AQ, covers20 positives and no negatives, whereas the secondcovers only 10 positives and no negatives. Althoughboth covers are complete and consistent, the coverof C4.5 is more complex and cannot represent theintersecting concept.

SpeedAQ is considerably slower than C4.5 because ofthe underlying differences between the ‘separate andconquer’ learning strategy of AQ and the ‘divide andconquer’ strategy of C4.5. In C4.5, at each iteration,the algorithm recursively divides the search space. Thismeans that at each iteration the algorithm analyzes analways smaller number of events. In contrast, AQcompares each positive with all of the negatives.Effectively, AQ can be optimized to consider only

a portion of the positive examples but still has toconsider all the negatives. This is in part due to theability of representing intersecting concepts, meaningthat rules are not bound to prior partitions.

Quality of DecisionsAs previously seen, C4.5 performs consecutive splitson a decreasing number of positive and negativeexamples. This means that at each iteration, decisionsare made on a smaller amount of information. Incontrast, AQ considers all the search space at eachiteration, meaning that all decisions are made withthe maximum amount of information available.

Control ParametersAQ has a very large number of control variables. Suchcontrols allow for a very fine tuning of the algorithm,which can lead to very high quality descriptions. Onthe other hand, it is often difficult to determine a prioriwhich set of parameters will generate better rules.Although heuristics on how to set the parametersexist, they are often suboptimal, and user fine tuningis required for optimal descriptions.

Matching of Rules and EventsAQ allows for different methods to test events on aset of learned rules. In C4.5, and most other classifierswhich do not allow for intersecting concepts, testingan event usually involves checking if it is included ina rule or not. This is due to the fact the entire searchspace is partitioned into one of the target classes. InAQ, each event can be included in more than one ruleor could be in an area of the search space which has notbeen assigned to any classes. Assigning an unclassifiedevent to one of the target classes involves computing

226 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 10: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

X

d1

d2

d3

FIGURE 3 | An event that lies in an area of the search space whichis not generalized to any of the training classes is assigned to the classit is closest to.

different degrees of match between the event and thecovers of each of the classes and selecting the class withhighest degree of match. Figure 3 shows the exampleof an unclassified event that lies in an area of the searchspace not generalized to any of the classes. The degreeof match between the event and the cover of each of theclass is computed, and it is assigned to the class withthe highest score, in this case (d1). Several distancefunctions can be used to match rules and events anddiffer at the top level if they are strict or flexible.

In strict matching, AQ counts how many timesa particular event is covered by the rules of each ofthe classes. An event can be covered multiple times bythe rules of a particular class because the rules mightbe intersecting due to internal disjunctions. It can alsobe covered multiple times by rules of different classesif AQ was run in PD mode, and inconsistent coverswere generated.

In flexible matching, AQ computes the ratio ofhow many of the attributes are covered over the totalnumber of attributes. Assuming an event with threeattributes, if a rule for class A matches three of them,and rule for class B matches two of them, the event isclassified as type A because of a higher flexible degreeof match. If the degree of match falls below a certainthreshold, AQ classifies the event as unknown. In casemore than one class has the same degree of match,the classification is uncertain, and multiple classes areoutput.

Multiple Target ClassesDecision tree classifiers are advantaged when learningfrom data with several target classes. They can learn

descriptions for each of the classes using a single itera-tion of the algorithm, leading to very fast results. AQ,on the other hand, must be run multiple times, eachtime using the events of the target class as positivesand the events of all the other classes as negatives.Such limitation seriously affects the execution time.Additionally, because rules are learned separately foreach class, the resulting covers might be intersecting.Intersecting concepts might lead unseen testing eventsto be classified as belonging to more than one class.

Alternative CoversAQ can generate alternative covers for each run. Thisis because in the postprocessing phase, only a portionof the learned rules are used for the final output. Byselecting different rules combinations, it is possible togenerate a number of alternative covers. Each covermight differ in completeness and consistency, and insimplicity of patterns.

Incremental LearningDecision rule learners have the intrinsic advantage ofbeing able to refine previous rules as new training databecome available. This is because of the sequentialnature of the ‘separate and conquer’ strategy ofalgorithms. Refinement of rules involves adding ordropping conditions in previously learned rules orsplitting a rule in a number of partially subsumedrules. The main advantage is that modification in a ruleof the cover does not affect the coverage of the otherrules in the cover (Although the overall completenessand consistency of the entire cover might be affected).In contrast, although possible, it is more complicatedto update a tree as it often involves several updatesthat propagate from leaves of the tree, all the wayto the root. Additionally, the resulting tree might besuboptimal and very unbalanced, prompting for acomplete re-evaluation of each node.

Input Background KnowledgeBecause of the ability of AQ to update previouslylearned rules, it is possible to add background knowl-edge in the form of input rules. This feature is partic-ularly important when there is an existing knowledgeof the data, or constraints on the attributes, whichcan lead to a simpler rules and a faster execution.

EVOLUTIONARY COMPUTATIONGUIDED BY AQ

The term evolutionary computation was coined in1991 as an effort to combine the different approaches

Volume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 227

Page 11: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

to simulating evolution to solve computationalproblems.18–23 Evolutionary computation algorithmsare stochastic methods that evolve in parallel aset of potential solutions through a trial and errorprocess. Potential solutions are encoded as vectorsof values and evaluated according to an objectivefunction (often called fitness function). The evolu-tionary process consists of selecting one or morecandidate solutions whose vector values are modifiedto maximize (or minimize) the objective function.If the newly created solutions better optimize theobjective function, they are inserted into the nextgeneration, otherwise they are disregarded. Whilethe methodologies and algorithms that are subsumedby this name are numerous, most of them share onefundamental characteristic. They use nondeterministicoperators such as mutation and recombination as themain engine of the evolutionary process.

These operators are semi-blind, and the evolu-tion is not guided by knowledge learned in the pastgenerations, but it is a form of search process executedin parallel. In fact, most evolutionary computationalgorithms are inspired by the principles of Darwinianevolution, defined by ‘. . .one general law, leading tothe advancement of all organic beings, namely, multi-ply, vary, let the strongest live and the weakest die’.24

The Darwinian evolution model is simple and fastto simulate, and it is domain independent. Becauseof these features, evolutionary algorithms have beenapplied to a wide range of optimization problems.25

There have been several attempts to extendthe traditional Darwinian operators with statisticaland machine learning approaches that use historyinformation from the evolution to guide the searchprocess. The main challenges are to avoid localmaxima and increase the rate of convergence. Themajority of such methods use some form of memoryand/or learning to direct the evolution towardparticular directions thought more promising.26–31

Because evolutionary computation algorithmsevolve a number of individuals in parallel, it is possibleto learn from the ‘experience’ of entire populations.There is not a similar type of biological evolutionbecause in nature there is not a mechanism to evolveentire species. Estimation of distribution algorithms(EDA) are a form of evolutionary algorithms wherean entire population may be approximated with aprobability distribution.32 New candidate solutionsare not chosen at random but using statisticalinformation from the sampling distribution. The aimis to avoid premature convergence and to provide amore compact representation.

Discriminating between best and worst per-forming individuals could provide additional infor-mation on how to guide the evolutionary process.The learnable evolution model (LEM) methodologywas proposed in which a machine learning ruleinduction algorithm was used to learn attributionalrules that discriminate between best and worst per-forming candidate solutions.33–35 New individualswere then generated according to inductive hypothe-ses discovered by the machine learning program. Theindividuals are thus genetically engineered, in the sensethat the values of the variables are not randomly orsemi-randomly assigned but set according to the rulesdiscovered by the machine learning program.

The basic algorithm of LEM works likeDarwinian-type evolutionary methods, that is, exe-cutes repetitively three main steps:

1. Create a population of individuals (randomly orby selecting them from a set of candidates usingsome selection method).

2. Apply operators of mutation and/or recombi-nation to selected individuals to create newindividuals.

3. Use a fitness function to evaluate the newindividuals.

4. Select the individuals which survive into the nextgenerations.

The main difference with Darwinian-type evo-lutionary algorithms is in the way it generates newindividuals. In contrast to Darwinian operators ofmutation and/or recombination, AQ conducts a rea-soning process in the creation of new individuals.Specifically, at each step (or selected steps) of evolu-tion, a machine learning method generates hypothesescharacterizing differences between high-performingand low-performing individuals. These hypotheses arethen instantiated in various ways to generate new indi-viduals. The search conducted by LEM for a globalsolution can be viewed as a progressive partitioningof the search space.

Each time the machine learning program isapplied, it generates hypotheses indicating the areasin the search space that are likely to contain high-performing individuals. New individuals are selectedfrom these areas and then classified as belonging toa high-performance and a low-performance group,depending on their fitness value. These groups arethen differentiated by a machine learning program,yielding a new hypothesis as to the likely location ofthe global solution.

228 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 12: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

To understand the advantage of using AQ togenerate new individuals, compared with using thetraditional Darwinian operation, it is necessary to takeinto account both the evolution length, defined as thenumber of function evaluations needed to determinethe target solution, and the evolution time, defined asthe execution time required to achieve this solution.The reason for measuring both characteristics is thatchoosing between the AQ and Darwinian algorithmsinvolves assessing tradeoffs between the complexity ofthe population generating operators and the evolutionlength. The AQ operations of hypothesis generationand instantiation used are more computationallycostly than operators of mutation and/or crossover,but the evolution length is typically much shorter thanthat of Darwinian evolutionary algorithms.

Therefore, the use of AQ as engine of evolutionis only advantageous for problems with high objectivefunction evaluation complexity. The problem ofsource detection of atmospheric pollutants describedin this article is an ideal such problem because of thecomplexity of the function evaluation which requiresrunning complex numerical simulations.

SOURCE DETECTIONOF ATMOSPHERIC RELEASES

When an airborne toxic contaminant is released in theatmosphere, it is rapidly transported by the wind anddispersed by atmospheric turbulence. Contaminantclouds can travel distances of the order of thousand ofkilometers within a few days and spread over areas ofthe order of thousands of square kilometers. A largepopulation can be affected with serious and long-term consequences depending on the nature of thehazardous material released. Potential atmospherichazards include toxic industrial chemical spills, forestfires, intentional or accidental releases of chemicaland biological agents, nuclear power plants accidents,and release of radiological material. Risk assessmentof contamination from a known source can becomputed by performing multiple forward numericalsimulations for different meteorological conditionsand by analyzing the simulated contaminant cloudswith clustering and classification algorithms toidentify the areas with highest risk.17

However, often the source is unknown, andit must be identified from limited concentrationmeasurements observed on the ground. The likelyoccurrence of a hazard release must be inferred fromthe anomalous levels of contaminant concentrationmeasured by sensors on the ground or by satellite-borne remote sensors.

There are currently no established methodolo-gies for the satisfactory solution to the problem ofdetecting the sources of atmospheric releases, andthere is a great degree of uncertainty with respect to theeffectiveness and applicability of existing techniques.One line of research focuses on the adjoint trans-port modeling,36–38 but more general and powerfulmethodologies are based on Bayesian inferencecoupled with stochastic sampling.39 Bayesian methodsaim at an efficient ensemble run of forward simu-lations, where statistical comparisons with observeddata are used to improve the estimates of the unknownsource location.40 This method is general, as it isindependent of the type of model used and thetype and amount of data, and can be applied tononlinear processes as well. Senocak et al.41 useda Bayesian inference methodology to reconstructatmospheric contaminant dispersion. They pair theBayesian paradigm with Markov chain Monte Carlo(MCMC) to iteratively identify potential candidatesources. A reflected Gaussian plume model is run foreach candidate source, and the resulting concentra-tions are compared with ground observations. Thegoal of the algorithm is to minimize the error betweenthe simulated and the measured concentrations.

A similar approach was followed by Refs 42–45,which use an iterative process based on genetic algo-rithms to find the characteristics of unknown sources.They perform multiple forward simulations from ten-tative source locations, and use the comparison ofsimulated concentration with sensor measurements toimplement an iterative process that converges to thereal source. The strength of the approach relies in thedomain independence of the genetic algorithm, whichcan effectively be used with different error functionswithout major modifications to the underline method-ology. The error functions quantify the differencebetween simulated and observed values.

The methodology applied in this article is basedon this approach, but rather than using a traditionalevolutionary algorithm, it uses AQ4SD to generatenew individuals. This application is particularly suitedfor AQ4SD, because the function evaluation is verycomputationally intensive and requires running anumerical simulation. The main advantage of usingAQ4SD to generate new individuals is the reducednumber of function evaluations, which in this casetranslates to a huge improvement in speed.

Transport and Dispersion SimulationsCentral to every evolutionary algorithm is thedefinition of the objective or fitness function. Givena candidate solution, the fitness function evaluates itand gives as feedback which solution is better for the

Volume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 229

Page 13: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

problem at hand. Each candidate solution is comprisedof eight variables x, y, z, θ , U, Q, S, and ψ . x, y, and zare the coordinates of the release in kilometers; θ andU are, respectively, the wind direction and speed indegrees and ms−1; Q is the source strength in gs−1;S is proportional to the area of the release in m2;and ψ describes the atmospheric stability accordingto Pasquill’s stability classes.46,47 The fitness of eachcandidate solution is computed using a normalizedmean square error (NMSQE) function between theobserved concentrations and the simulated values:

NMSQE =√√√√ (Co − Cs)2

Co2 (9)

where Co is each sensor’s observed values, and Cs isthe corresponding simulated value. The bar indicatesan average over all the observations. The values for Csare simulated using a three-dimensional (3D) Gaussiandispersion model, that is,

Cs = P1P2(P3 + P4) (10)

where P1, . . . , P4 are defined by

P1 = Q

2πU√

(S + σ 2y)(S + σ 2

z )(11)

P2 = exp

[− (y − y0)2

2(S + σ 2y)

](12)

P3 = exp[− (z − z0)2

2(S + σ 2z )

](13)

P4 = exp[− (z + z0)2

2(S + σ 2z )

](14)

where σ x(x, x0; ψ), σ y(x, x0; ψ), σ z(x, x0; ψ) arethe dispersion coefficients, which were computedfrom the tabulated equations of Briggs,48 and S =σ2y(xo, xo, ψ) = σ2y(xo, xo, ψ).

The result of the simulation is the concentrationfield generated by the release along an arbitrarywind direction. In order to map each Cs with thecorresponding Co, the wind direction θ is taken intoaccount by applying a rotation to the x, y, and zcoordinates of each Cs points.

Prairie Grass ExperimentThe current application uses real-world data fromthe prairie grass field experiment.49 The experimentconsisted of 68 consecutive releases of 10 min eachfrom the same source. SO2 was used as a trace gas,and measurements of concentrations were made at

0 200 400 600 800 1000

−1000

−500

0

500

1000

Distance (km)

Dis

tanc

e (k

m)

20

20

20

0

0

0

0

2040

60

80

FIGURE 4 | Summary of the 68 prairie grass experiments.

sensors positioned along arcs radially located atdistances of 50, 100, 200, 400, and 800 m fromthe source. Only sensors that recorded values abovea minimum threshold were considered reliable, andas a result, each experiment has a different numberof concentration measurements depending on theatmospheric conditions at the time of the release.The goal of the optimization process is to identify thesource and the atmospheric characteristics. The onlyinformation used for the fitness evaluation are thevalues of the concentrations measured at the sensors.Figure 4 shows a summary of the 68 consecutiveexperiments. The concentration was computed byinterpolating all the values measured at the concentricsensors (shown). The main direction of each release,as indicated in the experiment’s summary, is shownwith the solid lines protruding from the interpolatedsurface. One of the characteristics of the prairie grassexperiment is the detailed information on the atmo-spheric conditions at the time of the release. It is thenpossible to classify each experiment as belonging toa different atmospheric type, using Pasquill’s stabilityclasses.46,47 Pasquill’s classes range from unstable(A) to neutral (D) to stably stratified atmosphere (F).

230 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 14: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

FIGURE 5 | Different sample prairiegrass releases by atmosphere type.

0 200 600 1000−600

−200

0

200

400

600

−600

−200

0

200

400

600

−600

−200

0

200

400

600

−600

−200

0

200

400

600

−600

−200

0

200

400

600

−600

−200

0

200

400

600

3

32

2

2

1

0

1 2

Release 25, type A

0 200 600 1000

3

3

3

2

1

1

0

1

Release 7, type B

0 200 600 1000

4

3

2

2

1

1

0

1 2

Release 9, type C

0 200 600 1000

1

10

0

1

2

Release 12, type D

0 200 600 1000

2

2

11

0 013

Release 42, type E

0 200 600 1000

21

001 2 3 3

Release 13, type F

Figure 5 shows 6 of the 68 experiments, each havingoccurred under a different atmospheric type. Thefigure shows how the atmospheric stability determinesthe characteristics of the concentration field. Unstableatmosphere (A) enhances the spread, thus reducingthe ground level concentration, whereas stableatmosphere causes much narrower plumes, whichresult in higher ground concentrations.

ResultsExperiments were performed for each of the 68 prairiegrass releases. The algorithm started by generatinga population of random candidate solutions. Eachcandidate solution is a potential source and is encodedas a vector of eight variables: x, y, z, θ , U, Q, S, and ψ .For each potential source, the resulting concentrationfield is computed by Eq. (10). The fitness score of eachsource is defined as the error between the observedground concentration and the simulated concentrationat the same locations, computed according to Eq. (9).

The algorithm proceeds by dividing the can-didate solutions with high and low fitness scores,and learning patterns (rules) which characterizethe attribute values combinations that discriminate

A B C D E F

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Atmosphere type

Err

or

FIGURE 6 | Errors of AQ4SD divided by atmosphere type.

between the two groups. New candidate solutionsare generated according to the learned patterns. Theprocess continues for 500 iterations. The algorithmwas run using a population of 100 candidatesolutions. At each step, the top and lowest 30%of the solutions were used as members of the high-and low-performing groups. Each experiment wasrepeated 10 times to study the sensitivity of the

Volume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 231

Page 15: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

001 005 009 013 017 021 025 029 033 036 040 044 048 051 055 059 065

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Experiment ID

Err

or

Atmosphere type

ABCDEF

FIGURE 7 | Summary of the errors of AQ4SDfor each prairie grass experiment. The atmospheretype of each experiment is color coded.

algorithm to the initial guess of solutions. The resultsare shown also in terms of atmospheric type.

A total of 680 AQ4SD source detectionswere performed, namely 10 for each of the 68experiments. There is a considerably higher numberof experiments of type D, as this was the predominantatmospheric condition at the time of the releases. Inorder to compensate for the different distributionsof experiments, the results are normalized using thisinformation.

Figure 6 shows a summary of the differenterrors, defined by Eq. (9), achieved by AQ4SD as afunction of the atmospheric type. A threshold of 1.0was assigned as minimum fitness value to recognizea source, because such value indicates AQ4SDidentifying the source within 50 m of correct solution.Considerably better results were achieved for atmo-spheric type D, and worse results for atmospheric typeA and type F. This pattern reflects the accuracy of thedispersion model (10) to reproduce the concentrationfield under different stability conditions. The Gaus-sian model is expected to perform better in neutralconditions (D), whereas convective turbulence (A)and stable stratification (F) involve more complex dis-persion mechanisms which cannot be accounted for,resulting in a lack of accuracy. Figure 6 is consistentwith the notion that the algorithm performs betterwhen the fitness of the dispersion model is higher.

Figure 7 shows a summary for all the 68 prairiegrass experiments. Each atmosphere type is colorcoded. With the exception of six experiments (3, 4,7, 25, 52), each of type A or type F, AQ4SD alwaysachieves a minimum fitness of 1.0, which was thetarget acceptance threshold for this experiment. Theoverall average fitness error is 0.6.

Figure 8 shows a summary of the results interms of x, y, z, θ (called WA = wind angle), and Q.For all experiments, the original source was locatedat x,y,z = 0, 0, 0. The WA and Q errors are defined,respectively, as the identified angle minus the realangle, and the identified Q minus the real Q. Onceagain, the atmosphere type is color coded using thesame colors as in Figure 7. The ideal solution wouldbe all the points located at 0, for all variables. Inthe figure, although this is primarily true for mostvariables, the strong dependence between y and θ

is evident. The units for the x,y, and z directionsare meters. Therefore, the errors associated withchanges in the z values are actually very small, asz only varies 10 m at most. There are larger errorsfor the alongwind dimension x compared with thecrosswind dimension y. This is to be attributed to theconcentration field which has a larger gradient in thecrosswind direction compared with the alongwinddirection. Note the correlation between the error inθ and the y dimension. Such behavior exemplifiesthe algorithm’s skill at compensating for errors in ythrough changes in θ . The variable that seems to beharder to optimize is Q. Such results are primarily dueto the correlation between Q and U [P1 in Eq. (10)].

DISCUSSION

This article introduces the main concepts of theAQ methodology and discusses its advantages anddisadvantages. It describes a new implementation ofthe AQ methodology, AQ4SD, applied to the problemof source detection of atmospheric releases. In thatcontext, AQ4SD is used as main engine of evolution

232 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 16: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

−500

−300

−100

0

−500

−300

−100

X

−60

−202060

Y

10 8 6 4 2 0

Z

−1000

100

Err

or W

A

−500

−300

−100

0

050100

150

−60

−20

020

4060

800

24

68

10−1

000

100

050

100

150

050100

150

Err

or Q

Atm

osph

ere

type

A B C D E F

FIG

UR

E8

|Pair-

wis

epl

otof

diffe

rent

attr

ibut

esus

eddu

ring

the

optim

izat

ion.

Vo lume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 233

Page 17: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

for an evolutionary computation process aimed atfinding the source of an atmospheric release, usingonly a observed ground measurements and a numericatmospheric dispersion model. Experiments were per-formed to identify the source of each of the 68 releasesof the prairie grass field experiment.

The numerical experiments show that in all butfive cases the methodology was able to achieve afitness score considered acceptable for the correctidentification of the source. The performance ofthe algorithm has been very satisfactory consider-ing that the error intrinsic in the measured data andthe approximation of the dispersion model. AQ4SDalso proved to be quite efficient in terms of numberof model simulations required for each optimiza-tion case. This is one of the main advantages ofthe proposed methodology compared with traditionalevolutionary algorithms, because a fitness evaluationfor a complete source detection procedure may requirecomputationally expensive numerical simulations. Inparticular, for larger scale dispersion problems, moresophisticated and computationally expensive mete-orological and dispersion models need to be runconcurrently to evaluate the fitness of each candidatesolution.45

The proposed methodology has a wide domainof applicability, not restricted only to the sourcedetection problem. It can be used for a variety of opti-mization problems and is particularly advantageousfor those problems where the fitness function evalua-tion involves a computationally expensive operation.

NOTESaSome versions of AQ can also be run to generaterules with the largest amount of attributes (calledcharacteristic mode), but such mode merely consistsin generating discriminant rules and adding condi-tions that include all events in the class, but having nodiscriminatory information.bSome versions of AQ sort all or a part of the negativeevents according to a distance metric. Although suchmechanism has been shown to generate simpler rulesin specific cases, because of the additional complexityof defining such distance metrics, which is not alwayspossible as in the case of nominal attributes, pairedwith the additional computational resources required,the advantage of such sorting is not clear. AQ4SDcan be run with and without sorting, and experimentshave shown no or negligible improvements.

ACKNOWLEDGEMENTS

This material is partly based upon work supported by the National Science Foundation under Grant no: AGS0849191.

REFERENCES

1. Michalski R. On the quasi-minimal solution of thegeneral covering problem. Proceesings of Fifth Inter-national Symposium on Information Processing (FCIP69), Yugoslavia, Bled, vol A3; October 3–11 1969,125–128.

2. Michalski R. A theory and methodology of inductivelearning. Mach Learn 1983, 1:83–134.

3. Michalski R. AQVAL/1 computer implementation of avariable-valued logic system VL 1 and examples of itsapplication to pattern recognition. First InternationalJoint Conference on Pattern Recognition, Washington,D.C., 1973, 3–17.

4. Chilausky R, Jacobsen B, Michalski R. An applicationof variable-valued logic to inductive learning of plantdisease diagnostic rules. Proceedings of the Sixth Inter-national Symposium on Multiple-valued Logic. Logan,UT: IEEE Computer Society Press Los Alamitos; 1976,233–240.

5. Steinberg D, Colla P. CART: Tree-structured Non-parametric Data Analysis San Diego, CA: SalfordSystems; 1995.

6. Quinlan J. C4.5: Programs for Machine Learning. SanMateo: Morgan Kaufmann; 1993.

7. Breiman L, Friedman J, Stone CJ, Olshen RA. Classifi-cation and Regression Trees: Wadsworth InternationalGroup; 1984.

8. Michalski R, Mozetic I, Hong J, Lavrac N. The multi-purpose incremental learning system AQ15 and its test-ing application to three medical domains. Proceedingsof the 1986 AAAI Conference, Philadelphia, PA vol.104; August 11–15 1986, 1041–1045.

9. Kaufman K, Michalski R. The AQ18 Machine Learn-ing and Data Mining System: An Implementation andUser’s Guide. MLI Report. Fairfax, VA: MachineLearning and Inference Laboratory, George MasonUniversity; 1999.

234 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010

Page 18: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

WIREs Computational Statistics Algorithm quasi-optimal learning

10. Mitchell T. Machine Learning. New York: McGraw-Hill; 1997.

11. Cervone G, Panait L, Michalski R. The developmentof the AQ20 learning system and initial experiments.Proceedings of the Fifth International Symposium onIntelligent Information Systems, June 18-22, 2001,Zakopane, Poland: Physica Verlag; 2001, 13.

12. Keesee APK. How Sequential-Cover Data Mining Pro-grams Learn. College of Science. Fairfax, VA: GeorgeMason University; 2006.

13. Austern M. Generic Programming and the STL: Usingand Extending the C++ Standard Template Library.1998.

14. Gamma E, Helm R, Johnson R, Vlissides J. Design Pat-terns: Elements of Reusable Object-Oriented Software.Westford, MA: Addison-Wesley Reading; 1995.

15. Bloedorn E, Wnek J, Michalski R. Multistrategy con-structive induction: AQ17-MCI. Rep Mach Learn InferLab 1993, 1051:93–4.

16. Wnek J, Michalski R. Hypothesis-driven constructiveinduction in AQ17-HCI: a method and experiments.Mach Learn 1994, 14:139–168.

17. Cervone G, Franzese P, Ezber Y, Boybeyi Z. Riskassessment of atmospheric emissions using machinelearning. Nat Hazards Earth Syst Sci 2008,8:991–1000.

18. Holland J. Adaptation in Natural and Artificial Sys-tems. Cambridge, MA: The MIT Press; 1975.

19. Goldberg DE. Genetic Algorithms in Search, Optimiza-tion, and Machine Learning. Reading, MA: AddisonWesley; 1989.

20. Back T. Evolutionary Algorithms in Theory and Prac-tice: Evolutionary Straegies, Evolutionary Program-ming, and Genetic Algorithms. Oxford, NY: OxfordUniversity Press; 1996.

21. Michalewicz Z. Genetic Algorithms + Data Structures= Evolution Programs. 3rd ed. Berlin: Springer-Verlag;1996.

22. Fogel L. Intelligence Through Simulated Evolution:Forty Years of Evolutionary Programming. Wiley Serieson Intelligent Systems. New York: John Wiley & Sons,Inc.; 1999.

23. De Jong K. Evolutionary computation: a unifiedapproach. Proceedings of the 2008 GECCO Confer-ence on Genetic and Evolutionary Computation. NewYork: ACM; 2008, 2245–2258.

24. Darwin C. On the Origin of Species by Means of Natu-ral Selection, or the Preservation of Favoured Races inthe Struggle for Life. London: Oxford University Press;1859.

25. Ashlock D. Evolutionary Computation for Modelingand Optimization. Berlin Heidelberg: Springer-Verlag;2006.

26. Grefenstette J. Incorporating problem specific knowl-edge into genetic algorithms. Genetic Alg Simul Anneal-ing 1987, 4:42–60.

27. Grefenstette J. Lamarckian learning in multi-agentenvironments. Proceedings of the Fourth InternationalConference on Genetic Algorithms, Morgan KaufmannPublishers Inc., San Francisco, CA, 1991.

28. Sebag M, Schoenauer M. Controlling Crossoverthrough Inductive Learning. Lecture Notes inComputer Science. London: Springer-Verlag; 1994,209–209.

29. Sebag M, Schoenauer M, Ravise C. Inductive learningof mutation step-size in evolutionary parameter opti-mization, Lecture Notes in Computer Science. London:Springer-Verlag; 1997, 247–261.

30. Reynolds R. Cultural Algorithms: Theory and Applica-tions. Mcgraw-Hill’S Advanced Topics In ComputerScience Series. Maidenhead, England: McGraw-HillLtd.; 1999, 367–378.

31. Hamda H, Jouve F, Lutton E, Schoenauer M, Sebag M.Compact unstructured representations for evolutionarydesign. Appl Intell 2002, 16:139–155.

32. Lozano J. Towards a New Evolutionary Computation:Advances in the Estimation of Distribution Algorithms:Springer; 2006.

33. Michalski R. Learnable evolution: combining symbolicand evolutionary learning. Proceedings of the FourthInternational Workshop on Multistrategy Learning(MSL’98). 1999, 14–20.

34. Cervone G, Michalski R, Kaufman K, Panait L. Com-bining machine learning with evolutionary computa-tion: Recent results on lem. Proceedings of the FifthInternational Workshop on Multistrategy Learning(MSL-2000). Portugal: Guimaraes; 2000, pp. 41–58.

35. Cervone G, Kaufman K, Michalski R. Experimentalvalidations of the learnable evolution model. Proceed-ings of the 2000 Congress on Evolutionary Computa-tion, LaJolla, CA, vol. 2; July 16–19 2000.

36. Pudykiewicz J. Application of adjoint tracer transportequations for evaluating source parameters. AtmosEnviron 1998, 32:3039–3050.

37. Hourdin F, Issartel JP. Sub-surface nuclear tests moni-toring through the ctbt xenon network. Geophys ResLett 2000, 27:2245–2248.

38. Enting I. Inverse Problems in Atmospheric ConstituentTransport. Cambridge, NY: Cambridge UniversityPress; 2002, 392.

39. Gelman A, Carlin J, Stern H, Rubin D. Bayesian DataAnalysis: Chapman & Hall/CRC; 2003, 668 pp.

40. Chow F, Kosovic B, Chan T. Source inversion forcontaminant plume dispersion in urban environmentsusing building-resolving simulations. Proceedings of the86th American Meteorological Society Annual Meeting,Atlanta, GA, January 2006, 12–22.

Volume 2, March/Apr i l 2010 2010 John Wi ley & Sons, Inc. 235

Page 19: Algorithm quasioptimal (AQ) learning - Earth Observationgeoinf.psu.edu/publications/2010_WIRES_AQLearning_Cervone.pdf · he algorithm quasi-optimal (AQ) learning methodology traces

Advanced Review wires.wiley.com/compstats

41. Senocak I, Hengartner N, Short M, Daniel W. Stochas-tic event reconstruction of atmospheric contaminantdispersion using Bayesian inference. Atmos Environ2008, 42:7718–7727.

42. Haupt SE. A demonstration of coupled recep-tor/dispersion modeling with a genetic algorithm.Atmos Environ 2005, 39:7181–7189.

43. Haupt SE, Young GS, Allen CT. A genetic algorithmmethod to assimilate sensor data for a toxic contami-nant release. J Comput 2007, 2:85–93.

44. Allen CT, Young GS, Haupt SE. Improving pollutantsource characterization by better estimating wind direc-tion with a genetic algorithm. Atmos Environ 2007,41:2283–2289.

45. Delle Monache L, Lundquistand J, KosovicB, Johannesson G, Dyer K, et al. Bayesian inferenceand markov chain monte carlo sampling to reconstructa contaminant source on a continental scale. J ApplMeteor Climatol 2008, 47:2600–2613.

46. Pasquill F. The estimation of the dispersion of wind-borne material. Meteorol Magazine 1961, 90:33–49.

47. Pasquill F, Smith F. Atmospheric Diffusion. Chichester,UK: Ellis Horwood; 1983.

48. Arya PS. Air Pollution Meteorology and Dispersion.Oxford, NY: Oxford University Press; 1999.

49. Barad M, Haugen D. Project Prairie Grass, A FieldProgram in Diffusion: United States Air Force, AirResearch and Development Command, Air Force Cam-bridge Research Center; Cambridge, MA, 1958.

236 2010 John Wi ley & Sons, Inc. Volume 2, March/Apr i l 2010