Soft-Ware 2002: Computing in an Imperfect World: First International Conference, Soft-Ware 2002 Belfast, Northern Ireland, April 8â€“10, 2002 Proceedings

Lecture Notes in Computer Science 2311Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

3BerlinHeidelbergNew YorkBarcelonaHong KongLondonMilanParisTokyo

David Bustard Weiru Liu Roy Sterritt (Eds.)

Soft-Ware 2002:Computing in anImperfect World

First International Conference, Soft-Ware 2002Belfast, Northern Ireland, April 8-10, 2002Proceedings

1 3

Series Editors

Gerhard Goos, Karlsruhe University, GermanyJuris Hartmanis, Cornell University, NY, USAJan van Leeuwen, Utrecht University, The Netherlands

Volume Editors

David BustardWeiru LiuRoy SterrittUniversity of UlsterFaculty of InformaticsSchool of Information and Software EngineeringJordanstown Campus, Newtownabbey, BT37 0QB, Northern IrelandE-mail: dw.bustard/w.liu/[email protected]

Cataloging-in-Publication Data applied for

Die Deutsche Bibliothek - CIP-Einheitsaufnahme

Computing in an imperfect world : first international conference, soft ware2002, Belfast, Northern Ireland, April 8 - 10, 2002 ; proceedings / DavidBustard ... (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ;London ; Milan ; Paris ; Tokyo : Springer, 2002

(Lecture notes in computer science ; Vol. 2311)ISBN 3-540-43481-X

CR Subject Classification (1998): D.2, K.6, F.1, I.2, J.1, H.2.8, H.3, H.4

ISSN 0302-9743ISBN 3-540-43481-X Springer-Verlag Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer-Verlag. Violations areliable for prosecution under the German Copyright Law.

Springer-Verlag Berlin Heidelberg New Yorka member of BertelsmannSpringer Science+Business Media GmbH

http://www.springer.de

© Springer-Verlag Berlin Heidelberg 2002Printed in Germany

Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Stefan SossnaPrinted on acid-free paper SPIN 10846571 06/3142 5 4 3 2 1 0

Preface

This was the first conference of a new series devoted to the effective handling ofsoft issues in the design, development, and operation of computing systems. Theconference brought together contributors from a range of relevant disciplines,including artificial intelligence, information systems, software engineering, andsystems engineering. The keynote speakers, Piero Bonissone, Ray Paul, Sir TonyHoare, Michael Jackson, and Derek McAuley have interests and experience thatcollectively span all of these fields.

Soft issues involve information or knowledge that is uncertain, incomplete,or contradictory. Examples of where such issues arise include:

– requirements management and software quality control in software engineer-ing,

– conflict or multiple sources information management in information systems,– decision making/prediction in business management systems,– quality control in networks and user services in telecommunications,– traditional human rationality modeling in artificial intelligence,– data analysis in machine learning and data mining,– control management in engineering.

The concept of dealing with uncertainty became prominent in the artificial intelli-gence community nearly 20 years ago, when researchers realized that addressinguncertainty was an essential part of representing and reasoning about humanknowledge in intelligent systems. The main methodologies that have emerged inthis area are soft computing and computational intelligence.It was also about 20 years ago that the notion of hard and soft systems

thinking emerged from the systems community, articulated by Checkland in hisseminal work on Soft Systems Methodology1. This work has influenced informa-tion system research and practice and is beginning to have an impact on systemsand software engineering.The conference gave researchers and practitioners with an interest in soft

issues an opportunity to learn from each other to identify ways of improving thedevelopment of complex computing systems.The conference had a strong industrial focus. In particular, all of the keynote

speakers had both industrial and academic experience, and the conference con-cluded with a session taking an industrial perspective on soft issues. Also, thefirst day of the conference was integrated with the 2nd European Workshop onComputational Intelligence in Telecommunications and Multimedia, organizedby the Technical Committee C of EUNITE, the European Network on Intelli-gent Technologies for Smart Adaptive Systems. This has a significant industrialmembership. There were two EUNITE keynote speakers: Ben Azvine, chairman1 Peter Checkland, Systems Thinking, Systems Practice, Wiley, 1981

VI Preface

of the Technical Committee C, and John Bigham who has many years’ experiencein applying computational intelligence in telecommunications.

The SS Titanic was chosen as a visual image for the conference becauseit represents the uncertainty associated with any engineering endeavor and is areminder that the Titanic was built in Belfast – indeed just beside the conferencevenue. Coincidentally, the conference took place between the date the Titanicfirst set sail, 2 April 1912, and its sinking on 15 April 1912. Fortunately, theorganizing committee is not superstitious!

A total of 24 papers were selected for presentation at the conference. Weare very grateful to all authors who submitted papers and to the referees whoassessed them.

We also thank Philip Houston and Paul McMenamin of Nortel Networks(Northern Ireland) whose participation in a collaborative project with the Uni-versity of Ulster provided initial support and encouragement for the conference.The project, Jigsaw, was funded by the Industrial Research and TechnologyUnit of the Northern Ireland Department of Enterprise, Trade, and Investment.Further support for the conference was provided by other industry and govern-ment collaborators and sponsors, especially Des Vincent (CITU-NI), GordonBell (Liberty Technology IT), Bob Barbour and Tim Brundle (Centre for Com-petitiveness), Dave Allen (Charteris), and Billy McClean (Momentum).

Internally, the organization of the conference benefitted from contributions byAdrian Moore, Pat Lundy, David McSherry, Edwin Curran, Mary Shapcott, Al-fons Schuster, and Kenny Adamson. Adrian Moore deserves particular mentionfor his imaginative design and implementation of the Web site, building on theTitanic theme. We are also very grateful to Sarah Dooley and Pauleen Marshallwhose administrative support and cheery manner were invaluable throughout.Finally, we thank Rebecca Mowat and Alfred Hofmann of Springer-Verlag fortheir help and advice in arranging publication of the conference proceedings.

February 2002 Dave BustardWeiru Liu

Roy Sterritt

Organization

SOFT-WARE 2002, the 1st International Conference on Computing in an Im-perfect World, was organized by the Faculty of Informatics, University of Ulsterin cooperation with EUNITE IBA C: Telecommunication and Multimedia Com-mittee.

Organizing Committee

Dave Bustard General Conference ChairWeiru Liu EUNITE Workshop ChairPhilip Houston Nortel Networks, Belfast LabsDes Vincent CITU (NI)Billy McClean MomentumKen Adamson IndustryEdwin Curran Applications and FinancePat Lundy Local EventsDavid McSherry Full SubmissionsAdrian Moore WebAlfons Schuster Short SubmissionsMary Shapcott Local ArrangementsRoy Sterritt Publicity and Proceedings

Program Committee

Behnam Azvine BT, Ipswich, UKSalem Benferhat IRIT, Universite Paul Sabatier, FranceKeith Bennett Durham University, UKDan Berry University of Waterloo, CanadaJean Bezivin University of Nantes, FrancePrabir Bhattacharya Panasonic Technologies Inc., USADanny Crookes Queen’s University, UKJanusz Granat National Institute of Telecoms, PolandRachel Harrison University of Reading, UKJanusz Kacprzyk Warsaw University of Technology, PolandStefanos Kollias National Technical Univ. of Athens, GreeceRudolf Kruse Otto-von-Guericke-University of Magdeburg,

GermanyManny Lehman Imperial College, UKPaul Lewis Lancaster University, UKXiaohui Liu Brunel University, UKAbe Mamdani Imperial College, UKTrevor Martin University of Bristol, UK

VIII Organization

Stephen McKearney University of Bournemouth, UKAndreas Pitsillides University of Cyprus, CyprusSimon Parsons University of Liverpool, UKHenri Prade IRIT, Universite Paul Sabatier, FranceMarco Ramoni Harvard Medical School, USAAlessandro Saffiotti University of Orebro, SwedenPrakash Shenoy University of Kansas, USAPhilippe Smets Universite Libre de Bruxelles, BelgiumMartin Spott BT Ipswich, UKFrank Stowell De Montfort University, UKJim Tomayko Carnegie Mellon University, USAAthanasios Vasilakos University of Crete, GreeceFrans Voorbraak University of Amsterdam, The NetherlandsDidar Zowghi University of Technology, Sydney, Australia

Additional Reviewers

Werner DubitzkySally McCleanMike McTear

Gerard ParrWilliam Scanlon

Bryan ScotneyGeorge Wilkie

Sponsoring Institutions

– University of Ulster– EUNITE - EUropean Network on Intelligent TEchnologies for Smart Adap-tive Systems

– Nortel Networks, Belfast Labs.– IRTU - Industrial Research and Technology Unit– Liberty IT– INCOSE– British Computer Society– Momentum– Centre for Competitiveness– Charteris– CSPT - Centre for Software Process Technologies

Table of Contents

Technical Session 1

Overview of Fuzzy-RED in Diff-Serv Networks . . . . . . . . . . . . . . . . . . . . . . . . . 1L. Rossides, C. Chrysostomou, A. Pitsillides (University of Cyprus),A. Sekercioglu (Monash University, Australia)

An Architecture for Agent-Enhanced Network Service Provisioningthrough SLA Negotiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

David Chieng (Queen’s University of Belfast), Ivan Ho (Universityof Ulster), Alan Marshall (Queen’s University of Belfast), Gerard Parr(University of Ulster)

Facing Fault Management as It Is, Aiming for WhatYou Would Like It to Be . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Roy Sterritt (University of Ulster)

Enabling Multimedia QoS Control with Black-Box Modelling . . . . . . . . . . . . 46Gianluca Bontempi, Gauthier Lafruit, (IMEC Belgium)

Technical Session 2

Using Markov Chains for Link Prediction in Adaptive Web Sites . . . . . . . . . 60Jianhan Zhu, Jun Hong, John G. Hughes (University of Ulster)

Classification of Customer Call Data in the Presence ofConcept Drift and Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Michaela Black, Ray Hickey (University of Ulster)

A Learning System for Decision Support in Telecommunications . . . . . . . . . 88Filip Zelezny (Czech Technical University), Jirı Zıdgek (AtlantisTelecom), Olga Stepankova (Czech Technical University)

Adaptive User Modelling in an Intelligent Telephone Assistant . . . . . . . . . . . 102Trevor P. Martin, Benham Azvine (BTexact Technologies)

Technical Session 3

A Query-Driven Anytime Algorithm for Argumentativeand Abductive Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Rolf Haenni (University of California, Los Angeles)

Proof Length as an Uncertainty Factor in ILP . . . . . . . . . . . . . . . . . . . . . . . . . 128Gilles Richard, Fatima Zohra Kettaf (IRIT, Universite Paul Sabatier,France)

X Table of Contents

Paraconsistency in Object-Oriented Databases . . . . . . . . . . . . . . . . . . . . . . . . 141Rajiv Bagai (Wichita State University, US), Shellene J. Kelley (AustinCollege, US)

Decision Support with Imprecise Data for Consumers . . . . . . . . . . . . . . . . . . . 151Gergely Lukacs (University of Karlsruhe, Germany)

Genetic Programming: A Parallel Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 166Wolfgang Golubski (University of Siegen, Germany)

Software Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174Manny M. Lehman (Imperial College, University of London, UK),J.F. Ramil (The Open University, UK)

Technical Session 4

Temporal Probabilistic Concepts from Heterogeneous Data Sequences . . . . 191Sally McClean, Bryan Scotney, Fiona Palmer (University of Ulster)

Handling Uncertainty in a Medical Study ofDietary Intake during Pregnancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Adele Marshall (Queen’s University, Belfast), David Bell, Roy Sterritt(University of Ulster)

Sequential Diagnosis in the Independence Bayesian Framework . . . . . . . . . . 217David McSherry (University of Ulster)

Static Field Approach for Pattern Classification . . . . . . . . . . . . . . . . . . . . . . . . 232Dymitr Ruta, Bogdan Gabrys (University of Paisley)

Inferring Knowledge from Frequent Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 247Marzena Kryszkiewicz (Warsaw University of Technology, Poland)

Anytime Possibilistic Propagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 263Nahla Ben Amor (Institut Superieur de Gestion, Tunis),Salem Benferhat (IRIT, Universite Paul Sabatier, France),Khaled Mellouli (Institut Superieur de Gestion, Tunis)

Technical Session 5

Macro Analysis of Techniques to Deal with Uncertainty in InformationSystems Development: Mapping Representational Framing Influences . . . . . 280

Carl Adams (University of Portsmouth, UK), David E. Avison (ESSECBusiness School, France)

Table of Contents XI

The Role of Emotion, Values, and Beliefs in the Construction ofInnovative Work Realities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

Isabel Ramos (Escola Superior de Tecnologia e Gestao, Portugal),Daniel M. Berry (University of Waterloo, Canada), Joao A. Carvalho(Universidade do Minho, Portugal)

Managing Evolving Requirements Using eXtreme Programming . . . . . . . . . . 315Jim Tomayko (Carnegie Mellon University, US)

Text Summarization in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332Colleen E. Crangle (ConverSpeech, California)

Invited Speakers

Industrial Applications of Intelligent Systems at BTexact . . . . . . . . . . . . . . . 348Benham Azvine (BTexact Technologies, UK)

Intelligent Control of Wireless and Fixed Telecom Networks . . . . . . . . . . . . . 349John Bigham (University of London, UK)

Assertions in Programming: From Scientific Theoryto Engineering Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350

Tony Hoare (Microsoft Research, Cambridge, UK)

Hybrid Soft Computing for Classification and Prediction Applications . . . . 352Piero Bonissone, (General Electric Corp., Schenectady, NY, US)

Why Users Cannot ‘Get What They Want’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354Ray Paul (Brunel University, UK)

Systems Design with the Reverend Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355Derek McAuley (Marconi Labs, Cambridge, UK)

Formalism and Informality in Software Development . . . . . . . . . . . . . . . . . . . 356Michael Jackson (Consultant, UK)

Industrial Panel

An Industrial Perspective on Soft Issues: Successes, Opportunities, andChallenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

Industrial Panel

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 1–13, 2002.© Springer-Verlag Berlin Heidelberg 2002

Overview of Fuzzy-RED in Diff-Serv Networks

L. Rossides1, C. Chrysostomou1, A. Pitsillides1, and A. Sekercioglu2

1 Department of Computer ScienceUniversity of Cyprus, 75 Kallipoleos Street, P.O. Box 20537, 1678 Nicosia, Cyprus,

Phone: +357 2 892230, Fax: +357 2 892240.2 Centre for Telecommunications and Information Engineering,

Monash University, Melbourne, Australia,Phone: +61 3 9905 3503, Fax: +61 3 9905 3454.

Abstract. The rapid growth of the Internet and increased demand to use theInternet for time-sensitive voice and video applications necessitate the designand utilization of new Internet architectures with effective congestion controlalgorithms. As a result the Diff-Serv architectures was proposed to deliver(aggregated) QoS in TCP/IP networks. Network congestion control remains acritical and high priority issue, even for the present Internet architecture. In thispaper we present Fuzzy-RED, a novel approach to Diff-Serv congestioncontrol, and compare it with a classical RIO implementation. We believe thatwith the support of fuzzy logic, we are able to achieve better differentiation forpacket discarding behaviors for individual flows, and so provide better qualityof service to different kinds of traffic, such as TCP/FTP traffic and TCP/Web-like traffic, whilst maintaining high utilization (goodput).

1 Introduction

The rapid growth of the Internet and increased demand to use the Internet for time-sensitive voice and video applications necessitate the design and utilization of newInternet architectures with effective congestion control algorithms. As a result theDiff-Serv architecture was proposed [1] to deliver (aggregated) QoS in TCP/IPnetworks. Network congestion control remains a critical and high priority issue, evenfor the present Internet architecture.

In this paper, we aim to use the reported strength of fuzzy logic (a ComputationalIntelligence technique) in controlling complex and highly nonlinear systems toaddress congestion control problems in Diff-Serv. We draw upon the vast experience,in both theoretical as well as practical terms, of Computational Intelligence Control(Fuzzy Control) in the design of the control algorithm [2]. Nowadays, we are facedwith increasingly complex control problems, for which different (mathematical)modeling representations may be difficult to obtain. This difficulty has stimulated thedevelopment of alternative modeling and control techniques which include fuzzylogic based ones. Therefore, we aim to exploit the well known advantages of fuzzylogic control [2]:• Ability to quickly express the control structure of a system using a priori

knowledge.• Less dependence on the availability of a precise mathematical model.

2 L. Rossides et al.

• Easy handling of the inherent nonlinearities.• Easy handling of multiple input signals.Our approach will be to adopt the basic concepts of RED [14], which was proposed toalleviate a number of problems with the current Internet congestion controlalgorithms, has been widely studied, and has been adapted, in many variants, for usein the Diff Serv architecture. Despite the good characteristics shown by RED and itsvariants in many situations, and the clear improvement it presents against classicaldroptail queue management, it has a number of drawbacks, including problems withperformance of RED under different scenarios of operation, parameter tuning,linearity of the dropping function, and need for other input signals.

We expect that Fuzzy-RED, the proposed strategy, will be robust with respect totraffic modeling uncertainties and system nonlinearities, yet provide tight control (andas a result offer good service). It is worth pointing out that there is increasingempirical knowledge gathered about RED and its variants, and several ‘rules ofthumb’ have appeared in many papers. It will be beneficial to build the Fuzzy Controlrule base using this knowledge. However, in this paper we only attempt to highlightthe potential of the methodology, and choose a simple Rule Base and simulationexamples, but with realistic scenarios.

2 Issues on TCP/IP Congestion Control

As the growth of the Internet increases it becomes clear that the existing congestioncontrol solutions deployed in the Internet Transport Control Protocol (TCP) [3], [4]are increasingly becoming ineffective. It is also generally accepted that thesesolutions cannot easily scale up even with various proposed “fixes” [5], [6]. Also, it isworth pointing out that the User Datagram Protocol (UDP), the other transport serviceoffered by IP Internet, offers no congestion control. However, more and more usersemploy UDP for the delivery of real time video and voice services. The newlydeveloped (also largely ad-hoc) strategies [7], [8] are also not proven to be robust andeffective. Since these schemes are designed with significant non-linearities (e.g. two-phase—slow start and congestion avoidance—dynamic windows, binary feedback,additive-increase multiplicative-decrease flow control etc), and they are based mostlyon intuition, the analysis of their closed loop behaviour is difficult if at all possible,even for single control loop networks.

Even worse, the interaction of additional non-linear feedback loops can produceunexpected and erratic behavior [9]. Empirical evidence demonstrates the poorperformance and cyclic behavior of the TCP/IP Internet [10] (also confirmedanalytically [11]). This is exacerbated as the link speed increases to satisfy demand(hence the bandwidth-delay product, and thus feedback delay, increases), and also asthe demand on the network for better quality of service increases. Note that for widearea networks a multifractal behavior has been observed [12], and it is suggested thatthis behavior—cascade effect—may be related to existing network controls [13].Based on all these facts it is becoming clear that new approaches for congestioncontrol must be investigated.

Overview of Fuzzy-RED in Diff-Serv Networks 3

3 The Inadequacy of RED

The most popular algorithm used for Diff-Serv implementation is RED (RandomEarly Discard) [14]. RED simply sets some min and max dropping thresholds for anumber of predefined classes in the router queues. In case the buffer queue sizeexceeds the min threshold, RED starts randomly dropping packets based on aprobability depending on the queue length. If the buffer queue size exceeds the maxthreshold then every packet is dropped, (i.e., drop probability is set to 1) or ECN(Explicit Congestion Notification) marked. The RED implementation for Diff-Servdefines that we have different thresholds for each class. Best effort packets have thelowest min and max thresholds and therefore they are dropped with greaterprobability than packets of AF (Assured Forwarding) or EF (Expedited Forwarding)class. Also, there is the option that if an AF class packet does not comply with the ratespecified then it would be reclassified as a best-effort class packet. Apart from RED,many other mechanisms such as n-RED, adaptive RED [15], BLUE [16], [17] andThree Color marking schemes were proposed for Diff-Serv queue control.

In Figure 1 we can see a simple Diff-Serv scenario where RED is used for queuecontrol. A leaky bucket traffic shaper is used to check if the packets comply with theSLA (Service Level Agreement). If EF packets do not comply with the SLA then theyare dropped. For AF class packets, if they do not comply then they are remapped intoBest Effort Class packets. Both AF and Best Effort packets share a RIO [18] Queue.RIO stands for RED In/Out queue, where “In” and “Out” means packets are in or outof the connection conformance agreement. For AF and Best Effort class we havedifferent min and max thresholds. EF packets use a separate high priority FIFO queue.

EF AF Class Best Effort

Check andtraffic

shaping

Priority QueueYe

NoDiscard

RIO Queue

MinMax

Fig. 1. Diff-Serv scenario with RED queue for control

Despite the good characteristics shown by RED in many situations and the clearimprovement it presents against classical droptail queue management, it has a numberof drawbacks. In cases with extremely bursty traffic sources, the active queuemanagement techniques used by RED, unfortunately, are often defeated since queuelengths grow and shrink rapidly well before RED can react.


The inadequacy of RED can be understood more clearly by considering theoperation of an ideal queue management algorithm. Consider an ideal traffic sourcesending packets to a sink through two routers connected via a link of capacity of LMbps (see Figure 2).

Source

A BL Mbps

Sendingrate LMbps

Queue

Sink

Fig. 2. Ideal Scenario

An ideal queue management algorithm should try to maintain the correct amount ofpackets in the queue to keep a sending rate of sources at L Mbps thus having a full100% throughput utilization. While RED can achieve performance very close to thisideal scenario, it needs a large amount of buffer and, most importantly, correctparameterization to achieve it. The correct tuning of RED implies a “global”parameterization that is very difficult, if not impossible to achieve as it is shown in[16]. The results presented later in this article show that Fuzzy-RED can provide suchdesirable performance and queue management characteristics without any specialparameterization or tuning.

4 Fuzzy Logic Controlled RED

A novel approach to the RED Diff-Serv implementation is Fuzzy-RED, a fuzzy logiccontrolled RED queue. To implement it, we removed the fixed max, min queuethresholds from the RED queue for each class, and replaced them with dynamicnetwork state dependant thresholds calculated using a fuzzy inference engine (FIE)which can be considered as a lightweight expert system. As reported in [16], classicalRED implementations with fixed thresholds cannot provide good results in thepresence of dynamic network state changes, for example, the number of activesources. The FIE dynamically calculates the drop probability behavior based on twonetwork-queue state inputs: the instantaneous queue size and the queue rate ofchange. In implementation we add an FIE for each Diff-Serv class of service. The FIEuses separate linguistic rules for each class to calculate the drop probability based onthe input from the queue length and queue length growth rate. Usually two input FIEscan offer better ability to linguistically describe the system dynamics. Therefore, wecan expect that we can tune the system better, and improve the behavior of the REDqueue according to our class of service policy. The dynamic way of calculating thedrop probability by the FIE comes from the fact that according to the rate of changeof the queue length, the current buffer size, and the class the packet belongs, adifferent set of fuzzy rules, and so inference apply. Based on these rules andinferences, the drop probability is calculated more dynamically than the classical


RED approach. This point can be illustrated through a visualization of the decisionsurfaces of the FIEs used in the Fuzzy-RED scheme. An inspection of these surfacesand the associated linguistic rules provides hints on the operation of Fuzzy-RED. Therules for the “assured” class are more aggressive about decreasing the probability ofpacket drop than increasing it sharply. There is only one rule that results in increasingdrop probability, whereas two rules set the drop probability to zero. If we contrast thiswith the linguistic rules of the “best effort” class packets, we see that more rules leadto an increase in drop probability, and so more packet drops than the assured trafficclass. These rules reflect the particular views and experiences of the designer, and areeasy to relate to human reasoning processes.

We expect the whole procedure to be independent of the number of active sourcesand thus avoid the problems of fixed thresholds employed by other RED schemes[16]. With Fuzzy-RED, not only do we expect to avoid such situations but also togenerally provide better congestion control and better utilization of the network.

5 Simulation Results

In this section we evaluate using simulation the performance of Fuzzy-RED andcompare with other published results. The implementation of the traffic sources isbased on the most recent version of ns simulator (Version 2.1b8a).

Three simulation scenarios are presented. Scenario 1 was a simple scenario used tomake an initial evaluation and test of Fuzzy-RED. Scenario 2 compares the behaviorof RED and Fuzzy-RED using only TCP/FTP traffic, and finally in Scenario 3 weintroduce web traffic, reported to test the ability of RED [19] to evaluate a simpleDiff-Serv implementation using RIO and Fuzzy-RED.

5.1 Scenario 1

We have done an initial testing of the performance of the fuzzy-RED queuemanagement using the ns simulation tool with the simple network topology shown inFigure 3 (also used by other researchers for RED performance and evaluation [14]).The buffer size was set to 70 packets (max packet size 1000 bytes), the min threshold(minth) for RED was 23 packets and the max threshold was 69 packets. The linkbetween the two routers was set to 40 Mbps and the simulation lasted for 100seconds. The scripts used for simulation Scenario 1 (and in all other simulationscenarios presented here) were based on the original scripts written in [14]. The rulefiles used for Scenario 1 were written without any previous study on how they canaffect the performance of Fuzzy-RED, so they can been seen as a random pickup ofsets and rules for evaluating Fuzzy-RED. After an extensive series of simulationsbased on Scenario 2 topology, and analysis of their results a new set of rule base fileswas created. This set of files was used in all the rest simulation scenarios (Scenarios 2and 3) without any change in order to show the capabilities of Fuzzy-RED in variousscenarios using different parameters.


40 Mb/s 20msdelay link

*

*

* All access connections are 100 Mb/s

Router 1Router 0

Fig. 3. Simple network topology used for the initial simulations

From the simulation results shown in Figure 4 and Figure 5 Fuzzy-RED achievesmore than 99% utilization while RED and droptail fail to achieve more than 90%. Asone can see from Figure 4 and Figure 5, Fuzzy-RED presents results very close to theideal (as presented in Figure 2).

0

5

10

15

20

25

30

35

40

45

0 20 40 60 80 100

Time

Th

rou

gh

pu

t

Fuzzy RED

RED

Droptail

Fig. 4. Throughput vs. Time

The throughput goes up to the 99.7% of the total link capacity (40 Mbps link) and theaverage queue size is around half the capacity of the buffer while maintaining asufficient amount of packets in the queue for achieving this high throughput. Whilethese are results from a very basic scenario (see Fig. 3) they demonstrate the dynamicabilities and capabilities of an FIE RED queue compared to a simple RED queue or aclassical droptail queue.

The results presented in the following simulation scenarios shows that thesecharacteristics and abilities are maintained under all conditions without changing anyparameters of Fuzzy-RED.

near 100% throughput (99.57%), see Figure 8, while packet drops are kept at very low levels (Figure 10) compared to the number of packets sent (Figure 9).

120.00 , - 100.00 VI -

80.00

% --RED & 60.00 a -Fuzzy RED = a 40.00 z

1 20.00

0.00 7 - - - - 7

0.00 20.00 40.00 60.00 80.00 100.00

Time (sec)

Fig. 7. Scenario 2 - Buffer size vs, t h e

These results show clearly that Fuzzy-RED manages to adequately control the queue size while keeping a higher than RED throughput (97.1 1% compared with 99.57% of Fuzzy-RED).

.c

0 1 0 2 3 ? 0 4 0 5 3 ~ 7 0 f f l 9 3 1 O l

lime (sec)

Fig. 8. Scenario 2 - Throughput vs, t h e

5.3 Scenario 3

In Scenario 3 we introduce a new network topology used in [19]. The purpose of this scenario is to investigate how Fuzzy-RED and RIO perform under a Diff-Serv scenario. To simulate a basic Diff-Serv environment we introduce a combination of web-like traffic sources and TCPFTP sources. Half of the sources are TCPFTP and the other half TCPlVeb-like Traffic. Traffic from the Web-like sources is tagged as assured class traffic and the FTP traffic as best effort. We run the simulation three times for 5000 seconds each in order to enhance the validity of the results. The network topology is presented in Figure 11. We use TCPISACK with a TCP window

Ovemiew of FUZZY-RED h~ Diff-Sem Networks 9

of 100 packets. Each packet has a size of 1514 bytes. For the Droptail queue we define a buffer size of 226 packets. We use AQM (Fuzzy RED or RIO) in the queues of the bottleneck link between router 1 and router 2. All other links have a simple droptail queue. The importance of this scenario is that it compares and evaluates not simply the performance of an algorithm but the performance in implementing a new IP network architecture, Diff-Sew. This means that we want to check whether Fuzzy- RED can provide the necessary congestion control and differentiation and ensure acceptable QoS in a Diff-Sew network. We also attempt to compare Fuzzy-RED with RIO (a RED based implementation of Diff-Sew).

Fig. 9. Scenario 2 Packe t s transmitted

The choice of distributions and parameters is based on [I91 and is summarized in Table 1. The implementation of the traffic sources is based on the most recent version of ns simulator (version 2.1.8b). All results presented for this scenario were extracted from the ns simulation trace file.

Table 1. Distributions and Parameters

Distribution Mean Shape

Inter-page Time Pareto 50ms 2

Objects per page Pareto 4m 1.2

Interobject Time Pareto 0.5ms 1.5

Object Size

Pareto 12KB 1.2

z 10000

-Best EffOll Class

6 6000

4" 0"

000 4 L 000 20.00 40.00 60 00 80.00 100 00

Time (set)

Fig. 10. Scenario 2 P a c k e t Drops

500 Mbps I

Ilnks 'IU N1"Ps 100 Mbps I 130 ms

dest2

Fig. 11. Scenario 3 Network Topology

Although this scenario is a simple and basic one it can be the starting point for investigating the abilities of Fuzzy-RED in a Diff-Sen, network. The results presented here are limited to just two graphs since the purpose of this paper is to present the prospective of Fuzzy-RED in providing a comparable to RIO congestion control in Diff-Serv networks and not a detailed description of its performance.

Overview of Fuzzy-RED hDiff-Sem Networks 11

From the results we see that although Fuzzy-RED and N O show similar behaviour, Fuzzy-RED appears to control better the flow rate across the network. From Figure 12 we see the throughput behaviour. With throughput here we mean the rate at which traffic is coming to a link (in this case the link connecting routerl with router2) before entering any queue. So it is the total traffic arriving at routerl (both Rp and web traffic). From the graph N O seems to stabilise its throughput around 10.25 Mbps (note that the link speed is limited to 10 Mbitlsec). This means that it can't effectively control the rate at which the sources are sending traffic (according to the ideal scenario shown in Figure 3). Around t=2500sec we see a small ascending step. At that point the traffic increases further from 10.1 Mbps to a 10.25 Mbps. This means that we have an increase of drops and therefore a decrease of goodput. We define goodput as the traffic rate traversing a link minus all dropped packets and all retransmitted packets.

-FUZZY R W

- RED

," I 2

7.0 ' 0 1000 2000 3000 4000 5000

Time (sec)

Fig. 12. Scenario 3 Throughput vs, t h e

From Figure 13 we see that Fuzzy-RED delivers a steady goodput around 9.9 Mbps while RIO has a decrease from 9.9 to 9.8 due to dropped packets that create retransmissions. The difference is not as important as the fact that Fuzzy-RED seems to provide a more stable behaviour. This result along with the previous encourages us to proceed with further testing in the future.

6 Conclusions

Current TCPIIP congestion control algorithms cannot efficiently support new and emerging services needed by the Internet community. RED propose a solution, however in cases with extreme bursty sources (such a case is Internet) it fails to effectively control congestion. Diff-Serv using N O (RED In-Out) was proposed to offer differentiation of services and control congestion. Our proposal in implementing Diff-Serv using a fuzzy logic controlled queue is a novel, effective, robust and

flexible approach and avoids the necessity of any special parameterization or tuning, apart from linguistic interpretation of the system behavior.

10 0 - ..... m n

Furry FED

n A RED = 9 0

8.5 , 0 1000 2000 3000 4000 5000

Time (sec)

Fig. 13. Scenario 3 - Goodput vs, t h e

It can provide similar or better performance compared to RIO without any retuning or parameterization. Specifically in scenario 3 we see that Fuzzy-RED, using the same rules and values in the fuzzy sets (i.e. no finer tuning), has achieved equal or better performance than NO, in which we use the optimal parameterization discussed in paper [19]. From this scenario we see that Fuzzy-RED can perform equally well using homogeneous or heterogeneous traffic sources (in this case TCPFTP traffic and TCPAVeb-like traffic) without any change in the way we define it or any special tuning. We believe that with further refinement of the rule base through mathematical analysis, or self-tuning, our algorithm can achieve even better results.

In future work we will investigate further performance issues such as fairness among traffic classes, packet drops per class (a QoS parameter), utilization and goodput under more complex scenarios. We expect to see whether Fuzzy-RED can be used to provide the necessary QoS needed in a Diff-Sen, network. From these results and based on our past experience with successful implementations of fuzzy logic control [20], 1211, we are very optimistic that this proposal will offer significant improvements on controlling congestion in TCPIIP Diff-Sen, networks.

References

1. S.Blake et al. "An architecture for Differentiated Services", RFC 2475, December 1998. 2 W. Pedrycz, A. V. Vasilakos (Ed.), Computatzonal Intellzgence zn Telecommunicatzons

Network, CRC Press, ISBN: 0-8493-1075-X, September 2000. 3. V. Jacobson, Congestion Avoidance and Control, ACM SIGCOMM88,1988. 4. W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery

Algorithms", RFC2001, January 1997. 5. W. Stevens, TCPIIP Illustrated, Volume 1 The Protocols', Addison-Wesley, 1994. 6 . V. Jacobson, R. Braden, D. Borman, TCP Extensions for High Performance, RFC 1323,

May 1992. 7. K.K. Ramakrishnan, and S. Floyd A proposal to add explicit congestionnotification (ECN)

to IP, drafi-kksjf-ecn-03.txt, October 1998. (RFC2481, January 1999).


8. Braden et al, Recommendations on Queue Management and Congestion Avoidance in theInternet, RFC2309, April 1998.

9. C.E. Rohrs and R. A. Berry and S. J.O’Halek, A Control Engineer’s Look at ATMCongestion Avoidance, IEEE Global Telecommunications Conference GLOBECOM’95,Singapore, 1995.

10. J. Martin, A. Nilsson, The evolution of congestion control in TCP/IP: from reactivewindows to preventive flow control, CACC Technical Report TR-97/11, North CarolinaState University, August 1997.

11. T.V. Lakshman and U. Madhow, The Performance of TCP/IP for Networks with HighBandwidth Delay Products and Random Loss, IEEE/ACM Transactions on Networking,vol. 5, pp. 336-350, June 1997.

12. A.Feldmann, A.C. Gilbert, W. Willinger, “Data Networks as cascades: Investigating themultifractal nature of the Internet WAN traffic”, SIGCOMM 98, Vancouver, 1998.

13. A. Feldmann, A. C. Gilbert, P. Huang, W. Willinger, “Dynamics of IP Traffic: A study ofthe role of variability and the impact of control”, Proceedings of ACM, SIGCOMM 2000.

14. S. Floyd and V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance”,ACM/IEEE Transaction on Networking, 1(4): pgs 397-413, August 1993.

15. W. Feng, D. Kandlur, D. Saha, and K. Shin, “A self-configuring RED gateway,” IEEEINFOCOM’99, New York, Mar. 1999.

16. Wu-chang Feng, “Improving Internet Congestion Control and Queue ManagementAlgorithms”, PhD Dissertation, University of Michigan, 1999 .

17. W. Feng, D. Kandlur, D. Saha, and K. Shin, “Blue: A New Class of Active QueueManagement Algorithms,” tech. rep., UM CSE-TR-387-99, 1999.

18. Clark D. and Fang W. (1998), Explicit allocation of best effort packet delivery service,IEEE/ACM Transactions on Networking, Volume 6, No. 4, pp. 362 – 373, August 1998.

19. G. Iannaccone, C. Brandauer, T. Ziegler, C. Diot, S. Fdida, M. May, Comparison of TailDrop and Active Queue Management Performance for Bulk-data and Web-like InternetTraffic, 6th IEEE symposium and Computers and Communications. Hammamet. July 2001.

20. A. Pitsillides, A. Sekercioglu, G. Ramamurthy, “Effective Control of Traffic Flow in ATMNetworks using Fuzzy Explicit Rate Marking (FERM)”, IEEE JSAC, Vol. 15, Issue 2, Feb1997, pp.209-225.

21. A. Pitsillides, A. Sekercioglu, ‘Congestion Control’, in Computational Intelligence inTelecommunications Networks, (Ed. W. Pedrycz, A. V. Vasilakos), CRC Press, ISBN: 0-8493-1075-X, September 2000.


An Architecture for Agent-Enhanced Network ServiceProvisioning through SLA Negotiation

David Chieng1, Ivan Ho2, Alan Marshall1, and Gerard Parr2

1 The Advanced Telecommunication Research LaboratorySchool of Electrical and Electronic Engineering, Queen’s University of Belfast

Ashby Building, Stranmillis Road, Belfast, BT9 5AH, UKd.chieng, [email protected]

2 Internet Research Group, School of Information and Software EngineeringUniversity of Ulster, Coleraine, BT52 1SA, UK

wk.ho, [email protected]

Abstract. This paper focuses on two main areas. We first investigate variousaspects of subscription and session Service Level Agreement (SLA) issues suchas negotiating and setting up network services with Quality of Service (QoS)and pricing preferences. We then introduce an agent-enhanced servicearchitecture that facilitates these services. A prototype system consisting ofreal-time agents that represent various network stakeholders was developed. Anovel approach is presented where the agent system is allowed to communicatewith a simulated network. This allows functional and dynamic behaviour of thenetwork to be investigated under various agent-supported scenarios. This paperalso highlights the effects of SLA negotiation and dynamic pricing in acompetitive multi-operator networks environment.

1 Introduction

The increasing demand to provide Quality of Service (QoS) over Internet typenetworks, which are essentially best effort in nature, has led to the emergence ofvarious architectures and signalling schemes such as Differentiated Services(DiffServ) [1], Multi Protocol Label Switching (MPLS) [2], IntServ’s ResourceReservation Protocol (RSVP) [3] and Subnet Bandwidth Management [4]. Forexample Microsoft has introduced Winsock2 GQoS (Generic QoS) APIs in theirWindow OS that provides RSVP signalling, QoS policy support, and invocation oftraffic control [5]. CISCO and other vendors have also built routers and switches thatsupport RSVP, DiffServ and MPLS capabilities [6]. Over the next few years we aregoing to witness rapid transformation in the functions provided by networkinfrastructures, from providing mere connectivity, to a wider range of tangible andflexible services involving QoS.

However, in today’s already complex network environment, creating, provisioningand managing such services is a great challenge. First, service and network providershave to deal with a myriad of user requests that come with diverse Service LevelAgreements (SLAs) or QoS requirements. These SLAs then need to be mapped torespective policy and QoS schemes, traffic engineering protocols, and so on, in

An Architecture for Agent-Enhanced Network Service Provisioning 15

accordance with the underlying technologies. This further involves dynamic controland reconfiguration management, monitoring, as well as other higher-level issuessuch as service management, accounting and billing. Matching these servicerequirements to a set of control mechanisms in a consistent manner remains an area ofweakness within the existing IP QoS architectures. These processes rely on theunderlying network technologies and involve the cooperation of all network layersfrom top-to-bottom, as well as every network elements from end-to-end.

Issues regarding SLA arise due to the need to maximize customer satisfaction andservice reliability. According to [7], many end users and providers in general are stillunable to specify SLAs in a way that benefits both parties. Very often, the service ornetwork providers will overprovision their networks which leads to servicedegradation or alternatively they may fail to provide services to the best of theirnetworks capabilities. Shortcomings cover a wide range of issues such as contentqueries, QoS preferences, session preferences, pricing and billing preferences,security options, and so on. To set up a desired service from end-to-end, seamlessSLA transactions need to be carried out between users, network providers and otherservice providers. We propose an agent-enhanced framework that facilitates theseservices. Autonomous agents offer an attractive alternative approach for handlingthese tasks. For example a service provider agent can play a major role in guiding anddeciphering users’ requests, and is also able to respond quickly and effectively. Theseissues are critical, as the competitiveness of future providers relies not only on thediversity of the services they can offer, but also their ability to meet customers’requirements.

This rest of the paper is organized as follows: Section 2 discusses related researchin this area. Section 3 investigates various aspects of subscription and session SLAissues for a VLL service such as guaranteed bandwidth, service start time, sessiontime, and pricing. A generic SLA utility model for multimedia services is alsoexplained. Section 4 describes the architecture and the agent system implementation.Section 5 provides a brief service brokering demonstration using an industrial agentplatform. In section 6 a number of case studies on bandwidth negotiation and dynamicpricing are described.

2 Related Work

Our work is motivated by a number of research themes, especially the OpenSignalling community (OPENSIG)’s standard development project, IEEE P1520 [8],and Telecommunications Information Networking Architecture Consortium (TINA-C) [9] initiatives. IEEE P1520 is driving the concept of open signalling and networkprogrammability. The idea is to establish an open architecture between networkcontrol and management functions. This enhances programmability for diverse kindsof networks with diverse kinds of functional requirements, interoperability andintegration with legacy systems. TINA-C promotes interoperability, portability andreusability of software components so that it is independent from specific underlyingtechnologies. The aim is to share the burden of creating and managing a complexsystem among different business stakeholders, such as consumers, service providers,and connectivity providers. The authors in [11] proposed a QoS managementarchitecture that employs distributed agents to establish and maintain the QoS

16 D. Chieng et al.

requirements for various multimedia applications. Here, QoS specification iscategorized into two main abstraction levels: application and system. The authors in[10] introduced Virtual Network Service (VNS) that uses a virtualisation technique tocustomize and support individual Virtual Private Network (VPN)’s QoS levels. Whilethis work concentrates on the underlying resource provisioning management andmechanisms, our work focuses on the aspects of service provisioning using agents ontop of it, which is complementary. The capability of current RSVP signalling isextended in [12] where users are allowed to reserve bandwidth or connections inadvance so that blocking probability can de reduced. Authors in [13] offered a similaridea but use reservation agents to help clients to reserve end-to-end networkresources. The Resource Negotiation and Pricing Protocol (RNAP) proposed by [14]enables network service providers and users to negotiate service availability, pricequotation and charging information by application. This work generally supports theuse of agents and demonstrats how agents can enhance flexibility in provisioningnetwork services. In our work we extended the research by looking at the effects ofallowing SLA negotiation and the results of dynamic pricing in a competitive multi-operator networks environment.

3 Service Level Agreement

The definition of SLAs or SLSs is the first step towards QoS provisioning. It isessential to specify the SLAs and SLSs between ISPs and their customers, andbetween their peers with assurance that these agreements can be met. A Service LevelAgreement (SLA) provides a means of quantifying service definitions. In thenetworking environment, it specifies what an end user wants and what a provider iscommitting to provide. The definitions of an SLA differ at the business, applicationand network level [15]. Business level SLAs involve issues such as pricing schemesand contract. Application level SLAs are concerned with the issues of serveravailability (e.g., 99.9% during normal hours and 99.7% during other hours), responsetime, service session duration, and so on. Network level SLAs (or also often referredas Service Level Specifications (SLSs)) involve packet level flow parameters such asthroughput requirements or bandwidth, end-to-end latency, packet lossratio/percentage, error rate, and jitter. In this work, we concentrate on the SLA issuesinvolved in establishing a Virtual Leased Line (VLL) type of services over IPnetworks. The following describes some basic SLA parameters considered in ourframework during a service request [16]. Guaranteed Bandwidth (bi) for request i is the amount of guaranteed bandwidth

desired by this service. b may be the min or mean guaranteed bandwidth dependingon the user’s application requirement and network provider’s policy. Thisparameter is considered due to the ease of its configuration and also because it isthe single most important factor that affects other lower level QoS issues such asdelay and jitter. For our prototype system, the guaranteed bandwidth can bequantified in units of 1kb, 10kb, and so on.

Reservation Start Time (Tsi) is the time when this service needs to be activated. If auser requires an instant reservation, this parameter is simply assigned to the currenttime. A user can also schedule the service in advance. Any user who accesses the


network before his or her reservation start time may only be given the best effortservice.

Non-preemptable Session (Ti) is the duration required for this guaranteed service.When this duration expires, the connection will automatically go into preemtablemode where the bandwidth can no longer be guaranteed. The user needs to re-negotiate if he or she wishes to extend the reservation session. This parameter istypically used for Video on Demand (VoD) and news broadcast type serviceswhere service session is known a priori.

Price (Pi). It is believed that value-added services and applications are bestdelivered and billed on individual or per-transaction basis. Pi can be the maximumprice a user is willing to pay for this service which may represent the aggregatedcost quoted by various parties involved in setting up this service. For example, thetotal charge for a VoD service can be the sum of connection charge imposed by thenetwork provider, content provider’s video delivery charge and service providervalue added charge, etc. Alternatively the service can also be charged on a usagebasis such as cost per kbps per min.

Rules ( i) contain a user’s specified preferences and priorities. This is useful attimes when not all the required SLA parameters can be granted. User can specifywhich parameter is the priority or which parameter is tolerable.

3.1 SLA Utility Model

The SLA Management utility model proposed by [17] is adopted in this work (Fig. 1).

Utility

Service Provider

u1

u2

ui

r1

r2

ri

Resource Profile

SLA Management Objectives

Maximize user satisfaction, resource

usage, profit

Sum of each service utility

Resource with Constraint

( )ii i qu∑ ( ) Cqr ii i ≤∑

Fig. 1. General Utility Model

This is a mathematical model for describing the management and control aspects ofSLAs for multimedia services. The utility model formulates the adaptive SLAmanagement problem as integer programming. It provides a unified and

18 D. Chieng et al.

computationally feasible approach to make session request/admission control, qualityselection/adaptation, and resource request/allocation decisions.

For a user i, a service utility can be defined as ui(qi), where qi is the service qualityrequested by this session. This utility then needs to be mapped to the amount ofresource usage, ri(qi). The service quality represents the QoS level requested by auser, which then needs to be mapped to bandwidth, CPU or buffering usage. In thiswork however, only the bandwidth resource is considered.The scope of total utility U, is infinite but the total amount of resource available, i.e.link capacity C is finite. The objective of a service provider is to maximize the serviceutility objective function U:

U =max )(1 i

n

i i qu∑ ==max∑ ∑= =

n

i

m

j ijijux1 1

(1)

∑ ==

m

j ijx1

1and 1,0∈ijx (2)

Where n = total number of service sessions and m = total number of service qualityoptions. The total resource usage can be represented by:

R =∑ =

n

i ii qr1

)( = ≤∑ ∑= =

n

i

m

j ijij rx1 1

C, (3)

Equation (2) means that a user can only choose one service quality at one time.Equation (1) & (3) mean that the problem of SLA management of a multimediaservice provider is to allocate the resource required by each customer, whilemaximizing the service utility under the resource constraint. From the SLAparameters described in the previous section, a user request i for a VLL service can berepresented by the extended utility function [16]:

ui (bi, Tsi, Ti, Pi, i) (4)

After mapped to resource usage, the utility is:

ri (bi, Tsi, Ti,) (5)

From [12], pre-emption priority, p(t) is introduced. When p(t)=1, the sessionbandwidth is guaranteed when p(t)=0, the session bandwidth is not guaranteed. Thiscan be illustrated by:

pi(t) = iii

iii

TTstTs

TTstTst

+<<+><

,

,

,

1

0(6)

If we consider the case of a single link, where C = total link capacity, the totalreserved bandwidth resource at time t, R(t) can be represented as:

R(t) = nrrrrr ....4321 ++++ = ( ) ( ) ( ) ( ) nn pbtpbtpbtpbtpb ......... 4432211 ++++

∑ == n

i ii tpb1

)(. (7)

Where n = number of active VLLs at time t.This allows network provider to know the reserved bandwidth load at present and

in advance. In order to provide end-to-end bandwidth reservation facility (immediateor future), three sets of databases are required. User Service Database (USD) stores individual customer profiles and information

particularly the agreed SLAs on per service session basis. This information isessential for resource manager to manage individual user sessions such as when to


guarantee and when to preempt resources. This database can also be utilized forbilling purposes.

Resource Reservation Table (RRT) provides a form of lookup table for networkprovider to allocating network resources (bandwidth) for new requests. It tells thenetwork provider regarding the current or future available resources in any link at minutes, 1 minutes or one second depending on the network provider’s policy.Similarly, the minimum amount of guaranteed bandwidth can be defined in a

Path Table (PT) store the distinct hop-by-hop information from an ingress point toan egress point and vice versa. If there is an alternative path, another set of pathtable should be created. These tables are linked to RRTs.

Reserved BW, R(t)

Time

C

t Tsx

Tx

bx Rx

Reserved BW

Available BW

Fail

Success

Ry

Fig. 2. RRT and Reservation Processes

Figure 2 illustrates the reservation table and how do reservation processes take placeat a link for requests x and y. It can be observed that requested ux‘s guaranteedbandwidth request couldn’t be honoured throughout the requested time period Tx. Inthis case the requester has a number of options. They can either reduce the amount ofbandwidth requested, postpone the reservation start time, decrease the non-preemptable session duration, or accept the compromise that for the short period theirbandwidth will drop off.

This system offers a greater flexibility on resource reservation compared toconventional RSVP requests. In RSVP, reservation process will fail even if only asingle criterion such as the required bandwidth is not available. In addition, theproposed mechanism does not pose scalability problems as experienced in RSVP asthe reservation states and SLAs are stored in a centralized domain server(network/resource manager server) and not in the routers.

Unlike the existing resource reservation procedures, which are static and normallylack of quantitative knowledge of traffic statistics, this approach provides a moreaccountable resource management scheme. Furthermore, the resource allocated inexisting system is based on initial availability and does not take into account changesin future resource availability.

20 D. Chieng et al.

4 Agent-Enhanced Service Provisioning Architecture

Software agents offer many advantages in this kind of environment. Agents can carryout expertise brokering tasks on behalf of the end users, the service and networkproviders. Agents can negotiate services at service start-up or even when service is inprogress. Agents are particularly suited for those tasks where fast decision-making iscritical. These satisfy the two most important aspects of performance in an SLA;availability and responsiveness [15]. The following attributes highlight thecapabilities of agents [18]. Autonomy. Agents can be both responsive and proactive. It is able to carry out

tasks autonomously on their owners’ behalf under pre-defined rules or tasks. Thelevel of their intelligence depends on the given roles or tasks.

Communication. With this ability, negotiations can be established between twoagents. FIPA’s [19] Agent Communication Language (ACL) has become thestandard communication language for agents. An ACL ontology for VPN has alsobeen defined within FIPA.

Cooperation. Agents can cooperate to achieve a single task. Hence, agentsrepresenting end users, service providers and network providers are able tocooperate to set-up an end-to-end service e.g. VPN that spans across multidomains. Within a domain, agents also enhance coordination between nodes,which is lacking in most current systems.

Mobility. Java-based agents can migrate across heterogeneous networks andplatforms. This attribute differentiates mobile agents from the other forms of staticagents. Mobile agents can migrate their executions and computation processes toremote hosts. This saves shared and local resources such as bandwidth and CPUusage compared to conventional client/server systems. Thus, intensive SLAnegotiation processes can be migrated to the service provider or network provider’sdomain.

The value-added services provided by the agent system have been developed usingPhoenix v1.3 APIs [20]. Phoenix is a framework for developing, integrating anddeploying distributed, multi-platform, and highly scalable Web/Java-basedapplications. Being 100% Java-based, Phoenix is object-oriented in nature andplatform independent. The Phoenix Core Services offers various functions required inour framework i.e. service/user administration, session management, resourcemanagement, event management, service routing, logging, billing, templateexpansion, and customization. The Phoenix Engine is basically a threaded controlprogram that provides the runtime environment for our servlet-based agents. Thesimplicity of the Phoenix framework allows fast prototyping and development ofmulti-tiered, highly customisable applications or services. Figure 3 shows our overallprototype system architecture. This represents a simplified network marketplaceconsisting of user, network and content provider domains.

For our prototype system, the agents are built using Java 2 (JDK 1.2.2) and J2EEServlet API [21]. These agents communicate via HTTP v1.1 [22] between differentPhoenix Engines/virtual machines (JVMs), and via local Java method calls within thesame Phoenix Engine/VM. Port 8888 is reserved as the Agents’ CommunicationChannel (ACC). HTTP is preferred due to its accessibility in current webenvironment. The prototype system also incorporated some components from OpenAgent Middleware (OAM) [23] developed by Fujitsu Laboratories Ltd. that allows a


dynamic, flexible and robust operation and management of distributed servicecomponents via mediator agents. This realizes services plug-and-play, componentrepository management, service access brokering, dynamic customization accordingto user preferences, and so on.

UA

TA

ASPA

NA

SM

CPA

TA

User GUI

RM

RA

User Terminal Content Server Network Provider

Agent Communication Channel (Http)

Stream Flow

Stream Interface

User Domain Network Domain Content Provider Domain

Agent Communication Interface

RA

Simulation Environment

Telecom Lab LAN

Fig. 3. Agent-Enhanced Architecture.

In operation, the UA (User Agent) first receives requests from users to establish aservice. ASPA (Access Service Provider Agent) acts as the central contact point forall the authorized UA, CPA (Content Provider Agent) and NA (Network Agent). Thetasks undertaken by the ASPA include brokering, scheduling and pricing contentdelivery with the CPA. It also facilitates connection configuration, reconfigurationand teardown processes with the NA. OAM Mediator components are incorporatedwithin ASPA to provide the brokering facility. The NA is responsible for mappinguser level SLAs into network level SLAs. The SM (Service Manager) managesindividual user sessions i.e. the USD and the RM (Resource Manager) manages thenetwork resources that include the collection of PTs and RRTs (ResourceReservation Tables) within its domain.

At the element layer, we introduce a RA (Router Agent) that performs routerconfigurations, flow management and control functions such as packets classification,shaping, policing, buffering, scheduling and performance monitoring according to therequired SLA. The TA (Terminal Agent) manages local hardware and softwarecomponents at the end system such as display property, RAM, resources, drivers,MAC/IP Addresses, applications library, etc. In the current prototype, most of thefunctions residing within the network layer and below are implemented in simulationenvironment.

22 D. Chieng et al.

5 Service Brokering Demonstration

Figure 4 illustrates various stakeholders involved in our prototype system scenario. Inthis framework, a novel approach is taken where the real-time agents are allowed tocommunicate with a simulated network model via TCP/IP socket interfaces.

Phoenix Engine

UA Servlet

ASP Agent Servlet

Web Browser

Mediator

VoD1 Agent Servlet

VoD2 Agent Servlet

Music Agent Servlet

Network Agent Servlet

TCP/IP

BONeS Interface Module

Manager/ Monitoring

Agent Servlet

Simulator

BONeS Network Manager Module

BONeS Network Elements

Dbase Agent Servlet

Fig. 4. Service Brokering Scenario

Video on Demand (VoD), Music, and Database Agents have earlier registered theirservices with the ASP agent. When first contacted by the UA, the ASPA offers users arange of services to which they can subscribe. For example, if a VoD service ischosen, the ASPA will first lookup its database for registered content providers thatmatch the user’s preferences. If a use chooses to ‘subscribe’, ASPA will propagate therequest to the target agent, which then take the user through a VoD subscriptionprocess.

The experimental VoD demonstration system allows users to select a movie title,desired quality, desired movie start time, maximum price they are willing to pay,tolerances, and so on. Here, the service quality, qi of a video session can be classifiedin terms of high, medium and low. The chosen service quality will then be mapped tothe amount of resource, ri(qi) required. The UA then encodes these preferences andsends it to ASPA. The process is summarised in Figure 5.


UA VOD1

1: Submit Request /Negotiate

ASPA

2: Submit Request /Re-Negotiate

3: Request Accepted /Request Denied 6: Ack User

NA

4: Request/ Re-Negotiate Connection

5: Connection Granted /Connection Failed

Fig. 5. Negotiation and Configuration

5.1 Bandwidth Negotiation Evaluation

For this scenario, we assume there is only one link and only bandwidth is negotiable.The following simulation parameters were applied: Link capacity, C = 100Mbps User requests, u(t) arrived according to Poisson distribution with mean request

arrival rate, and all requested for immediate connections.

Exponential mean session duration, T or -1=300s Users’ VLL requested bandwidth unit, Br are according to exponential random

distribution, with minimum requirement, minrB =1unit and with a max cut-off limit,

maxrB . The request that exceed max

rB will be put back to the loop until it satisfies

the limit. If 1 unit is assumed 64kbps, hence the bandwidth requested, Br rangesbetween 64..10000kbps. This emulates the demand for different qualities ofvoice, hi-fi music, and video telephony up to high quality video conferencing orVoD sessions.

Users’ bandwidth tolerance is varied from 0% (not negotiable) up to 100% (acceptany bandwidth or simply best effort).

Arrival rates, requested bandwidth and session duration are mutually independent. The data was collected over a period of 120000s or 33.3 hours simulation time.Here bandwidth negotiations only occurred at resource limit. When the requestedbandwidth exceeds the available bandwidth, ri(t) + R(t) ≥ C, for the whole requestedsession, Tsi< t < Tsi + Ti, the ASPA will have to negotiate the guaranteed bandwidthwith the UA. For the following experiments, different levels of mean percentageoffered load (requested bandwidth) were applied. These loads are defined in thepercentage of link capacity where 120% Mean Offered Load means the demand forbandwidth is exceeding the available capacity.

Figure 6 shows the request blocking or rejection probability when bandwidthnegotiation at different tolerance levels was applied. It is obvious that if bandwidthrequest is tolerable, the rejection probability can be reduced. Here, 100% toleration isequivalent to best effort request. Figure 7 shows that the effect on the meanpercentage reservation load or mean R is almost negligible at 60% mean offered loador lower. This is because only those users whose requested bandwidth is not availableare subjected to negotiation.

24 D. Chieng et al.

0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

0 20 40 60 80 100 120

% Offered Load (Requested BW)

Blo

ckin

g P

rob

ab

ilit

y

0% Tol20% Tol40% Tol60% Tol80% Tol100% Tol

Mean % Reserved Load vs. Offered Load

58.00

63.00

68.00

73.00

78.00

83.00

88.00

60 70 80 90 100 110 120

Mean % Offered Load

Me

an

% R

ese

rva

tio

n L

oa

d

0% BW Tol20% BW Tol40% BW Tol60% BW Tol100% BW Tol

Fig. 6. Request Rejection Probability Fig. 7. Mean % Reservation Load(R)

Improvement on Mean % Reservation Load

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

60 70 80 90 100 110 120

Mean % Offered Load

Imp

rove

me

nt

(Me

an

% L

oa

d)

20%Tol40%Tol60%Tol100%Tol

BWI vs Mean % Offered Load

0.75

0.8

0.85

0.9

0.95

1

40 60 80 100 120

Mean % Offered Load

BW

I

0% BW Tol20% BW Tol40% BW Tol60% BW Tol100% BW Tol

Fig. 8. Improvement on Mean %R Fig. 9. Bandwidth Index (BWI)

In Figure 8, we observe a significant improvement on mean R when the bandwidthrequired is tolerable. An improvement 1% R means an extra 1 Mbps of bandwidth onaverage was reserved/sold over 33.3 hours. If 1 Mbps was priced at £2 per hour, thiswould generate the provider an extra revenue of £66.60. The Bandwidth Index (BWI)is introduced in Figure 9. It corresponds to the amount of bandwidth granted over theamount of bandwidth originally requested or bg/br. We can see that at low load, mostusers get what they want (BWI ~ 1). However at high load, those users who cantolerate their bandwidth requirements will have to accept lower bandwidth if theyneed the service.

6 Dynamic Bandwidth Pricing Evaluation

A demonstration on dynamic bandwidth pricing was carried out during the technicalvisit session at Fujitsu Telecommunications Europe Ltd. headquater in Birmingham inconjuction with World Telecommuncation Congress (WTC2000). In each session,three volunteers were invited from the audience to assume the role of the futurenetwork operators. Their common goal was basically to maximize their networkrevenue by dynamically pricing bandwidth within the agent-brokering environment.The scenario presents three competing network providers with identical networktopology and resources. Each network consisted of 4 routers with 2x2 links and allprovide connections to a multimedia server.


All the possible paths from one ingress point to another egress point wereprecomputed and stored in the PTs (path tables). The Network Provider Agent (NPA)always offers the shortest available path to the ASPA. The access service provideragent (ASPA) acts as the mediator who aggregates resource requirements by varioususer requests and search for the best possible deals in order to satisfy customers’requirements.

A simple LAN was set-up for this game/demo. The multi-operator network modelcoupled with internal agent brokering mechanisms was run on the Sun Ultra 10workstation. A few PCs were set up for the competing network operators and aseparate monitoring screen was provided for other audience’s viewing.

In this game we considered some universal customer SLA preferences such as bestQoS (guaranteed bandwidth) and cheapest price. We assumed most users prefer QoSas their first priority and the cheapest offering price as the second priority. Hence theUser’s SLA preferences are:

bi, Tsi, Ti = Not negotiable (priority) AND Pi < 21 (maximum acceptable price)

Then ACCEPT (min NetA||NetB||NetC)

Therefore if the cheapest network provider could not provide the required guaranteedflow bandwidth, the service provider would opt for the second cheapest one. Duringthe game, ASPA continually negotiated with NPAs to obtain the required number ofVLLs to a multimedia server that offered voice, audio, VoD and data services. Inorder to show the effects of dynamic pricing, we allowed the invited audience/actingnetwork operator to manually change the bandwidth price. For this demo we did notprovide different pricing scheme for different user classes (voice, video, etc.),although this is also possible. We also did not provide bandwidth negotiation facilityin this game, as we only want to focus on dynamic pricing competition between theoperators. Table 1 describes the billing parameters associated with this demonstrator.

Table 1. Billing Parameters

Items Description

Our Price, ip The selling price for a bandwidth unit per min.for VLL i.

Cost Price, iθThe cost price for a bandwidth unit per minute for VLL i.This value changes according to link’s reservation load andcan loosely represent the management overhead.

Operation Cost, This represents the overall maintenance cost, hardwarecost, labour cost, etc per minute.

Guaranteed bandwidth, ib This is the amount of guaranteed/reserved bandwidth unitallocated to VLL i.

No. of links, iThe number of links used by VLL i. Since no. of links isconsidered in the charging equation, the shortest availablepath is therefore preferred.

Session, iT The session length in minute subscribed by VLL i.

26 D. Chieng et al.

In this game, each user was billed at the end of their service session. Therefore thecalculation for gross revenue earned from each VLL i is based on the followingequations:

Revi = ( ) iiiii Tbp ⋅⋅⋅− θ = Pi , total charge imposed to user i (8)

Total the gross revenue earned by a network operator, Revgross after period t hence:

Revgross(t) = ( )∑ =

tn

i 1 iRev (9)

Where t = simulation time elapsed ( )Tstopt ≤≤0 in minute and n(t) = total numberof VLL counts after t. The total net revenue Revnet after t is then:

Revnet(t) = ( )∑ =

tn

i 1 iRev – .t (10)

Each player or acting network operator could monitor his/her competitors’ offeredbandwidth prices from the console. The instantaneous reservation load for anetwork’s links, and the network topology showing current network utilisation, weredisplayed. The reservation load R(t) is the sum of all the independent active VLLs’reserved bandwidth. The number of active users, and network rejection statistics werealso reported. For this game, the operators’ revenues were solely generated from VLLsales. A monitoring window displayed the total revenues (profit/lost) generated byeach network operator. We associated the link QoS level in terms of Gold, Silver andBronze by referring to link’s reservation load where 0<R(t)<50%=Gold,51%<R(t)<75%=Silver and 76<R(t)<100%=Bronze respectively. This is differentfrom per-user’s QoS as each user’s VLL was already assigned an amount ofguaranteed bandwidth.

Table 2 shows the characteristics set for different classes of users during thegame.

Table 2. Classes of Users

UserClass

RequestArrival Rate(per hour)

Guaranteed unitBW Requested

Per Flow

Session Time PerVLL (mins)

ExampleApplications

1 70 2 3-10 VOIP/Audio

2 15 30 10-60VoD/Video

Conferencing

3 28 20 1-10 WWW/FTP

* One bandwidth unit is roughly defined as 64kbps.

Figure 10 shows the accumulated requested bandwidth (offered load) profile fordifferent classes of users. The results from one of the sessions were collected andanalyzed in figures 11 through figure 15. Figure 11 shows the pricing history of thethree acting network operators. Here, the network operators were trying to maximisethe revenues by setting different bandwidth prices at one time.


0

100

200

300

400

500

600

27 49 66 85 100 116 140 157 177 197

Time(mins)

BW

Un

it

Class1 Users Req Class3 Users Req Class2 Users Req

0

5

10

15

20

0 20 40 60 80 100 120 140 160 180 200Time(mins)

BW

Un

it P

rice

Net A Price Net B Price Net C Price

Fig. 10. Offered Load at the Access Node Fig. 11. Price Bidding History

At t~20mins, network operator A lowered its bandwidth price to 1 and caused a sharpincrease in load over the measured link (see Figure 12). At t>40mins, networkoperator A increased its price dramatically, and soon became much more expensivethan the others. As a result, a significant drop in reservation load observed aftert>75mins. This was most likely due to video subscribers leaving the network. Notethat at time 100mins<t<110mins, when network A was still the most costly network,traffic was coming into the network because the other two networks were saturatedand were unable to provide the required bandwidth. This earned network A a sharprise in revenue (see Figure 13) and a short lead in revenue race as a result of usersbuying high cost connections at bulk volume.

0

20

40

60

80

100

120

0 20 40 60 80 100 120 140 160 180 200Time(mins)

Net

wo

rk L

oad

(B

W U

nit

)

Net A Load Net B Load Net C Load

-20,000

30,000

80,000

130,000

180,000

0 50 100 150 200Time(mins)

Rev

enu

es

Net A RevenueNet B RevenueNet C Revenue

Fig. 12. Link Load (link2) vs. Time Fig. 13. Revenue Generated vs. Time

In Figure 14, we can observe a close relationship between reservation load and price.In this case, it seemed that the cheapest provider earned the most revenues. Howeverit can be observed that the average network B’s price was just marginally more thanto network A. This means network B can actually bid a higher average price and winthe game because network C had a significantly higher average bandwidth price ascompared to network B. Although this strategy is only applicable for this scenario, itillustrates the basic principle of how network providers can maximise their revenuesin such a dynamically changing market.

28 D. Chieng et al.

43.93

52.5161.31

11.1011.7414.35

0

10

20

30

40

50

60

70

Network A Network B Network COperator

Lo

ad/P

rice

Average Load Average Price

-20,000

-10,000

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

1-20 20-40 40-60 60-80 80-100 100-120

120-140

140-160

160-180

180-200

Time IntervalR

even

ues

Network A Network B Network C

Fig. 14. Average Load Vs. Price Fig. 15. Revenue per Time Interval

Figure 15 shows the importance of setting the right price at the right time. Network Amade a loss at 20-40 minutes interval due to low price offer. However much of theloss was compensated at 100-120 interval due to bulk VLL sales at high price. It isalso observed from this simple experiment that it is more profitable to get revenuefrom high bandwidth, long session streams such as video conferencing.

7 Conclusion

In this paper we have described an agent-enhanced service provisioning architecturethat facilitates SLA negotiations and brokering. An extended utility model is devisedto formulate the SLA management problem. A prototype system consisting ofdifferent types of real-time agents that interact with a simulated network wasdeveloped to demonstrate scenarios and allows functional and dynamic behaviour ofthe network under various agent-supported scenarios to be investigated. Somefuturistic scenarios on dynamic SLA negotiations i.e. bandwidth and price weredemonstrated, particularly for VLL type of services. The results show that agent-based solution introduces a much greater dynamic and flexibilities in how servicescan be provisioned. During the dynamic pricing demonstration, we allowed theaudience to manually set the bandwidth price. In the future, an pricing agent couldpotentially take over this task where it can set the right price at the right time based ona more sophisticated pricing mechanism (e.g. different pricing schemes for differentservice classes). A well-defined pricing structure is not only important as a tool toenable users to match and negotiate services as a function of their requirements, it canalso be a traffic control mechanism in itself, as the dynamic setting of prices can beused to control the volume of traffic carried. We learned that how the straightforwardcheapest bandwidth price-bidding scenario affects the competition between variousnetwork providers. In our current prototype system, our agents only exercise thesimplest forms of transactions such as bandwidth negotiation and pricing comparison.Hence, agents’ activities within the network are negligible whether client-server basedagents or mobile agents were used. Nevertheless it is anticipated that when agents


have acquired a higher level of negotiation capabilities and intelligence, this issuemust be further addressed.

Acknowledgement. The authors gratefully acknowledge Fujitsu TelecommunicationsEurope Ltd for funding this project and, Fujitsu Teamware Group Ltd. and FujitsuLaboratories Japan for their software support. Special thanks to Professor ColinSouth, Dr. Dominic Greenwood, Dr. Keith Jones, Dr. Desmond Maguire and Dr.Patrick McParland for all their invaluable feedback.

References

1. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. “An Architecture forDifferentiated Services”, RFC 2475, Network Working Group, IETF, December 1998.(www.ieft.org)

2. E. C. Rosen, A. Viswanathan and R. Callon. “Multiprotocol Label Switching Archi-ecture”, Internet Draft, Network Working Group, IETF, July 2000. (www.ieft.org)

3. R. Braden, L. Zhang, S. Berson, S. Herzog, and S. Jamin. “Resource ReSerVation Protocol(RSVP) -- Version 1 Functional Specification”, RFC 2205, Network Working Group,IETF, September 1997. (www.ietf.org)

4. Stardust.com Inc. “White Paper- QoS Protocols & Architectures”, 8 July 1999,(www.qosforum.com/white-papers/qosprot_v3.pdf)

5. Microsoft Cooperation, “Quality of Services and Windows”, June 2000.(www.microsoft.com/hwdev/network/qos/#Papers)

6. Cisco IOS Quality of Service (QoS). (www.cisco.com/warp/public/732/Tech/qos/)7. Fred Engel, Executive Vice President and CTO of Concord Communications. “Grasping

the ASP means service level shakedown”, Communications News, Aug 2000, pp. 19-20.8. Jit Biswas et. al., “EEE P1520 Standards Initiative for Programmable Network Interfaces”

IEEE Communications, Vol.36, No.10, October 1998, pp. 64-70.9. Martin Chapman, Stefano Montesi, “Overall Concepts and Principles of TINA”, TINA

Baseline Document, TB_MDC.018_1.0_94, 17th Feb 1995. (www.tinac.com)10. L.K. Lim, J.Gao, T.S. Eugene Ng, P. Chandra, “Customizable Virtual Private Network

Service with QoS”, Computer Networks Journal, Elsevier Science, Special Issue on“Overlay Networks”, to appear in 2001.

11. N. Agoulmine, F. Nait-Abdesselam and A. Serhrouchni, “QoS Management of MultimediaServices Based On Active Agent Architecture”, Special Issue: QoS Management in Wired& Wireless Multimedia Communication Networks, ICON, Baltzer Science Publishers, Vol2/2-4, ISSN 1385 9501, Jan 2000.

12. M. Karsen, N. Beries, L. Wolf, and R. Steinmetz, “A Policy-Based Service Specificationfor Resource Reservation in Advance”, Proceedings of the International Conference onComputer Communications (ICCC’99), September 1999, pp. 82-88.

13. O. Schelen and S. Pink, “Resource sharing in advance reservation agents”, Journal of HighSpeed Networks: Special issue on Multimedia Networking, vol 7, no. 3-4, pp. 213-28,1998.

14. X. Wang and H. Schulzrinne, “RNAP: A Resource Negotiation and Pricing Protocol”,Proc. International Workshop on Network and Operating System Support for DigitalAudio and Video (NOSSDAV’99), New Jersey, Jun. 1999.

15. Dinesh Verma, “Supporting Service Level Agreements on IP Networks”, MacmillanTechnical Publishing.

30 D. Chieng et al.

16. David Chieng, Alan Marshall, Ivan Ho and Gerald Parr, “Agent-Enhanced DynamicService Level Agreement In Future Network Environments”, IFIP/IEEE MMNS 2001,Chicago, 29 Oct - 1 Nov 2001.

17. S. Khan, K. F. Li and E. G. Manning, “The Utility Model for Adaptive MultimediaSystems”, International Conference on Multimedia Modeling, Singapore, Nov 97.

18. David Chieng, Alan Marshall, Ivan Ho and Gerald Parr, “A Mobile Agent BrokeringEnvironment for The Future Open Network Marketplace”, IS&N2000, Athens, 23-25February 2000, pp 3-15. (Springer-Verlag LNCS Series, Vol. 1774)

19. Foundation for Intelligent Physical Agents. (www.fipa.org)20. White Paper, “Phoenix-The Enabling Technology Behind Pl@za”, MC0000E, Teamware

Group Oy, April 2001. (www.teamware.com)21. http://java.sun.com/j2ee/tutorial/doc/Overview.html22. R. Fielding et. al, “Hypertext Transfer Protocol – HTTP/1.1”, RFC 2048, Network

Working Group, IETF, Jan 1997. (www.ieft.org)23. http://pr.fujitsu.com/en/news/2000/02/15-3.html


Facing Fault Management as It Is, Aiming for What YouWould Like It to Be

Roy Sterritt

University of UlsterSchool of Information and Software Engineering, Faculty of Informatics

Jordanstown Campus, Northern [email protected]

Abstract. Telecommunication systems are built with extensive redundancy andcomplexity to ensure robustness and quality of service. Such systems requirescomplex fault identification and management tools. Fault identification andmanagement are generally handled by reducing the number of alarm events(symptoms) presented to the operating engineer through monitoring, filteringand masking. The goal is to determine and present the actual underlying fault.Fault management is a complex task, subject to uncertainty in the symptomspresented. In this paper two key fault management approaches are considered:(i) rule discovery to attempt to present fewer symptoms with greater diagnosticassistance for the more traditional rule based system approach and (ii) the in-duction of Bayesian Belief Networks (BBNs) for a complete ‘intelligent’ ap-proach. The paper concludes that the research and development of the two targetfault management systems can be complementary.

1 Introduction

It has been proposed that networks will soon be the keystone to all industries [1]. Ef-fective network management is therefore increasingly important for profitability.Network downtime will not only result in the loss of revenue but may also lead toserious financial contractual penalties for the provider.

As the world becomes increasingly reliant on computer networks the complexity ofsuch networks has grown in a number of dimensions [2]. The phenomenal growth ofthe Internet has shown a clear example of the extent to which the use of computernetworks are becoming ubiquitous [3]. As users’ demands and expectations becomemore varied and complex so do the networks themselves. In particular, heterogeneityhas become the rule rather than the exception [2]. Data of any form may travel underthe control of different protocols through numerous physical devices manufacturedand operated by large numbers of different vendors. Thus there is a general consensusthat the trend is towards increasing complexity.

32 R. Sterritt

Such complexity lies in the accumulation of several factors: the embedded increas-ing function of network elements, the need for sophisticated services and the hetero-geneity challenges of customer networks [4].

This paper explores one aspect of network management in detail, fault identifica-tion. Section 2 looks at the complexity in networks, network management and faultmanagement. Section 3 considers techniques for discovering rules for existing rule-based systems used in fault management. Section 4 discusses other intelligent tech-niques that may offer solutions to the inherent problems associated with rule-basedsystems and finally Section 5 concludes the paper. Throughout the sections referenceis made to a data set of fault management alarms that was gathered from an experi-ment on an SDH/Sonet network in Nortel Networks.

2 Uncertainty in Fault Management

Network management encompasses a large number of tasks with various standardsbodies specifying a formal organisation of these tasks. The International StandardsOrganization (ISO) divides network management into six areas as part of the OpenSystems Interconnection (OSI) model: configuration management, fault management,performance management, security management, accounting management and direc-tory management these sit within a seven layer network hierarchical structure.

However, with the Internet revolution and the convergence of the Telcos and DataCommunications the trend is towards a flatter structure.

2.1 Faults and Fault Management

Essentially, network faults can be classified into hardware and software faults, whichcause elements to produce outputs, which in turn cause overall failure effects in thenetwork, such as congestion [5]. A single fault in a complex network can generate acascade of events, potentially overloading a manager’s console with information [6].

The fault management task can be characterised as detecting when network behav-ior deviates from normal and formulating a corrective course of action when required.Fault management can be decomposed into three tasks: fault identification, fault diag-nosis and fault remediation [2]. A fourth could be added as a desire or expectation ofthe fault management task, considered as a natural extension of fault identification—fault prediction [7].

2.2 An Experiment to Highlight Uncertainty in Fault Management

A simple experiment was performed to highlight some of the uncertainty that can beexperienced within fault management, specifically looking at the physical networkelement layer from which other management layer information is derived. The con-figuration was a basic network with two STM-1 network elements (NEs) and an ele-

Facing Fault Management as It Is, Aiming for What You Would Like It to Be 33

ment manager. A simple test lasting for just over 3 minutes containing 16 commandsthat exercises the 2M Single channel was run on the network 149 times. The networkwas dedicated to the test and no external configuration changes took place during theexperiment. After each run the network was allowed a rest period to ensure no trailingevents existed before being reset.

0

5

10

15

20

25

30

35

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101

105

109

113

117

121

125

129

133

137

141

145

149

Test run

Ala

rm F

req

Fig. 1. Simple test repeated 149 times: Uncertainty in frequency of alarms raised.

The graph (Figure 1) displays the number of alarms raised, 24 being the lowest (run124) and 31 being the highest (run 22). The average was 27 alarm events. This ex-periment on a small network highlights the variability of alarm data under fault condi-tions. The expectation that the same fault produces the same symptoms each time isunfortunately not true of this domain.

3 Facing Fault Management as It Is

The previous experiment hints at the number of alarm events that may be raised underfault conditions. Modern networks produce thousands of alarms per day, making thetask of real-time network surveillance and fault management difficult [8]. Due to thelarge volume of alarms, it is potentially possible to overlook or misinterpret them.Alarm correlation has become the main technique to reduce the number to consider.

3.1 Alarm Correlation

Alarm correlation is a conceptual interpretation of multiple alarms, giving them a newmeaning [8] and from that, potentially creating derived higher order alarms [9]. Jakob-

34 R. Sterritt

son and Weissman proposed correlation as a generic process of five types: compres-sion, suppression, count, generalisation and Boolean patterns.

Compression is the reduction of multiple occurrences of an alarm into a singlealarm. Suppression is when a low-priority alarm may be inhibited in the presence of ahigher alarm, generally referred to as masking. Count is the substitution of a specifiednumber of occurrences of an alarm, with a new alarm. Generalisation is when analarm is referenced by its super-class. Boolean patterns is the substitution of a set ofalarms satisfying a Boolean pattern with a new alarm.

3.2 Alarm Monitoring, Filtering, and Masking

Within current fault management systems, as specified by ITU-T, alarm correlation isgenerally handled in three sequential transformations: alarm monitoring, alarm filter-ing and alarm masking. These mean that if the raw state of an alarm instance changesan alarm event is not necessarily generated.

Alarm monitoring takes the raw state of an alarm and produces a monitored state.Alarm monitoring is enabled/disabled for each alarm instance. If monitoring is en-abled, then the monitored state is the same as the raw state; if disabled, then themonitored state is clear.

Alarm filtering is also enabled/disabled for each alarm instance. An alarm may ex-ist in any one of three states: present, intermittent or clear, depending on how long thealarm is raised. Assigning these states, by checking for the presence of an alarm withincertain filtering periods, determines the alarm filtering.

Alarm masking is designed to prevent unnecessary reporting of alarms. The maskedalarm is inhibited from generating reports if an instance of its superior alarm is activeand fits the masking periods. A masking hierarchy determines the priority of eachalarm type. Alarm masking is also enabled/disabled for each alarm instance.

When an alarm changes state the network management system must be informed.The combination of alarm monitoring, filtering and masking makes alarm handlingwithin the network elements relatively complex.

3.3 Fault Management Systems

All three stages of fault management (identification, diagnosis and remediation) in-volve reasoning and decision making based on information about current and paststates of the network [2]. Interestingly, much of the work in this area makes use oftechniques from Artificial Intelligence (AI), especially expert systems and increas-ingly machine learning. The complexity of computer networks and the time criticalnature of management decisions make network management a domain that is difficultfor humans [2]. It has been claimed that expert systems have achieved performanceequivalent to that of human experts in a number of domains, including certain aspectsof network management [10].


Most systems employing AI technologies for fault diagnosis are expert or produc-tion rule systems [9]. Some of these classic applications and theory are discussed in[13,14], containing cases such as ACE (Automated Cable Expertise) [15] and SMART33 (Switching Maintenance Analysis and Repair Tool) [16]. Details of several otherartificial intelligence applications in telecommunications network management mayalso be found in [17-19].

Many of these systems have proved very successful yet have their limitations. Gen-erally speaking, as Gürer outlines, rule-based expert systems: Cannot handle new and changing data since rules are brittle and can not cope when

faced with unforeseen situations. Can not adapt. That is, they can not learn from past experiences. Do not scale well to large dynamic real world domains. Require extensive maintenance. Are not good at handling probability or uncertainty. Have difficulty in analysing large amounts of uncorrelated, ambiguous and incom-

plete data.These drawbacks support the use of different AI techniques that can overcome thesedifficulties either alone or as an enhancement of expert systems [9]. There is a pre-dicament, however, as there is doubt if such techniques would be accepted as the en-gine within the fault management system.

3.4 Facing the Challenge

At the heart of alarm correlation is the determination of the cause. The alarms repre-sent the symptoms and as such, in the global scheme, are not of general interest oncethe failure is determined [11]. There are two real world concerns: (i) the sheer volumeof alarm event traffic when a fault occurs; and (ii) identifying the cause from thesymptoms.

Alarm monitoring, filtering and masking and their direct application in the form ofrule-based systems address concern (i), which is vital. They focus on reducing thevolume of alarms but do not necessarily help with (ii) to determine the actual cause—this is left to the operator to resolve from the reduced set of higher priority alarms.Ideally, a technique that can tackle both these concerns would be best.

AI offers that potential and has been and still is an active and worthy area of re-search to assist in fault management. Yet telecommunication manufacturers, under-standably, have shown reluctance in incorporating AI techniques, in particular thosethat have an uncertainty element, directly into their critical systems. Rule-based typesystems have achieved acceptance largely because the decisions obtained are determi-nistic, they can be traced and understood by domain experts.

As a step towards automated fault identification, and with the domain challenges inmind, it is useful to use AI to derive rule discovery techniques that present fewersymptoms with greater diagnostic assistance.

A potential flaw in data mining is that it is not user-centered. This may be allevi-ated by the visualisation of the data at all stages to enable the user to gain trust in theprocess and hence have more confidence in the mined patterns. The transformation

36 R. Sterritt

from data to knowledge requires interpretation and evaluation, which also stands tobenefit from multi-stage visualisation of the process [20-22], as human and computerdiscovery are interdependent.

The aim with computer-aided human discovery is to reveal hidden knowledge, un-expected patterns and new rules from large datasets. Computer handling and visuali-sation techniques of vast amounts of data make use of the remarkable perceptualabilities that humans possess, such as the capacity to recognise images quickly, anddetect the subtlest changes in size, colour, shape, movement or texture—thus, poten-tially, discovering new event correlations in the data.

Data mining (discovery algorithms) may reveal hidden patterns and new rules yetthese require human interpretation to transform them into knowledge. The humanelement attaches a more meaningful insight to decisions allowing the discovered cor-relations to be coded as useful rules for fault identification and management.

3.5 Three-Tier Rule Discovery Process

Computer-assisted human discovery and human-assisted computer discovery can becombined in a three tier process, to provide a mechanism for discovery and learning ofrules for fault management [23].

The tiers are; Tier 1 - Visualisation Correlation Tier 2 - Knowledge Acquisition or Rule Based Correlation Tier 3 - Knowledge Discovery (Data Mining) Correlation

The top tier (visualisation correlation) allows the visualisation of the data in severalforms. The visualisation has a significant role throughout the knowledge discoveryprocess, from data cleaning to mining. This allows analysis of the data with the aim ofidentifying other alarm correlations (knowledge capture). The second tier (knowledgeacquisition or rule-based correlation) aims to define correlations and rules using moretraditional knowledge acquisition techniques—utilising documentation and experts.The third tier (knowledge discovery correlation) mines the telecommunications man-agement network data to produce more complex correlation candidates.

The application of the 3-tier process is iterative and flexible. The visualisation tiermay require the knowledge acquisition tier to confirm its findings. Likewise visualisa-tion of the knowledge discovery process could facilitate understanding of the patternsdiscovered.

The next section uses the experimental data to demonstrate how the application ofthe 3-tier process may be of use in discovering new rules.

3.6 Analysing the Experimental Data: Rule-Based Systems

The experiment demonstrated how a relatively large volume of fault management datacan be produced on a simple network and how the occurrence of alarm events underfault conditions is uncertain. Since the data is only concerned with the same simulatedfault/test being repeated 149 times it does not provide the right context for any mined


results to be interpreted as typical network behavior, yet is sufficient to illustrate theanalysis process.Mining the data can identify rules of alarms that occur together, that is potential cor-relations such as If PPI-AIS on Gate_02 then INT-TU-AIS on Gate_02. Through tradi-tional knowledge acquisition the majority of these may be explained as existingknowledge—for instance with reference to the alarm masking hierarchy from ITU-T.Visualisation may help explain the data that led to the mined discoveries or allow forhuman discoveries themselves, as illustrated in Figure2.

The alarm life spans are displayed as horizontal Gantt bars. In this view, the alarmsare listed for each of the two network elements as such alarms that are occurring onthe same vertical path (time) are potential correlations. The two highlighted alarms arePPI-Unexpl_Signal and LP-PLM. This matches a discovery found in a different dataset [23].

Fig. 2. Network Element view - alarms raised on each element. Highlighted is a possible corre-lation in time (PPI-Unexpl_Signal and LP-PLM).

On investigating the standards specifications it is found that a PPI-Unexp-Signal hasno impacts or consequent actions. LP-PLM affects traffic and can inject an AIS andLP-RDI alarm depending on configuration (consequent actions for LP-PLM can beenabled/disabled, the default being disabled). Thus there is no explicit connectiondefined for these two alarms.

Visualising only these alarms in the alarm view (Figure 3) would tend to confirmthis correlation. Each time PPI-Unexp_Signal is raised LP-PLM becomes active on

38 R. Sterritt

the other connected network element. This occurs both ways. That is, it is not depend-ant on which NE PPI-Unexp_Signal is raised.

Since this discovered correlation is an unexpected pattern it is of interest and can becoded as a rule for the fault management system or other diagnostic tool. This may beautomated for the target rule system using, for example, ILOG rules.The rule would state that when PPI-Unexpl_Signal and LP-PLM occur together on thesame port but with different connected multlplexers, then correlate these alarms andretract them while raising a derived alarm that specifies the correlation. This derivedalarm would be used to trigger diagnostic assistance or be correlated with furtheralarms to define the fault.This example shows that tools can be developed to semi-automate the rule develop-ment process. The rule in this example still has all the problems connected with han-dling uncertainty, however, since it would be used within existing rule-based faultmanagement systems. The next section considers how the approach can be improvedusing uncertainty AI techniques in the discovery process.

Fig. 3. Alarm view - each time PPI-Unexp_Signal becomes active on one NE, LP-PLM be-comes active on the other. This occurs on both NEs.

4 Aiming for What You Would Like Fault Management to Be

Fault management is a complex task, subject to uncertainty in the ‘symptoms’ pre-sented. It is a good candidate for treatment by an AI methodology that handles uncer-tainty, such as soft computing or computational intelligence [24].


Correlation serves to reduce the number of alarms presented to the operator, and anintelligent fault management system might additionally facilitate fault prediction. Therole of the fault management system may be described as: Fault identification/diagnosis—prediction of the fault(s) that have occurred from

the alarms present, Behaviour prediction—warn the operator before hand of severe faults from the

alarms that are presenting themselves, and, Estimation—of a fault’s likely life span.

4.1 Intelligence Research

The technology most commonly used to add significant levels of automation to net-work management platforms is rule-based expert systems. Yet the inherent disadvan-tages with such systems, discussed previously, limit how much they can be used, thusencouraging many researchers to seek new approaches. Problems associated withacquiring knowledge from experts and building it into a system before it is out of date(knowledge acquisition bottleneck) together with the high manual maintenance burdenfor rules has led to research into machine learning techniques. Ways of handling par-tial truths, evidence, causality and uncertainty were sought from statistics. Likewise,the brittleness and rigidity of the rules and their inability to learn from experience hasyielded to research into self-adaptive techniques, such as case-based reasoning.

Increasingly, these and other AI techniques are being investigated for all aspects ofnetwork management. Machine learning has been used to detect chronic transmissionfaults [25] and dispatch technicians to fix problems in local loops [26, 27]. Neuralnetworks have been used to predict the overall health of the network [28] and monitortrunk occupancy [29]. Decision trees have also been used for rule discovery [30, 31]as well as data mining [32, 33] and the most recent trend is the use of agents [34, 35].

Although the techniques may address some of the problems of rule-based systemsthey have disadvantages of their own. Thus an increasing trend in recent years hasbeen to utilise hybrid systems to maximise the strengths and minimise the weaknessesof individual techniques.

4.2 Intelligent Techniques for Fault Identification

Neural networks are a key technique in both computational Intelligence and soft com-puting, with a proven predictive performance. They have been proposed along withcase-based reasoning as a hybrid system for a complete fault management process [9],as well as to identify faults in switching systems [38], the management of ATMs [36]and alarm correlation in cellular phone networks [39].

Yet they do not meet one important goal—comprehensibility [38]. This lack of ex-planation, leads to some reluctance to use neural networks in fault management sys-tems [32]. Kohonen self-organising maps [41] and Bayesian belief networks [42] havebeen offered as alternatives to such a black-box approach.

40 R. Sterritt

4.3 Bayesian Belief Networks for Fault Identification

The Bayesian Belief Networks (BBNs) graphical structure more than meets the needfor ‘readability’. BBNs consist of a set of propositional variables represented by nodesin a directed acyclic graph. Each variable can assume an arbitrary number of mutuallyexclusive and exhaustive values. Directed arcs (arrows) between nodes represent theprobabilistic relationships between nodes. The absence of a link between two variablesindicates independence between them given that the values of their parents are known.In addition to the network topology, the prior probability of each state of a root node isrequired. It is also necessary, in the case of non-root nodes, to know the conditionalprobabilities of each possible value given the states of parent nodes or direct causes.

The power of the BBN is evident whenever a change is made on one of the mar-ginal probabilities. The effects of the observation are propagated throughout the net-work and the other probabilities updated. The BBN can be used for deduction in thefault management domain. For given alarm data, it will determine the most probablecause(s) of the supplied alarms, thus enabling the process to act as an expert systemthat handles uncertainty.

For a discussion on the construction of BBNs for Fault Management see [43]. Thenext section uses the network experimental data to induce a simple BBN for FM.

4.4 Analysing the Experimental Data with Bayesian Belief Networks

The experiment may be considered as inducing 149 instances of the same simulatedfault. In order to develop a BBN for the experiment, each instance of the simulatedfault (all 149 sets of data) were assigned a row in the contingency table containing itsfrequencies of alarm occurrences. This is being performed at a much higher abstractlevel than usual. The normal procedure would be to assign a time window, of possiblyas little as 1 second, and calculate the frequencies of occurrence for combinations ofalarms that are present throughout the data set. It is expected that there will be feweredges in the graph but as a whole it should still reflect the significant alarms and therelationships between them for a simulated fault in this experiment.

Fig. 4. Induced results from the 149 simulated fault experiment.


The PowerConstructor package [44] incorporates an efficient 3-stage algorithm usinga version of the mutual information approach [45]. The BBN in Figure 4 was inducedfor the simulated fault experiment. There was not enough data for the algorithm todistinguish the direction of the edges although this is something an expert could pro-vide. Only the structure and not the values are depicted for simplicity and since thesmall data set may explain why the figures are not very meaningful.

The specifications state that if the signal is configured unstructured (that is, doesnot conform to ITU-T recommendation G732) an AIS alarm can be considered a validpart of the signal. In this case the PPI-AIS’s strong presence can explain the unstruc-tured signal. PPI-AIS also has a consequence of injecting an AIS towards the tributaryunit. If the alarms are separated into two data sets and BBNs induced separately therelationships shown in Figure 5 begin to develop. The relationship betweenPPI_Unexp_Signal and LP_PLM (discovered earlier by visual inspection of Fig. 2) isconfirmed by the induced BBNs.

The variables (nodes) in a BBN can represent faults as well as alarms which sup-ports the aim of automated fault diagnosis and not just alarm correlation. Fault nodesare added to this example in Figure 6 and Figure 7. In the first case the fault has threepossible fault values: faulty tributary unit, faulty payload manager or unstructuredsignal. Figure 7 includes a fault node that has two levels: faulty tributary unit andcable misconnection.

Fig. 5. Inducing from split data set due to the knowledge that the signal was unstructured.

Fig. 6. Fault node added to AIS relationship.

42 R. Sterritt

Once the BBN is part of an expert system, the occurrence of these alarms will causepropagation of this ‘evidence’ through the network providing probability updates andpredictions of fault identification.

The example induced is somewhat simple but illustrates the potential of BBNs infault management. It is important to note that even with the ability to induce (machinelearn or data mine) the BBN from data it took human access to expert knowledge tofind a more accurate solution. The success of the approach is also dependent on thequality and quantity of the data. To develop a fully comprehensive belief network thatcovers the majority of possible faults on a network would be a massive undertakingyet the benefits over rule-based systems suggest that this may be a very worthwhiletask.

Fig. 7. Fault node added to Unexplained-signal relationship.

5 Summary and Conclusion

The paper first illustrated the complexity and uncertainty in fault management byshowing the variability of alarms raised under the same fault conditions. It then de-scribed the standard approach of dealing with fault identification using alarm correla-tion, via monitoring, filtering and masking, implemented as a rule-based system. Thishighlighted problems with rule-based systems and discussed and demonstrated a three-tier rule discovery process to assist in alleviating some of these problems.

Other AI techniques and methodologies that handle uncertainty, in particular themain techniques in computational intelligence and soft computing, were discussed.Belief networks were proposed and demonstrated as a technique to support automatedfault identification.

In each section the approaches were illustrated with data from an experimentalsimulation of faults on an SDH/Sonet network in Nortel Networks.


Acknowledgements. The author is greatly indebted to the Industrial Research andTechnology Unit (IRTU) (Start 187–The Jigsaw Programme, 1999-2002) for fundingthis work jointly with Nortel Networks. The paper has benefited from discussions withother members of the Jigsaw team and with collaborators at Nortel.

References

1. M. Cheikhrouhou, P. Conti, R. Oliveira,J. Labetoulle. Intelligent agents in network man-agement: A state-of-the-art. Networking and Information Systems J., 1, pp9-38, Jun 1998.

2. T. Oates. Fault identification in computer networks: A review and a new approach. T.R. 95-113, University of Massachusetts at Amherst, Computer Science Department, 1995.

3. C. Bournellis. Internet ‘95. Internet World, 6(11) pp47-52, 1995.4. M. Cheikhrouhou, P. Conti, J. Labetoulle, K. Marcus, Intelligent Agents for Network Man-

agement: Fault Detection Experiment. In Sixth IFIP/IEEE International Symposium on In-tegrated Network Management, Boston, USA, May 1999.

5. Z. Wang, Model of network faults, In B. Meandzija, J. Westcott (Eds.), Integrated NetworkManagement I., North Holland, Elsevier Science Pub. B.V., 1989.

6. T. Oates. Automatically Acquiring Rules for Event Correlation From Event Logs, TechnicalReport 97-14, University of Massachusetts at Amherst, Computer Science Dept, 1997.

7. R. Sterritt, A.H. Marshall, C.M. Shapcott, S.I. Mcclean, Exploring Dynamic BayesianBelief Networks For Intelligent Fault Management Systems, IEEE Int. Conf. Systems, Manand Cybernetics, V, pp3646-3652, Sept. 2000.

8. G. Jakobson and M. Weissman. Alarm correlation. IEEE Network, 7(6):52–59, Nov. 1993.9. D. Gürer, I. Khan, R. Ogier, R. Keffer, An Artificial Intelligence Approach to Network

Fault Management, SRI International, Menlo Park, California, USA.10. R. N. Cronk, P. H. Callan, L. Bernstein, Rule based expert systems for network manage-

ment and operations: An introduction. IEEE Network, 2(5) pp7-23, 1988.11. K. Harrison, A Novel Approach to Event Correlation, HP, Intelligent Networked Comput-

ing Lab, HP Labs, Bristol. HP-94-68, July, pp. 1-10, 1994.12. I. Bratko, S. Muggleton, Applications of Inductive Logic Programming, Communications of

the ACM, Vol. 38, no. 11, pp. 65-70, 1995.13. J. Liebowitz, (ed.) Expert System Applications to Telecommunication,. John Wiley and

Sons, New York, NY, USA, 1988.14. B.Mintegrated, J.Westcott, (eds.) Network Management I, North Holland/IFIP, Elsevier

Science Publishers B.V., Netherlands, 1989.15. J.R. Wright, J.E. Zielinski, E. M. Horton. Expert Systems Development: The ACE System,

In [13], pp45—72, 1988.16. Gary M. Slawsky and D. J. Sassa. Expert Systems for Network Management and Control in

Telecommunications at Bellcore, In [13], pp191–199, 1988.17. Shri K. Goyal and Ralph W. Worrest. Expert System Applications to Network Manage-

ment, In [13], pp 3—44, 1988.18. C. Joseph, J. Kindrick, K. Muralidhar, C. So, T. Toth-Fejel, MAP fault management expert

system, In [14], pp 627–636, 1989.19. T. Yamahira, Y. Kiriha, S. Sakata, Unified fault management scheme for network trouble-

shooting expert system. In [14], pp 637—646, 1989.

44 R. Sterritt

20. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, From Data Mining to Knowledge Discovery:An Overview, Advances in Knowledge Discovery & Data Mining, AAAI Press & The MITPress: California, pp1-34, 1996.

21. R.J. Brachman, T. Anand, The Process of Knowledge Discovery in Databases: A Human-Centered Approach., Advances in Knowledge Discovery & Data Mining, AAAI Press &The MIT Press: California, pp37-57, 1996.

22. R. Uthurusamy, From Data Mining to Knowledge Discovery: Current Challenges and Fu-ture Directions, Advances in Knowledge Discovery & Data Mining, AAAI Press & TheMIT Press: California, pp 561-569, 1996.

23. R. Sterritt, Discovering Rules for Fault Management, Proceedings of IEEE InternationalConference on the Engineering of Computer Based Systems (ECBS), Washington DC, USA,April 17-20, pp190-196, 2001.

24. R. Sterritt, Fault Management and Soft Computing, Proceedings of the International Sym-posium Soft Computing and Intelligent Systems for Industry, Paisley, Scotland, UK, June 26- 29, 2001.

25. R. Sasisekharan, V. Seshadri, and S. M. Weiss. Proactive network maintenance using ma-chine learning. In Proceedings of the 1994 Workshop on Knowledge Discovery in Data-bases, pp 453-462, 1994.

26. A. Danyluk, F. Provost. Small disjuncts in action: Learning to diagnose errors in the tele-phone network local loop. In Proceedings of the Tenth International Conference on Ma-chine Learning, 1993.

27. Foster Provost, Andrea Danyluk, A Study of Complications in Real-world Machine Learn-ing , TR NYNEX, 1996.

28. German Goldszmidt and Yechiam Yemini. Evaluating management decisions via delega-tion. In H. G. Hegering and Y. Yemini, editors, Integrated Network Management, III,pp247-257. Elsevier Science Publishers B.V., 1993.

29. Rodney M. Goodman, Barry Ambrose, Hayes Latin, and Sandee Finnell. A hybrid expertsystem/neural network traffic advice system. In H. G. Hegering and Y. Yemini, editors, In-tegrated Network Management, III, pp 607-616. Elsevier Science Publishers B.V., 1993.

30. Rodney M. Goodman and H. Latin. Automated knowledge acquisition from network man-agement databases. In I. Krishnan and W. Zimmer, editors, Integrated Network Manage-ment, II, pp 541-549. Elsevier Science Publishers B.V., 1991.

31. Shri K. Goyal. Knowledge technologies for evolving networks. In I. Krishnan and W. Zim-mer, editors, Integrated Network Management, II, pp 439-461. Elsevier Sci ence PublishersB.V., 1991.

32. K. Hatonen, M. Klemettinen, H. Mannila, P. Ronkainen, H.Toivonen, Knowledge Discov-ery from Telecommunication Network Alarm Databases, Proc. 12th Int. Conf. on Data En-gineering (ICDE’96), pp.115-122, 1996.

33. Oates, T., Jensen, D., and Cohen, P. R. (1998). Discovering rules for clustering and pre-dicting asynchronous events. In Danyluk, A. Predicting the future: AI approaches to time-series problems. Technical Report WS-98-07, AAAI Press, pp73-79, 1998.

34. M. Cheikhrouhou, P. Conti, R. Oliveira, and J. Labetoulle. Intelligent agents in networkmanagement: A state-of-the-art. Networking and Information Systems Journal, 1 pp 9-38,Jun 1998.

35. M. Cheikhrouhou, P. Conti, J. Labetoulle, K. Marcus, Intelligent Agents for Network Man-agement: Fault Detection Experiment. In Sixth IFIP/IEEE International Symposium on In-tegrated Network Management (IM’99), Boston, USA, May 1999.


36. Y.A. Sekercioglu, A. Pitsillides, A. Vasilakos, Computational Intelligence in Managementof ATM Networks: A survey of Current Research, Proc. ERUDIT Workshop on Applicationof Computational Intelligence Techniques in Telecommunication, London, 1999.

37. B. Azvine, N.Azarmi, K.C. Tsui, Soft computing - a tool for building intelligent systems,BT Technology Journal, vol.14, no. 4 pp37-45, Oct. 1996.

38. T. Clarkson, Applications of Neural Networks in Telecommunications, Proc. ERUDITWorkshop on Application of Computational Intelligence Techniques in Telecommunication,London, UK, 1999.

39. H. Wietgrefe, K. Tochs, and et al. Using neural networks for alarm correlation in cellularphone networks. In the International Workshop on Applications of Neural Networks inTelecommunications, 1997.

40. Dorffner, Report for NEuroNet, http://www.kcl.ac.uk/neuronet, 1999.41. R.D. Gardner and David A. Harle, Alarm Correrlation and Network Fault Resolution using

the Kohonen Self-Organising Map, Globecom-97, 1997.42. R. Sterritt, K. Adamson, M. Shapcott, D. Bell, F. McErlean, Using A.I. For The Analysis of

Complex Systems, Proc. Int. Conf. Artificial Intelligence and Soft Computing, pp113-116,1997.

43. R. Sterritt, W. Liu, Constructing Bayesian Belief Networks for Fault Management in Tele-communications Systems, 1st EUNITE Workshop on Computational Intelligence in Tele-communications and Multimedia at EUNITE 2001, pp 149-154, Dec. 2001.

44. J. Cheng, D.A. Bell, W. Liu, An algorithm for Bayesian network construction from data.Proceedings of the 6th International Workshop on Artificial Intelligence and Statistics(AI&STAT’97), 1997.

45. C. J. K. Chow, C. N. Liu, Approximating discrete probability distributions with dependencetrees, IEEE Trans. Information Theory, Vol. 14(3), pp.462-467, 1968.


Enabling Multimedia QoS Control with Black-BoxModelling

Gianluca Bontempi and Gauthier Lafruit

IMEC/DESICS/MICSKapeldreef 75

B-3001 Heverlee, BelgiumGianluca.Bontempi, [email protected]

http://www.imec.be/mics

Abstract. Quality of Service (QoS) methods aim at trading quality againstresource requirements to meet the constraints dictated by the applicationfunctionality and the execution platform. QoS is relevant in multimedia taskssince these applications are typically scalable systems. To exploit the scalabilityproperty for improving quality, a reliable model of the relation between scalableparameters and quality/resources is required. The traditional QoS approachrequires a deep knowledge of the execution platform and a reasonably accurateprediction of the expected configurations. This paper proposes an alternativeblack-box data analysis approach. The advantage is that it requires no a prioriassumptions about the correlation between quality/resources and parametersand it can easily adapt to situations of high complexity, changing platforms andheterogeneous environments. Some preliminary experiments with the QoSmodelling of the Visual Texture Coding (VTC) functionality of a MPEG-4decoder using a local learning technique are presented to support the claim.

1 Introduction

Multimedia applications are dynamic, in the sense that they can dynamically changeoperational requirements (e.g., workload, memory, processing power), operationalenvironments (e.g., mobile, hybrid), hardware support (e.g., terminals, platforms,networks) and functionality (e.g., compression algorithms). It would be highlyvaluable if such systems could adapt to all changes of configuration and still providedependable service to the user. This means that the application should be providedwith a capacity of monitoring and evaluating online its own behaviour andconsequently adjusting it in order to meet the agreed upon goals [24]. This is feasiblefor scalable systems, where the required resources and the resulting functionality canbe controlled and adapted by a number of parameters. A well-known example of amultimedia standard that supports scalability is MPEG-4 [17] where processing andpower requirements can be tuned in order to circumvent run-time overloads at aminimal settled quality.

The adaptation of scalable applications is a topic that is typically addressed by theQuality of Service (QoS) discipline [27]. QoS traditionally uses expert knowledgeand/or domain specific information to cope with time-varying working conditions [19,

Enabling Multimedia QoS Control with Black-Box Modelling 47

21]. Consequently, this approach requires a careful analysis of the functionality, adeep knowledge of the execution platform and a reasonably accurate prediction of theexpected configurations.

This paper aims to extend the traditional QoS approach with the support ofmethods, techniques and tools coming from the world of intelligent data analysis [6,28]. We intend to show that these techniques are promising in a multimedia contextsince they require little a priori assumptions about the system and the operatingconditions, then guaranteeing a large applicability and improved robustness.

The work proposes a black-box statistical approach which, based on a set ofobservations, addresses two main issues. First, defining and identifying whichfeatures, within the huge set of parameters characterising a multimedia application,influence the operational requirements (e.g. execution time, memory accesses,processing power) and consequently the quality perceived by the user (e.g.,responsiveness, timeliness, signal to noise ratio). Secondly, building a reliablepredictive model of the resource requirements, having for input the set of featuresselected in the previous step. To this aim, we adopt state-of-the-art linear and non-linear data mining techniques [14]. The resulting predictive model is expected to bean enabling factor for an automated QoS-aware system that should guarantee therequired quality by tuning the scalable parameters of the applications [21]. Althoughthe paper will limit the experimental contribution to a modelling task, a QoSarchitecture integrating the proposed black-box procedure with adaptive control willbe introduced and discussed.

As a case study, we consider the Visual Texture Coding (VTC) algorithm [20], awavelet-based image compression algorithm for the MPEG-4 standard. For a fixedplatform, we show that it is possible to predict the execution time and the number ofmemory accesses of the VTC MPEG-4 decoder with a reasonable accuracy, once thevalue of relevant scalable parameters are known (e.g., the number of wavelets levels).The mapping between resource requirements and scalability parameters isapproximated by using a data mining approach [14]. In particular, we compare andassess linear approaches and non-linear machine learning approaches on the basis of afinite amount of data. The measurements come from a set of 21 test images, encodedand decoded by the MoMuSys (Mobile Multimedia Systems) reference code.

The experimental results will show that a non-linear approach, namely the lazylearning approach [10], is able to outperform the conventional linear method. Thisoutcome proves that modelling the resource requirement of a multimedia algorithm isa difficult task that can be tackled with success with the support of black-boxapproaches. Note that, although the paper is limited to modelling the resourcerequirements, we deem that the proposed methodology is generic enough to beapplicable to the modelling of quality attributes.

The remainder of the paper is structured as follows. Section 2 presents the qualityof service problem in the framework of automatic control problems. Section 3introduces a statistical data analysis procedure to model the mapping between thescalable parameters and the resource workload. Section 4 analyses how the VTCresource modelling problem can be tackled according to the data driven proceduredescribed in Section 3. The experimental setting and the results are reported inSection 5. Conclusions and future work are discussed in Section 6.

48 G. Bontempi and G. Lafruit

2 A Control Interpretation of the QoS Problem

This section introduces a QoS problem as a standard problem of adaptive control,where a system (e.g., the multimedia application) has to learn how to adapt to anunknown environment (e.g., the new platform, configuration or running mode) andreach a specific goal (e.g., the quality requirements). Like any adaptive controlproblem, a QoS problem can be decomposed in two subproblems, a modellingproblem and a regulation problem, respectively. The following sections will discussthese two steps in detail.

2.1 Modelling for Quality of Service

The aim of a QoS procedure is to set properly the scalable parameters in order to meetthe quality requirements and satisfy the resource constraints. This requires an accuratedescription of these two relations:1. The relation linking the scalable parameters characterising the application (e.g., the

number of wavelets level in a MPEG-4 still texture codec) to the requiredresources (e.g., power, execution time, memory).

2. The relation linking the same scalable parameters to the perceived quality (e.g., thesignal to noise ratio).

It is important to remark how these relations are unknown a priori and that amodelling effort is required to characterise them. Two conventional methods toapproach a modelling problem in the literature are:

− A white-box approach, which starting from an expert-based knowledge aims atdefining how parameters of the algorithm are related to the resource usage and theperceived quality [19, 20].

− A black-box statistical approach, which, based on a sufficient amount ofobservations, aims at discovering which scalable parameters are effectivepredictors of the performance of the applications [11]. Note that this approachrequires less a priori knowledge and allows a continuous refinement and adaptationbased on the upcoming information. Moreover, a quantitative assessment of thepredictive capability can be returned together with the model.

This paper will focus on the second approach and will apply it to the modelling of therelation between scalable parameters and resource requirements. To this aim, we willsuggest the adoption of supervised learning methods. Supervised learning [13]addresses the problem of modelling the relation between a set of input variables, andone or more output variables, which are considered somewhat dependent on the inputs(Fig. 1), on the basis of a finite set of input/output observations. The estimated modelis essentially a predictor, which, once fed with a particular value of the inputvariables, returns a prediction of the value of the output.

The goal is to obtain a reliable generalisation, i.e. that the predictor, calibrated onthe basis of a finite set of observed measures, is able to return an accurate predictionof the dependent variable when a previously unseen value of the independent vector ispresented. In other words, this technique aims to discover and to assess, on the basisof observations only, potential correlations between sets of variables and use thesecorrelations to extrapolate to new scenarios.


Phenomenon

Observations

Modelprediction

input output error

Fig. 1. Data-driven modelling of an input/output phenomenon.

The following section will show the relevance of black-box input/output models inthe context of a QoS policy.

2.2 A QoS-Aware Adaptive Control Architecture

The discipline of automatic control offers techniques for developing controllers thatadjust the inputs of a given system in order to drive the outputs to some specifiedreference values [15]. Typical applications include robotics, flight control andindustrial plants where the process to be controlled is represented as an input/outputsystem. Similarly, once we model a multimedia algorithm as an input/output process,it is possible to extend control strategies to multimedia quality of service issues. Inthis context, the goal is to steer properly the inputs of the algorithm (e.g. the scalableparameters) to drive the resource load and the quality to the desired values.

In formal terms we can represent the QoS control problem by the followingnotation. Let us suppose that at time t we have A concurrent scalable applications andR constrained resources. Each resource has a finite capacity rmaxi(t), i=1,..,R, that canbe shared, either temporally or spatially. For example, CPU and network bandwidthwould be time-shared resources, while memory would be a spatially shared resource.For each application a=1,…,A we assume the existence of the following relations

between scalable parameters ajs , j=1,…,m, and resource usage a

ir and between

scalable parameters ajs and quality metrics a

kq , k=1,…,K:

))(),(),...,(),(()(

))(),(),...,(),(()(

21

21

tEtststsgtq

tEtststsftr

aaaak

aaaai

m

m

=

=.

(1)

where ajs denotes the jth scalable parameter of the ath application, K is the number of

quality metrics for each application and E(t) accounts for the remaining relevantfactors, like the architecture, the environmental configuration, and so on. We assume

that the quantities akq are strictly positive and proportional to the quality perceived by

the user. At each time instant the following constraints should be met


KkQtq

Ritrmaxtrtr

ak

ak

A

aii

ai

,...,1)(

,...,1)()()(1

=≥

=≤=∑=

(2)

where ri(t) denotes the amount of the ith resource which is occupied at time t by the A

applications and akQ is the lower bound threshold for the kth quality attribute.

The goal of the QoS control policy can be quantified in the following terms: ateach time t maximise the quantity

)(1 1

tqwA

a

K

k

ak

ak∑ ∑

= =

(3)

while respecting the constraints in (2). Note that the terms akw in (3) denote the

weighted contributions of the different quality metrics in the different applications tothe overall quality perceived by the user. In order to achieve this goal a QoS controlpolicy should implement a control law on the scalable parameters, like

))(),(),...,(),(),...,(),(),...,(()1(

...

))(),(),...,(),(),...,(),(),...,(()1(

111

11111

tEtqtqtrtrtstsuts

tEtqtqtrtrtstsuts

aK

aR

am

aam

am

aK

aR

am

aaa

=+

=+ (4)

This control problem appears to be outstanding if we assume generic non-linearrelations in (1). To make the resolution more affordable, a common approach is todecompose the control architecture into two levels [21]:

1. A resource management level that sets the target quality _akq and target resource

_a

ir for each application in order to maximise the global cost function (4).

2. An application level controller which acts on the scalable parameters of each

application in order to meet the targets _akq and

_a

ir fixed by the resource

management level.In the rest of the section we will limit our discussion to the application levelcontroller. For examples of resource management approaches we refer the interestedreader to [21,12] and the references therein.

The application level controller can be implemented by adopting conventionalcontrol techniques. Traditional control approaches assume a complete knowledge ofthe system to be controlled, that is of the relations f and g in (1). Once this knowledgeis not available or too difficult to be expressed in analytical form, an adaptive strategyis required [2]. The idea consists in combining a learning procedure (like the onesketched in Section 2.1) with a regulation strategy. The regulation module can thenexploit the up-to-date information coming from the learned model in order to drivethe system towards the desired goal (Fig. 2).

In other words, an adaptive controller assumes that the model is a truerepresentation of the process (certainty equivalence principle [2]) and, based on thisinformation, sets the inputs to the values supposed to bring the system to the desiredconfiguration. In a multimedia context, the controller role is played by the QoS policy


that in front of the current configuration tunes the scalable parameters in order toadjust the resource usage and/or the perceived quality.

Phenomenon

Observations

Modelprediction

input outputprediction

error

AdaptiveController

targetoutput

Fig. 2. Adaptive control system. The adaptive controller exploits the information returned bythe learned model in order to drive the output of the I/O phenomenon to the target values.

A large number of methods, techniques and results are available in the automaticcontrol discipline to deal with adaptive control problems [2]. An example of anadaptive control approach for QoS in telecommunications (congestion control) isproposed in [25].

The experimental part of the paper will focus only on the learning module of theQoS adaptive architecture depicted in Fig.2. In particular we will propose a data-analysis procedure to learn the relation f in (1) based on a limited number ofobservations.

3 A Data Analysis Procedure for QoS Modelling

In Section 2.1, we defined a QoS model as an input/output relation. Here, we proposea black-box procedure to model the relation between scalable parameters (inputs) andresource requirements (outputs). There are two main issues in a black-box approach:(i) the selection of that subset of scalable parameters to which the resources aresensitive and (ii) the estimation of the relation f between scalable parameters andresources. In order to address them, we propose a procedure composed of thefollowing steps:1. Selection of benchmarks. The first step of the procedure aims to select a

representative family of benchmarks in order to collect significant measurements.For example in the case of video multimedia applications this family should cover


a large spectrum of streaming formats and contents if we require a high degree ofgeneralisation from the QoS model [28].

2. Definition of the target quantities. The designer must choose the most criticalquantities (e.g., resources and/or quality metrics in (1)) to be predicted. In thispaper we will focus only on the modelling of the resource requirements for a singleapplication (A=1 in Equation (1)). Then, we denote with r the vector of targetquantities.

3. Definition of the input variables. In multimedia applications the number ofparameters characterising the functionality is typically very high, making theprocedure extremely complex. We propose a feature selection approach [18] totackle this problem. This means that we start with a large set of parameters thatmight reasonably be correlated with the targets and we select among them the onesthat are statistically relevant for obtaining sufficient accuracy. We define with s theselected set of features. Note that in the literature these parameters are also called“knobs” [22] for their capacity of steering the target to the desired values.

4. Data collection. For each sample benchmark, we measure the values of the targetquantities obtained by sweeping the input parameters over some predefined ranges.The full set of samples is stored in a dataset D of size N.

5. Modelling of the input/output relation on the basis of the data collected in step 4).The dataset D is used to estimated the input/output relation r = f (s). Note that forsimplicity we will assume here that the resource requirements depend only on thescalable parameters, that is, the vector E in (1) is empty. We propose the utilisationand comparison of linear and non-linear models. Linear models assume theexistence of linear relations

od

m

j jsjdr +∑=

=1

between the input s and the output r. Non-linear models )(sfr = make less strong

assumptions about the analytical form of the relationship. The role of the dataanalysis technique is indeed to select, based on available observations, what is theform of f which returns the best approximation of the unknown relation [6].

6. Validation. Once the model is estimated, it is mandatory to evaluate how theperformance prediction deteriorates by changing the scenario, or in other termshow the calibrated model is able to generalise in front of new scenarios.

4 The VTC/MPEG-4 Modelling Problem

The procedure described in the previous section has been instantiated to a real QoSmodelling problem: predicting the resource requirements of the VTC wavelet-basedalgorithm of the MPEG-4 decoder. VTC is basically a pipeline of a WaveletTransform (WT), a Quantisation (Q) and a Zero-Tree (ZTR) based entropy coding(arithmetic coder) module. The Wavelet Transform represents a digital image (madeof 3 color components: y, u, and v) as a hierarchy of levels, where the first one—theDC image—represents an anti-aliased downsampled version of the original image. Allother levels represent additional detailed information that enables the correctupsampling of the DC image back to the original one. Quantisation is the process ofthresholding wavelet coefficients, prior to entropy coding. There are three modes of


quantisation in VTC: Single Quantisation (SQ), Multi-level Quantisation (MQ) andBi-level Quantisation (BQ). For further details on VTC we refer the reader to [20].

These are the different steps of the procedure we followed in order to model theresource requirements:

1. Benchmark selection: we chose 21 image test files in yuv format. Besides the Lenapicture, the images are extracted from 4 different AVI videos: Akiyo, IMECnology,Mars, Mother and Daugthter. Table 1 reports all the images’ names with thecorresponding formats (width×heigth). The reference software version is theMoMuSys (Mobile Multimedia Systems) reference code. The referencemicroprocessor is an HP J7000/4 at 440 MHz.

Table 1. Benchmark images (width x heigth)

1-Akiyo 352x288 7-Imec6 352x288 13-Lena 256x256 19-Mars3 1248x8962-Imec1 232x190 8-Imec7 433x354 14-Imec12 878x719 20-MoDaug 176x1443-Imec2 528x432 9-Imec8 352x288 15-Imec13 1056x8644-Imec3 352x288 10-Imec9 387x317 16-Imec14 1232x10085-Imec4 704x576 11-Imec10 317x259 17-Mars1 936x6726-Imec5 1252x1024 12-Imec11 352x288 18-Mars2 1280x919

2. Target quantities: we define with r the vector of resource requirements composedof• td: the decoding execution time in seconds,• rd: the total number of decoding read memory accesses and• wd: the total number of decoding write memory accesses.We will refer with rI to the I-th component of the vector r.

3. Input parameters: the initial set of inputs which is assumed to influence the valueof target quantities is made of• w: Image width.• h: Image height.• l: Number of wavelet decomposition levels. In the experiments the value of l

ranges over the interval [2..4];• q: Quantization type. It assumes three discrete values: 1 for SQ, 2 for MQ, 3 for

BQ.• n: Target SNR level. In the experiments it ranges over the interval [1..3].• y: Quantization level QDC_y. This number represents the number of levels used

to quantize the DC coefficients of the y-component of the image. It assumes twodiscrete values: 1 and 6.

• u: Quantization level QDC_uv. This number represents the number of levelsused to quantize the DC coefficients of the u-component and the v-component ofthe image. It assumes two discrete values: 1 and 6.

• ee: Encoding execution time. This is the time required for encoding the picturewith the same reference software. This variable is considered as an input onlyfor predicting ed.

• re: Total number of encoding read memory accesses. This variable is consideredas an input only for predicting rd and wd.

• we: Total number of encoding write memory accesses. This variable isconsidered as an input only for predicting rd and wd.


We define with s the vector [w, h, l, q, n, y, u, ee, re, we] and with S the domain ofvalues of the vector s. We will refer with si to the ith component of the vector s. InSection 5.1 we will present a feature selection procedure to reduce the size of thevector s, aiming at taking into consideration only that subset of s which iseffectively critical to predict the target quantities.

4. Data collection. For each test image we collect 108 measurements, by sweepingthe input s over the ranges defined in the previous section. The total number ofmeasurements is N=2484. The execution time is returned by the timex commandof the HPUX operating system. The total number of read and write memoryaccessed are measured by instrumenting the code with the Atomium profiling tool[3, 4].

5. Model estimation. Two different models are estimated on the basis of the collectedmeasurements. The first one is an input/output model m1 of the relation f betweenthe input scalable parameters and the value of the target variables:

r=m1(s) (5)

The second is an input/output model m2 of the relation between the change of theinput parameters and the change of the value of the target variables.

r(t)-r(t-1)=m2 (s(t) , s(t-1) ) (6)

The model m2 is relevant in order to enable a QoS control policy. On the basis ofthe information returned by the model m2, the QoS controller can test what change∆s in the parameter configuration may induce the desired change ∆r in therequirements.

Both linear and non-linear models are taken into consideration to model m1 andm2. In particular we adopt the multiple linear regression technique [13] to identifythe linear model. Among the large amount of results in statistical non-linearregression and machine learning [6, 28], we propose the adoption of a method oflocally weighted regression, called lazy learning [10]. Lazy learning is a memory-based technique that, once an input is received, extracts a prediction interpolatinglocally the examples which are considered similar according to a distance metric.This method proved to be effective in many problems of non-linear data modelling[7,10] and was successfully applied to the problem of multivariate regressionproposed by the NeuroNet CoIL Competition [9].

6. Validation. We adopt a training-and-test procedure, which means that the originaldata set, made of N samples, is decomposed v times in two non overlapping setsnamely− the training set, made of Ntr samples, used to train the prediction model, and− the test set, composed of Nts=N-Ntr samples, used to validate the prediction

model according to some error criterion.After having trained and tested the prediction model v times, the generalisationperformance of the predictor is evaluated by averaging the error criterion over the vtest sets. Note that the particular case where v=N, Nts=1 and Ntr=N-1 is generallydenoted as the leave-one-out (LOO) validation in the statistical literature [24].


5 The Experimental Results

5.1 Feature Selection

The first step of the modelling procedure aims at reducing the complexity of the inputvector. The goal is to select which variables among the ones contained in the vector sare effectively correlated with the target quantities. There are several popular featureselection algorithms [18]. Here, we adopt an incremental feature selection approachcalled forward selection [13]. The method starts with an empty set of variables andincrementally adds new variables, testing, for all of them, the predictive accuracy ofthe model. At each step, the variable that guarantees the largest decrease of theprediction error is added to the set. The procedure stops when the accuracy does notimprove any more or deteriorates due to the over-parameterisation.

Figure 3 reports the estimated prediction accuracy against the set of features(represented by their number) for a non-linear lazy learning model that predicts thevariable r1. Note that the prediction accuracy is measured by a leave-one-outprocedure [24]. We choose the set made of the features no. 1 (i.e., w), no. 2 (i.e., h),no. 3 (i.e., l) no. 4 (i.e., q) and no. 8 (i.e., te) as according to the picture this set returnsthe lowest error in predicting r1.

8 8 4 8 4 2 8 4 2 3 8 4 2 3 1 8 4 2 3 1 6 8 4 2 3 1 6 5

Fig. 3. Feature selection. The x-axis reports the subset of features and the y-axis the MeanSquared Error (MSE) estimated by the leave-one-out procedure.

Using the same procedure for the other two target quantities, we obtain that the bestfeature subsets are w, h, l, q, re for r2 and w, h, l, q, we for r3, respectively.

These feature sets will be used in the rest of the experiments as input vectors of theprediction models.

5.2 Estimation of Model m1

We compare linear and non-linear models m1 (Equation (5)) by adopting a training-and-test setting. This means that we perform 21 experiments where each time thetraining set contains all the images except one, which is set aside to validate the


prediction accuracy of the method. The predictive accuracy of the models is assessedby their percentage error (PE). Table 2 reports the average of the 21 percentage errorsfor the linear and non-linear model.

Table 2. Predictive accuracy (in percentage error) of the linear and non-linear approach tomodelling the relation m1.

Target Linear PE Non-linear PEDecoding execution time r1 15.8 % 3.8 %

Decoding read memory accesses r2 39.9 % 6.0 %Decoding write memory accesses r3 40.3 % 4.6 %

Figure 4 report the real values of r1 (decoding execution time) and the predictionsreturned by the linear model for the images no. 2 and no. 6. The y-axis reports theexecution time for a specific input configuration. Each point on the x-axis represents adifferent input configuration.

0 10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Tim

e (

sec.

)

Execution timePredictions

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

Tim

e (

sec.

)


Fig. 4. Prediction of the linear model for image n.2(left) and n.6 (right).

Similarly, Figure 5 report the real values of r1 (decoding execution time) and thepredictions returned by the non-linear model for the images no. 2 and no. 6.

5.3 Estimation of the Model m2

The predictive accuracy of the model m2 in (6) is assessed by counting the number oftimes that the model is not able to predict the sign of the change of the output value(r(t)-r(t-1)) for a given input change (s(t)→s(t-1)). Given the relevance of this modelfor the QoS control policy, we deem that the number of incorrect sign changespredictions is more relevant than the average prediction error.

We adopt the same training-and-test validation procedure of the previous section.We present two different experimental settings: a non incremental one where thetraining test is kept fixed and an incremental one where the training test isincremented each time a new observation is available. Note that the goal of theincremental experiment is to test the adaptive capability of the modelling algorithmswhen the amount of observations is updated on-line.


0 10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Tim

e (

sec.

)


0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

Tim

e (

sec.

)


Fig. 5. Prediction of the non-linear model for image n.2(left) and n.6 (right)

Table 3 reports the results for the non incremental case. The prediction error ismeasured by percentage of times that the predictive model returns a wrong predictionof the sign of the output change. Again, linear and nonlinear predictors are assessedand compared.

Table 3. Non incremental configuration: the table reports the percentage of times that themodel is not able to predict the sign of the change of the output value for a given input change.

Target Linear Non-linearDecoding execution time r1 23.5 % 7.5 %

Decoding read memory accesses r2 46.7 % 9.4 %Decoding write memory accesses r3 19.9 % 10.2 %

Table 4 reports the results for the incremental case in the case of non-linear approach.It is interesting to note the improvement of the prediction accuracy for all the 3 targetquantities although only a limited number of samples is added on-line to the initialdataset.

Table 4. Incremental configuration: the table reports the percentage of times that the model isnot able to predict the sign of the change of the output value for a given input change.

Target Non-linearDecoding execution time r1 3.4 %

Decoding read memory accesses r2 6.4 %Decoding write memory accesses r3 4.7 %


6 Conclusions

The paper presented the preliminary results of a modelling approach suitable for anadaptive control implementation of QoS techniques for multimedia applications.

The results presented in Table 2, Table 3 and Table 4, albeit on the basis of alimited number of examples, show that it is possible to estimate a model returning anaccurate prediction of the resource requirements, both in terms of execution time andmemory accesses. In particular, the experiments suggest that non-linear modelsoutperform linear models in a significant way, showing the intrinsic complexity of themodelling task. Moreover, we show, in concordance with other published results [8],that the non-linear technique we have proposed is also robust in an adaptive settingwhere the number of samples increments on-line.

Future work could take several directions:• Extending the work to multiple platforms and architecture.• Exploring prediction models of some quantitative attributes of the quality.• Integrating the prediction models in a control architecture, responsible for

negotiating online the quality demands vs. the resource constraints.• Integrating the application-level control with an higher system-level control

mechanism (e.g., resource manager).We are convinced that the promising results of this work will act as driving force forfuture black-box approaches to QoS policies.

References

1. Abdelzaher, T.F.: An Automated Profiling Subsystem for QoS-Aware Services. In:Proceedings of Sixth IEEE Real-Time Technology and Applications Symposium, 2000.RTAS 2000 (2000).

2. Astrom, K.J.: Theory and Applications of Adaptive Control - A Survey. Automatica, 19, 5,(1983) 471-486.

3. ATOMIUM:, http://www.imec.be/atomium/.4. Bormans, J. , Denolf, K., Wuytack, S., Nachtergaele, L., Bolsens, I. : Integrating System-

Level Low Power Methodologies into a Real-Life Design Flow. In: PATMOS’99 NinthInternational Workshop Power and Timing Modeling, Optimization and Simulation, (1999)19-28.

5. Birattari, M ., Bontempi, G.: The Lazy Learning Toolbox, For use with Matlab. TechnicalReport, TR/IRIDIA/99-7, Université Libre de Bruxelles (1999)(http://iridia.ulb.ac.be/~gbonte/Papers.html).

6. Bishop, C.M.: Neural Networks for Statistical Pattern Recognition. Oxford, UK: OxfordUniversity Press (1994).

7. Bontempi, G.: Local Learning Techniques for Modeling, Prediction and Control. PhDdissertation. IRIDIA- Université Libre de Bruxelles, Belgium, (1999).

8. Bontempi, G., Birattari, M., Bersini, H.: Lazy learning for modeling and control design.International Journal of Control, 72, 7/8, (1999) 643-658.

9. Bontempi, G., Birattari, M., Bersini, H.: Lazy Learners at work: the Lazy LearningToolbox. In: EUFIT '99 - The 7th European Congress on Intelligent Techniques and SoftComputing, Aachen, Germany (1999).

10. Bontempi, G. , Birattari, M., Bersini, H.: A model selection approach for local learning.Artificial Intelligence Communications, 13, 1, (2000) 41-48.


11. Bontempi, G., Kruijtzer, W.: A Data Analysis Method for Software PerformancePrediction. Design Automation and Test in Europe DATE 2002 (2002).

12. Brandt, S., Nutt, G., Berk, T., Mankovich, J.: A dynamic quality of service middlewareagent for mediating application resource usage. In: Proceedings of the 19th IEEE Real-Time Systems Symposium, (1998) 307 –317.

13. Draper, N. R., Smith, H.: Applied Regression Analysis. New York, John Wiley and Sons(1981).

14. Fayyad, U., Piatesky-Shapiro, G., Smyth, P.: The KDD Process for Extracting UsefulKnowledge from Volumes of Data. Communications of the ACM, 39, 11, (1996) 27-34.

15. Franklin, G., Powell, J.: Digital control of Dynamic Systems. Addison Wesley (1981).16. Jain, R.: Control-theoretic Formulation of Operating Systems Resource Management

Policies. Garland Publishing Companies (1979).17. Koenen, R.: MPEG-4. Multimedia for our time. IEEE Spectrum, (1999), 26-33.18. Kohavi, R., John, G. H.: Wrappers for Feature Subset Selection. Artificial Intelligence, 97,

1-2, (1997) 273-324.19. Lafruit, G.: Computational Graceful Degradation methodology. IMEC Technical report

(2000).20. Lafruit, G., Vanhoof, B.: MPEG-4 Visual Texture Coding: Variform, yet Temperately

Complex. In: IWSSIP the 8th International Workshop on Systems, Signals and ImageProcessing , Romania (2001) 63-66.

21. Li, B., Nahrstedt, K.: A Control-based Middleware Framework for Quality of ServiceAdaptations. IEEE Journal of Selected Areas in Communications, Special Issue on ServiceEnabling Platforms, 17, 9, (1999) 1632-1650.

22. Li, B., Kalter, W., Nahrstedt, K.: A Hierarchical Quality of Service Control Architecture forConfigurable Multimedia Applications. Journal of High-Speed Networks, Special Issue onQoS for Multimedia on the Internet, IOS Press, 9, (2000) 153-174.

23. Lu, Y., Saxena, A., Abdelzaher, T.F.: Differentiated caching services; a control-theoreticalapproach. In: 21st International Conference on Distributed Computing Systems, (2001)615–622.

24. Lu, C., Stankovic, J.A., Abdelzaher, T.F., Tao, G., Sao, S.H., Marley, M.: Performancespecifications and metrics for adaptive real-time systems. In: Proceedings of the 21st IEEEReal-Time Systems Symposium, (2000) 13 –23.

25. Pitsillides, A., Lambert, J.: Adaptive congestion control in ATM based networks: qualityof service with high utilisation. Journal of Computer Communications, 20, (1997), 1239-1258.

26. Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal ofRoyal Statistical Society B, 36, 1, (1974) 111-147.

27. Vogel, A., Kerherve, B., von Bochmann, G., Gecsei, J.: Distributed Multimedia and QoS:A Survey. IEEE Multimedia, 2, 2, (1995) 10-19.

28. Weiss, S.M., Kulikowski, C.A.: Computer systems that learn. San Mateo, California,Morgan Kaufmann (1991).


Using Markov Chains for Link Prediction in AdaptiveWeb Sites

Jianhan Zhu, Jun Hong, and John G. Hughes

School of Information and Software Engineering, University of Ulster at JordanstownNewtownabbey, Co. Antrim, BT37 0QB, UK

jh.zhu, j.hong, [email protected]

Abstract. The large number of Web pages on many Web sites has raisednavigational problems. Markov chains have recently been used to model usernavigational behavior on the World Wide Web (WWW). In this paper, wepropose a method for constructing a Markov model of a Web site based on pastvisitor behavior. We use the Markov model to make link predictions that assistnew users to navigate the Web site. An algorithm for transition probabilitymatrix compression has been used to cluster Web pages with similar transitionbehaviors and compress the transition matrix to an optimal size for efficientprobability calculation in link prediction. A maximal forward path method isused to further improve the efficiency of link prediction. Link prediction hasbeen implemented in an online system called ONE (Online NavigationExplorer) to assist users’ navigation in the adaptive Web site.

1 Introduction

In a Web site with a large number of Web pages, users often have navigationalquestions, such as, Where am I? Where have I been? and Where can I go? [10]. Webbrowsers, such as Internet Explorer, are quite helpful. The user can check the URIaddress field to find where they are. Web pages on some Web sites also have ahierarchical navigation bar, which shows the current Web location. Some Web sitesshow the user’s current position on a sitemap. In IE 5.5, the user can check the historylist by date, site, or most visited to find where he/she has been. The history can alsobe searched by keywords. The user can backtrack where he/she has been by clickingthe “Back” button or selecting from the history list attached to the “Back” button.Hyperlinks are shown in a different color if they point to previously visited pages.

We can see that the answers to the first two questions are satisfactory. To answerthe third question, what the user can do is to look at the links in the current Web page.On the other hand, useful information about Web users, such as their interestsindicated by the pages they have visited, could be used to make predictions on thepages that might interest them. This type of information has not been fully utilized toprovide a satisfactory answer to the third question. A good Web site should be able tohelp its users to find answers to all three questions. The major goal of this paper is toprovide an adaptive Web site [11] that changes its presentation and organization onthe basis of link prediction to help users find the answer to the third question.

Using Markov Chains for Link Prediction in Adaptive Web Sites 61

In this paper, by viewing the Web user’s navigation in a Web site as a Markovchain, we can build a Markov model for link prediction based on past users’ visitbehavior recorded in the Web log file. We assume that the pages to be visited by auser in the future are determined by his/her current position and/or visiting history inthe Web site. We construct a link graph from the Web log file, which consists ofnodes representing Web pages, links representing hyperlinks, and weights on the linksrepresenting the numbers of traversals on the hyperlinks. By viewing the weights onthe links as past users’ implicit feedback of their preferences in the hyperlinks, we canuse the link graph to calculate a transition probability matrix containing one-steptransition probabilities in the Markov model.

The Markov model is further used for link prediction by calculating the conditionalprobabilities of visiting other pages in the future given the user’s current positionand/or previously visited pages. An algorithm for transition probability matrixcompression is used to cluster Web pages with similar transition behaviors together toget a compact transition matrix. The compressed transition matrix makes linkprediction more efficient. We further use a method called Maximal Forward Path toimprove the efficiency of link prediction by taking into account only a sequence ofmaximally connected pages in a user’s visit [3] in the probability calculation. Finally,link prediction is integrated with a prototype called ONE (Online NavigationExplorer) to assist Web users’ navigation in the adaptive Web site.

In Section 2, we describe a method for building a Markov model for linkprediction from the Web log file. In Section 3, we discuss an algorithm for transitionmatrix compression to cluster Web pages with similar transition behaviors forefficient link prediction. In Section 4, link prediction based on the Markov model ispresented to assist users’ navigation in a prototype called ONE (Online NavigationExplorer). Experimental results are presented in Section 5. Related work is discussedin Section 6. In Section 7, we conclude the paper and discuss future work.

2 Building Markov Models from Web Log Files

We first construct a link structure that represents pages, hyperlinks, and users’traversals on the hyperlinks of the Web site. The link structure is then used to build aMarkov model of the Web site. A traditional method for constructing the linkstructure is Web crawling, in which a Web indexing program is used to build an indexby following hyperlinks continuously from Web page to Web page. Weights are thenassigned to the links based on users’ traversals [14]. This method has two drawbacks.One is that some irrelevant pages and links, such as pages outside the current Website and links never traversed by users, are inevitably included in the link structure,and need to be filtered out. Another is that the Webmaster can set up the Web site toexclude the crawler from crawling into some parts of the Web site for various reasons.We propose to use the link information contained in an ECLF (Extended CommonLog File) [5] format log file to construct a link structure, called a link graph. Ourapproach has two advantages over crawling-based methods. Only relevant pages andlinks are used for link graph construction, and all the pages relevant to users’ visitsare included in the link graph.

62 J. Zhu, J. Hong, and J.G. Hughes

2.1 Link Graphs

A Web log file contains rich records of users’ requests for documents on a Web site.ECLF format log files are used in our approach, since the URIs of both the requesteddocuments and the referrers indicating where the requests came from are available.An ECLF log file is represented as a set of records corresponding to the pagerequests, WL =( 1 2

, , ...,m

e e e ), where 1 2, , ..., me e e are the fields in each record. A

record in an ECLF log file might look like as shown in Figure 1:

Fig. 1. ECLF Log File

The records of embedded objects in the Web pages, including graphical, video, andaudio files, are treated as redundant requests and removed, since every request of aWeb page will initiate a series of requests of all the embedded objects in itautomatically. The records of unsuccessful requests are also discarded as erroneousrecords, since there may be bad links, missing or temporarily inaccessible documents,or unauthorized requests etc. In our approach, only the URIs of the requested Webpage and the corresponding referrer are used for link graph construction. We therefore

have a simplified set r

WL =( ,r u ), where r and u are the URIs of the referrer and

the requested page respectively. Since various users may have followed the samelinks in their visits, the traversals of these links are aggregated to get a set

sWL =( , ,r u w ), where w is the number of traversals from r to u . In most cases a

link is the hyperlink from r to u . When “-“ is in the referrer field, we assume there isa virtual link from “-“ to the requested page. We call each element ( , ,r u w ) in the set

a link pair. Two link pairs i

l =( , ,i i i

r u w ) and j

l =( , ,j j jr u w ) are said to be connected

if and only if i

r = jr , i

r = ju , i

u = jr , or i

u = ju . A link pair set m

LS =( , ,i i i

r u w ) is

said to connect to another link pair set n

LS =( , ,j j jr u w ) if and only if for every

link pair j

l ∈n

LS , there exists a link pair i

l ∈m

LS , so that i

l and j

l are connected.

Definition 2.1 (Maximally connected Link pair Set) Given a link pair set

sWL =( , ,j j jr u w ), and a link pair set

mLS =( , ,

i i ir u w ) ⊂

sWL , we say

nLS =( , ,

l l lr u w ) ⊂

sWL is the Maximally connected Link pair Set (MLS) of

mLS

on s

WL if and only if m

LS connects to n

LS , and for every link pair j

l ∈(s

WL -n

LS ),

j

l and m

LS are not connected.

For a Web site with only one major entrance, the homepage, people can come to itin various ways. They might come from a page on another Web site pointing to thehomepage, follow a search result returned by a search engine pointing to the

177.21.3.4 - - [04/Apr/1999:00:01:11 +0100] "GET /studaffairs/ccampus.html HTTP/1.1"200 5327 "http://www.ulst.ac.uk/studaffairs/accomm.html" "Mozilla/4.0 (compatible;MSIE 4.01; Windows 95)"


homepage. “-” in the referrer field of a page request record indicates that the user hastyped in the URI of the homepage directly into the address field of the browser,selected the homepage from his/her bookmark, or clicked on a shortcut to thishomepage. In all these cases the referrer information is not available. We select a set

of link pairs 0

LS =(0

, ,i i

r u w ), where i

r is “-“, the URI of a page on another Web

site, or the URI of a search result returned by a search engine, 0

u is the URI of the

homepage, and iw is the weight on the link, as the entrance to the hierarchy. We then

look for the Maximally connected Link pair Set (MLS) 1

LS of 0

LS in s

WL -0

LS to

form the second level of the hierarchy. We look for 2

LS of 1

LS in s

WL -0

LS -1

LS .

This process continues until we get k

LS , so that s

WL -0

k

i

i

LS=

∑ = or 1k

LS + =.

For a Web site with a single entrance, we will commonly finish the link graph

construction with (s

WL -0

k

i

i

LS=

∑ )=, which means that every link pair has been put

onto a certain level in the hierarchy. The levels in the hierarchy are from 0

LS to k

LS .

For a Web site with several entrances, commonly found in multi-functional Web sites,

the construction will end with 1k

LS + = while (s

WL -0

k

i

i

LS=

∑ ) ≠ . We can then

select a link pair set forming another entrance from (s

WL -0

k

i

i

LS=

∑ ) to construct a

separate link graph.

Definition 2.2 (Link Graph) The link graph of s

WL , a directed weighted graph, is a

hierarchy consisting of multiple levels, 0

LS ,…,i

LS ,…,k

LS , where

0LS =(

0 0 0, ,r u w ),

iLS is the MLS of

1iLS − in

sWL -

1

0

i

j

j

LS−

=

∑ , and s

WL -0

k

j

j

LS=

∑ =

or 1k

LS + =.

We add the “Start” node to the link graph as the starting point for the user’s visit tothe Web site and the “Exit” node as the ending point of the user’s visit. In order toensure that there is a directed path between any two nodes in the link graph, we add alink from the “Exit” node to the “Start” node. Due to the influence of caching, theamount of weights on all incoming links of a page might not be the same as theamount of weights on all outgoing links. To solve this problem, we can either assignextra incoming weights to the link to the start/exit node or distribute extra outgoingweights to the incoming links.

Figure 2 shows a link graph we have constructed using a Web log file at theUniversity of Ulster Web site, in which the title of each page is shown beside the noderepresenting the page.


Fig. 2. A Link Graph Constructed from a Web Log File on University of Ulster Web Site

2.2 Markov Models

Each node in the link graph can be viewed as a state in a finite discrete Markovmodel, which can be defined by a tuple < S , Q , L >, where S is the state space

containing all the nodes in the link graph, Q is the probability transition matrix

containing one-step transition probabilities between the nodes, and L is the initialprobability distribution on the states in S . The user’s navigation in the Web site can

be seen as a stochastic process n

X , which has S as the state space. If the

conditional probability of visiting page j in the next step, ( )

,

m

i jP , is dependent only on

the last m pages visited by the user, n

X is called a m -order Markov chain [8].

Given that the user is currently at page i and has visited pages 1 0, ...,

ni i− , ( )

,

m

i jP is only

dependant on pages i , 1 1, ...,

n n mi i− − + .

( )

,

m

i jP =1 1 1 0 0

( | , , ..., )n n n n

P X j X i X i X i+ − −= = = = =

1 1 1 1 1( | , , ..., )

n n n n n m n mP X j X i X i X i+ − − − + − += = = =

(1)

1

2 3 4

5 6 7 8 910

11

9000

1800

2700 4500

810880 720

880

648

600

2128

282

2390 1800 2400

180023902400

72

University of Ulster

Department

InformationStudent

CSScience&Arts

InternationalOffice Library

Under-graduate

Graduate

Jobs

200 300S

E

9000

Start

Exit

122128

Register


where the conditional probability of 1n

X + given the states of all the past events is

equal to the conditional probability of 1n

X + given the states of the past m events.

When m =1, 1n

X + is dependent only on the current state n

X . ,i jP = (1)

,i jP

= 1

( | )n n

P X j X i+ = = is an one-order Markov chain, where ,i jP is the probability

that a transition is made from state i to state j in one step.

We can calculate the one-step transition probability from page i to page j using a

link graph as follows, by considering the similarity between a link graph and a circuitchain discussed in [7]. The one-step transition probability from page i to page j ,

,i jP , can be viewed as the fraction of traversals from i to j over the total number of

traversals from i to other pages and the “Exit” node.

,i jP = 1 1 1 0 0

( | , , ..., )n n n n

P X j X i X i X i+ − −= = = = = 1

( | )n n

P X j X i+ = = =

,

,

i j

i kk

w

w∑(2)

where ,i jw is the weight on the link from i to j , and ,i kw is the weight on a link

from i to k . Now a probability transition matrix, which represents the one-steptransition probability between any two pages, can be formed. In a probabilitytransition matrix, row i contains one-step transition probabilities form i to all states.Row i sums up to 1.0. Column i contains one-step transition probabilities from allstates to i . The transition matrix calculated from the link graph in Figure 2 is shownin Figure 3.

1 2 3 4 5 6 7 8 9 10 11 12

1 0.2 0.3 0.5

2 0.111 0.489 0.4

3 0.253 0.747

4 0.067 0.4 0.533

5 1.0

6 0.1 0.9

7 0.68 0.32

8 1.0

9 1.0

10 1.0

11 1.0

12 1.0

1.0

1.0

\PagePage Exit Start

Exit

Start

Fig. 3. Transition Probability Matrix for the Link Graph in Fig. 2


3 Transition Matrix Compression

An algorithm, that can be used to compress a sparse probability transition matrix, ispresented in [15] while the transition behaviors of the Markov model are preserved.States with similar transition behaviors are aggregated together to form new states. Inlink prediction, we need to raise the transition matrix Q to the n th power. For a large

Q this is computationally expensive. Spears’ algorithm can be used to compress the

original matrix Q to a much smaller matrix c

Q without significant errors since the

accuracy experiments on large matrices have shown that n

cQ and ( )n

cQ are very

close to each other. Since the computational complexity of nQ is 3( )O N , by

dramatically reducing N , the time taken by compression can be compensated by allsubsequent probability computations for link prediction [15]. We have used Spear’salgorithm in our approach. The similarity metric of every pair of states is formed toensure those pairs of states that are more similar should yield less error when they arecompressed [15]. Based on the similarity metric in [15], the transition similarity oftwo pages i and j is a product of their in-link and out-link similarities. Their in-link

similarity is the weighted sum of distance between column i and column j at each

row. Their out-link similarity is the sum of distance between row i and row j at

each column.

, , ,( ) ( )i j i j i jSim Sim out link Sim in link= − × −

, ,( ) ( )i j i jySim out link yα− = ∑

, ,( ) ( )i j i jxSim in link xβ− = ∑

, ( )i j yα = , ,| |i y j yP P−

, ( )i j xβ =, ,i x j j x i

i j

m P m P

m m

× − ×

+

,i l ilm P= ∑ , ,j l jl

m P= ∑

(3)

where i

m and jm are the sums of the probabilities on the in-links of page i and j

respectively, , ( )i jSim out link− is the sum of the out-link probability difference

between i and j , , ( )i jSim in link− is the sum of in-link probability difference

between i and j .

For the transition matrix in Figure 3, the calculated transition similarity matrix isshown in Figure 4.

If the similarity is close to zero, the error resulted from compression is close to

zero [15]. We can set a threshold ε , and let ,i jSim < ε to look for candidate pages for

merging.


1 2 3 4 5 6 7 8 9 10 11 12

1 0.00

2 0.58 0.00

3 1.29 0.21 0.00

4 1.24 0.00 0.36 0.00

5 1.31 0.57 0.74 0.99 0.00

6 1.14 0.53 0.60 0.89 0.00 0.00

7 1.04 0.51 0.81 0.83 0.26 0.24 0.00

8 1.71 0.63 1.17 1.20 1.18 1.04 0.18 0.00

9 1.14 0.53 0.75 0.


89 0.88 0.80 0.51 0.00 0.00

10 1.39 0.58 0.87 1.03 1.02 0.91 0.58 0.00 0.00 0.00

11 2.88 0.74 1.61 1.68 1.64 1.38 0.89 2.32 1.39 1.77 0.00

12 2.00 0.67 1.29 1.33 1.31 1.14 0.71 0.00 0.00 0.00 2.88 0.00

3.25 0.76 1.72 1.79 1.75 1.46 1.31 2.55 1.46Exit 1.90 5.98 3.25 0.00

2.00 0.67 1.29 1.33 1.31 1.14 1.04 1.71 1.14 1.39 2.88 2.00 3.25 0.00Start

Fig. 4. Transition Similarity Matrix for Transition Matrix in Fig. 3 (Symmetric)

By raising ε we can compress more states with a commensurate increase in error.Pages sharing more in-links, out-links, and having equivalent weights on them willmeet the similarity threshold. Suppose states i and j are merged together, we need

to assign transition probabilities between the new state i j∨ and the remaining state

k in the transition matrix. We compute the weighted average of the i th and j th

rows and place the results in the row of state i j∨ , and sum the i th and j th

columns and place the results in the column of state i j∨ .

, , ,k i j k i k jP P P∨ = +

, ,

,

i i k j j k

i j k

i j

m P m PP

m m∨

× + ×=

+

(4)

For the similarity matrix in Figure 4, we set the similarity threshold ε =0.10.Experiments indicated a value of ε between 0.08 and 0.15 yielded good compressionwith minimal error for our link graph. The compression process is shown in Figure 5.

States 2 and 4, 5 and 6 are compressed as a result of , ( )i jSim in link− =0, states 8, 9,

10 and 12 are compressed as a result of , ( )i jSim out link− =0.

The compressed matrix is shown in Figure 6. The compressed matrix is denser thanthe original transition matrix.

When either , ( )i jSim out link− =0 or , ( )i jSim in link− =0, the compression will

result in no error: , 0i jError = and n

cQ = ( )n

cQ [15]. So there is no compression

error for the transition matrix in Figure 4 and its compressed matrix in Figure 6. Theremay not always be the case for a transition matrix calculated from another link graph.

When ,i jSim is below a given threshold, the effect of compression on the transition


behavior of the states ( ( )n

cQ n

cQ− ) will be controlled, the transition property of the

matrix is preserved and the system is compressed to an optimal size for probabilitycomputation. The compressed transition matrix is used for efficient link prediction.

Fig. 5. Compression Process for Transition Matrix in Fig. 3

1 (2,4) 3 (5,6) 7 (8,9,10,12) 11

1 0.7 0.3

(2, 4) 0.08 0.25 0.67

3 0.25 0.75

(5,6) 0.04 0.96

7 0.68 0.32

(8,9,10,12) 1.0

11 1.0

1.0

1.0


Exit

Start

Fig. 6. Compressed Transition Matrix for Transition Matrix in Figure 3

4 Link Prediction Using Markov Chains

When a user visits the Web site, by taking the pages already visited by him/her as ahistory, we can use the compressed probability transition matrix to calculate theprobabilities of visiting other pages or clusters of pages by him/her in the future. Weview each compressed state as a cluster of pages. The calculated conditionalprobabilities can be used to estimate the levels of interests of other pages and/orclusters of pages to him/her.

4.1 Link Prediction on M-Order N-Step Markov Chains

Sarukkai [14] proposed to use the “link history” of a user to make link prediction.Suppose a user is currently at page i , and his/her visiting history as a sequence of m

pages is 1 2 0, , ...,

m mi i i− + − + . We use vector

0L = jl , where jl =1 when j i= and

jl =0 otherwise, for the current page, and vectors k

L = kj

l ( 1, ..., 1k m= − − + ),

Compressed state 4 into state 2 (similarity 0.000000)(states: 2 4)Compressed state 6 into state 5 (similarity 0.000000)(states: 5 6)Compressed state 9 into state 8 (similarity 0.000000)(states: 8 9)Compressed state 12 into state 10 (similarity 0.000000)(states: 10 12)Compressed state 10 into state 8 (similarity 0.000000)(states: 8 9 10 12)Finished compression.Have compressed 14 states to 9.


where 1kj

l = when k

j =k

i and 0kj

l = otherwise, for the previous pages. These

history vectors are used together with the transition matrix to calculate vector 1

Re c

for the probability of each page to be visited in the next step as follows:

1Re c = 2

1 0 2 1 1... m

m ma L Q a L Q a L Q− − +× × + × × + + × × (5)

where 1 2, , ...

ma a a are the weights assigned to the history vectors. The values of

1 2, , ...

ma a a indicate the level of influence the history vectors have on the future.

Normally, we let 1>1 2

...m

a a a> > > >0, so that the closer the history vector to the

present, the more influence it has on the future. This conforms to the observation of a

user’s navigation in the Web site. 1

Re c = jrec is normalized, and the pages with

probabilities above a given threshold are selected as the recommendations.We propose a new method as an improvement to Sarukkai’s method by calculating

the possibilities that the user will arrive at a state in the compressed transition matrixwithin the next n steps. We calculate the weighted sum of the possibilities of arrivingat a particular state in the transition matrix within the next n steps given the user’shistory as his/her overall possibility of arriving at that state in the future. Comparedwith Sarukkai’s method, our method can predict more steps in the future, and thus

provide more insight into the future. We calculate a vector Ren

c representing the

probability of each page to be visited within the next n steps as follows:

Ren

c = 2

1,1 0 1,2 0 1, 0... n

na L Q a L Q a L Q× × + × × + + × × +

2 3 1

2,1 1 2,2 1 2, 1... n

na L Q a L Q a L Q +− − −× × + × × + + × × +…+

1 1

1,1 1 1,2 1 1, 1...m m m n

m m m m m n ma L Q a L Q a L Q− + −− − + − − + − − +× × + × × + + × ×

(6)

where 1,1 1,2 1, 1,1 1,2 1,, , ..., , ..., , , ...,n m m m na a a a a a− − − are the weights assigned to the

history vectors 0

L ,…,1m

L− + in 1,2,…, n ,…, 1m − , m …, 1m n+ − steps into the

future, respectively. Normally, we let 1> ,1 ,2 ,...k k k ma a a> > > >0 ( k =1,2,…, m ), so

that for each history vector, the closer its transition to the next step, the more

important its contribution is. We also let 1> 1, 2, ,...l l m la a a> > > >0 ( l =1,2,…, n ),

so that the closer the history vector to the present, the more influence it has on the

future. Ren

c = jrec is normalized, and the pages with probabilities above a given

threshold are selected as the recommendations.

4.2 Maximal Forward Path Based Link Prediction

A maximal forward path [3] is a sequence of maximally connected pages in a user’svisit. Only pages on the maximal forward path are considered as a user’s history for


link prediction. The effect of some backward references, which are mainly made forease of travel, is filtered out. In Fig. 3, for instance, a user may have visited the Webpages in a sequence 1 → 2 → 5 → 2 → 6. Since the user has visited page 5 after page2 and then gone back to page 2 in order to go to page 6, the current maximal forwardpath of the user is: 1 → 2 → 6. Page 5 is discarded in the link prediction.

5 Experimental Results

Experiments were performed on a Web log file recorded between 1st and 14th ofOctober, 1999 on the University of Ulster Web site, which is 371 MB in size andcontains 2,193,998 access records. After discarding the irrelevant records, we get423,739 records. In order to rule out the possibility that some links are onlyinteresting to individual users, we set a threshold as the minimum number oftraversals on each hyperlink as 10 and there must be three or more users who havetraversed the hyperlink. We assume each originating machine corresponds to adifferent user. These may not always be true when such as proxy servers exist. But inthe absence of user tracking software, the method can still provide rather reliableresults. We then construct a link graph consisting of 2175 nodes, and 3187 linksbetween the nodes. The construction process takes 26 minutes on a Pentium 3desktop, with a 600 MHz CPU, 128M RAM. The maximum number of traversals on alink in the link graph is 101,336, which is on the link from the “Start” node to thehomepage of the Web site. The maximum and average numbers of links in a page inthe link graph are 75 and 1.47 respectively. The maximum number of in-links of apage in the link graph is 57.

The transition matrix is 2175×2175 and very sparse. By setting six differentthresholds for compression, we get the experimental results given in Table 1:

Table 1. Compression Results on a Transition Matrix from a Web Log File

ε Compression Time(Minutes)

Size after compression % of states removed

0.03 107 1627 25.2

0.05 110 1606 26.20.08 118 1579 27.40.12 122 1549 28.80.15 124 1542 29.1

0.17 126 1539 29.2

We can see that when ε increases, the matrix becomes harder to compress. For thismatrix, we choose ε =0.15 for a good compression rate without significant error.Experiments in [15] also show that a value of ε =0.15 yielded good compression withminimum errors. Now we calculate 2

cQ and use the time spent as the benchmark form

cQ . Since we can repeatedly multiply 2

cQ by c

Q to get 2

cQ , …, 1m

cQ − , m

cQ , the

time spent for computing 2

cQ ,…, 1m

cQ − , m

cQ can be estimated as the 1m − time of

the time for 2

cQ . Table 2 summarises the experimental results of computation for


2

cQ . We can see that the time needed for compression is compensated by the time

saved in the computation for 2

cQ . When calculating mQ , computational time can be

further reduced. 2Q ,…, mQ can be computed off-line and stored for link prediction.

So the response time is not an issue given the fast developing computationalcapability of the Web servers.

Table 2. Experimental Results for 2Q and

2

cQ

Matrix Dimension(2

)2175 1627 1606 1579 1549 1542 1539

Computation Time for2Q or

2

cQ (Minutes)

1483 618 592 561 529 521 518

Percentage of timesaved (%)

N/A 58.3 60.1 62.1 64.3 64.9 65.1

We then use the compressed transition matrix for link prediction. Link prediction isintegrated with a prototype called ONE (Online Navigation Explorer) to assist users’navigation in our university Web site. ONE provides the user with informative andfocused recommendations and the flexibility of being able to move around within thehistory and recommended pages. The average time needed for updating therecommendations is under 30 seconds, so it is suitable for online navigation, given theresponse can be speeded up with the current computational capability of manycommercial Web sites. We selected m =5 and n =5 in link prediction to take intoaccount five history vectors in the past and five steps in the future. We computed

2Q ,…, 9Q for link prediction.

The initial feedback from our group members is very positive. They have spent lesstime to find interested information using ONE than not using ONE in our universityWeb site. They have more successfully found the information useful to them usingONE than not using ONE. So users’ navigation has been effectively speeded up usingONE. ONE presents a list of Web pages as the user’s visiting history along with therecommended pages updated while the user traverses the Web site. Each time when auser requests a new page, probabilities of visiting any other Web pages or pageclusters within the next n steps are calculated. Then the Web pages and clusters withthe highest probabilities are highlighted in the ONE window. The user can browse theclusters and pages like in the Windows Explorer. Icons are used to represent differentstates of pages and clusters. Like the Windows Explorer, ONE allows the user toactivate pages, expand clusters. Each page is given its title to describe the contents inthe page.

6 Related Work

Ramesh Sarukkai [14] has discussed the application of Markov chains to linkprediction. User's navigation is regarded as a Markov chain for link analysis. Thetransition probabilities are calculated from the accumulated access records of pastusers. Compared with his method, we have three major contributions. We have


compressed the transition matrix to an optimal size to save the computation time of1m nQ + − , which can save a lot of time and resources given the large number of Web

pages on a modern Web site. We have improved the link prediction calculation bytaking into account more steps in the future to provide more insight into the future.We have proposed to use Maximal Forward Path method to improve the accuracy oflink prediction results by eliminating the effect of backward references by users.

The “Adaptive Web Sites” approach has been proposed by Perkowitz and Etzioni[11]. Adaptive Web sites are Web sites which can automatically change theirpresentation and organization to assist users’ navigation by learning from Web usagedata. Perkowitz and Etzioni proposed the PageGather algorithm to generate indexpages composed of Web pages most often associated with each other in users’ visitsfrom Web usage data to evaluate a Web site’s organization and assist users’navigation [12].

Our work is in the context of adaptive Web sites. Compared with their work, ourapproach has two advantages. (1) The index page is based on co-occurrence of pagesin users’ past visits and does not take into account users’ visiting history. The indexpage is a static recommendation. Our method has taken into account users’ history tomake link prediction. The link prediction is dynamic to reflect the changing interestsof the users. (2) In PageGather, it is assumed that each originating machinecorresponds to a single user. The assumption can be undermined by proxy servers,dynamic IP allocations, which are both common on the WWW. Our method treats auser group as a whole without the identification of individual users and thus is morerobust to these influences. However, computation is needed in link prediction and therecommendations can not respond as quickly as the index page, which can be directlyretrieved from a Web server. Spears [15] proposed a transition matrix compressionalgorithm based on transition behaviors of the states in the matrix. Transition matricescalculated from systems, which are being modeled in too many details, can becompressed for smaller state spaces while the transition behaviors of the states arepreserved. The algorithm has been used to measure the transition similarities betweenpages in our work and compress the probability transition matrix to an optimal sizefor efficient link prediction.

Pirolli and Pitkow [13] studied the web surfers' traversing paths through theWWW and proposed to use a Markov model for predicting users' link selectionsbased on past users' surfing paths. Albrecht et al. [1] proposed to build three types ofMarkov models from Web log files for pre-sending documents. Myra Spiliopoulou[16] discussed using navigation pattern and sequence analysis mined from the Weblog files to personalize a web site. Mobasher, Cooley, and Srivastava [4, 9] discussedthe process of mining Web log files using three kinds of clustering algorithms for siteadaptation. Brusilovsky [2] gave a comprehensive review of the state of the art inadaptive hypermedia research. Adaptive hypermedia includes adaptive presentationand adaptive navigation support [2]. Adaptive Web sites can be seen as a kind ofadaptive presentation of Web sites to assist users’ navigation.

7 Conclusions

Markov chains have been proven very suitable for modeling Web users’ navigation onthe WWW. This paper presents a method for constructing link graphs from Web log


files. A transition matrix compression algorithm is used to cluster pages with similartransition behaviors together for efficient link prediction. The initial experimentsshow that the link prediction results presented in a prototype ONE can help user tofind information more efficiently and accurately than simply following hyperlinks tofind information in the University of Ulster Web site.

Our current work has opened several fruitful directions as follows: (1) Maximalforward path has been utilized to approximately infer a user’s purpose in his/hernavigation path, which might not be accurate. The link prediction can be furtherimproved by identifying users’ goal in each visit [6]. (2) Link prediction in ONEneeds to be evaluated by a larger user group. We plan to select a group of usersincluding students, staff in our university, and people from outside our university touse ONE. Their interaction with ONE will be logged for analysis. (3) We plan to useWeb log files from some commercial Web site to build a Markov model for linkprediction and evaluate the results on different user groups.

References

1. Albrecht, D., Zukerman, I., and Nicholson, A.: Pre-sending Documents on the WWW: AComparative Study. IJCAI99 (1999)

2. Brusilovsky, P.: Adaptive hypermedia. User Modeling and User Adapted Interaction 11(1/2). (2001) 87-110

3. Chen, M. S., Park, J. S., Yu, P. S.: Data mining for path traversal in a web environment. InProc. of the 16th Intl. Conference on Distributed Computing Systems, Hong Kong. (1996)

4. Cooley, R., Mobasher, B., and Srivastava, J. Data Preparation for Mining World Wide WebBrowsing Patterns. Journal of Knowledge and Information Systems, Vol. 1, No. 1. (1999)

5. Hallam-Baker, P. M. and Behlendorf, B.: Extended Log File Format. W3C Working DraftWD-logfile-960323. http://www.w3.org/TR/WD-logfile. (1996)

6. Hong, J.: Graph Construction and Analysis as a Paradigm for Plan Recognition. Proc. ofAAAI-2000: Seventeenth National Conference on Artificial Intelligence, (2000) 774-779

7. Kalpazidou, S.L. Cycle Representations of Markov Processes, Springer-Verlag, NY. (1995)8. Kijima, M.: Markov Processes for Stochastic Modeling. Chapman&Hall, London. (1997)9. Mobasher, B., Cooley, R., Srivastava, J.: Automatic Personalization Through Web Usage

Mining. TR99-010, Dept. of Computer Science, Depaul University. (1999)10. Nielsen, J.: Designing Web Usability, New Riders Publishing, USA. (2000)11. Perkowitz, M., Etzioni, O.: Adaptive web sites: an AI challenge. IJCAI97 (1997)12. Perkowitz, M., Etzioni, O.: Towards adaptive Web sites: conceptual framework and case

study. WWW8. (1999)13. Pirolli, P., Pitkow, J. E.: Distributions of Surfers’ Paths Through the World Wide Web:

Empirical Characterization. World Wide Web 1: 1-17. (1999)14. Sarukkai, R.R.: Link prediction and path analysis using Markov chains. WWW9, (2000)15. Spears, W. M.: A compression algorithm for probability transition matrices. In SIAM

Matrix Analysis and Applications, Volume 20, #1. (1998) 60-7716. Spiliopoulou, M.: Web usage mining for site evaluation: Making a site better fit its users.

Comm. ACM Personalization Technologies with Data Mining, 43(8). (2000) 127-134


Classification of Customer Call Data in the Presence ofConcept Drift and Noise

Michaela Black and Ray Hickey

School of Information and Software Engineering, Faculty of InformaticsUniversity of Ulster, Coleraine, BT51 1SA, Northern Ireland

mm.black, [email protected]

Abstract. Many of today’s real world domains require online classificationtasks in very demanding situations. This work presents the results of applyingthe CD3 algorithm to telecommunications call data. CD3 enables the detectionof concept drift in the presence of noise within real time data. The applicationdetects the drift using a TSAR methodology and applies a purging mechanism asa corrective action. The main focus of this work is to identify from customerfiles and call records if the profile of customers registering for a ‘friends andfamily’ service is changing over a period of time. We will begin with a reviewof the CD3 application and the presentation of the data. This will conclude withexperimental results.

1 Introduction

On-line learning systems which receive batches of examples on a continual basis andare required to induce and maintain a model for classification have to deal with twosubstantial problems: noise and concept drift.

The effects of noise have been studied extensively and has lead to noise-proofeddecision tree learners such as C4.5 [13], C5 [14] and rule induction algorithms, e.g.CN2 [17]. Although considered by Schlimmer and Grainger [6] and again recently byothers including Widmer and Kubat [10], Widmer [9] and, from a computationallearning theory perspective, by Hembold and Long [7], concept drift [1], [8], [10] hasreceived considerably less attention. There has also been work on time dependencyfor association rule mining; see, for example, [11] and [12].

By concept drift we mean, essentially, that concepts or classes are subject tochange over time [5], [7], [8], [9], [10]. Such change may affect one or more classesand, within a class, may affect one or more of the rules which constitute the definitionof that class. Examples of this phenomenon are common in real world applications. Inmarketing, the definition of the concept ‘likely to buy this product in the next threemonths’ could well change several times during the lifetime of the product or evenduring a single advertising campaign. Also in fraud detection for, say, credit card ormobile phone usage. Here change may be prompted by advances in technology thatmake new forms of fraud possible or may be the result of fraudsters altering theirbehaviour to avoid detection. The consequences of ignoring concept drift whenmining for classification models can be catastrophic [1].

Classification of Customer Call Data in the Presence of Concept Drift and Noise 75

As noted by Widmer [10], distinguishing between noise and drift is a difficult taskfor a learner. When drift occurs, incoming examples can appear as just imperfect. Thedilemma for the algorithm is then to decide between the two: has drift occurred and ifso, how are the learned class definitions to be updated? In the system Flora 4, concepthypotheses description sets are maintained using a statistical evaluation procedureapplied when a new example is obtained. Examples deemed to be no longer valid are‘forgotten’, i.e. dropped from the window of examples used for forminggeneralisations. A WAH (window adjustment heuristic) is employed for this purpose.The approach assumes that it is the newer examples which are relevant and thus theolder examples are dropped. Although window size is dynamic throughout learningthere is a philosophy of preventing the size from becoming overly large: if afterreceiving a new example the concepts seem sTable then an old example is dropped.

Drift may happen suddenly, referred to as revolutionary [1], [2], or may happengradually over an extended time period, referred to as evolutionary; see [1] and [5]. Inthe former case we refer to the time at which drift occurred as the drift point. We canregard evolutionary drift as involving a series of separate drift points with very smalldrift occurring at each one.

In [1], [2] we proposed a new architecture for learning system, CD3, to aiddetection and correction of concept drift. Like the METAL systems presented byWidmer [9] CD3 utilizes existing learners, in our case tree or rule inductionalgorithms – the basic induction algorithm is really just a parameter in the system. Wedo not, however, seek to determine contextual clues. We provided an alternativestrategy to ‘windowing’ called ‘purging’, as a mechanism for removing examples thatare no longer valid. The technique aims to keep the knowledge base more up-to-dateby not offering preference to the newer examples and thus retaining older validinformation that has not drifted.

2 TSAR and the CD3 Algorithm

We know data will arrive in batches, where batch size may vary from one example tomany. The CD3 algorithm presented in [1], [2] must accept batches of data of varioussizes providing a flexible update regime. Online incoming data may be batched inaccordance to time of arrival or in set sizes of examples. Experimental scriptsproduced in Prolog allow the user to specify how the data is to be processed by CD3.Batch variations may also exist within one induction process.

The central idea is that a time-stamp is associated with examples and treated as anattribute, ts, during the induction process. In effect the learning algorithm is assessingthe relevance of the time-stamp attribute. This is referred to as the time-stampattribute relevance (TSAR) principle, the implication being that if it turns out to berelevant then drift has occurred.

Specifically, the system maintains a set of examples (the example base) deemed tobe valid and time-stamps these as ts=current. When a new batch arrives, the examplesin it are stamped ts=new. Induction then takes place using a noise-proofed tree or rule-building algorithm—referred to as the base algorithm. The pseudocode for CD3 ispresented in Figure1.

Following the induction step, CD3 will provide us with a pruned induced treestructure that may or may not have the ts attribute present. This induced tree could be

76 M. Black and R. Hickey

used for classification of unseen examples by merely having their descriptionaugmented with the ts attribute and setting its value to ‘new’. It will be assumed thatall unseen examples presented following an induction step are generated from thecurrent model in force when the most recent batch arrived. This assumption is helduntil another new batch of training examples arrives invoking an update of the tree.

CD3(Header_file, Batch_parameters, Data_file, Output_file,Purger_parameter, No_of_trials)repeat for No_of_trialsload file specification(Header_file);load training data(Data_file, Trial);beginextract first batch and mark as‘current’(Batch_parameters);while more batchesextract next batch and mark as ‘new’;call ID3_Induce_tree;prune induced tree;extract rules;separate drifted rules;purge invalid training examples(Purger_parameter);append ‘new’ examples to ‘current’;test rules;output results(Output_file);end whileend repeatend CD3

Fig. 1. Pseudocode for CD3 Induction Algorithm

However, CD3 offers another method for classification which does not requireunseen examples having their description altered. From the pruned induced tree CD3will extract rules, the rules representing all the concept paths from the root to theleaves. In the case of ts being present in some or all of the paths, CD3 will look uponthose having a ts value ‘current’ as being out-of-date and now an invalid rule. Pathswith the ts value ‘new’ will be viewed as up-to-date and thus valid rules. Finally,paths in which ts is not instantiated correspond to rules which were valid prior to thenew batch and are still so, i.e. unchanged valid rules.

These rules can now be separated i.e. valid from invalid, and the ts attributedropped from the rule conditions. The TSAR methodology uses the ts attribute todifferentiate between those parts of the data which have been affected by drift fromthose that have not. Thus by CD3 applying the TSAR methodology it is enabled todetect drift. Once the invalid and valid rules have been highlighted and separated thets attribute has fulfilled its purpose and can now be removed.

Classification requests can be ongoing, but in an aim to provide the most up to dateclassifier, CD3 must always be ready to accept these updates. This requiresmaintaining a correct and up-to-date database. Following the recent induction stepand rule conversion, CD3’s database is currently out of date with its knowledgestructure. Before CD3 can accept these new updates of training examples it mustremove existing examples in the knowledge base that are covered by the mostrecently identified invalid rules. As presented in [1], [2] CD3 provides a removaltechnique i.e. ‘purging’ which can be applied to the knowledge base following theconcept drift detection phase to extract the now ‘out of date’ examples thus


maintaining an up-to-date version of the database. In purging examples that are nolonger relevant, the TSAR approach does not take account of age as windowingmechanisms tend to do. Rather the view is that an example should be removed if it isbelieved that the underlying rule, of which the example is an instance, has drifted, i.e.matches that of an invalid rule.

The invalid rules can be discarded or stored for trend analysis. The valid rules areused for classification where an unseen example description can be matched againstone of the rules condition to obtain its classification. However, CD3 aims to be anincremental online learning system which can therefore be updated with newinformation i.e. new training examples when available.

A learning system as described above is said to update learning in the presence ofnoise and possible concept drift using the TSAR principle. The central feature here isthe involvement of the ts attribute in the base learning process. Any tree or ruleinduction algorithm which effectively handles noise (and therefore, in theory, shouldprune away the ts attribute - as well as other noisy attributes - if there is no drift) canbe used at the heart of a TSAR learning system. In particular, if ID3 with post-pruningis used we call the resulting system CD3 (CD = concept drift). We have implementeda version of CD3 which uses the well-known Niblett-Bratko post-pruning algorithm[18].

A TSAR system implements incremental learning as batch re-learning, i.e. theknowledge base is induced again from scratch every time a new batch arrives with thenew batch being added to the existing example base. (In contrast to FLORA4 whichincrementally learns on receipt of each new example and without re-learning from theexisting examples). Bearing in mind that, depending on the extent and frequency(over time) of drift, many examples could be purged from the example base this is notas inefficient as it might appear. It may be possible to produce a more genuinelyincremental and hence more computationally efficient implementation by exploitingthe techniques used by [16] in the ITI algorithm.

The TSAR approach does not require the user to manually set parameter valuesother than those that may be required by the base algorithm. Noise and the presence ofpure noise attributes, i.e. those that are never useful for classification, will interferewith the ability of CD3 to decide whether drift has occurred. The ts attribute may beretained in a pruned tree even if there has been no drift—a ‘false positive’—and, as aconsequence, some examples will be wrongfully purged. Conversely ts may bepruned away when drift has taken place with the result that invalid examples willremain in the example base and contaminate learning.

CD3 allows classification tasks to continue to work and survive in very demandingenvironments such as fraudulent detection in the telecommunications industry,moving marketing targets and evolving customer profiling. The methodology issimple in principle and allows ease of implementation into a wide spectrum ofproblem areas and classification methods.

Purging helps CD3 to work in these demanding environments. The data that CD3works with will be continually updated with the changing environment. The speed atwhich it can work is also improved by removing unnecessary data for the task at hand.Using the purging mechanism with the TSAR methodology makes it very easy toimplement. We basically rely on the ts attribute to provide two sets of rules: valid andinvalid. The purging rules maybe simply coded to check all examples accordingly.Having the purging mechanism separate from the drift detection enhances its


reusability aspect. If coded as a separate executable object, this would allow it to bereused, not only across a number of different classification tasks, but also acrossseparate classification algorithms.

2.1 Refinement of the TSAR Methodology

We extended the CD3 algorithm presented in [1] to allow refinement of the tsattribute being used as discussed in [2]. CD3 uses a very simple form of timestamping relying on just two values - ‘current’ and ‘new’. This is sufficient to allowit to separate valid and invalid rules and to maintain good classification rates using theformer. One possible disadvantage, however, is that as mining proceeds, the examplespurged in previous rounds are lost to the system. This is the case even though suchpurges may be false, i.e. may have occurred as a result of errors in the inductionprocess. It was shown in [1] that false purging even under realistic noise levels couldbe kept to a minimum. Nevertheless, it is worth considering how the mining processcould be allowed to review and maybe revoke earlier decisions. In [2] we used twoversions of time stamp refinement.

The first refinement of the time stamps as ‘Batch Identifiers’ resulted in the CD4algorithm. Within this algorithm each batch is assigned and retains indefinitely itsown unique batch identifier. At each round of mining, all the data from the previousbatches is used together with the new batch. There is no purging process as CD4 isimplemented using a new base learner C5 [14] without the purging mechanism.

Instead the base learning algorithm is able to distinguish valid and invalid rules byappropriate instantiation of the time stamp attribute possibly revising decisions madein a previous induction. As presented in [2] drift will be located as being after acertain batch and will be presented within the knowledge structure as a binary split atthe ts value of the batch identifier. This procedure will result in a set of invalid andvalid rules. The valid rules can then be used for online classification.

The second refinement procedure presents CD5, which removes the effects ofbatches on mining by using continuous time stamping in which each training exampleis has its own unique time stamp. This would be a numeric attribute. The baselearning algorithm is now free to form binary splits as for CD4 but without regard tobatch. Thus it can place a split point within a batch (either the new or any of theprevious batches) and review these decisions at each new round of mining. Again theprocedures for extraction of valid and invalid rules and maintenance of a database ofcurrently valid examples are as described above. As with CD4, purging is not anintegral part of the incremental mining process.

As demonstrated in [2] both the extension to individual batch time stamps andfurther to individual example time stamps in algorithms CD4 and CD5 respectivelyappear to produce results comparable to, and possibly, slightly superior to thoseobtained from CD3. It was also highlighted in [2] that the simple strategy of dealingwith drift by ignoring all previously received data and just using the new batch of datais effective only immediately after drift. Elsewhere it prevents growth in ACRthrough accumulation of data. It also, of course, denies us the opportunity to detectdrift should it occur.

With the benefit from the enhanced time stamps turning out to be marginal then thechoice of which algorithm to deploy may depend on the characteristics of the domain


used and the nature of the on-line performance task. Without the purging process CD4and CD5 will produce larger trees than CD3 and may take slightly longer to learn.

Both CD4 and CD5 offer a benefit over CD3 since they induce trees that record thetotal history of changes in the underlying rules and therefore provide a basis forfurther analysis. However the purging mechanism of CD3 offers the application oftrend analysis off-line.

3 Experimental Trial with British Telecom Call Data

Until now we have developed and experimented with artificial data [1], [2]. This hasallowed for strategic generation and control of parameters within the data, aiding thedevelopment of a simple yet so far effective application for detecting concept drift inthe presence of noise. We are now able to extend this to an experimental trial of realworld call data acquired from British telecom (BT). The data is in form of fivebatches based over a time period of twenty seven months, batched in accordance towhen BT proposed that drift was most likely to occur: March and October. The datareflects all landline calls for 1000 customers over this period combined with customerinformation. We aim to prepare and process the data for CD3 with the hope that itwill highlight some concept drift over the five batches.

3.1 The Content of the Data

The data was initially presented for the project consisting of two files: customer fileand call file as shown in Table 1, Table 2 and Table 3. These were linked via anencoded customer id. One of the main interests of BT was to train on call data and tryto identify if the profile of customers who register for ‘friends and family’ (F&F) or‘premier line’ (P_L) service has changed over time. The ‘friends and family’ serviceis a discount service option offered to BT customers. This experiment will involvethe induction of a rule set for classification of the F&F indicator. Total usage i.e.‘revenue’ was also of great interest and how this related to the F&F users. This wascalculated using two available fields: number of calls ∗ average cost of calls, whichcould then be split into discrete band values.

The F&F, P_L and option_15 indicator (O15) had five separate indicators, one foreach of the five time periods, as shown in Table 1. These could then be translatedinto one indicator for each field, within individual batches, highlighting if thecustomer was now using the service at that time period. This also allowed forcustomers to register for a service and, within one of the successive time points, de-register for the service. These indicators are referred to as ffind, plind and O15ind.

The binary fields of particular interest were: the friends and family indicator(F&F); O15 indicator (O15); Premier Line Indicator (P_L); single line indicator(SLIND). Other discrete valued fields of interest were: revenue, life stage indicator(LSIND) and the acorn code. (More details are available in Table 1).

For the fields in Table 2 marked with an * there exist 13 sets of summarised dataunder the following sub headings shown in Table 3. These summarised subgroupsoccur within the call file in the order shown resulting in 79 columns in total. The first


is unique: encoded telephone number and then there are 13 batches of six re-occurringattributes.

Table 1. Customer File

Field Description Typeencode (telno) ecrypted telephone number Chardistcode district code (27 unique) charstartdat customer started using no. dd-mon-ccyy timeacorn residental codes from postcodes integerffind friends and family indicator Y/Nffdate first got service dd-mon-ccyy timeF&F in Oct 1995 had F&F service at this time Y/NF&F in Mar 1996 “ Y/NF&F in Oct 1996 “ Y/NF&F in Mar 1997 “ Y/NF&F in Oct 1997 “ Y/NF&F in Mar 1998 “ Y/NPlind premier line indicator Y/NPldate first got service dd-mon-ccyy timeP&L in Oct 1995 had P&L service at this time Y/NP&L in Mar 1996 “ Y/NP&L in Oct 1996 “ Y/NP&L in Mar 1997 “ Y/NP&L in Oct 1997 “ Y/NP&L in Mar 1998 “ Y/NO15ind option15 ind.(fixed call amount) Y/NO15date first got service dd-mon-ccyy timeO15 in Oct 1995 had O15 service at this time Y/NO15 in Mar 1996 “ Y/NO15 in Oct 1996 “ Y/NO15 in Mar 1997 “ Y/NO15 in Oct 1997 “ Y/NO15 in Mar 1998 “ Y/NXdir Xdirectory Y/NMps Mailing preference scheme Y/NTps Telephone preference scheme Y/NDontmail Don’t mail marketing data Y/NLusind Low user scheme - Code form X/HCcind Charge card indicator Y/NHwind Hard wired indicator Y/NLsind Life stage indicator (from postcode, 1..10) integerSlind More than one line Y/NPostcode Post code POSTCODE

Table 2. Call File Details

Field TypeEncode (telno) char*no. of calls int*average duration of calls real*variance of duration of calls real*average cost of calls real*variance of cost of calls real*no. of distinct destinations phoned int


Table 3. Call File Summarised Sub-Groups

All callsDay-time calls

Directory Enquiry callsInternational calls

ISP callsLocal callsLong calls

Low- call callsMobile calls

National callsPremium Rate calls

Short callsWeek-end calls

3.2 Pre-processing of the Data

Having been accustomed to artificial data, this real data brought with it a completenew set of challenges. The number of fields seemed overwhelming, not to mentionthe number of discrete values available with fields like the acorn code. The initialfiles were read into Microsoft Access where they could be grouped via the customerid relationship. This also allowed the revenue field to be calculated and inserted. Theaim of the field selection process was to greatly reduce the number of fields. Onlythose fields of particular interest for the problem stated above were selected. Theseare shown in Table 4.

Once the access files were complete they were then available for Clementine [15]for the second stage of the processing. Clementine allowed analysis of the data forfields like revenue, acorn and LSIND, which were required to be split into anaccepTable number of bands. It became clear from the early analysis that there was aclear shift within these customers from non F&F users to F&F users over the twentyseven months period. Our initial fears come from the first batch where the proportionof the F&F users was quite small. We had concerns regarding classes with smallcoverage. Another concern was that this might be a population shift problem [4] andnot that of concept drift. However, the results of the experiments would prove ordisprove this.

Initially the acorn field had 55 values. These values are derived from the postcodeand categorise communities with respect to their location and the sub-groups of thepopulation within that community. These can then be grouped into seven higherclassification groups. These higher seven bands were used for our experiments.Similarly the life stage indicator had ten values spanning from 1: representing youngpeople, to 10: representing retired couples. These were regrouped to five values,combining 1 and 2 to give ‘ls_a’, 3 and 4 to give ‘ls_b’ and so on. The resulting fieldsand their final values are shown below in Table 4.

The revenue field values represent the six bands that were selected and the numberat the end of each value represents the maximum values of revenue for the category.For example ‘a_12’ represents all customers with revenue less than £12,000. The


value ‘b_28’ represents all customers with revenue values from £12,000 but less than£28,000 and so on.

Table 4. Selected and Processed Fields for Experimentation

Fields

Acorn F&F P_L O15 SLIND LSIND Revenue

Values acorn_a,acorn_b,acorn_c,acorn_d,acorn_e,acorn_f

y,n y,n y,n y,n ls_a,ls_b,ls_c,ls_d,ls_e,

a_12,b_28,c_40,d_52,e_70,f_high

Clementine also enabled the quality of the data to be analysed. We could monitor formissing values and erroneous values. With the quantity of the data available, andbecause accuracy was an issue, we initially removed records with such properties.The remaining records within each batch were then split into training and test batchesas shown in Table 5. The flexibility in CD3’s update regime allows for the batches tobe of various sizes.

Table 5. Training and Test Batches for Experimental run

Month Total

Examples

Training

Examples

TestExamples

October 1995 840 840March 1996 837 558 279October 1996 848 566 282March 1997 823 549 274October 1997 793 529 264

3.3 The Experimental Trial

For all our previous experiments we had a number of trials of data available, howeverwith real data there can only ever be one. We had five batches of data availablespanning a period of twenty seven months. As before CD3 uses the first batch as itscurrent batch and then appends additional batches checking for drift. A header filemust be created to represent the data specification including the class values for thealgorithm as seen in Figure 2.

The ‘mark2’ purger was used for this trial due to its prior success in [1]. The firstbatch Oct_95 was applied by CD3 as its current batch. The consecutive batches wereapplied one by one in order of date. The algorithm recorded the ACR of CD3 basedon the test set provided, the percentage purged on each iteration and the highestposition of the ts attribute within the tree. Our initial hopes were that by analysing theposition of the ts attribute and the percentage purged we would be able to clearlyidentify drift between consecutive batches.


univ(btUniv) .

attributes([acorn,p_l,o15,slind,lsind,revenue]) .

att_values(acorn,[acorn_a, acorn_b, acorn_c,acorn_d,acorn_e,acorn_f]) .att_values(p_l,[y,n]) .att_values(o15,[y,n]) .att_values(slind,[y,n]) .att_values(lsind,[ls_a,ls_b,ls_c,ls_d,ls_e]) .att_values(revenue,[a12,b28,c40,d52,e70,fhigh]) .

classes([y,n]).

Fig. 2. BT Call Data Specification

3.4 The Final Test

The classification performance of CD3 starts well at a peak of 85% on application ofthe second batch: Mar_96 as shown in Figure 3. It would seem that there is littlechange between Oct_95 and Mar_96. The position of the ts attribute should confirmthis. Application of the next two batches: Oct_96 and Mar_97, show a slight declinein performance. Could this be an indication of drift? Detailed study of the ts attributewill confirm this.

4045

50556065

707580

8590

Mar-96 Oct-96 May-97Tim e Point(n)

AC

R%

ACR%

Fig. 3. ACR% for CD3 with BT Data

Figure 4 confirms our initial suspicions. The ts attribute only reaches level 2 on theapplication of the second batch: Mar_96 (0 is the root position: top, 1 is second topand so on). This could be false purging occurring or it could be the beginning ofconcept drift. CD3 still achieves a good ACR.

Detailed analysis of the tree in Figure 5 shows at this early stage the revenueattribute is the most informative, and in some cases the only attribute required indetermining ‘F&F’ users. The acorn attribute then becomes informative. The driftoccurs within acorn value acorn_d highlighting a change in the relevance of the lsindas highlighted in Figure 5.


Again with reference to Figure 4 the position of the ts attribute after applying the nexttwo batches: Oct_96 and May_97 confirms our suspicions about drift. The ts attributeclimbs to the top of the tree highlighting that all of the data has drifted. This driftcontinues into the fourth batch: May_97.

0

0.5

1

1.5

2

2.5


ts H

igh

ts Highest Position

Fig. 4. Highest ts Position for CD3 with BT Data

revenue a12--> n - [40,735] / 775 b28--> n - [37,211] / 248 c40--> n - [5,43] / 48 d52 acorn acorn_a--> n - [1,2] / 3 acorn_b--> def - n acorn_c--> y - [1,1] / 2 acorn_d ts curr lsind ls_a--> def - n ls_b--> y - [1,0] / 1 ls_c--> def - n ls_d--> n - [0,3] / 3 ls_e--> def - n new--> y - [2,0] / 2

Fig. 5. A Section of the Pruned Tree Output from CD3 after Applying Second Batch

The percentage of examples being purged as shown in Figure 6 is measured as apercentage of the current number of examples. This shows that for the second batch,i.e. Mar_96, the drift has not really started, resulting in a very low percentage purgedof 0.54%. Again this could be false purging or the beginning of the change. However,things begin to change with the next two batches. After applying the third batch wesee an increase in the percentage being purged to 13%. This increase accelerates tothe highest rate of the experiment between the third and fourth batches: Oct_96 andMay_97 from 13% to 34%.


Following this as with the other findings above, the drift seems to beginning todecline after applying the final batch. Although the percentage being purged stillsincreases, it is doing so at a slower rate. We also see that the ts attribute moves downto position one in Figure 4 in accordance with this.

0

10

20

30

40

50


%P

urg

ed

% Purged

Fig. 6. % Purged for CD3 with BT Data

With close analysis of the tree in Figure 7 it becomes clear at this final stage that therevenue is once again the most informative attribute and that drift is only applicable tothe lowest band of revenue; ‘a12’. Customers within this revenue band and acorncategories of either acorn_d or acorn_f previously all had a F&F indicator of ‘n’: non‘F&F’ users. (Figure 7 has had some of the binary splits of the acorn attribute underthe revenue value ‘a12’ removed to aid readability) However, after applying the finalbatch: Oct_97, these two bands under go major change with almost all changing to‘F&F’ users. The reduction in the percentage purged, and assumingly the drift is alsohighlighted by the increase in the ACR in Figure 3 to 66%.

4 Conclusion

When working with real data it is difficult to determine if drift exists within the data,and if so where it occurs. The experiments confirmed the company’s suspicions.Within the twenty-seven week period the profile of customers using the service haschanged. The TSAR methodology allowed CD3 to locate the drift and highlight thechanging properties within the customer profile.

If we look closely at the percentage being purged in Figure 6 we can see thattowards the end of the trial CD3 is purging almost 50% of the data. At this stage CD3has retained 2085 examples out of a total of 2762.

What would be interesting is to follow this trial with another few batches afterOctober 1997 to determine if the drift reduces and if so where? Over a longer periodit may reduce and then reappear. By using the TSAR approach the user can analysethe drift at each stage. It is very interesting to study the differences in the knowledgestructure between what was current and what is now new. The tree structure naturallyoffers a very clear and readable interpretation of the drift.


Revenue a12 ts curr acorn acorn_a lsind ls_a--> def - n ls_b--> y - [3,2] / 5 ls_c--> y - [9,3] / 12 ls_d p_l y--> y - [1,0] / 1 n--> n - [23,117] / 140 ls_e o15 y--> y - [1,0] / 1 n p_l y--> y - [1,0] / 1 n--> n - [17,69] / 86 acorn_d--> n - [77,322] / 399

acorn_f--> n - [39,205] / 244New

p_l y--> y - [27,1] / 28 n o15 y--> y - [8,1] / 9 n acorn acorn_a lsind ls_a--> def - n ls_b--> y - [2,2] / 4

ls_c--> y - [5,3] / 8 acorn_d

lsind ls_a--> n - [2,4] / 6 ls_b--> y - [2,1] / 3 ls_c--> n - [6,7] / 13

ls_d--> y - [39,31] / 70 ls_e--> y - [6,5] / 11

acorn_f lsind ls_a--> n - [0,2] / 2 ls_b--> y - [8,7] / 15 ls_c--> n - [2,3] / 5 ls_d--> y - [8,3] / 11 ls_e--> y - [11,10] / 21

Fig. 7. The Section of the Pruned Tree Output From CD3 after Applying Final Batch


References

1. Black, M., Hickey, R.J.: Maintaining the Performance of a Learned Classifier underConcept Drift. Intelligent Data Analysis 3 (1999) 453-474

2. Hickey, R.J., Black, M.,: Refined Time Stamps for Concept Drift Detection during Miningfor Classification Rules. Spatio-Temporal Data Mining - TSDM2000, published in LNAI2007, Springer-Verlag.

3. Hickey, R.J., 1996, Noise Modelling and Evaluating Learning from Examples. ArtificialIntelligence, 82, pp157-179.

4. Kelly, M.G., Hand, D.J., Adams, N.M.: The Impact of Changing Populations on ClassifierPerformance. In: Chaudhuri, S., Madigan, D. (eds.): Proceedings of the Fifth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. Associationfor Computing Machinery, New York (1999) 367-371

5. Klenner, M., Hahn, U.: (1994). Concept Versioning: A Methodology for TrackingEvolutionary Concept Drift in Dynamic Concept Systems. In: Proceedings of EleventhEuropean Conference on Artificial Intelligence, Wiley, Chichester, England, 473-477

6. Schlimmer, J.C., Granger, R.H.: Incremental Learning from Noisy Data. Machine Learning1 (1986) 317-354

7. Hembold, D.P., Long, P.M.: Tracking Drifting Concepts by Minimising Disagreements.Machine Learning 14 (1994) 27-45

8. Hulten, G., Spencer, L., Domingos, P., 2001, Mining Time-Changing Data Streams,Proceedings of the Seventh International Conference on Knowledge Discovery and DataMining.

9. Widmer, G.: Tracking Changes through Meta-Learning. Machine Learning 27 (1997) 259-286

10. Widmer, G., Kubat, M.: Learning in the Presence of Concept Drift and Hidden Contexts.Machine Learning 23 (1996) 69-101

11. Chakrabarti, S., Sarawagi, S., Dom, B.: Mining Surprising Patterns using TemporalDescription Length. In: Gupta, A., Shmueli, O., Widom, J. (eds.): Proceedings of theTwenty-Fourth International Conference on Very Large databases. Morgan Kaufmann, SanMateo, California (1998) 606-61

12. Chen, X., Petrounias, I.: Mining Temporal Features in Association Rules. In: Zytkow, J.,Rauch, J. (eds,): Proceedings. of the Third European Conference on Principles and Practiceof Knowledge Discovery in Databases. Lecture Notes in Artificial Intelligence, Vol. 1704.Springer-Verlag, Berlin Heidelberg New York (1999) 295-300

13. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo,California (1993)

14. Quinlan, J.R., 1998, See5, http://www.rulequest.com/.15. http://www.spss.com/clemintine/16. Utgoff, P.E., 1997, Decision Tree Induction Based on Efficient Tree Restructuring,

Machine Learning 29(1): 5-44.17. Clark, P. and Boswell, R. 1991. Rule Induction with CN2: Some Recent Improvements, in

Proceedings of the European Workshop on Learning (EWSL-91). 151-163. Berlin:Springer-Verlag.

18. Bratko, I. 1990. Prolog programming for artificial intelligence. Wokingham: Addison-Wesley.

A Learning System for Decision Support inTelecommunications

Filip Zelezny1, Jirı Zıdgek2, and Olga Stepankova3

1 Center for Applied Cybernetics, 3 The Gerstner Laboratory1,3 Faculty of Electrotechnics, Czech Technical University

Prague, Czech RepublicTechnicka 2, CZ 166 27, Prague 6zelezny,[email protected]

2 Atlantis Telecom s.r.o.Zirovnicka 2389, CZ 106 00, Prague 10

[email protected]

Abstract. We present a system for decision support in telecommuni-cations. History data describing the operation of a telephone exchangeare analyzed by the system to reconstruct understandable event descrip-tions. The event descriptions are processed by an algorithm inducingrules describing regularities in the events. The rules can be used as de-cision support rules (for the exchange operator) or directly to automatethe operation of the exchange.

1 Introduction

In spite of the explosion of information technologies based on written communi-cation, the most common and most frequently used tool is the telephone. Up-to-date private branch exchanges (PBX) provide comfort in managing the telephonetraffic, namely regarding calls coming into an enterprise from the outside world.Communication proceeds smoothly provided that the caller knows with whomshe wants to communicate and the person is available. In the opposite case,there is a secretary, receptionist, operator or colleague that can for instance helpto find a substituting person. The operator is a person with no direct product,but with a strong impact on the productivity of other people. Despite that, awide range of companies have cancelled the post of the telephone operator. Thereason is that it is not easy to find a person who is intelligent enough to be agood operator and modest enough to be just an operator. This opens the way forcomputers - the computer is paid for only once so no fix costs set in. Moreover,the machine can work non-stop and provide additional data suitable for analysisallowing for improvements of the telecommunication traffic.

Currently there are several domains where computers are used in the PBXarea (neglecting the fact that PBX itself is a kind of computer):

– Automated attendant - a device that welcomes a caller in a unified mannerand allows him usually to reach a person, or choose a person from a spokenlist; in both cases the calling party is required to co-operate.

D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 88–101, 2002.c© Springer-Verlag Berlin Heidelberg 2002

A Learning System for Decision Support in Telecommunications 89

– Voice mail - a device allowing a spoken message to be left for an unavailableperson; some rather sophisticated methods of delivering the messages areavailable.

– Information service - the machine substitutes a person in providing somebasic information usually organized into an information tree; the callingparty is required to co-operate.

The aim of the above listed tools is to satisfy a caller even if there is no humanservice available at the moment. But all such devices are designed in a static,simple manner - they always act the same way. The reason is simple - theydo not consider who is calling nor what they usually want - as opposed to thehuman operator. Comparing a human operator/receptionist to a computer, wecan imagine the following improvements of the automated telephony:

1. Considering who is calling (by the identified calling party number) and whatnumber was dialled by the caller, the system can learn to determine theperson most probably desired by the caller; knowledge can be obtained eitherfrom previous cases (taking into account other data like daytime, explicitinformation - long absence of some of the company’s employee, etc.) or by‘observing’ the way the caller was handled by humans before; this couldshorten the caller’s way to get the information she needs.

2. The caller can be informed by a machine in a spoken language about thestate of the call and suggested most likely alternatives; messages should be‘context sensitive’.

Naturally, the finite goal of computerized telephony is a fully ‘duplex’ machinethat can both speak and comprehend spoken language so that the feedback withthe caller can proceed in a natural dialog.

We present a methodology where the aim is to satisfy goal 1. The task wasdefined by a telecommunication company that installs PBX switchboards invarious enterprizes. Our experiments are based on the PBX logging data comingfrom one of the enterprises. The methodology is reflected in a unified systemwith inductive (learning) capabilities employed to produce decision support rulesbased on the data describing the previous PBX switching traffic. The systemcan be naturally adapted to the conditions of a specific company (by including aformally defined enterprise-related background knowledge) as well as in the caseof a change in the PBX firmware (again via an inductive learning process).

We employ the language of Prolog [3,5] (a subset of the language of first-order logic) as a unified formalism to represent the input data, the backgroundknowledge, the reasoning mechanism and the output decision support rules. Thereason for this is the structured nature of the data with important dependenciesbetween individual records, and the fact that sophisticated paradigms are avail-able for learning in first-order logic. These paradigms are known as InductiveLogic Programming (ILP) [9,7]. The fundamental goal of ILP is the induction offirst-order logic theories from logic facts and background knowledge. In recentyears, two streams of ILP have developed, called the normal setting (where -roughly - theories with a ’predictive’ nature are sought) and the non-monotonic

90 F. Zelezny, J. Zıdgek, and O. Stepankova

setting (where the theories have a ’descriptive’ character). We employ both of thesettings in the system and their brief description will be given in the respectivesections.

The paper is further organized as follows. The next section describes thedata produced by the PBX. In Sections 3 and 4 we deal with the question ofhow to reconstruct events from the data, i.e. how to find out what actions thecallers performed. In Section 5 we shall describe the way we induce decisionsupport rules from the event database and appropriate background knowledge.Section 6 shows the overall interconnection of the individual learning/reasoningmechanisms into an integrated system.

A rough knowledge of the syntax of Prolog clauses (rules) is needed to un-derstand the presented examples of the learning and reasoning system parts.

2 The Exchange and Its Data

The raw logging file of the PBX (MC 7500) is an ASCII file composed ofmetering receipts (tickets) describing ‘atomic events’. The structure of such aticket is e.g.

4AB000609193638V1LO 1 12193650EDILBRDDEX 0602330533 005000 1FEFE

This ticket describes a single unanswered ring from the external number0602 330533 on the internal line 12. To make the information carried by theticket accessible to both the human user and the reasoning mechanisms, weconvert the tickets description into a relational-table form obtained from thedata transformation tool Sumatra TT [2] developed at CTU Prague. A windowinto the relational table is shown in Figure 1. The numbered columns denote thefollowing attributes extracted from the ticket and related to the correspondingevent: 1: date, 2: starting time, 3: monitored line, 4: end time, 5: call type (E- incoming, S - outgoing), 6: release type (LB - event terminated, LI - eventcontinues in another ticket), 7: release cause (e.g. TR - call has been transfered),8: call setup (D - direct, A - result of a previous transfer), 11: call nature (EX- external, LO - local, i.e. between internal lines etc.), 12: corresponding partynumber, 14: PBX port used, 17: unique ticket key. Attributes not mentionedare not crucial for the explanation following.

A complete event, i.e. the sequence of actions (e.g. transfers between lines)starting with an answered ring from an outside party and ending with the calltermination, is reflected by two or more tickets. For example, a simple eventsuch as an external answered (non-transferred) call will produce two tickets inthe database (one for the ring, another for the talk). Figure 1 contains recordsrelated to two simultaneous external calls, each of which was transferred toanother line after a conversation on the originally called line. The first problemof the data-analysis is apparent: tickets related to different events are mixed andnot trivially separable. Moreover, although calls originating from a transfer from


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17000802 085151 32 085151 E LI D EX 0405353377 005001 FE FE 17664000802 085158 10 085201 E LI D LO 32 0 4 17665000802 085201 10 085205 E LI D DR 32 LO 32 005001 0 4 17666000802 085151 32 085205 E LB TR D DR 06 EX 0405353377 005001 FE FE 17667000802 085158 32 085205 S LB TR D 6 LO 10 10 10 0 0 17668000802 085207 31 085207 E LI D EX 85131111 005009 FE FE 17669000802 085218 11 085218 E LI D LO 31 0 3 17670000802 085218 11 085223 E LI D DR 31 LO 31 0 3 17671000802 085207 31 085223 E LB TR D DR 72 EX 85131111 005009 FE FE 17672000802 085214 31 085223 S LB TR D LO 11 11 0 0 17673000802 085223 11 085339 E LB A DR 31 EX 85131111 005009 FE FE 17674000802 085205 10 085424 E LB A DR 32 EX 0400000000 005001 FE FE 17675

Fig. 1. A window into the PBX logging data containing two simultaneous calls.

a preceding call can be identified (those labelled A in the attribute 8), it cannotbe immediately seen from which call they originate. Moreover, we have to dealwith an erroneous way of logging some instances of the external numbers bythe PBX, e.g. the number 0400000000 in Figure 1 actually refers to the callerpreviously identified as 0405353377. This problem will be discussed in Section4.1.

3 Event Extraction

The table in Figure 1 can be visualized graphically as shown in Figure 2. Thefigure visually distinguishes the two recorded simultaneous calls, although theyare not distinguished by any attribute in the data. The two events are as follows:caller 0405353377 (EX1) is connected to the receptionist on the line 32, asks to betransferred onto line 10. After a spoken notification from 32 to 10, the redirectionoccurs. During the transferred call between EX1 and 10, a similar event proceedsfor caller 85131111 (EX2), receptionist 31 and the desired line 11. It can beseen that the duration of each of the two events (transferred calls) is coveredby the durations of the external-call tickets related to the event. Furthermore,the answering port (attribute 14) recorded for each of the external-call ticketsis constant within one event and different for different simultaneous events. Inother words, a single external caller remains connected to one port until shehangs up, whether or not she gets transferred to different internal lines.

This is an expert-formulated, generally valid rule which can be used to delimitthe duration of particular events (Figure 3). The sequences of connected external-call tickets are taken as a base for event recognition. We have implemented theevent extractor both as a Prolog program and as a set of SQL queries. However,additional tickets (besides the external-call tickets) related to an event have tobe found in the data to recognize the event. For instance, both of the eventsreflected in Figure 2 in fact contains a transfer-with-notification action (e.g. line


32 informs line 10 about the forthcoming transfer before it takes place), whichcan be deduced from the three tickets related to the internal line communicationwithin each of the events.

23.35 % of all tickets in the experimental database fall into one of the ex-tracted events. The rest of the communication traffic thus consists of internal oroutgoing calls.

LB

1131

1032

08:51:51 08:51:51 08:51:58 08:52:01 08:52:05 08:52:07 08:52:07 08:52:18 08:52:18 08:53:23 08:54:2408:53:39

32LI

32LBTR

32 10D LI

32 10D LI

D LBTR

DEX1

DEX1

EX2 31D LI

EX2 31D LBTR

11

31 11LI

31D

D LI

D LBTREX2 11A LB

EX1D

10

Fig. 2. Visualizing the chronology of the telecommunication traffic contained in thetable in Figure 1. Vertical lines denote time instants, labelled horizontal lines denotethe duration and attributes given by one ticket. For each such line, the upper-left /upper-right / lower-left / lower-right attributes denote the calling line, called line, callsetup attribute and release type + cause, respectively. The abbreviations EX1 and EX2represent two different external numbers. Thin lines stand for internal call tickets whilethick lines represent external call tickets. The vertical position of the horizontal linesreflects the order of the tickets in the database. For ease of insight, tickets representedby dashed lines are related to a different call than those with full lines.

4 Event Reconstruction

Having obtained an event delimitation from the event extractor as a sequence ofexternal-call tickets, we need to look up the database for all other tickets relatedto that event. According to these tickets we can decide on what sequence ofactions occurred during the event, such as different kinds of call transfer (direct,with notification), their outcome (refusal, no answer, line busy), returns to theprevious attendant, etc. The way such actions are reflected in the ticket databasedepends on the current setting of the PBX firmware and an appropriate formalmapping events → sets of inter-related tickets is not available.


DELIMITATIONS

TICKETS EVENT

EXTRACTOR

EVENT

Fig. 3. The event extraction.

Such mapping can, however, be obtained via an inductive learning processwhich will discover ticket patterns for individual actions from classified examples,that is, completely described events. These classified examples were produced byintentionally performing a set of actions on the PBX and storing separately thetickets generated for each of the actions. The discovered (first-order) patternswill then be used to recognize transitions of an automaton formally describingthe course of actions within an event.

4.1 Learning Action Patterns

The goal of the action-pattern learner is to discover ticket patterns, that is,the character of the set of tickets produced by the PBX in the logging data as aresult of performing a specific action. This examines the occurrence of individualtickets in the set, their mutual order and/or (partial) overlapping in time, andso on.

For this purpose, tickets are represented in the Prolog fact syntax1 as

t(...,an, an+1,...)

where ai are ticket attributes described earlier and the irrelevant date at-tribute is omitted. The constant empty will stand for a blank field in the data.The learner has two inputs: the classified event examples (sets of Prolog facts)and a general background knowledge (GBK). Following is a single instance ofthe example set, composed of facts bound to a single event, namely two factsrepresenting tickets, and two facts representing the actions in the event: anexternal call from the number 0602330533 answered by the internal line 12 andthe call termination caused by the external number hanging up.2

t(time(19,43,48),[1,2],time(19,43,48),e,li,empty,d,empty,empty,ex

[0,6,0,2,3,3,0,5,3,3],empty,anstr([0,0,5,0,0,0]),fe,fe,id(4)).

t(time(19,43,48),[1,2],time(19,43,50),e,lb,e(relcause),d,dr,06,ex

[0,6,0,0,0,0,0,0,0,0],empty,anstr([0,0,5,0,0,0]),fe,fe,id(5)).

1 Such a representation is obtained simply by a single pass of the Sumatra TT trans-formation tool on the original data.

2 Both external and internal line numbers are represented as Prolog lists to allow easyaccess to their substrings.


ex ans([0,6,0,2,3,3,0,5,3,3],[1,2]).

hangsup([0,6,0,2,3,3,0,5,3,3]).

The general background knowledge GBK describes certain priori known prop-erties of the PBX. For example, due to a malfunction, the PBX occasionallysubstitutes a suffix of an identified external caller number by a sequence of zeros(such as in the second fact above). The correct and substituted numbers have tobe treated as identical in the corresponding patterns. Therefore one of the rules(predicate-definitions) in GBK is samenum(NUM1,NUM2) that unifies two numberswith identical prefixes and different suffixes, one of which is a sequence of zeros.

To induce patterns from examples of the above form and the first-order back-ground knowledge GBK, we constructed an ILP system working in the non-monotonic ILP setting (known also as learning from interpretations). The prin-ciple of this setting is that given a first order theory B (background knowledge),a set of interpretations (sets of logic facts) E and a grammar G, we have to findall first-order clauses (rules) c included in the language defined by the grammarG, such that c is true in B&e for all e ∈ E.3 In our case, E is the set of classifiedevents, B = GBK and G is defined so that it produces rules where tickets andtheir mutual relations are expressed in the rule’s Body and the action is iden-tified in the rule’s Head. To specify G we integrated the freely available DLAB[4] grammar-definition tool into our ILP system. Besides grammar specification,DLAB also provides methods of clausal refinement, so we could only concentrateon clause validity evaluation and implementing the (pruning) search through thespace of clauses.

An example of a generated pattern found to be valid for all of the collectedexamples is the following, describing which combination of tickets and theirrelationship specified by the equalities in the rule’s body reflects the action ofanswering a direct (non-transferred) external call.4

ex ans(RNCA1,DN1):-

t(IT1,DN1,ET1,e,li,empty,d,EF1,FI1,ex,RNCA1,empty,ANTR1,CO1,DE1,ID1),

IT2=ET1,

ANTR2=ANTR1,

t(IT2,DN2,ET2,e,lb,RC2,d,EF2,FI2,ex,RNCA2,empty,ANTR2,CO2,DE2,ID2),

samenum(RNCA1,RNCA2).

The time order of the involved tickets is determined by the equality IT2 =ET1 in the rule (with the variables IT2, ET1 referring to the initial time of thesecond ticket and the end time of the first ticket, respectively).

3 A clause c = Head : −Body is true in B&e if both B and e are stored in a Prologdatabase and the Prolog query ?− Body, not Head against that database does notsucceed.

4 Remind that capital letters stand for universally quantified variables in the Prologsyntax.


Using the described approach to generate rules for other actions as well,we create a database of action patterns (as shown in Figure 4). Since we hadknown some of the patterns from experience in the manual data analysis, thisprocess was both a theory discovery and theory revision. The final action pat-tern database is thus a combination of induction results and explicit knowledgerepresentation. The database can be kept static as long as the PBX exchangefirmware (i.e. the exact manner of logging) remains unchanged. The processshould be repeated when the firmware is modified and the logging procedureschange.

CLASSIFIED

BACKGROUND

KNOWLEDGE

PATTERNS

ACTION

LEARNER

PATTERN

ACTIONS

GENERAL

Fig. 4. Learning action patterns.

4.2 Event Recognizing Automaton

To discover the sequence of actions in an event, we assume that every event(starting with an incoming call) can be viewed as a simple finite-state automatonshown in Figure 5. Each transition corresponds to one or more actions defined inthe action pattern database (e.g. ‘Attempt to transfer’ corresponds to more kindsof the transfer procedure). The automaton (event reconstructor) is encoded inProlog. It takes as input an event delimiting sequence S (produced by the eventextractor) and the action-pattern database. In parsing S, the patterns are usedto recognize transitions between the states. Since the patterns may refer to GBKand also to tickets not present in S (such as transfer-with-notification patterns- see Figure 2), both GBK and the ticket database must be available to theautomaton. This dataflow is depicted in Figure 6.

Regarding the output, one version of the reconstructor produces human-understandable descriptions of the event, such as in the following example.

? - recognize([id(60216),id(60218),id(60224),id(60228),id(60232),id(60239)])

EVENT STARTS.

648256849 rings on 32 - call accepted,

32 attempts to transfer 0600000000 to 16 with notification, but 16 refused,

32 notifies 12 and transfers 0648256849 to 12,

12 attempts to transfer 0600000000 to 28 with notification, but 28 does not respond,


Answered RING

UNAVAILABLE

TERMINATED

TRN_ATTEMPT

Hang upHang up

Attempt

to transferAttempt

to transfer

Answered

Unanswered

Unanswered

TALK

Fig. 5. The states and transitions of the event automaton. In this representation of thePBX operation, the sequence RING → Unanswered → UNAVAILABLE→ Attempt totransfer cannot occur because the caller is not assisted by a person on an internal line.

12 notifies 26 and transfers 0600000000 to 26,

call terminated.

EVENT STOPS.

An alternative version of the reconstructor produces the descriptions inthe form of structured (recursive) Prolog facts of the recursive form

incoming(DATE,TIME,CALLER,FIRST CALLED LINE,RESULT),

where

RESULT ∈ talk, unavailable, transfer([t1, t2, ..., tn], RESULT) (1)

and t1...tn−1 denote line numbers to which unsuccessful attempts to transferwere made, and the transfer result refers to the last transfer attempt (to tn).According to this syntax, the previous example output will be encoded as

incoming(date(10,18),time(13,37,29),[0,6,4,8,2,5,6,8,4,9],[3,2],

transfer([[1,6],[1,2]],transfer([[2,8],[2,6]],talk))). (2)

and in this form used as the input to the inductive process described in the nextsection.

The effectiveness of the described recognition procedure is illustrated in Fig-ure 7. It can be seen that the method results in a very good coverage (recognition)of events containing more than 2 tickets. For the shorter events, more trainingexamples will have to be produced and employed in learning to improve thecoverage.


EVENTS

RECONSTRUCTOR

GENERAL

BACKGROUND

KNOWLEDGE

EVENT

DELIMITATIONS

ACTION

PATTERNS

TICKETS

EVENT

Fig. 6. The event reconstruction.

5 Decision Support

Having reconstructed the events from the logging file, that is, knowing howdifferent external callers have been handled in the history, the system can findregularities in the data, according to which some future events may be partiallypredicted (extrapolated) or even automatized. For example, it may be foundthat if the caller identified with her number N calls the receptionist, she alwaysdesires to be transferred to line L. Then it makes sense to transfer N to Lautomatically upon the ring without the assistance of the receptionist (providedthat N can be transferred to another line by L if the prediction turns out tobe wrong). Or, it may be a regular observation that if the person on line L1 isnot available, line L2 is always provided as a substitute - this again offers anautomation rule, or at least a decision support advice to the receptionist.

A suitable methodology for inducing predictive rules from our data is thenormal inductive logic programming setting, where the goal is (typically) thefollowing. Given a first-order theory (background knowledge) B, two sets oflogic facts E+, E− (positive and negative examples), find a theory T such that

1. B&T e+ for each e+ ∈ E+

2. B&T e− for each e− ∈ E−

That is, we require that any positive (negative) example can (cannot) be logicallyderived from the background knowledge and the resulting theory.

In our case, B is composed of the GBK described earlier and an enterprise-related background knowledge (EBK). EBK may describe for example the regular(un) availability of employees. The E+ set contains the event descriptions given


Fig. 7. Portions of extracted events of individual lengths (number of tickets) that arerecognized by the recongition automaton (left). Portions of tickets classified into arecongized event of all tickets extracted into events of different lenghts (right). Thecomplete ticket database contains about 70.000 tickets ranging within a 3 month op-eration of the PBX.

by the predicate incoming such as the example 2. The negative example set is inour case substituted by integrity constraints that express for instance that a callcannot be transferred to two different lines etc. The resulting theory is boundnot to violate the integrity constraints.

Our experiments in the decision support part of the system have been re-ported in detail elsewhere [10], therefore we just mention one example of a re-sulting rule valid with accuracy 1 on the training set. The rule

incoming(D,T,EX,31,transfer([10|R],RES)):-

day is(monday,D),branch(EX,[5,0]).

employs the predicates day is whose meaning is obvious and branch whichidentifies external numbers by a prefix. Both of the predicates are defined inEBK. The rule’s meaning is that if a number starting with 50- calls on Mondayon the (reception) line 31, the caller always desires to be transferred to line 10(whatever the transfer result is).

See [10] for a detailed overview of the predictive-rule induction experiments.Figure 8 summarizes the data-flow in the system’s decision support part.

The performance of the decision-support rules is being tested so far only in theexperimental environment and the implementation in the enterprize is currentlyunder construction.

6 System Integration Overview

Figure 9 shows how the previously described individual system parts are inte-grated. The fundamental cycle is the following: the PBX generates data thatare analyzed to produce decision support rules which then again influence theoperation of the PBX (with or without a human assistence).


EVENTS

BACKGROUND

KNOWLEDGE

RULE

GENERATOR RULES

SUPPORT

DECISION

KNOWLEDGE

BACKGROUND

ENTERPRISE

GENERAL

Fig. 8. Generating decision support rules.

7 Conclusions

We have presented a system for decision support in telecommunications. Thesystem analyzes data stored by a private branch exchange to reconstruct un-derstandable event descriptions. For this purpose, action patterns are learnedfrom classified examples of actions. The event descriptions are processed by analgorithm that induces rules describing event regularities which can be used asdecision support rules (for the exchange operator) or directly to automate thePBX operation.

In other words, we have performed a data-mining task on the input dataand tried to integrate the results into a decision support system. The methodsof data-mining have been currently receiving a lot of attention [6], especiallythose allowing for intelligent mining from multiple-relation databases [9]. Byemploying the techniques of inductive logic programming, we are in fact con-ducting a multi-relational data-mining task. Although there is previous work ondata-mining in telecommunications [8], we are not aware of another publishedapproach utilizing multi-relational data-mining methods in this field. The inte-gration of data-mining and decision-support systems is currently also an openingand discussed topic [1], and research projects are being initiated in the scientificcommunity to lay out a conceptual framework for such integration. We hope tohave contributed to that research by this application-oriented paper.

Acknowledgements. This work has been supported by the project MSM212300013 (Decision Making and Control in Manufacturing, Research pro-gramme funded by the Czech Ministry of Education, 1999-2003), the CzechTechnical University internal grant No. 3 021 087 333 and the Czech Ministryof Education grant FRVS 23 21036 333.


DELIMITATIONS

ENTERPRISE

CHANGES

GENERAL

BACKGROUND

KNOWLEDGE

4

4 4

4

5

5

4EVENTS

4EVENT

2

3

3

2TELEPHONE

ILP

MC 7500

F.S.A. (IN PROLOG)

SQL

PROLOG,

OTHER

ILP,

HUMAN

HUMAN

KNOWLEDGE

BACKGROUND

ENTERPRISE

ACTIONS

CLASSIFIED

LEARNER

PATTERN

PATTERNSACTION

EXTRACTOR

SUPPORT RULES

DECISION

EXCHANGE (PBX)

RECONSTRUCTOR

TICKETS

EXTRACTOR

EVENT 5

RULEGENERATOR *

*

WHEN

CHANGES

FIRMWARE

EVENT

WHEN

Fig. 9. The system integration overview. The dataflow over the upper (lower) dashedline is needed only when the enterprise conditions are (PBX firmware is) modified,respectively. The dotted arrow represents the optional manual formulation of the ex-pert knowledge about the action representation in the logging data. The star-labelledprocesses are those where learning/induction takes place. The digit in the upper-leftcorners in boxes refers the section where more detail on the respective process/databaseis given.


References

1. Workshop papers. In Integrating Aspects of Data Mining, Decision Support andMeta-Learning. Freiburg, Germany, 2001.

2. Petr Aubrecht. Sumatra Basics. Technical report GL–121/00 1, Czech TechnicalUniversity, Department of Cybernetics, Technicka 2, 166 27 Prague 6, December2000.

3. Ivan Bratko. Prolog: Programming for Artifical Intelligence. Computing Series.Addison-Wesley Publishing Company, 1993. ISBN 0-201-41606-9.

4. L. Dehaspe and L. De Raedt. DLAB: A declarative language bias formalism. InProceedings of the 10th International Symposium on Methodologies for IntelligentSystems, volume 1079 of Lecture Notes in Artificial Intelligence, pages 613–622.Springer-Verlag, 1996.

5. P.A. Flach. Simply Logical: Intelligent Reasoning by Example. John Wiley, 1994.6. D. Hand, H. Mannila, and Smyth P. Principles of Data Mining. MIT, 2000.7. N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and Appli-

cations. Ellis Horwood, 1994.8. R. Mattison. Data Warehousing and Data Mining for Telecommunications. Artech

House, 1997.9. N. Lavrac S. Dzeroski, editor. Relational Data Mining. Springer-Verlag, Berlin,

September 2001.10. F. Zelezny, P. Miksovsky, O. Stepankova, and J. Zıdek. ILP for automated tele-

phony. In J. Cussens and A. Frisch, editors, Proceedings of the Work-in-ProgressTrack at the 10th International Conference on Inductive Logic Programming, pages276–286, 2000.


Adaptive User Modelling in an Intelligent TelephoneAssistant

Trevor P. Martin1 and Benham Azvine2

1University of Bristol, Bristol, BS8 1TR, [email protected]

2BTexact Technologies, Adastral Park, Ipswich, IP5 3RE, [email protected]

Abstract. With the burgeoning complexity and capabilities of moderninformation appliances and services, user modelling is becoming anincreasingly important research area. Simple user profiles already personalisemany software products and consumer goods such as digital TV recorders andmobile phones. A user model should be easy to initialise, and it must adapt inthe light of interaction with the user. In many cases, a large amount of trainingdata is needed to generate a user model, and adaptation is equivalent toretraining the system. This paper briefly outlines the user modelling problemand work done at BTexact on an Intelligent Personal Assistant (IPA) whichincorporates a user profile. We go on to describe FILUM, a more flexiblemethod of user modelling, and show its application to the Telephone Assistantcomponent of the IPA, with tests to illustrate its usefulness.

1 Introduction

We can recognise a strongly growing strand of interest in user modelling arising fromresearch into intelligent interfaces. In this context, we can identify three differentoutcomes of user modelling:• Changing the way in which some fixed content is delivered to the user.• Changing the content that is delivered to the user.• Changing the way in which the device is used.Each of these is discussed in turn below.

The first is more a property of the device that is displaying content to a user. Forexample, a WAP browser must restrict graphical content. There is little room for userlikes and dislikes, although [12] describe a system which implements differentinterfaces for different users. Those who have more difficulty navigating through thesystem use a menu-based interface whereas those with a greater awareness of thesystem contents are given an interface using a number of shortcut keys.

The second category—improving information content—is perhaps the mostcommon. Examples abound in Internet-related areas, with applications to• Deliver only “interesting” news stories to an individual’s desktop. The pointcast

news delivery systems are a first step (e.g. www.pointcast.com/products/pcn/ andcnn.com/ads/advertiser/pointcast2.0/); see also [11] and IDIoMS [13].

• Remove unwanted emails.

Adaptive User Modelling in an Intelligent Telephone Assistant 103

• Identify interesting web pages—for example Syskill &Webert [24] uses aninformation-theoretic approach to detect “informative” words on web pages. Theseare used as features, and user ratings of web pages (very interesting, interesting,not interesting, etc.) creates a training data set for a naive Bayesian classifier. Asimilar approach can be used for the retrieval of documents from digital libraries,using term frequency/inverse document frequency [30] to select keywords andphrases as features. A user model can be constructed in terms of these features, andused to judge whether new documents are likely to be of interest.

This is a very active area of web development—for example, W3C’s MetadataActivity [32] is concerned with ways to model and encode metadata, that is,information on the kind of information held in a web page, and to document themeaning of the metadata. The primary reason behind this effort is to enable computersto search more effectively for relevant data; however this presupposes some methodfor the system to know the user’s interests, that is, some kind of user profile.

With the incorporation of powerful embedded computing devices in consumerproducts, there is a blurring of boundaries between computers and other equipment,resulting in a convergence to information appliances or information devices.Personalisation, which is equivalent to user modelling, is a key selling point of thistechnology—for example, to personalise TV viewing (www.tivo.com, 1999):− “With TiVo, getting your favorite programs is easy. You just teach it what shows

you like, and TiVo records them for you automatically.− As you’re watching TV, press the Thumbs Up or Thumbs Down button on the TiVo

remote to teach TiVo what you like− As TiVo searches for shows you’ve told it to record, it will also look for shows that

match your preferences and get those for you as well...”Sony have implemented a prototype user modelling system [34] which predicts aviewing timetable for a user, on the basis of previous viewing and programmeclassification. Testing against a database of 606 individuals, 108 programmecategories and 45 TV channels gave an average prediction accuracy of 60-70%. Wewill not discuss social or collaborative filtering systems here. These are used torecommend books (e.g. amazon.com), films, and so on, and are based on clusteringthe likes and dislikes of a group of users.

The third category—changing the way in which the device is used—can also beillustrated by examples. Microsoft’s Office Assistant is perhaps the best knownexample of user modelling, and aims to provide appropriate help when required, aswell as a “tip of the day” that is intended to identify and remedy gaps in the user’sknowledge of the software. The Office Assistant was developed from the Lumiere[16] project, which aimed to construct Bayesian models for reasoning about the time-varying goals of computer users from their observed actions and queries. Although itcan be argued that the Office Assistant also fits into the previous category (changingthe content delivered to the user), its ultimate aim is to change the way the user worksso that the software is employed more effectively.

The system described by [19] has similar goals but a different approach. Usermodelling is employed to disseminate expertise in use of software packages (such asMicrosoft Word) within an organisation. By creating an individual user model andcomparing it to expert models, the system is able to identify gaps in knowledge andoffer individualised tips as well as feedback on how closely the user matches expertuse of the package. The key difference from the Office Assistant is that this system

104 T.P. Martin and B. Azvine

monitors all users and identifies improved ways of accomplishing small tasks; thisexpertise can then be spread to other users. The Office Assistant, on the other hand,has a static view of best practice.

Hermens and Schlimmer [14] implemented a system which aided a user filling inan electronic form, by suggesting likely values for fields in the form, based on thevalues in earlier fields.

The change in system behaviour may not be obvious to the user. Lau and Horvitz[18] outline a system which uses a log of search requests from Yahoo, and classifiesusers’ behaviour so that their next action can be predicted using a Bayesian net. If it islikely that a user will follow a particular link, rather than refining or reformulatingtheir query, then the link can be pre-fetched to improve the perceived performance ofthe system. This approach generates canonical user models, describing the behaviourof a typical group of users rather than individual user models.

There are two key features in all these examples:• The aim is to improve the interaction between human and machine. This is a

property of the whole system, not just of the machine, and is frequently asubjective judgement that can not be measured objectively.

• The user model must adapt in the light of interaction with the user.Additionally, it is desirable that the user model− Be gathered unobtrusively, by observation or with minimal effort from the user.− Be understandable and changeable by the user - both in terms of the knowledge

held about the user and in the inferences made from that knowledge.− Be correct in actions taken as well as in deciding when to act.

2 User Models—Learning, Adaptivity, and Uncertainty

The requirement for adaptation puts user modelling into the domain of machinelearning (see [17] and [33]). A user model is generally represented as a set ofattribute-value pairs—indeed the w3c proposals [31] on profile exchange recommendthis representation. This is ideal for machine learning, as the knowledgerepresentation fits conveniently into a propositional learning framework. To applymachine learning, we need to gather data and identify appropriate features plus thedesired attribute for prediction. To make this concrete, consider a system whichpredicts the action to be taken on receiving emails, using the sender’s identity andwords in the title field. Most mail readers allow the user to define a kill file,specifying that certain emails may be deleted without the user seeing them. A set ofexamples might lead to rules such as

if title includes $ or money then action = deleteif sender = boss then action = read, and subsequently fileif sender = mailing list then action = read andsubsequently delete

This is a conventional propositional learning task, and a number of algorithms exist tocreate rules or decision trees on the basis of data such as this [4], [5], [7], [26], [27].Typically, the problem must be expressed in an attribute-value format, as above; somefeature engineering may be necessary to enable efficient rules to be induced. Rule-based knowledge representation is better than (say) neural nets due to betterunderstandability of the rules produced - the system should propose rules which the


user can inspect and alter if necessary. See [23] for empirical evidence of theimportance of allowing the user to remain in control.

One problem with propositional learning approaches is that it is difficult to extractrelational knowledge. For example:

if several identical emails arrive consecutively from a listserver, then delete all but one of them

Also, it can be difficult to express relevant background knowledge such as:if a person has an email address at acme.com then thatperson is a work colleague

These problems can be avoided by moving to relational learning, such as inductivelogic programming [22], although this is not without drawbacks as the learningprocess becomes a considerable search task.

Possibly more serious issues relate to the need to update the user model, and toincorporate uncertainty. Most machine learning methods are based on a relativelylarge, static set of training examples, followed by a testing phase on previouslyunseen data. New training examples can normally be addressed only by restarting thelearning process with a new, expanded, training set. As the learning process istypically quite slow, this is clearly undesirable. Additionally in user modelling it isrelatively expensive to gather training data - explicit feedback is required from theuser, causing inconvenience. The available data is therefore more limited than istypical for machine learning.

A second problem relates to uncertainty. User modelling is inherently uncertain—as [15] observes, “Uncertainty is ubiquitous in attempts to recognise an agent’s goalsfrom observations of behaviour,” and even strongly logic-based methods [25]acknowledge the need for “graduated assumptions.” There may be uncertainty overthe feature definitions. For example:

if the sender is a close colleague then action = read verysoon

where close colleague and very soon are fuzzily defined terms) or over theapplicability of rules. For example:

if the user has selected several options from a menu andundone each action, then it is very likely that the userrequires help on that menu

where the conclusion is not always guaranteed to follow.It is an easy matter to say that uncertainty can be dealt with by means of a fuzzy

approach, but less easy to implement the system in a way that satisfies the need forunderstandability. The major problem with many uses of fuzziness is that they rely onintuitive semantics, which a sceptic might translate as “no semantics at all.” It is clearfrom the fuzzy control literature that the major development effort goes into adjustingmembership functions to tune the controller. Bezdek [9], [10] suggests thatmembership functions should be “adjusted for maximum utility in a given situation.”However, this leaves membership functions with no objective meaning—they aresimply parameters to make the software function correctly. For a fuzzy knowledgebased system to be meaningful to a human, the membership functions should have aninterpretation which is independent of the machine operation—that is, one which doesnot require the software to be executed in order to determine its meaning.Probabilistic representations of uncertain data have a strictly defined interpretation,and the approach adopted here uses Baldwin’s mass assignment theory and votingmodel semantics for fuzzy sets [3], [8].


3 The Intelligent Personal Assistant

BTexact’s Intelligent Personal Assistant (IPA) [1], [2] is an adaptive software systemthat automatically performs helpful tasks for its user, helping the user achieve higherlevels of productivity. The system consists of a number of assistants specialising intime, information, and communication management:• The Diary Assistant helps users schedule their personal activities according to their

preferences.• Web and Electronic Yellow Pages Assistants meet the user’s needs for timely and

relevant access to information and people.• The RADAR assistant reminds the user of information pertaining to the current

task.• The Contact Finder Assistant puts the user in touch with people who have similar

interests.• The Telephone and Email Assistants give the user greater control over incoming

messages by learning priorities and filtering unwanted communication.As with any personal assistant, the key to the IPA’s success is an up-to-dateunderstanding of the user’s interests, priorities, and behaviour. It builds this profile bytracking the electronic information that a user reads and creates over time—forexample, web pages, electronic diaries, e-mails, and word processor documents.Analysis of these information sources and their timeliness helps IPA understand theusers personal interests. By tracking diaries, keyboard activity, gaze, and phoneusage, the IPA can build up a picture of the habits and preferences of the user.

We are particularly interested in the Telephone and E-mail assistants forcommunication management, used respectively for filtering incoming calls andprioritising incoming e-mail messages. The Telephone Assistant maintains a set ofpriorities of the user’s acquaintances, and uses these in conjunction with the caller’sphone number to determine the importance of an incoming call. The E-mail Assistantcomputes the urgency of each incoming message based on its sender, recipients, sizeand content. Both assistants use Bayesian networks for learning the intended actionsof the user, and importantly, the system continually adapts its behaviour as the user’spriorities change over time.

The telephone assistant handles incoming telephone calls on behalf of the user withthe aim of minimising disruption caused by frequent calls. For each incoming call, thetelephone assistant determines whether to interrupt the user (before the phone rings)based on the importance of the caller and on various contextual factors such as thefrequency of recent calls from that caller and the presence of a related entry in thediary (e.g. a meeting with the caller). When deciding to interrupt the user, thetelephone assistant displays a panel indicating that a call has arrived; the user has theoption of accepting or declining to answer the call. The telephone assistant uses thisfeedback to learn an overall user model for how the user weights the different factorsin deciding whether or not to answer a call. Although this model has been effective,its meaning is not obvious to a user, and hence it is not adjustable. To address thisissue, the FILUM [20], [21] approach has been applied to the telephone assistant.


4 Assumptions for FILUM

We consider an interaction between a user and a software or hardware system inwhich the user has a limited set of choices regarding his/her next action. For example,given a set of possible TV programmes, the user will be able to select one to watch.Given an email, the user can gauge its importance and decide to read it immediately,within the same day, within a week, or maybe as unimportant and discardable. Theaim of user modelling is to be able to predict accurately the user’s decision and henceimprove the user’s interaction with the system by making such decisionsautomatically. Human behaviour is not generally amenable to crisp, logicalmodelling. Our assumption is that the limited aspect of human behaviour to bepredicted is based mainly on observable aspects of the user’s context—for example,in classifying an email the context could include features such as the sender, otherrecipients of the message, previously received messages, current workload, time ofday, and so on. Of course, there are numerous unobservable variables - humans havecomplex internal states, emotions, external drives, and so on. This complicates theprediction problem and motivates the use of uncertainty modelling—we can onlyexpect to make correct predictions “most” of the time.

We define a set of possible output values

B = b1, b2, …, bj,

which we refer to as the behaviour, and a set of observable inputs

I = i1, i2, …, im.

Our assumption is that the n+1th observation of the user’s behaviour is predictable bysome function of the current observables and all previous inputs and behaviours.

bn = f(I1, b1, I2, b2, … In, bn, In+1)

The user model, including any associated processing, is equivalent to the function f.This is assumed to be relatively static; within FILUM, addition of new prototypeswould correspond to a change in the function.

We define a set of classes implemented as Fril++ [6], [28] programs.

C = c1, c2, … ck,

A user model is treated as an instance that has a probability of belonging to each classaccording to how well the class behaviour matches the observed behaviour of theuser. The probabilities are expressed as support pairs, and updated each time a newobservation of the user’s behaviour is made.

We aim to create a user model m, which correctly predicts the behaviour of a user.Each class ci must implement the method Behaviour, giving an output in B (this maybe expressed as supports over B). Let Sn m Œci( ) be the support for the user model m

belonging to the ith class before the nth observation of behaviour. Initially,

S1 m Œ ci( )= 0, 1[ ]for all classes ci, representing complete ignorance.


Each time an observation is made, every class makes a prediction, and the supportfor the user model being a member of that class is updated according to the predictivesuccess of the class :

Sn+1 m Œ ci( )=n ¥ Sn m Œ ci( )+S ci.Behaviourn+1 ==bn+1( )

n +1

(1)

where S ci .Behaviourn +1 == bn+1( ) represents the (normalised) support for class ci

predicting the correct behaviour (from the set B) on iteration n+1. It is necessary tonormalise the support update to counteract the swamping effect that could occur ifseveral prototypes predict the same behaviour.

Clearly as n becomes large, supports change relatively slowly. [29] discuss analternative updating algorithm. The accuracy of the user model at any stage is theproportion of correct predictions made up to that point—this metric can easily bechanged to use a different utility function, for example, if some errors are moreserious than others.

4.1 Testing

In order to test any user modelling approach, data is needed. Initial studies generateddata using an artificial model problem, the n-player iterated prisoner’s dilemma. AFril++ system was developed to run n-IPD tournaments, allowing a choice ofstrategies to be included in the environment, with adjustable numbers of players usingeach strategy. The user model aimed to reproduce the behaviour of a player by meansof some simple prototypes. The user models converged after 10-12 iterations, that is,the supports for the models belonging to each class do not change significantly afterthis. Overall predictive success rates were good, typically 80%-95%, although randomstrategies were difficult to predict, as would be expected.

4.2 User Models in the Telephone Assistant

The FILUM approach has also been applied to the prediction of user behaviour in thetelephone assistant. The following assumptions have been made:• The user model must decide whether to divert the call to voicemail or pass it

through to be answered.• The user is available to answer calls.• Adaptive behaviour is based on knowing the correct decision after the call has

finished.• A log of past telephone activity and the current diary are available• The identity of all callers is known.A sample set of user prototypes is shown in Table 1.


Table 1. User Prototypes

Prototype Identifying Characteristic Behaviour

Talkative none always answer

Antisocial none always divert to voicemail

Interactive recent calls or meetings involving thiscaller

answer

Busy small proportion of free time in nextworking day (as shown by diary)

answer if caller is brief, other-wise divert to voicemail

Overloaded small proportion of free time in nextworking day (as shown by diary)

divert to voicemail

Selective none answer if caller is a member ofa selected group, else divert tovoicemail

Regular large proportion of calls answered atparticular times of the day e.g. earlymorning

answer if this is a regular time

This approach assumes that all activities are planned and recorded accurately in anelectronically accessible format. Other ways of judging a user’s activity would beequally valid and may fit in better with a user’s existing work pattern—for examplethe IPA system also investigated the use of keyboard activity, gaze tracking andmonitoring currently active applications on a computer. There is a need to modelcallers using a set of caller prototypes, since a user can react in different ways todifferent callers in a given set of circumstances. For example, the phone rings whenyou are due to have a meeting with the boss in five minutes. Do you answer if (a) thecaller is the boss or (b) the caller is someone from the other side of the office who isringing to talk about last night’s football results while waiting for a report to print.The sample set of caller prototypes is shown in Table 2.

The user and caller prototypes are intended to illustrate the capabilities of thesystem rather than being a complete set; it is hoped that they are sufficiently close toreal behaviour to make detailed explanation unnecessary.

Terms in italic are either fuzzy definitions that can be changed to suit a user. Notethat support pairs indicate the degree to which a user or caller satisfies a particularprototype - this can range from uncertain (0 1) to complete satisfaction (1 1) or itsopposite (0 0), through to any other probability interval.

Table 2. Caller Prototypes

Prototype Identifying Characteristic

Brief always makes short calls to user

Verbose always makes long calls to user

Frequent calls user frequently

Reactive calls following a recent voicemail left by user

Proactive calls prior to a meeting with user


A sample diary is shown in Figure 1. Note that the diary is is relatively empty at thebeginning and end of the week but relatively full in the middle of the week. The busyand overloaded prototypes are written to be applicable when there is a smallproportion of free time in the immediate future, that is, during the latter part ofTuesday and Wednesday.

Fig. 1. Sample of diary. The window for the working day has been defined as 7:00 am - 8:00pm, and diaried activities for each fifteen minute period within the window are shown;unassigned slots represent free time which can be used as appropriate at the time.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1/10/010:00

1/10/0112:00

2/10/010:00

2/10/0112:00

3/10/010:00

3/10/0112:00

4/10/010:00

4/10/0112:00

5/10/010:00

5/10/0112:00

6/10/010:00

Individual

Cumulative

Fig. 2. Performance of the user model on individual calls (correct prediction =1, incorrectprediction = 0) and as a cumulative success rate.

Figure 2 shows the success rate of the user model in predicting whether a call shouldbe answered or diverted to voicemail. The drop in performance on the second dayoccurs because the busy and overloaded prototypes become active at this time, due tothe full diary on Wednesday and Thursday. It takes a few iterations for the system toincrease the membership of the user model in the busy and overloaded classes; oncethis has happened, the prediction rate increases again.

The necessary and possible supports for membership of the user model in the busyclass is shown in Figure 3, where the evolution of support can be seen on the third andfourth days where this prototype is applicable. At the start, the identifyingcharacteristics (full diary) are not satisfied and support remains at unknown (0 1). Inthe middle of the week, conditions are satisfied. Initially the user does not behave aspredicted by this prototype and possible support drops (i.e. support against increases);

12345

0 52 104 156 208 260

time (15 minute intervals)

1 design_review

2 seminar

3 research

4 programming

5 home


subsequently, the user behaves as predicted and necessary support increases. At theend of the week, once again the identifying characteristics are not satisfied.

By repeating the week’s data, there is relatively little change in the support pairs—this is suggestive that the learning has converged, although additional testing isnecessary. Evolution of caller models can also be followed within the system, andgood convergence to a stable caller model is observed. It should be emphasised thatthe supports for each prototype can be adjusted by the user at any stage. The usermodelling software has been tested in several diary and call log scenarios, with goodrates of prediction accuracy. Further testing is needed to investigate the user modelevolution over longer periods.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1/10/010:00

1/10/0112:00

2/10/010:00

2/10/0112:00

3/10/010:00

3/10/0112:00

4/10/010:00

4/10/0112:00

5/10/010:00

5/10/0112:00

6/10/010:00

Busy

Busy

Fig. 3. Evolution of support pair for the “busy” prototype in the user model.

5 Summary

The aim of user modelling is to increase the quality of interaction—this is almostalways a subjective judgement, and it can be difficult to discuss the success (orotherwise) of user modelling. We have developed an experimental testbed based onthe iterated prisoner’s dilemma, allowing generation of unlimited data. Predictionsuccess rates vary between 80-95% for non-random behaviours in the testbed, andaccuracy of over 80% has been obtained in a series of simulated tests of the telephoneassistant.

The user model changes with each observation, and there is very little overhead inupdating the user model. This approach depends on having a “good” set of prototypes,giving reasonable coverage of possible user behaviour. It is assumed that a humanexpert is able to provide such a set; however, it is possible that (for example)inductive logic programming could generate new prototype behaviours. This is aninteresting avenue for future research.


References

1. Azvine, B., D. Djian, K.C. Tsui and W. Wobcke, The Intelligent Assistant: An Overview.Lecture Notes in Computer Science, 2000(1804): p. 215-238.

2. Azvine, B. and W. Wobcke, Human-centred intelligent systems and soft computing. BtTechnology Journal, 1998. 16(3): p. 125-133.

3. Baldwin, J.F., The Management of Fuzzy and Probabilistic Uncertainties for KnowledgeBased Systems, in Encyclopedia of AI, S.A. Shapiro, Editor. 1992, John Wiley. p. 528-537.

4. Baldwin, J.F., J. Lawry and T.P. Martin, A mass assignment based ID3 algorithm fordecision tree induction. International Journal of Intelligent Systems, 1997. 12(7): p. 523-552.

5. Baldwin, J.F., J. Lawry and T.P. Martin. Mass Assignment Fuzzy ID3 with Applications. inProc. Fuzzy Logic - Applications and Future Directions. 1997. pp 278-294 London:Unicom.

6. Baldwin, J.F. and T.P. Martin, Fuzzy classes in object-oriented logic programming, inFuzz-Ieee ‘96 - Proceedings of the Fifth Ieee International Conference On Fuzzy Systems,Vols 1-3. 1996. p. 1358-1364.

7. Baldwin, J.F. and T.P. Martin. A General Data Browser in Fril for Data Mining. in Proc.EUFIT-96. 1996. pp 1630-1634 Aachen, Germany.

8. Baldwin, J.F., T.P. Martin and B.W. Pilsworth, FRIL - Fuzzy and Evidential Reasoning inAI. 1995, U.K.: Research Studies Press (John Wiley). 391.

9. Bezdek, J.C., Fuzzy Models. IEEE Trans. Fuzzy Systems, 1993. 1(1): p. 1-5.10. Bezdek, J.C., What is Computational Intelligence?, in Computational Intelligence -

Imitating Life, J.M. Zurada, R.J. Marks II, and C.J. Robinson, Editors. 1994, IEEE Press. p.1-12.

11. Billsus, D. and M.J. Pazzani. A Hybrid User Model for News Story Classification. in Proc.User Modeling 99. 1999. pp 99-108 Banff, Canada.

12. Bushey, R., J.M. Mauney and T. Deelman. The Development of Behavior-Based UserModels for a Computer System. in Proc. User Modeling 99. 1999. pp 109-118 Banff,Canada.

13. Case, S.J., N. Azarmi, M. Thint and T. Ohtani, Enhancing e-Communities with Agent-Based Systems. IEEE Computer, 2001. 33(7): p. 64.

14. Hermens, L.A. and J.C. Schlimmer, A Machine-Learning Apprentice for The Completion ofRepetitive Forms. IEEE Expert-Intelligent Systems & Their Applications, 1994. 9(1): 28-33.

15. Horvitz, E., Uncertainty, action, and interaction: in pursuit of mixed-initiative computing.Ieee Intelligent Systems & Their Applications, 1999. 14(5): p. 17-20.

16. Horvitz, E., J. Breese, et al. The Lumière Project: Bayesian User Modeling for Inferring theGoals and Needs of Software USers. in Proc. Fourteenth Conference on Uncertainty inArtificial Intelligence. 1998. pp 256-265 (see alsohttp://www.research. microsoft.com/research/dtg/horvitz/lum.htm ) Madison, WI.

17. Langley, P. User Modeling in Adaptive Interfaces. in Proc. User Modeling 99. 1999. Seehttp://www.cs.usask.ca/UM99/Proc/invited/Langley.pdf Banff, Canada.

18. Lau, T. and E. Horvitz. Patterns of Search: Analyzing and Modeling Web Query Refinement.in Proc. User Modeling 99. 1999. pp 119-128 Banff, Canada.

19. Linton, F., A. Charron and D. Joy, Owl: A Recommender System for Organization-WideLearning. 1998, MITRE Corporation.http://www.mitre.org/ technology/tech_tats/modeling/owl/Coaching_Software_Skills.pdf.

20. Martin, T.P. Incremental Learning of User Models - an Experimental Testbed. in Proc.IPMU 2000. 2000. pp 1419-1426 Madrid.

21. Martin, T.P. and B. Azvine. Learning User Models for an Intelligent Telephone Assistant. inProc. IFSA-NAFIPS 2001. 2001. pp 669 -674 Vancouver: IEEE Press.

22. Muggleton, S., Inductive Logic Programming. 1992: Academic Press. 565.


23. Nunes, P. and A. Kambil, Personalization? No Thanks. Harvard Business Review, 2001.79(4): p. 32-34.

24. Pazzani, M. and D. Billsus, Learning and revising user profiles: The identification ofinteresting Web sites. Machine Learning, 1997. 27(3): p. 313-331.

25. Pohl, W., Logic-based representation and reasoning for user modeling shell systems. UserModeling and User-Adapted Interaction, 1999. 9(3): p. 217-282.

26. Quinlan, J.R., Induction of Decision Trees. Machine Learning, 1986. 1: p. 81-106.27. Quinlan, J.R., C4.5: Programs for Machine Learning. 1993: Morgan Kaufmann.28. Rossiter, J.M., T. Cao, T.P. Martin and J.F. Baldwin. A Fril++ Compiler for Soft Computing

Object-Oriented Logic Programming. in Proc. IIZUKA2000, 6th International Conferenceon Soft Computing. 2000. pp 340 - 345 Japan.

29. Rossiter, J.M., T.H. Cao, T.P. Martin and J.F. Baldwin. User Recognition in UncertainObject Oriented User Modelling. in Proc. Tenth IEEE International Conference On FuzzySystems (Fuzz-IEEE 2001). 2001. pp Melbourne.

30. Salton, G. and G. Buckley, Term-weighting Approaches in Automatic Text Retrieval.Information Processing and Management, 1988. 24: p. 513-523.

31. W3C, Composite Capability/Preference Profiles (CC/PP): A user side framework forcontent negotiation. 1999.

32. W3C, Metadata and Resource Description. 1999.33. Webb, G., Preface to Special Issue on Machine Learning for User Modeling. User Modeling

and User-Adapted Interaction, 1998. 8(1): p. 1-3.34. Yiming, Z. Fuzzy User Profiling for Broadcasters and Service Providers. in Proc. Computa-

tional Intelligence for User Modelling. 1999.http://www.fen.bris.ac.uk/engmaths/research/aigroup/martin/ci4umProc/Zhou.pdf Bristol,UK.

A Query-Driven Anytime Algorithm forArgumentative and Abductive Reasoning

Rolf Haenni

Computer Science Department, University of California, Los Angeles, CA [email protected], http://haenni.shorturl.com

Abstract. This paper presents a new approximation method for com-puting arguments or explanations in the context of logic-based argumen-tative or abductive reasoning. The algorithm can be interrupted at anytime returning the solution found so far. The quality of the approxi-mation increases monotonically when more computational resources areavailable. The method is based on cost functions and returns lower andupper bounds.1

1 Introduction

The major drawback of most qualitative approaches to uncertainty managementcomes from their relatively high time and space consuming algorithms. To over-come this difficulty, appropriate approximation methods are needed. In the do-main of argumentative and abductive reasoning, a technique called cost-boundedapproximation has been developed for probabilistic argumentation systems [16,17]. Instead of computing intractably large sets of minimal arguments for a givenquery, the idea is that only the most relevant arguments not exceeding a certaincost bound are computed. This is extremely useful and has many applications indifferent fields [1,3]. In model-based diagnostics, for example, computing argu-ments corresponds to determining minimal conflict sets and minimal diagnoses.Very often, intractably many conflict sets and diagnoses exist. The method pre-sented in [16] is an elegant solution for such difficult cases. However, the questionof choosing appropriate cost bounds and the problem of judging the quality ofthe approximation remain.

The approach presented in this paper is based on the same idea of computingonly the most relevant arguments. However, instead of choosing the cost boundfirst and then computing the corresponding arguments, the algorithm startsimmediately by computing the most relevant arguments. It terminates as soonas no more computational resources (time or space) are available and returnsthe cost bound reached during its run. The result is a lower approximationthat is sound but not complete. Furthermore, the algorithm returns an upperapproximation that is complete but not sound. The difference between lower andupper approximation allows the user to judge the quality of the approximation.1 Research supported by scholarship No. 8220-061232 of the Swiss National ScienceFoundation.


A Query-Driven Anytime Algorithm 115

The algorithm is designed such that the cost bound (and therefore the quality ofthe approximation) increases monotonically when more resources are available. Itcan therefore be considered as an anytime algorithm that provides a result at anytime. Note that this is very similar to the natural process of how people collectinformation from their environment. In legal cases, for example, resources (time,money, etc.) are limited, and the investigation focusses therefore on the searchof the most relevant and most obvious evidence and such that the correspondingcosts remain reasonable.

Another important property of the algorithm is the fact that the actual queryis taken into account during its run. This ensures that those arguments whichare of particular importance for the user’s actual query are returned first. Sucha query-driven behavior corresponds to the natural way of how human gathersthe relevant information from different sources.

2 Probabilistic Argumentation Systems

The theory of probabilistic argumentation systems is based on the idea of combin-ing classical logic with probability theory [17,15]. It is an alternative approachfor non-monotonic reasoning under uncertainty. It allows to judge open ques-tions (hypotheses) about the unknown or future world in the light of the givenknowledge. From a qualitative point of view, the problem is to find argumentsin favor and against the hypothesis of interest. An argument can be seen as adefeasible proof for the hypothesis. It can be defeated by counter-arguments.The strength of an argument is weighted by considering its probability. In thisway, the credibility of a hypothesis can be measured by the total probabilitythat it is supported or rejected by such arguments. The resulting degree of sup-port and degree of possibility correspond to (normalized) belief and plausibilityin Dempster-Shafer’s theory of evidence [24,27,19]. A quantitative judgementis sometimes more useful and can help to decide whether a hypothesis shouldbe accepted, rejected, or whether the available knowledge does not permit todecide.

The technique of probabilistic argumentation systems generalizes de Kleer’sand Reiter’s original concept of assumption-based truth maintenance systems(ATMS) [6,7,8,23,18,10] by (1) removing the restriction to Horn clauses and (2)by adding probabilities in a similar way as Provan [22] or Laskey and Lehner[2]. Approximating techniques for intractably large sets of arguments has beenproposed for ATMS by Forbus and de Kleer [14,9], by Collins and de Coste [5],and by Bigham et al. [4].

For the construction of a probabilistic (propositional) argumentation sys-tem, consider two disjoint sets A = a1, . . . , am and P = p1, . . . , pn ofpropositions. The elements of A are called assumptions. LA∪P denotes the cor-responding propositional language. If ξ is an arbitrary propositional sentencein LA∪P , then a triple (ξ, P,A) is called (propositional) argumentation sys-tem. ξ is called knowledge base and is often specified by a conjunctively inter-preted set Σ = ξ1, . . . , ξr of sentences ξi ∈ LA∪P or, more specifically, clauses

116 R. Haenni

ξi ∈ DA∪P , where DA∪P denotes the set of all (proper) clauses over A ∪ P . Weuse Props(ξ) ⊆ A ∪ P to denote all the propositions appearing in ξ.

The assumptions play an important role for expressing uncertain information.They are used to represent uncertain events, unknown circumstances and risks,or possible outcomes. Conjunctions of literals of assumptions are of particularinterest. They represent possible scenarios or states of the unknown or futureworld. CA denotes the set of all such conjunctions. Furthermore, NA = 0, 1|A|represents the set of all possible configurations relative to A. The elements s ∈ NAare called scenarios. The theory is based on the idea that one particular scenarios ∈ NA is the true scenario.

Consider now the case where a second propositional sentence h ∈ LA∪Pcalled hypothesis is given. Hypotheses represent open questions or uncertainstatements about some of the propositions in A ∪ P . What can be inferredfrom ξ about the possible truth of h with respect to the given set of unknownassumptions? Possibly, if some of the assumptions are set to true and others tofalse, then h may be a logical consequence of ξ. In other words, h is supportedby certain scenarios s ∈ NA or corresponding arguments α ∈ CA. Note thatcounter-arguments refuting h are arguments supporting ¬h.

More formally, let ξ←s be the formula obtained from ξ by instantiating allthe assumptions according to their values in s. We can then decompose the setof scenarios NA into three disjoint sets

IA(ξ) = s ∈ NA : ξ←s |= ⊥, (1)SPA(h, ξ) = s ∈ NA : ξ←s |= h, ξ←s |= ⊥, (2)RFA(h, ξ) = s ∈ NA : ξ←s |= ¬h, ξ←s |= ⊥ = SPA(¬h, ξ), (3)

of inconsistent, supporting, and refuting scenarios, respectively. Furthermore, ifNA(α) ⊆ NA denotes the set of models of a conjunction α ∈ CA, then we candefine corresponding sets of supporting and refuting arguments of h relative toξ by

SP (h, ξ) = α ∈ CA : NA(α) ⊆ SPA(h, ξ), (4)RF (h, ξ) = α ∈ CA : NA(α) ⊆ RFA(h, ξ), (5)

respectively. Often, since SP (h, ξ) and RF (h, ξ) are upward-closed sets, onlycorresponding minimal arguments are considered.

So far, hypotheses are only judged qualitatively. A quantitative judgment ofthe situation becomes possible if every assumption ai ∈ A is linked to a corre-sponding prior probability p(ai) = πi. Let Π = π1, . . . , πm denote the set of allprior probabilities. We suppose that the assumptions are mutually independent.This defines a probability distribution p(s) over the set NA of scenarios2. Notethat independent assumptions are common in many practical applications [1]. Aquadruple (ξ, P,A,Π) is then called probabilistic argumentation system [17].2 In cases where no set of independent assumptions exists, the theory may also bedefinied on an arbitrary probability distribution over NA.


In order to judge h quantitatively, consider the conditional probability thatthe true scenario s is in SPA(h, ξ) but not in IA(ξ). In the light of this remark,

dsp(h, ξ) = p(s∈SPA(h, ξ) | s /∈ IA(ξ)) (6)

is called degree of support of h relative to ξ. It is a value between 0 and 1that represents quantitatively the support that h is true in the light of thegiven knowledge. Clearly, dsp(h, ξ) = 1 means that h is certainly true, whiledsp(h, ξ) = 0 means that h is not supported (but h may still be true). Notethat degree of support is equivalent to the notion of (normalized) belief in theDempster-Shafer theory of evidence [24,27]. It can also be interpreted as theprobability of the provability of h [21,26].

A second way of judging the hypothesis h is to look at the correspondingconditional probability that the true scenario s is not in RFA(h, ξ). It representsthe probability that ¬h can not be inferred from the knowledge base. In such acase, h remains possible. Therefore, the conditional probability

dps(h, ξ) = p(s /∈RFA(h, ξ) | s /∈ IA(ξ)) = 1− dsp(¬h, ξ) (7)

is called degree of possibility of h relative to ξ. It is a value between 0 and 1 thatrepresents quantitatively the possibility that h is true in the light of the givenknowledge. Clearly, dps(h, ξ) = 1 means that h is completely possible (thereare no counter-arguments against h), while dps(h, ξ) = 0 means that h is false.Degree of possibility is equivalent to the notion of plausibility in the Dempster-Shafer theory. We have dsp(h, ξ) ≤ dps(h, ξ) for all h ∈ LA∪P and ξ ∈ LA∪P .Note that the particular case of dsp(h, ξ) = 0 and dps(h, ξ) = 1 represents totalignorance relative to h.

An important property of degree of support and degree of possibility is thatthey behave non-monotonically when new information is added. More precisely,if ξ′ represents a new piece of information, then nothing can be said about thenew values dsp(h, ξ∧ ξ′) and dps(h, ξ∧ ξ′). Compared to dsp(h, ξ) and dps(h, ξ),the new values may either decrease or increase, both cases are possible. Thisreflects a natural property of how a human’s conviction or belief can change whennew information is given. Non-monotonicity is therefore a fundamental propertyfor any mathematical formalism for reasoning under uncertainty. Probabilisticargumentation systems show that non-monotonicity can be achieved in classicallogic by adding probability theory in an appropriate way. This has already beennoted by Mary McLeish in [20].

3 Computing Arguments

From a computational point of view, the main problem of dealing with proba-bilistic argumentation systems is to compute the set µQS(h, ξ) of minimal quasi-supporting arguments with QS(h, ξ) = α ∈ CA : α ∧ ξ |= h. The term “quasi”expresses the fact that some quasi-supporting arguments of h may be in contra-diction with the given knowledge. Knowing the sets µQS(h, ξ), µQS(¬h, ξ), and

118 R. Haenni

µQS(⊥, ξ) allows then to derive supporting and refuting arguments, as well asdegree of support and degree of possibility [17]. We use

QSA(h, ξ) = NA(QS(h, ξ)) = s ∈ NA : ξ←s |= h (8)

to denote corresponding sets of quasi-supporting scenarios.Suppose that the knowledge base ξ ∈ LA∪P is given as a set of clauses

Σ = ξ1, . . . , ξr with ξi ∈ DA∪P and ξ = ξ1∧· · ·∧ξr. Furthermore, let H ⊆ DA∪Pbe another set of clauses such that ∧H ≡ ¬h. ΣH = µ(Σ ∪H) denotes then thecorresponding minimal clause representation of ξ ∧ ¬h obtained from Σ ∪H bydropping subsumed clauses.

3.1 Exact Computation

The problem of computing minimal quasi-supporting arguments is closely relatedto the problem of computing prime implicants. Quasi-supporting arguments forh are conjunctions α ∈ CA for which α ∧ ξ |= h holds. This condition can berewritten as ξ∨¬h |= ¬α or ΣH |= ¬α, respectively. Quasi-supporting argumentsare therefore negations of implicates of ΣH which are in DA. In other words, ifδ ∈ DA is an implicate of ΣH , then ¬δ is a quasi-supporting argument for h.We use PI(ΣH) to denote the set of all prime implicates of ΣH . If ¬Ψ is the setof conjunctions obtained from a set of clauses Ψ by negating the correspondingclauses, then

µQS(h, ξ) = ¬(PI(ΣH) ∩ DA). (9)

Since computing prime implicates is known to be NP-complete in general, theabove approach is only feasible when ΣH is relatively small. However, when Ais small enough, many prime implicates of ΣH are not in DA and are thereforeirrelevant for the minimal quasi-support. Such irrelevant prime implicates canbe avoided by the method described in [17]. The procedure is based on twooperations

ConsQ(Σ) = Consx1 · · · Consxq (Σ), (10)ElimQ(Σ) = Elimx1 · · · Elimxq (Σ), (11)

where Σ is an arbitrary set of clauses and Q = x1, . . . , xq a subset of proposi-tions appearing in Σ. Both operations repeatedly apply more specific operationsConsx(Σ) and Elimx(Σ), respectively, where x denotes a proposition in Q.

Let Σx denote the clauses of Σ containing x as a positive literal, Σx theclauses containing x as a negative literal, and Σx the clauses not containing x.Furthermore, if

ρ(Σx, Σx) = ϑ1 ∨ ϑ2 : x ∨ ϑ1 ∈ Σx, ¬x ∨ ϑ2 ∈ Σx (12)

denotes the set of all resolvents of Σ relative to x, then Consx(Σ) = µ(Σ ∪ρ(Σx, Σx)) and Elimx(Σ) = µ(Σx ∪ ρ(Σx, Σx)).


Thus, ConsQ(Σ) computes all the resolvents (consequences) of Σ relativeto the propositions in Q and adds them to Σ. Note that if Q contains all theproposition in Σ, then ConsQ(Σ) = PI(Σ). In contrast, ElimQ(Σ) eliminatesall the propositions in Q from Σ and returns a new set of clauses whose set ofmodels corresponds to the projection of the original set of models. Note that froma theoretical point of view, the order in which the propositions are eliminatedis irrelevant [17], whereas from a practical point of view, it critically influencesthe efficiency of the procedure. Note that the elimination process is a particularcase of Shenoy’s fusion algorithm [25] as well as of Dechter’s bucket eliminationprocedure [13].

The set of the minimal quasi-supporting arguments can then be computedin two different ways by

µQS(h, ξ) = ¬ConsA(ElimP (ΣH)) = ¬ElimP (ConsA(ΣH)). (13)

Note that in many practical applications, computing the consequences relativeto the propositions in A is trivial. In contrast, the elimination of the propositionsin P is usually much more difficult and becomes even infeasible as soon as ΣH

has a certain size.

3.2 Cost-Bounded Approximation

A possible approximation technique is based on cost functions c : CA → IR+.Conjunctions α with low costs c(α) are preferred and therefore more relevant. Werequire that α ⊆ α′ implies c(α) ≤ c(α′). This condition is called monotonicitycriterion. Examples of common cost functions are:

– the length of the conjunction (number of literals): c(α) = |α|,– the probability of the negated conjunction: c(α) = 1− p(s ∈ NA(α)).

The idea of using the length of the conjunctions as cost function is that shortconjunctions are usually more weighty arguments. Clearly, if α is a conjunctionin CA, then an additional literal & is a supplementary condition to be satisfied,and α ∧ & is therefore less probable than α. From this point of view, the lengthof a conjunction expresses somehow its probability. However, if probabilities areassigned to the assumptions, then it is possible to specify the probability of aconjunction more precisely. That’s the idea behind the second suggestion.

Let β ∈ IR+ be a fixed bound for a monotone cost function c(α). A conjunc-tion α ∈ CA is called β-relevant, if and only if c(α) < β. Otherwise, α is calledβ-irrelevant. The set of all β-relevant conjunctions for a fixed cost bound β isdenoted by

CβA = α ∈ CA : c(α) < β. (14)

Note that CβA is a downward-closed set. This means that α ∈ CβA implies that every(shorter) conjunction α′ ⊆ α is also in CβA . Evidently, C0A = ∅ and C∞A = CA. An

120 R. Haenni

approximated set of minimal quasi-supporting arguments can then be definedby

µQS(h, ξ, β) = µQS(h, ξ) ∩ CβA . (15)

The corresponding set of scenarios is denotes by QSA(h, ξ, β). Note that µQS(h,ξ, β) is sound but not complete since µQS(h, ξ, β) ⊆ µQS(h, ξ). It can thereforebe seen as a lower approximation of the exact set µQS(h, ξ).

In order to compute µQS(h, ξ, β) efficiently, corresponding downward-closedsets are defined over the set of clauses DA∪P . Obviously, every clause ξ ∈ DA∪Pcan be split into sub-clauses ξA and ξP by

ξ = &1 ∨ · · · ∨ &k︸︷︷︸

∈A±∨ &k+1 ∨ · · · ∨ &m︸︷︷︸

∈P±= ξA ∨ ξP , (16)

where A± and P± are the sets of literals of A and P , respectively. Such a clausecan also be written as an implication ¬ξA → ξP where Arg(ξ) = ¬ξA is aconjunction in CA. The set of clauses ξ for which the corresponding conjunctionArg(ξ) is in CβA can then be defined by

DβA∪P = ξ ∈ DA∪P : Arg(ξ) ∈ CβA. (17)

A new elimination procedure called β-elimination can then by defined by

ElimβQ(Σ) = Elimβ

x1 · · · Elimβ

xq (Σ), (18)

where the clauses not belonging to DβA∪P are dropped at each step of the processby Elimβ

x(Σ) = Elimx(Σ) ∩ DβA∪P . In this way, the approximated set of quasi-supporting arguments can be computed by

µQS(h, ξ, β) = ¬ElimβP (ConsA(ΣH)) = ¬Elimβ

P (ConsA(ΣH) ∩ DβA∪P ). (19)

See [17] for a more detailed discussion and the proof of the above formula. Twomajor problems remain. First, it is difficult to choose a suitable cost bound β inadvance (if β is too low, then the result may be unsatisfactory, if β is to high,then the procedure risks to get stuck). Second, there is no mean to judge toquality of the approximation.

4 Anytime Algorithm

The algorithm introduced below helps to overcome the difficulties mentionedat the end of the previous section. Instead of first choosing the cost boundand then computing the corresponding arguments, the algorithm starts im-mediately by computing the most relevant arguments and terminates as soonas no more computational resources (usually time) are available. Finally, itreturns two minimal sets LB and UB of (potential) minimal arguments with


NA(LB) ⊆ QSA(h, ξ) ⊆ NA(UB) and a cost bound β with LB ⊇ µQS(h, ξ, β).Obviously, the sets LB is considered as lower bound (sound but not complete)while UB is considered as upper bound (complete but not sound). From thisfollows immediately that

p(s∈NA(LB)) ≤ p(s∈QSA(h, ξ)) ≤ p(s∈NA(UB)). (20)

Furthermore, if LB′, UB′, and β′ are the corresponding results for the sameinput parameters (knowledge base, hypothesis, cost function) but with morecomputational resources, then NA(LB) ⊆ NA(LB′), NA(UB) ⊇ NA(UB′), andβ ≤ β′. The quality of the approximation increases thus monotonically during theexecution of the algorithm. If the algorithm terminates before all computationalresources are used or if unlimited computational resources are available, thenthe algorithm returns the exact result µQS(h, ξ) = LB = UB and β =∞.

The idea for the algorithm comes from viewing the procedure described in theprevious section from the perspective of Dechter’s bucket elimination procedure[13]. From this point of view, the clauses contained in ConsA(ΣH) are initiallydistributed among an ordered set of buckets. There is exactly one bucket forevery proposition in P . If a clause contains several propositions from P , thenthe first appropriate bucket of the sequence is selected. In a second step, theelimination procedure takes place among the sequence of buckets.

The idea now is similar. However, instead of processing the whole set ofclauses at once, the clauses are now iteratively introduced one after another,starting with those having the lowest cost. At each step of the process, possibleresolvents are computed and added to the list of remaining clauses. Subsumedclauses are dropped. As soon as a clause containing only proposition from A isdetected, it is considered as a possible result. For a given sequence of buckets,this produces exactly the same set of resolvents as in the usual bucket elimi-nation procedure, but in a different order. It guarantees that the most relevantarguments are produced first. The algorithm works with different sets of clauses:

– Σ ⇒ the remaining set of clauses, initialized to ConsA(ΣH) \ DA,– Σ0 ⇒ the results, initialized to ConsA(ΣH) ∩ DA,– Σ1, . . . , Σn ⇒ the corresponding sequence of buckets for all propositions

in P = p1, . . . , pn, all initialized to ∅.The details of the whole procedure are described below. The process terminatesas soon as Σ = ∅ or when no more resources are available.

[01] Function Quasi-Support(P,A,Σ, h, c);[02] Begin[03] Select H ⊆ DA∪P such that ∧H ≡ ¬h;[04] ΣH ← µ(Σ ∪H);[05] Σ ← ConsA(ΣH) \ DA;[06] Σ0 ← ConsA(ΣH) ∩ DA;[07] Σi ← ∅ for all i = 1, . . . , n;[08] Loop While Σ = ∅ And Resources() > 0 Do

122 R. Haenni

[09] Begin[10] Select ξ ∈ Σ such that c(Arg(ξ)) = min(c(Arg(ξ)) : ξ ∈ Σ);[11] Σ ← Σ \ ξ;[12] k ← min(i ∈ 1, . . . , n : pi ∈ Props(ξ));[13] If pk ∈ ξ[14] Then R ← ρ(ξ, ξ ∈ Σk : ¬pk ∈ ξ);[15] Else R ← ρ(ξ ∈ Σk : pk ∈ ξ, ξ);[16] Σ ← Σ ∪ (R \ DA);[17] Σ0 ← Σ0 ∪ (R ∩ DA);[18] Σk ← Σk ∪ ξ;[19] S ← µ(Σ ∪Σ0 ∪ · · · ∪Σn);[20] Σ ← Σ ∩ S;[21] Σi ← Σi ∩ S for all i = 1, . . . , n;[22] End;[23] LB ← Arg(ξ) : ξ ∈ Σ0;[24] UB ← µArg(ξ) : ξ ∈ Σ0 ∪Σ;[25] β ← min(c(Arg(ξ)) : ξ ∈ Σ ∪ ∞);[26] Return (LB,UB, β);[27] End.

At each step of the iteration, the following operations take place:

– line [10]: select a clause ξ from Σ such that the corresponding cost c(Arg(ξ))is minimal;

– line [11]: remove ξ from Σ;– line [12]: select the first proposition pk ∈ P with pk ∈ Props(ξ) and• lines [13]–[15]: compute all resolvents of ξ with Σk,• lines [16] and [17]: add the resolvents either to Σ or Σ0,• line [18]: add ξ to Σk,• lines [19]–[21]: remove subsumed clauses from Σ,Σ0, Σ1, . . . , Σn.

Finally, LB and β are obtained from Σ0. Furthermore, UB can be derived from Σ0and Σ. Note that the procedure is a true anytime algortihm giving progressivelybetter solutions as time goes on and also giving a response however little timehas elapsed [12,11]. In fact, it satisfies most of the basic requirements of anytimealgorithms [28]:

– measurable quality : the precision of the approximate result is known,– monotonicity : the precision of the result is growing in time,– consistency : the quality of the result is correlated with the computation

time and the quality of the inputs,– the diminishing returns: the improvement of the solution is larger at the

early stages of computation and it diminishes over time,– interruptibility : the process can be interrupted at any time and provides

some answer,


– preemptability : the process can be suspended and continued with minimaloverhead.

The proofs of correctness will appear in one of the author’s forthcoming technicalreports.

4.1 Example

In order to illustrate the algorithm introduced in the previous subsection, con-sider a communication network with 6 participants (A, B, C, D, E, F ) and 9connections. The question is whether A is able to communicate with F or not.This is expressed by h = A → F . It is assumed that connection 1 (between Aand B) works properly with probability 0.1, connection 2 with probability 0.2,etc.

Fig. 1. A simple Communication Network.

The original knowledge base Σ consists of 9 clauses ξ3, ξ5, ξ6, ξ8, ξ10, ξ11,ξ13, ξ17 and ξ18. Furthermore, H = A,¬F = ξ1, ξ2 and ΣH = µ(Σ ∪H) =(Σ ∪H) \ ξ11, since ξ11 is subsumed by ξ1. The following table shows all theclauses produced during the process (ordered according to their probabilities).

ξ1 A 1.0 ξ14 D ∨ ¬a3 0.3 ξ2 ¬F 1.0 ξ15 E ∨ ¬a3 ∨ ¬a8 0.24• ξ3 ¬E ∨ F ∨ ¬a9 0.9 ξ16 ¬a3 ∨ ¬a8 ∨ ¬a9 0.216

ξ4 ¬E ∨ ¬a9 0.9 • ξ17 ¬B ∨ C ∨ ¬a2 0.2• ξ5 ¬D ∨ E ∨ ¬a8 0.8 • ξ18 ¬A ∨B ∨ ¬a1 0.1• ξ6 ¬C ∨ F ∨ ¬a7 0.7 ξ19 B ∨ ¬a1 0.1

ξ7 ¬C ∨ ¬a7 0.7 ξ20 ¬E ∨ C ∨ ¬a2 ∨ ¬a5 0.1• ξ8 ¬B ∨ F¬a6 0.6 ξ21 ¬a3 ∨ ¬a5 ∨ ¬a6 ∨ ¬a8 0.072

ξ9 ¬B ∨ ¬a6 0.6 ξ22 ¬E ∨ ¬a2 ∨ ¬a5 ∨ ¬a7 0.07• ξ10 ¬E ∨B ∨ ¬a5 0.5 ξ23 ¬a1 ∨ ¬a6 0.06• ξ11× ¬E ∨A ∨ ¬a4 0.4 ξ24 C ∨ ¬a1 ∨ ¬a2 0.02

ξ12 ¬E ∨ ¬a5 ∨ ¬a6 0.3 ξ25 ¬a2 ∨ ¬a3 ∨ ¬a5 ∨ ¬a7 ∨ ¬a8 0.017• ξ13 ¬A ∨D ∨ ¬a3 0.3 ξ26 ¬a1 ∨ ¬a2 ∨ ¬a7 0.014

124 R. Haenni

The initial clauses of Σ are marked by • and the clauses of H by . The minimalquasi-supporting arguments for h = A→ F are finally obtained from the clausesξ16, ξ21, ξ23, ξ25, and ξ26 (marked by ):

µQS(h,Σ) = ¬ξ16,¬ξ21,¬ξ23,¬ξ25,¬ξ26= a3∧a8∧a9, a3∧a5∧a6∧a8, a1∧a6, a2∧a3∧a5∧a7∧a8, a1∧a2∧a7.

Note that every minimal quasi-supporting argument in µQS(h,Σ) correspondsto a minimal path from node A to node F in the communication network. Ifwe take A,F,B,D,C,E as elimination sequence (order of the buckets), then thecomplete run of the algorithm can be described as in the table shown below(where c(α) = 1− p(s∈NA(α)) serves as cost function).

A F B D C E

Step Σ pi Σ1 Σ2 Σ3 Σ4 Σ5 Σ6 R Σ0 1−β1 ξ1 ξ2, ξ3, ξ5, ξ6, ξ8, ξ10, ξ13, ξ17, ξ18 A ξ1 1.002 ξ2 ξ3, ξ5, ξ6, ξ8, ξ10, ξ13, ξ17, ξ18 F ξ2 0.903 ξ3 ξ5, ξ6, ξ8, ξ10, ξ13, ξ17, ξ18 F ξ3× ξ4 0.90

4 ξ4 ξ5, ξ6, ξ8, ξ10, ξ13, ξ17, ξ18 E ξ4 0.805 ξ5 ξ6, ξ8, ξ10, ξ13, ξ17, ξ18 D ξ5 0.706 ξ6 ξ8, ξ10, ξ13, ξ17, ξ18 F ξ6× ξ7 0.70

7 ξ7 ξ8, ξ10, ξ13, ξ17, ξ18 C ξ7 0.608 ξ8 ξ10, ξ13, ξ17, ξ18 F ξ8× ξ9 0.60

9 ξ9 ξ10, ξ13, ξ17, ξ18 B ξ9 0.5010 ξ10 ξ13, ξ17, ξ18 B ξ10 ξ12 0.3011 ξ12 ξ13, ξ17, ξ18 E ξ12 0.3012 ξ13 ξ17, ξ18 A ξ13× ξ14 0.30

13 ξ14 ξ17, ξ18 D ξ14 ξ15 0.2414 ξ15 ξ17, ξ18 E ξ15 ξ16, ξ21 ξ16, ξ21 0.2015 ξ17 ξ18 B ξ17 ξ20 0.1016 ξ18 ξ20 A ξ18× ξ19 0.10

17 ξ19 ξ20 B ξ19 ξ23, ξ24 ξ23 0.1018 ξ20 ξ24 C ξ20 ξ22 0.0719 ξ22 ξ24 E ξ22 ξ25 ξ25 0.0220 ξ24 C ξ24 ξ26 ξ26 −∞1 2 3 4 5 6 7 8 9 10 11 12 13

Every row desribes a single step. The 2nd column shows the most probableclause in Σ (the one with the lowest cost) that is selected for the next step.The 3rd column contains the remaining clauses in Σ. The 4th column indicatesthe first proposition in the given sequence that appears in the selected clause.This determines the corresponding bucket Σi into which the selected clauses isadded (columns 5 to 10). Cross-marked clauses are subsumed by others and can


be dropped. Then, column 11 shows the resolvents produced at the actual step.Resolvents containing only proposition from A are added to column 12. Finally,the last column indicates the actual value of β.

At step 1, ξ1 = A is selected and added to Σ1. There are no resolvents and nosubsumed clauses. Σ1 contains then the single clause ξ1 while Σ0 is still empty.At step 3, for example, ξ3 = ¬E ∨ F ∨ ¬a9 is selected and added to Σ2 thatalready contains ξ2 = ¬F from step 2. A new resolvent ξ4 = ¬E ∨ ¬a9 canthen be derived from ξ2 and ξ3. Since ξ3 is subsumed by ξ4, it can be dropped.The new clause ξ4 is then added to Σ. Σ0 is still empty. Later, for example atstep 14, ξ15 = E ∨ ¬a3 ∨ ¬a8 is selected and added to Σ6 that contains twoclauses ξ4 = ¬E ∨¬a9 and ξ12 = ¬E ∨¬a5 ∨¬a6 from previous steps. Two newresolvents ξ16 = ¬a3 ∨¬a8 ∨¬a9 and ξ21 = ¬a3 ∨¬a5 ∨¬a6 ∨¬a8 are producedand added to Σ0. These are the first two results. After step 20, Σ is empty andthe algorithm terminates.

Observe that the clauses representing the query H = A,¬F = ξ1, ξ2are processed first. This influences considerably the further run of the algorithmand guarantees that those arguments which are of particular importance forthe user’s actual query are returned first. Such a query-driven behavior is animportant property of the algorithm. It corresponds to the natural way of howhuman gathers the relevant information from a knowledge base.

4.2 Experimental Results

This section discusses the results of testing the algorithm on a problem of morerealistic size. The discussion focusses on lower and upper bounds and comparesthem to the exact results. The knowledge base consists of 26 propositions, 39assumptions and 74 initial clauses. It describes a communication network likethe one used in the previous subsection. The exact solution for a certain queryconsists of 1,008 minimal arguments (shortest paths). For a given eliminationsequence, the complete procedure generates 211,828 resolvents. The correspond-ing degree of support is 0.284. Figure 2 shows how the approximated solutionmonotonically approaches the exact value during the process.

Fig. 2. Left, the complete run of the algorithm for Example 1; right, the first 2,000resolvents.

126 R. Haenni

Observe that after generating approximately 1,000 resolvents, the algorithmhas found the first 8 arguments and returns a numerical lower bound that corre-sponds in the first two numbers after the decimal point to the exact solution. The8 first arguments are found in less than 1 second (instead of approximately 15minutes for the 211,828 resolutions of the complete solution). The upper boundconverges a little bit slower. This is a typical behavior that has been observedfor many other examples of different domains [1].

5 Conclusion

This paper introduces a new algorithm for approximated assumption-based rea-soning. The advantages to other existing approximation methods are twofold:(1) it is an anytime algorithm that monotonically increases the quality of theresult as soon as more computational resources are available; (2) the algorithmproduces not only a lower but also an upper approximation without significantadditional computational costs. This two improvements are extremely usefuland can be considered as an important step towards the practical applicabilityof logic-based abduction and argumentaion in general, and probabilistic argu-mentation systems in particular.

References

1. B. Anrig, R. Bissig, R. Haenni, J. Kohlas, and N. Lehmann. Probabilistic argumen-tation systems: Introduction to assumption-based modeling with ABEL. TechnicalReport 99-1, Institute of Informatics, University of Fribourg, 1999.

2. K. B. Laskey and P. E. Lehner. Assumptions, beliefs and probabilities. ArtificialIntelligence, 41(1):65–77, 1989.

3. D. Berzati, R. Haenni, and J. Kohlas. Probabilistic argumentation systems andabduction. In C. Baral and M. Truszczynski, editors, Proccedings of the 8th Inter-national Workshop on Non-Monotonic Reasoning, Breckenridge Colorado, 2000.

4. J. Bigham, Z. Luo, and D. Banerjee. A cost bounded possibilistic ATMS. In Chris-tine Froidevaux and Jurg Kohlas, editors, Proceedings of the ECSQARU Confer-ence on Symbolic and Quantitive Approaches to Reasoning and Uncertainty, vol-ume 946, pages 52–59, Berlin, 1995. Springer Verlag.

5. J. W. Collins and D. de Coste. CATMS: An ATMS which avoids label explosions.In Kathleen McKeown and Thomas Dean, editors, Proceedings of the 9th NationalConference on Artificial Intelligence, pages 281–287. MIT Press, 1991.

6. J. de Kleer. An assumption-based TMS. Artificial Intelligence, 28:127–162, 1986.7. J. de Kleer. Extending the ATMS. Artificial Intelligence, 28:163–196, 1986.8. J. de Kleer. Problem solving with the ATMS. Artificial Intelligence, 28:197–224,

1986.9. J. de Kleer. Focusing on probable diagnoses. In Kathleen Dean, Thomas L.; McK-

eown, editor, Proceedings of the 9th National Conference on Artificial Intelligence,pages 842–848. MIT Press, 1991.

10. J. de Kleer. A perspective on assumption-based truth maintenance. ArtificialIntelligence, 59(1–2):63–67, 1993.


11. T. Dean and M. Boddy. An analysis of time-dependent planning. In Proceedingsof the Seventh National Conference on Artificial Intelligence (AAAI-88), pages49–54. MIT Press, 1988.

12. T. L. Dean. Intrancibility and time-dependent planning. In M. P. Geofrey and A .L.Lansky, editors, Proceedings of the 1986 Workshop on Reasoning about Actions andPlans. Morgan Kaufmann Publishers, 1987.

13. R. Dechter. Bucket elimination: A unifying framework for reasoning. ArtificialIntelligence, 113(1–2):41–85, 1999.

14. K. D. Forbus and J. de Kleer. Focusing the ATMS. In Tom M. Smith and Reid G.Mitchell, editors, Proceedings of the 7th National Conference on Artificial Intelli-gence, pages 193–198, St. Paul, MN, 1988. Morgan Kaufmann.

15. R. Haenni. Modeling uncertainty with propositional assumption-based systems. InS. Parson and A. Hunter, editors, Applications of Uncertainty Formalisms, LectureNotes in Artifical Intelligence 1455, pages 446–470. Springer-Verlag, 1998.

16. R. Haenni. Cost-bounded argumentation. International Journal of ApproximateReasoning, 26(2):101–127, 2001.

17. R. Haenni, J. Kohlas, and N. Lehmann. Probabilistic argumentation systems. InJ. Kohlas and S. Moral, editors, Handbook of Defeasible Reasoning and Uncer-tainty Management Systems, Volume 5: Algorithms for Uncertainty and DefeasibleReasoning. Kluwer, Dordrecht, 2000.

18. K. Inoue. An abductive procedure for the CMS/ATMS. In J. P. Martins andM. Reinfrank, editors, Truth Maintenance Systems, Lecture Notes in A.I., pages34–53. Springer, 1991.

19. J. Kohlas and P. A. Monney. A Mathematical Theory of Hints. An Approach to theDempster-Shafer Theory of Evidence, volume 425 of Lecture Notes in Economicsand Mathematical Systems. Springer, 1995.

20. M. McLeish. Nilsson’s probabilistic entailment extended to Dempster-Shafer the-ory. In L. N. Kanal, T. S. Levitt, and J. F. Lemmer, editors, Uncertainty inArtificial Intelligence, volume 8, pages 23–34. North-Holland, Amsterdam, 1987.

21. J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.22. G. M. Provan. A logic-based analysis of Dempster-Shafer theory. International

Journal of Approximate Reasoning, 4:451–495, 1990.23. R. Reiter and J. de Kleer. Foundations of assumption-based truth maintenance

systems: Preliminary report. In Kenneth Forbus and Howard Shrobe, editors,Proceedings of the Sixth National Conference on Artificial Intelligence, pages 183–188. American Association for Artificial Intelligence, AAAI Press, 1987.

24. G. Shafer. The Mathematical Theory of Evidence. Princeton University Press,1976.

25. P. P. Shenoy. Binary join trees. In Eric Horvitz and Finn Jensen, editors, Pro-ceedings of the 12th Conference on Uncertainty in Artificial Intelligence (UAI-96),pages 492–499, San Francisco, 1996. Morgan Kaufmann Publishers.

26. Ph. Smets. Probability of provability and belief functions. In M. Clarke, R. KruseR., and S. Moral, editors, Proceedings of the ECSQARU’93 conference, pages 332–340. Springer-Verlag, 1993.

27. Ph. Smets and R. Kennes. The transferable belief model. Artificial Intelligence,66:191–234, 1994.

28. S. Zilberstein. Using anytime algorithms in intelligent systems. AI Magazine,Fall:71–83, 1996.

Proof Length as an Uncertainty Factor in ILP

Gilles Richard and Fatima Zohra Kettaf

IRIT - UMR CNRS 5505118 Rte de Narbonne31062 Toulouse cedex 4

kettaf, [email protected]

Abstract. A popular idea is that the longer the proof the riskier thetruth prediction. In other words, the uncertainty degree over a conclusionis an increasing function of the length of its proof. In this paper, weanalyze this idea in the context of Inductive Logic Programming. Somesimple probabilistic arguments lead to the conclusion that we need toreduce the length of the clause bodies to reduce uncertainty degree (or toincrease accuracy). Inspired by the boosting technique, we propose a wayto implement the proof reduction by introducing weights in a well-knownILP system. Our preliminary experiments confirm our predictions.

1 Introduction

It is a familiar idea that, when a proof is long, it is difficult to verify and thenumber of potential errors increases. This is our starting point and we applythis evidence to Inductive Logic Programming (ILP). Recall that the aim of anILP machine is to produce explanations H (i.e. logical formulas) to observablephenomena described by data E (set of examples). We hope E is “provable” fromH. Following the previous popular assertion, the uncertainty degree concerningthe truth of a given fact increases with the length of its proof. So, we decidedto measure this uncertainty degree by the length of the proof and to weighta given fact with this number. At the end of a learning process, we have anassociated weight for each training example. But what can we do with thisweight to improve the learning process? Here we take our inspiration from amachine-learning technique known as boosting [10]. The idea is to complete afull learning process by modifying the weight of each training example betweeneach turn. At the end of the process, we get a finite set of classifiers: the finalclassifier is a linear combination of the intermediate ones. We follow these linesfor improving the behavior of a well-known ILP system: Progol [8].

The mixing of boosting and ILP has already been proposed in [9,6]. But ouridea is rather different. For each training example, we compute its uncertaintydegree (which is not an error degree as in [9]) by using a trick in the Progolmachinery. This degree is considered as a weight for a given example. We see, inour experimentation, that the behavior of “Progol with loop” is, of course, betterthan Progol alone but, and this is more convincing, better than “Progol withrandomly chosen weights”. In section 2, we briefly describe the ILP framework. In


Proof Length as an Uncertainty Factor in ILP 129

section 3, we use some simple probabilistic arguments to validate the initial ideaconcerning the relationship between uncertainty and the length of the proofs.Section 4 is devoted to the target ILP system we want to optimize by introducingweighted examples. In section 6, we describe our implementation and we givethe results of some practical experiments. Finally, we discuss in section 7 andconclude in section 8.

2 ILP: A Brief Survey

We assume some familiarity with the usual notions of first order logic and logicprogramming (see for instance [7] for a complete development). In standardinductive logic programming, a concept c is a Herbrand interpretation that isa subset of the Herbrand base, the full set of ground atoms. The result of thelearning process is a logic program H. The question may be asked, what doesit means to say that we have learned c with H? Informally, this means that the“implicit” information deducible from (or coded by) H “covers” c in some sense.If we stay within the pure setting where programs do not involve negation, it iswidely admitted that this implicit information is the least Herbrand model ofH denoted MH , which is yet a subset of the Herbrand base. An ideal situationwould be c = MH but, generally, MH (or H) is just an “approximation” of c.To be more precise, an ILP machine takes as input:

– a finite proper subset E =< E+, E− > (the training set in Instance BasedLearning terminology) where E+ are the positive examples i.e. the thingsknown as being true and is a subset of c, E− are the negative examples andis a subset of c.

– a logic program usually denoted B (as background knowledge) representinga basic knowledge we have concerning the concept to approximate. Thisknowledge satisfies two natural conditions: it does not explain the positiveexamples: B |=\E+ and it does not contradict the negative ones: B ∪E− |=\⊥

To achieve the ILP task, one of the most popular methods is to build H suchthat H ∪ B |= E+ and H ∪ B ∪ E− |=\⊥. In that case, since E+ ⊆ c andE− ⊆ c, it is expected that ∀e ∈ c,H ∪ B |= e i.e. H ∪ B |= c. Thus, H,associated with B, could be considered as an explanation for c . Of course, asexplained in the previous section, an ILP machine could behave as a classifier.Back to the introduction, the sample set S = (x1, y1), . . . , (xi, yi), . . . , (xn, yn)is represented as a finite set of Prolog facts class(xi, yi) constituting the set E+.The ILP machine will provide an hypothesis H. Given a query q, we get ananswer with the program H ∪B by running standard Prolog machinery (H ∪Bis in fact a Prolog program). In the simple case of clustering, for instance, weget the class y of a given element x by giving the query class(x, Y )? to a Prologinterpreter, H being previously consulted.

Back to our main purpose, we want to evaluate, in some sense, the uncertaintyrelative to the answer for a given query q?. This is the object of the next section.

130 G. Richard and F.Z. Kettaf

3 A Simple Probabilistic Analysis

Our learning machine build an hypothesis H, which is a logic program and,as such, could be identified with its least Herbrand model MH . To formalizethe problem, we follow the lines of [5,1] by considering a probability measure µover the full set I of all possible interpretations (the set of “worlds” in the [5,1] terminology) Then the degree of validity of a given ground formula F is themeasure of its set of models: roughly speaking, the more models F has, the truerF is. Since we have a measure over I, we can consider any ground formula F asa random variable starting from I and with range in t, f by putting F (I) =the truth value of F in interpretation I.

F

I −→ t, fI → I(F )

Now, µ(F = v) has a standard meaning: µ(I ∈ I | F (I) = v). To abbreviate,we denote µ(F ) = µ(F = t). We may notice that, in the classical bi-valuedsetting, µ(F = t) = 1 − µ(F = f). In a more general three-valued logic, thisequality does not hold. Nevertheless, the remaining of this section would be yetvalid in such a context. A logic program H could be considered as a conjunctionof ground Horn clauses and so µ(H) is defined. Of course, we are interested inthe relationship between this probability measure µ and the |= relation. We havethe following lemma:

Lemma 1. If H1 |= H2 then

– µ(H1) ≤ µ(H2)– µ(H2/H1) = 1

Proof: i) by definition of H1 |= H2, if H1(I) = I(H1) = t then H2(I) = I(H2) = tthis implies the first relation.ii) µ(H2/H1) =

µ(H1∧H2)µ(H1)

but µ(H1 ∧H2) = µ(I ∈ I | I |= H1 ∧H2) = µ(I ∈I | I |= H1 ∩ I ∈ I | I |= H2) = µ(I ∈ I | I |= H1) by hypothesis and weare done.

Let be now a new instance of our problem a /∈ E which is known as beingtrue. Our question is “what is our chance that one of the two hypothesis coversa i.e. is such that H |= a ?”. So we are interested in the comparison betweenthe probabilities that a holds knowing that b→ a holds or that a holds knowingthat b ∧ c→ a. We can easily deduce from the previous lemma:

Proposition 1. If m ≤ n:

µ(a | ∧i∈[1,n]bi → a) ≤ µ(a | ∧i∈[1,m]bi → a)

Proof: We loose no generality by dealing with m = 1 and n = 2. Since we havea |= (b→ a) and (b→ a) |= (b∧ c→ a), we deduce using the Bayes formula andthe part ii) of the previous lemma:

µ(a | b→ a) =µ(b→ a | a)µ(a)

µ(b→ a)=

µ(a)µ(b→ a)

(1)

and


µ(a | b ∧ c→ a) =µ(b ∧ c→ a | a)µ(a)

µ(b ∧ c→ a)=

µ(a)µ(b ∧ c→ a)

(2)

Note that b → a |= b ∧ c → a and we can now use the part i) of the previouslemma:

µ(b→ a) ≤ µ(b ∧ c→ a) (3)

So we get the expected result: µ(a | b ∧ c→ a) ≤ µ(a | b→ a).We can conclude here that it is probably better to justify a with a clause

with a short body. If we consider the length of a proof for a given instance a, thisnumber is thus a reasonable measure of the uncertainty we have concerning thefact that a is true. This number could be viewed as a weight for a and, inspiredby standard techniques in machine learning field, we shall try to introduce theseweights in the ILP process to guide the search for relevant hypothesis. Theremaining of our paper is devoted to take into account in an ILP process theprevious analysis and to reduce uncertainty by reducing the proofs. The targetmachinery is Progol ([8]) that we shortly describe in the following section.

4 The Progol System

Back to the standard ILP process, instead of searching for consequences, wesearch for premises: it is thus rather natural to reverse standard deductive in-ference mechanisms. That is the case for Progol which uses the so-called inverseentailment mechanism ([8]). We only give a simple algorithm1 schematizing itsbehavior in figure 1.

Initialize : E′ = E (initial set of examples)H = ∅ (initial hypothesis)

While E′ = ∅ doChoose e ∈ E′Compute a covering clause C for eH = H ∪ CCompute Cov = e′ | e′ ∈ E,B ∪H |= e′E′ = E′ \ Cov

End While

Fig. 1. General Progol scheme

At a glance, it is clear that no weight is taken into account in the previousalgorithm. But let us examine in the next subsection how the covering clause ischosen.1 see http://www.cs.york.ac.uk/mlg/progol.html where a full and clear descrip-tion is given.


4.1 The Choice of the Covering Clause

It is clear that there is an infinite number of clauses covering e, and so Progolneed to restrict the search in this set. The idea is thus to compute a clause Cesuch that if C covers e, then necessarily C |= Ce (C is more general than Ce).Since, in theory, Ce could be an infinite disjunction, Progol restricts the con-struction of Ce using mode declarations and some other settings (like numberof resolution inferences allowed, etc...). Mode declarations imply that some vari-ables are considered as input variables and other ones as output variables: thisis a standard way to restrict the search tree for a Prolog interpreter.

At last, when we have a suitable Ce, it suffices to search for clauses C whichθ-subsume Ce since this is a particular case which validates C |= Ce. Thus,Progol begins to build a finite set of θ-subsuming clauses, C1, . . . , Cn. For eachof these clauses, Progol computes a natural number f(Ci) which expresses thequality of Ci: this number measures in some sense how well the clause explainsthe examples and is combined with some other compression requirement. Givena clause Ci extracted to cover e, we have:

f(Ci) = p(Ci)− (c(Ci) + h(Ci) + n(Ci))

where:

– p(Ci) = #(e | e ∈ E,B ∪ Ci |= e) i.e. the number of covered examples– n(Ci) = #(e | e ∈ E,B ∪ Ci ∪ e |= ⊥) i.e. the number of incorrectlycovered examples

– c(Ci) is the length of the body of the clause Ci– h(Ci) is the minimal number of atoms of the body of Ce we have to add tothe body of Ci to insure output variables have been instantiated.

The evaluation of h(Ci) is done by static analysis of Ce. Then, Progol choosesa clause C = Ci0 ≡ arg maxCif(Ci) (i.e. such that f(Ci0) = maxf(Cj) | j ∈[1, n]). We may notice that, in the formula computing the number f(Ci) fora given clause covering e, there is no distinction between the covered positiveexamples. So p(Ci) is just the number of covered positive examples. The sameremark is valuable for the computation of n(Ci) and then success and failureexamples could be considered as equally weighted. In the next section, we shallexplain how we introduce weights to distinguish between examples.

4.2 A Simulation of the Weights

Given a problem instance a covered by H, there is a deduction, using only theresolution rule, such that H ∪B |= a. Back to our previous example, to deduce aknowing a← b, c (using Prolog standard notation), you have first to prove (or toremove in Prolog terminology) b then c with standard Prolog strategy. It becomesclear that the number of resolution steps used to prove a is an increasing functionof the length of the clauses defining a in H. Starting from this observation, weargue that this number is likely a good approximation of the difficulty to cover


an instance of a which would not be in the training set. We infer that trainingexamples with long proofs will likely generate errors in the future and so, wehave to focus on such examples during the learning process. So, we decide togive to each training example a weight equal to the length of the proof it needsto be covered by the current hypothesis. Among the finite set of possible proofs,we choose the shortest length.

Let us recall that Progol is not designed to readily employ weighted traininginstances and we have to simulate in some sense the previous weights. Never-theless, in the definition of the f function allowing to choose the best clausefor the current instance, the parameter we are interested in is p(C): this is thenumber of training instances in E covered by the clause we are dealing with. Letus suppose that, instead of considering E as a set, we consider E as a multiset(or bag) and we include in the current E multiples occurrences of some existingexamples. Then we force the ILP machine to give great importance to theseinstances and to chose the associated covering clauses just because of the greatvalue of p(C). Let us give an example to highlight our view. E = e1, e2, e3, e4.Starting from e1, the Progol machine finds C1, C2 and C3 such that C1 coverse1, e2 and e3, C2 covers e1, e2 and e4, C3 covers e1, e3 and e4. At this step, wehave

p(C1) = p(C2) = p(C3) = 3.

So these clauses could not be distinguished only with the covered positive exam-ples. Let us suppose now that we give to e2 the weight 2 and to e3 the weight 3,e1 and e4 keeping their implicit weight of 1. Now we have

p(C1) = 1 + 2 + 3 = 6, p(C2) = 1 + 2 + 1 = 4

andp(C3) = 1 + 3 + 1 = 5.

From the viewpoint of the positive covered examples, C1 is the best clause andwill be added to the current hypothesis H: the clause C2, which does not covere3, is eliminated and we choose among the clauses covering e3. We understandthat the Progol machinery will then choose the clauses covering the “heavy”instances.

5 Progol with Weights

Now we have to manage the weight in a Progol system: we take inspiration froma technique issued from the field of classification namely boosting. We give themain lines of this algorithm in the next subsection.

5.1 The Boosting Technique

This technique is issued from the field of classification: the main idea is to com-bine simple classifiers Ct, t ∈ [1, T ], into a new one C whose predictive accuracy


is better than those of the Ct’s. So C = ψ(Ct | t ∈ [1, T ]) and the problemis to cleverly define the ψ aggregating function. We focus here on the boostingtechnique as initially described in [3,4] and improved in [10]. We suppose givena finite set of training instances S = (x1, y1), . . . , (xi, yi), . . . , (xn, yn) whereyi denotes the class of xi (to simplify the presentation, we suppose a binaryclassification where yi ∈ −1,+1). The “boosted” system will now repeat Ttimes the construction of a classifier Ct for weighted training instances startingfrom a distribution D1 where the initial weight w1(i) of instance i is 1/n. Atthe end of turn t of the algorithm, the distribution Dt is updated into Dt+1:roughly speaking, the idea is to increase the weight of a misclassified instancei.e. if Ct(xi) = yi then wt+1(i) > wt(i). Of course, the difficulty is to cleverlydefine the new weight. One of the main results of [10] is to give choice criteriaand also to prove important properties of such boosted algorithms. For our aim,it is not necessary to go further and so we give in figure 2 a simplified versionof the boosting algorithm. Generally, the ψ functional is a linear combination of

Initialize t = 1For i = 1 to n do Dt(i) = 1/n

For t = 1 to T dobuilt a classifier Ct with S weighted by Dtupdate Dt

End ForC = ψ(Ct | t ∈ [1, T ])

Fig. 2. General boosting scheme

the Ct’s considered as real-valued functions. To predict the class of a given x, itsuffices to compute C(x): if C(x) > 0 then +1 else −1. So [10] proved that thetraining error of the classifier C has a rather simple upper bound (when Dt isconveniently chosen) and some nice methods to reduce this error are described.

5.2 Introduction of the Weights in Progol

Now it is relatively easy to introduce a kind of weight management in the Progolmachinery: it suffices to consider the training set as a multiset, the weight ω(e)of an example e being its multiplicity order. So we start with equally weightedexamples: each instance appears one time in E. Introducing T times a changeof the function ω in a learning loop, then we will get distinct clauses and thendistinct final hypothesis in each turn of the loop. To abbreviate, we shall denoteProgol(B,E, f) the output program H currently given by the Progol machinewith input B as background knowledge, E as training set (or bag) and usingfunction f to chose the relevant clauses. Et is the bag associated to weight ωt

during turn t and let us now show in figure 3 the scheme we use. It clearly appearsthat, with regard to the behavior of Progol, the program Progol(B,Et, f) willbe probably different from the program Progol(B,Et+1, f). This is an exact


Given E = e1, . . . , em a set of positive examples,B a background knowledge (a set of Horn clauses)

/* Initialization */E1 = E;

/* main loop */For t = 1 to T do

Ht = Progol(B,Et, f);For each e ∈ E, compute ωt+1(e) = length ofthe proof B ∪Ht |= e;Update Et into Et+1 using ωt+1

/* end main loop */

Fig. 3. A boosting-like scheme for Progol

simulation of a weight-handling Progol system where the function f would bedefined with:

f(C) = p(C)− (h(C) + c(C) + n(C))

where

p(C) = Σe∈Et|B∪C|=eω(e)

So we get a set of hypothesis Ht | t ∈ [1, T ] and since we deal with symbolicvalues, we aggregate these programs by a plurality vote procedure i.e. the answerfor a query q is just the majority vote of each component Ht. This is exactlythe solution of [9]. Doing so, we may notice that we get a classifier but this isnot a simple logic program. Nevertheless, we do not focus on this aspect forclassification tasks. We examine in the next section how we proceed and thebehavior of our machine from the viewpoint of predictive accuracy.

6 Implementation and Experiments

Our experiments were made using UCI database ([2]). We choose 3 databases,namely ZOO (102 examples, 18 attributes), HEPATITIS (155 examples, 20 at-tributes) and MORAL-REASONER (202 examples) databases. The first twoones are standard classification data using symbolic and numeric attribute val-ues. The third one is fully relational, dealing only with symbolic values. Themain boosting loop is implemented as a Unix script with

- a) additional Prolog predicates to generate the weights and- b) some C programs to process the different outcome files.The running time is rather important for data with large number of at-

tributes: about 2 hours for Hepatisis over a standard PC (500Mhz). This is themain reason why we chose restricted set of data. A full C implementation wouldlikely allow to deal with more complex data and relations. But, as usual in thatdomain, we are not primarily concerned with real time constraints. The maindifficulty is due to the computation of the weights at the end of a learning turn:


we need to run a Prolog interpreter for each example and to keep the length ofeach success proof in view to compute the shortest one.

6.1 Results for ZOO and HEPATITIS Databases

Results for ZOO Database (in Accuracy Rate%)

Attribute name Progol Random Weights +Boosting

eggs 88,5 90 94,2

hair 96 96 96

feathers 100 100 100

airbone 76 80 84

aquatic 80 86 86

toothed 72 72 72

backbone 98 98 98

breathed 90 93 98

We fix the number of loops to 10 (as in [9]). We run standard Progoland our “boosted” system over the previous database with fixed training sets ofsize 50, randomly sampled from the full one. We also compare our machine witha “random boosting” system, i.e. where the weights are randomly generatedbetween two consecutive turns of the main loop. Then we test the resultshypothesis over the full set of examples(this is an easy task with Progol since abuilt-in predicate test\1 does the job) and we compute the accuracy rate simplyby dividing the number of correctly classified data by the full number of data.A majority vote is implemented for computing the answers of the “boosted”systems.

That is why we have 4 columns in our tables. The first one indicates theclass we are focusing on, the next ones indicate the respective accuracy ratewith Progol only, random weight loop and our weight function.

Results for HEPATITIS Database (in Accuracy Rate%)

Each line of the tables is identified by the name of the attribute we focus on: insuch a line, we consider that the value of the attribute has to be guessed by thesystem (i.e. this attribute is the concept to learn).


Attribute name Progol Random Weights +Boosting

alive 83,7 84 86,1

sex 88,5 90 94,2

steroids 41 41 44

antiviral 20 20 25

anorexia 80 83 83

liverBig 75 76 80

liverFirm 53,6 60 63,4

spleenpalpable 41,6 45 51,2

spiders 65,8 70 73

ascite 87,8 90 92,6

varices 90,2 95 95,1

The results we get are described in the above tables. We may remarkthan the random method is, in general better than standard Progol and this isan expected result. Moreover, our method is often better than the random one.

6.2 Results for MORAL-REASONER Database

One of the main interest we found in this base is the huge number of predicatesymbols involved. Recall that we are interested in the length of the clause bodiesand then we focus on theory with a lot of predicate symbols: we thus increasethe possibility for the bodies of clauses. For instance, considering the BALLOONdatabase (UCI), the Progol machine gives only clauses with empty body (becausethe concept to learn is rather simple) so there is no possibility to improve themechanism with our scheme. In the MORAL-REASONER database, we have 49predicate symbols to define the guilty/1 target predicate. A little bit more than4750 clauses constitute the background knowledge. The full training set contains102 positive instances S+ and 100 negative ones S−. Since this database is rathera logical one, we adopt a different protocol to test our machine. The protocolused in that case is the following one: training sets E = E+ ∪E− of sizes 10, 20,30, 40, 50, 60, 70, 80 , 90 and 100 were randomly sampled from the full trainingset. Each training set is equally divided in positive and negative instances (i.e. 5positives examples in E+ and 5 negative examples in E− for the first test, and


so on). First of all, we may notice the behavior of standard progol where theaccuracy rate is not a strictly increasing function of the number of examples:this is due to the fact that the given training sets do not constitute an increasingsequence. But the main observation is the fact that ‘”boosted Progol” has, inall the cases except the first one, a better accuracy rate over the full sampleset than simple Progol. We guess that this exception is not really informativebecause of the small size of the sample set. With only 10 training instances, itis clear that we are rather lucky to get as first hypothesis a performant classifier(accuracy = 81.19%). So, in the other boosting loops, we get a more standardbehavior with a great deviation and the majority vote demonstrates this fact.

Fig. 4. Comparison Progol/Progol with weight

7 Discussion

Our method for computing weights is generally better than the random weightmethod which is itself better than simple Progol. So, we can infer that boostingwith length of proof optimizes the behavior of the Progol machine. We maynotice that the first hypothesis we get with the boosting scheme is exactly theone produced by the standard Progol machine. During the boosting process, if weget an hypothesisH which has already been produced, this is a halting condition.


The reason is that the weights to actualize have already been computed and wewill not get more information.

Since we use a plurality vote, this means we have not a single logic programat the end of the process and so, we lose an interesting features of ILP: theintelligibility of the resulting hypothesis, that is, the possibility to a non-expertpeople to understand why we get a given result. Of course, each individualhypothesis remains understandable but the behavior of the whole system couldnot be reduced to one hypothesis unless we exhibit an equivalent logic program.So, instead of using a voting procedure, it would be interesting to combine thehypothesis Pt to get a single full logic program. The simplest solution would beto take the union of the Pt’s: P =

⋃t∈[1,T ] Pt. From a logical point of view, we

are sure to get a consistent theory since we only deal with Horn clauses and soeach Pt has only positive consequences. We examine this possibility now.

Let us consider the union P of two logic programs P1 and P2 consideredas unordered finite sets of Horn clauses. Then the success set of P , ss(P ) (see[7] for a complete description), is a superset of the union of the two successsets: ss(P1) ∪ ss(P2) ⊆ ss(P ). So, from the viewpoint of the covered positiveexamples, we increase the accuracy rate. But, let us focus now on the negativeexamples. We recall that a negative example is covered if and only if it belongsto the finite failure set of P , ff(P ) (see [7]). Unfortunately, we have the dualinclusion ff(P ) ⊆ ff(P1)∩ff(P2) highlighting the fact that negation as failureis a non monotonic logical rule.

A partial solution would be to consider accuracy rate over positive examplesonly. Nevertheless, two other problems occur:

1. First, we increase the non-determinism of the program since we probablyget several answers for one given instance. And this is problematic when wedeal with single class clustering task.

2. Secondly, we probably get a very redundant system where each covered in-stance leads to a lot of different proofs then reducing time efficiency of thefinal program.

So a more clever aggregating method has to be found. This is an open issue.

8 Conclusion

In this paper, we provide a way to introduce weights in an ILP machine, Progol.This idea is not completely new. As far as we know, [9] was the first one to buildsuch a mechanism for the FOIL machine but there are two main differences withour work:

– FOIL is a weight-handling system so the novelty was only the boosting mech-anism. In our case, Progol does not handle weighted examples. As a conse-quence, we have to search for reasonable weights and to find a method todeal with them.

– in FOIL, the weights are derived from the training error but our weights relyon the length of proofs since the training error is always null.


The length of a proof is viewed as a numerical quantification of the uncer-tainty degree and a simple probabilistic reasoning justifies this claim. We imple-ment this idea by simulating weight using bags instead of sets as training bases,the multiplicity order of an element being considered as its “weight”. This is away to focus over potentially unfair examples and to force the learning machine,Progol, to privilegiate hypothesis covering these examples. The final aggregationis just a plurality vote to get the answer. Our practical results improve the stan-dard Progol behavior and validate our idea over a restricted set of data. A moreextensive experimentation would be interesting.

Among the different ideas to develop, we have to introduce negative examplessince they are available in such first order systems. But in that case, mattersare not so simple. To deal with negative information, the standard way in logicprogramming is to start with the Closed World Assumption and to implement anegation-as-failure mechanism. Unfortunately, negation-as-failure is not a logicalinference rule but only an operational trick and in that case, the length of a proofis of no help in designing a weight function.

References

1. F.Bacchus, A.J. Grove, J.Y. Halpern, and D. Koller. From statistical knowledgebases to degree of belief. In Artificial Intelligence, (87), pp 75-143, 1997

2. C.I. Blake, C.J. Merz. UCI repository of machine learning databases, 1998.http://www.ics.uci.edu/ mlearn/MLRepository.html.

3. Y. Freund, R.E. Shapire. Experiments with a new boosting algorithm. In MachineLearning : proceeding of the 13th Int. Conf., pp148-156, 1996

4. Y. Freund, R.E. Shapire. A decision-theoretic generalization of on-line learning andan application to boosting. In Journal of Computer and System Sciences, vol.55(1)pp119-139, 1997

5. J.Y. Halpern. An analysis of first-order logics of probability. In Artificial Intelli-gence, (46), pp 311-350, 1990

6. S. Hoche, S. Wrobel. Using constrained confidence-rated boosting. In 11th Int.Conf., ILP, Strasbourg, France, pp51-64, 2001

7. J.W. Lloyd. Foundations of Logic Programming. Symboloc Computation series.Springer Verlag, 1997 (revised version).

8. S. Muggleton. Inverse entailment and Progol. New Gen. Comput., 12 pp 245-286,1994.

9. J.R. Quinlan. Boosting First-Order Learning. In Proc. 7th Int. Workshop on Al-gorithmic Learning Theory ALT’96. Springer Verlag, 1996.

10. R.E. Shapire, Y. Singer. Improving boosting algorithms using confidence-ratedpredictions. In Proc. 11th Ann. Conf. Computation Learning Theory, 1998.

Paraconsistency in Object-Oriented Databases

Rajiv Bagai1 and Shellene J. Kelley2

1 Department of Computer Science, Wichita State University,Wichita, KS 67260-0083, [email protected]

2 Department of Mathematics & Computer Science, Austin College,Sherman, TX 75090-4440, USA

[email protected]

Abstract. Many database application areas require an ability to han-dle incomplete and/or inconsistent information. Such information, calledparaconsistent information, has been the focus of some recent research.In this paper we present a technique for representing and manipulatingparaconsistent information in object-oriented databases. Our techniqueis based on two new data types for such information. These data types aregeneralizations of the boolean and bag data types of the Object DataManagement Group standard (ODMG). Algebraic operators with a 4-valued paraconsistent semantics are introduced for the new data types.Also a 4-valued operational semantics is presented for the select expres-sion, with an example illustration of how such a semantics can be usedeffectively to query an object-oriented database containing paraconsis-tent information. To our knowledge, our technique is the first treatmentof inconsistent information in object-oriented databases.

1 Introduction

Employing classical sets to capture some underlying property of their elementsrequires the membership status of each possible element to be completely de-terminable, either positively or negatively. For bags, determining multiplicity ofmembership of their elements is necessary as well. While it is often possible todetermine that underlying property of possible elements of sets or bags, thereare numerous applications in which that property can at best be only guessed byemploying one or more tests or sensors. By itself, such a test or sensor is usuallynot foolproof, making it necessary to take into account the outcomes, sometimescontradictory, of several such tests.

As an example, Down’s Syndrome is an abnormality in some fetus, lead-ing to defective mental or physical growth. Whether or not the fetus inside apregnant woman has Down’s Syndrome can only be guessed at by subjectingthe mother to certain tests, e.g. serum screening, ultrasound, amniocentesis di-agnosis, etc. Such tests are often carried out simultaneously and contradictory This research has been partially supported by the National Science Foundation re-search grant no. IRI 96-28866.


142 R. Bagai and S.J. Kelley

results are possible. Decisions are made based on information obtained fromall sensors, even contradictory. As another example, target identification in abattlefield is often carried out by employing different sensors for studying anobserved vehicle’s radar image shape, movement pattern, gun characteristics,etc. Gathered information is often incomplete, and sometimes even inconsistent.Astronomical and meteorological databases are other examples of domains richin such information. For the same reasons, combinations of databases [4,9] ordatabases containing beliefs of groups of people often have incomplete, or moreimportantly, inconsistent information.

Paraconsistent information is information that may be incomplete or in-consistent. Given the need for handling such information in several applicationareas, some techniques have recently been developed. Bagai and Sunderraman[1] proposed a data model for paraconsistent information in relational databases.The model, based on paraconsistent logic studied in detail by da Costa [7] andBelnap [5], has been shown by Tran and Bagai in [10] to require handling infi-nite relations, with an efficient algebraic technique presented in [11]. Also, someelegant model computation methods for general deductive databases have beendeveloped using this model in [2,3].

In the context of object-oriented databases, however, we are not aware of anyprior work on handling paraconsistency.

In this paper we present a technique, motivated by the above model, forrepresenting and manipulating paraconsistent information in object-orienteddatabases. We present two new data types for the object model of Cattell andBarry [6]. Our data types are generalizations of the boolean and bag data types.We also introduce operators over our new data types and provide a 4-valued se-mantics for the operators. In particular, we provide a richer, 4-valued operationalsemantics for the select expression on our new data types. Equipped with our se-mantics, the operators become an effective language for querying object-orienteddatabases containing paraconsistent information.

The remainder of this paper is organized as follows. Section 2 introducesthe two new data types for storing paraconsistent information. This section alsogives some simple operators on these data types. Section 3 presents a 4-valuedoperational semantics for the select expression, followed by an illustrative ex-ample of the usage and evaluation of this construct. Finally, Section 4 concludeswith a summary of our main contributions and a mention of some of our futurework directions.

2 Paraconsistent Data Types

In this section we present two new data types, namely pboolean and pbag<t>,that are fundamental to handling paraconsistent information in an object-oriented database. These data types are, respectively, generalizations of the datatypes boolean and bag<t> of the ODMG 3.0 object data standard of [6].

Paraconsistency in Object-Oriented Databases 143

2.1 The Data Type pboolean

The data type pboolean (for paraconsistent boolean) is a 4-valued generaliza-tion of the ordinary data type boolean. The new generalized type consists offour literals:

true, false, unknown and contradiction.

As in the boolean case, the literal true is used for propositions whose truthvalue is believed to be true, and the literal false for ones whose truth valueis believed to be false. If the truth value of a proposition is not known, werecord this fact by explicitly using the literal unknown as its value. Finally, ifa proposition has been observed to be both true as well as false (for example,by different sensors), we use the literal contradiction as its value.

Paraconsistent Operators pand, por & pnot

We also generalize the well-known operators and, or and not on the typeboolean to their paraconsistent counterparts pand, por and pnot, respectively,on the type pboolean. The following table defines these generalized operators:

P Q P pand Q P por Q pnot Pcontradiction contradiction contradiction contradiction contradictioncontradiction true contradiction true contradictioncontradiction false false contradiction contradictioncontradiction unknown false true contradiction

true contradiction contradiction true falsetrue true true true falsetrue false false true falsetrue unknown unknown true falsefalse contradiction false contradiction truefalse true false true truefalse false false false truefalse unknown false unknown true

unknown contradiction false true unknownunknown true unknown true unknownunknown false false unknown unknownunknown unknown unknown unknown unknown

Except for the cases when one of P and Q is contradiction and the other isunknown, all values in the above table are fairly intuitive. For example, falsepand unknown should be false, regardless of what that unknown value is.And false por unknown is unknown as false is the identity of por. Thetwo cases when one of P and Q is contradiction and the other is unknownwill become clear later. However, at this stage it is worthwhile to observe thatthe 4-valued operators pand, por and pnot are monotonic under the no-more-informed lattice ordering (≤), where

unknown < true < contradiction, andunknown < false < contradiction.


Also, the duality of pand and por is evident from the above table. Moreover,the following algebraic laws can easily be shown to be exhibited by the above4-valued operators:

1. Double Complementation Law:pnot (pnot P ) = P

2. Identity and Idempotence Laws:P pand true = P pand P = PP por false = P por P = P

3. Commutativity Laws:P pand Q = Q pand PP por Q = Q por P

4. Associativity Laws:P pand (Q pand R) = (P pand Q) pand RP por (Q por R) = (P por Q) por R

5. Distributivity Laws:P pand (Q por R) = (P pand Q) por (P pand R)P por (Q pand R) = (P por Q) pand (P por R)

6. De Morgan Laws:pnot (P pand Q) = (pnot P ) por (pnot Q)pnot (P por Q) = (pnot P ) pand (pnot Q)

2.2 The Data Type pbag<t>

If t is any data type, then bag<t> is an unordered collection of objects of typet, with possible duplicates. We define pbag<t> (for paraconsistent bag) to bea collection data type, such that any element p of that type is an ordered pair〈p+, p−〉, where p+ and p− are elements of type bag<t>. For example,

〈1, 1, 2, 2, 2, 3, 1, 3, 3〉is an element of type pbag<short>.

Intuitively, just like a bag, a paraconsistent bag captures some underlyingproperty of elements that (may) occur in it. An element a of type t occurs inp+ as many times as there are evidences of it having the underlying property;similarly, a occurs in p− as many times as there are evidences of it not havingthe underlying property. Since inconsistent information is possible, an elementmay be simultaneously present (in fact, a multiple number of times) in bothpositive as well as negative parts of a paraconsistent bag. Also, an element maybe absent from one or both parts.

As a more realistic example, suppose a patient in a hospital is tested forsome symptoms s1, s2 and s3. Often the testing methods employed by hospitalsare not guaranteed to be foolproof. Thus, for any given symptom, the patient,over a period of time, is tested repeatedly, with possibly contradictory results.A possible collection of test results captured as a paraconsistent bag is:

〈s1, s2, s2, s3, s1, s3, s3〉.The above contains the information that this patient tested positive once eachfor symptoms s1 and s3, and twice for symptom s2. Also, the patient testednegative once for symptom s1 and twice for s3.


Membership of an Element

Membership of an element in a paraconsistent bag is a paraconsistent notion. If ais an expression of type t and p of type pbag<t>, then a pin p is an expressionof type pboolean given by:

a pin p =

true if a ∈ p+ and a ∈p−,false if a ∈p+ and a ∈ p−,unknown if a ∈p+ and a ∈p−,contradiction if a ∈ p+ and a ∈ p−.

The above corresponds to the main intuition behind the four literals of thepboolean type.

Paraconsistent Operators punion, pintersect & pexcept

We now define the union, intersection and difference operators on paraconsistentbags as generalizations of the union, intersect and except operators, respec-tively, on ordinary bags. Let t1 and t2 be compatible types and t their smallestcommon supertype.

If p1 is of type pbag<t1> and p2 is of type pbag<t2>, then p1 punion p2is the following paraconsistent bag of type pbag<t>:

〈p+1 union p+2 , p−1 intersect p−2 〉.

Once again, the punion operator is best understood by interpreting bags (bothordinary and paraconsistent) as collections of evidences for elements having theirrespective underlying properties, and p1 punion p2 as the “either-p1-or-p2”property. Thus, since p+1 and p+2 are the collections of positive evidences forthe properties underlying p1 and p2, respectively, the collection of positive evi-dences for the property “either-p1-or-p2” is clearly p+1 union p

+2 . Similarly, since

p−1 and p−2 are the collections of negative evidences for the properties underly-ing p1 and p2, respectively, the collection of negative evidences for the property“either-p1-or-p2” is p−1 intersect p

−2 .

The generalized operators for the intersection and difference of paraconsistentbags should be understood similarly. p1 pintersect p2 is the following paracon-sistent bag of type pbag<t>:

〈p+1 intersect p+2 , p−1 union p−2 〉

and p1 pexcept p2 is the following paraconsistent bag of type pbag<t>:

〈p+1 intersect p−2 , p−1 union p+2 〉.


Type Conversion

For any collection C, we define pbagof(C) as the following paraconsistent bag:

pbagof(C)+ =

the bag of all elements of C if C is a list or set,C if C is a bag,C+ if C is a paraconsistent bag,

pbagof(C)− =∅ if C is a list, set or bag,C− if C is a paraconsistent bag.

3 A Paraconsistent Select Expression

We now introduce a paraconsistent select expression construct that can be usedto query paraconsistent information from a database. The syntax of our con-struct is similar to the select construct of OQL but we provide a paraconsistentoperational semantics for evaluating the construct, resulting in a paraconsistentbag.

The general form of the select expression is as follows:

select [distinct] g(v1,v2, . . . ,vk,x1,x2, . . . ,xn)from

x1 pin f1(v1,v2, . . . ,vk),x2 pin f2(v1,v2, . . . ,vk,x1),x3 pin f3(v1,v2, . . . ,vk,x1,x2),· · ·xn pin fn(v1,v2, . . . ,vk,x1,x2, . . . ,xn−1)

[where p(v1,v2, . . . ,vk,x1,x2, . . . ,xn)]

where v1,v2, . . . ,vk are free variables that have to be bound to evaluate thequery, expressions f1, f2, . . . , fn result in collections of types with extents, say,e1, e2, . . . , en, respectively. (The extent of a type is the set of all current instancesof that type.) And p is an expression of type pboolean. The result of the querywill be of type pbag<t>, where t is the type of the result of g.

The query is evaluated as follows:

1. The result of the from clause, Φ(n), is a paraconsistent bag of n-tuples,defined recursively as

Φ(1) = pbagof(f1(v1,v2, . . . ,vk))

and for 2 ≤ i ≤ n,

Φ(i)+ =⋃

〈a1,a2,...,ai−1〉∈Φ(i−1)+

〈a1,a2, . . . ,ai−1, t〉 :t ∈ pbagof(fi(v1,v2, . . . ,vk,

a1,a2, . . . ,ai−1))+

Φ(i)− = Φ(i− 1)− × ei ∪⋃

〈a1,a2,...,ai−1〉∈Φ(i−1)+

〈a1,a2, . . . ,ai−1, t〉 :t ∈ pbagof(fi(v1,v2, . . . ,vk,

a1,a2, . . . ,ai−1))−


The subexpression Φ(i− 1)− × ei appears in the definition of Φ(i)− due tothe fact that if a tuple does not have the Φ(i−1) property, then no extensionof it can have the Φ(i) property.

2. If thewhere clause is present, obtain fromΦ(n) the following paraconsistentbag Θ:

Θ+ = Φ(n)+ ∩ 〈a1,a2, . . . ,an〉 ∈ e1 × e2 × · · · × en :p(v1,v2, . . . ,vk,a1,a2, . . . ,an) is true or contradiction

Θ− = Φ(n)− ∪ 〈a1,a2, . . . ,an〉 ∈ e1 × e2 × · · · × en :p(v1,v2, . . . ,vk,a1,a2, . . . ,an) is false or contradiction

Intuitively, Θ is obtained from the paraconsistent bag of n-tuples, Φ(n), byperforming a pintersect operation with the paraconsistent bag p, also ofn-tuples.

3. If g is just ”*”, keep the result of step (2). Otherwise, replace each tuple〈a1,a2, . . . ,an〉 in it by g(v1,v2, . . . ,vk,a1,a2, . . . ,an).

4. If the keyword distinct is present, eliminate duplicates from each of the twobag components of step (3).

An Example

Let us now look at an example illustrating some paraconsistent computationsfor a query that requires them. Consider again a hospital ward where patientsare tested for symptoms. Let a class Patient contain the following relationshipdeclaration in its definition:

relationship pbag<Symptom> test;

where Symptom is another class. The above declaration states that, for anypatient, the relationship test is essentially a paraconsistent bag containing, ex-plicitly, the symptoms for which that patient has tested positive or negative.

Let the sets P1, P2 and s1, s2, s3 be the current extents of the classesPatient and Symptom, respectively. We also suppose that in the current stateof the database, P1.test and P2.test are the following relationships:

P1.test = 〈s1, s1, s2, s3〉,P2.test = 〈s1, s3, s3, s1, s1〉.

In other words, P1 was tested positive for s1 (twice), s2 (once), and negative fors3 (once). Also, P2 was tested positive for s1 (once), s3 (twice), and negative fors1 (twice). Now consider the query:

What patients showed contradictory test results for some symptom?


This query clearly acknowledges the presence of inconsistent information in thedatabase. More importantly, it attempts to extract such information for usefulpurposes.

A select expression for this query is:

select distinct pfrom

p pin Patient,s pin p.test

wherepnot(s pin p.test)

In the ordinary 2-valued OQL a similar select expression will produce an emptybag of patients as the stipulations ”s in p.test” and ”not(s in p.test)” arenegations of each other and no patient-symptom combination can thus simul-taneously satisfy both of these stipulations. However, our 4-valued semantics ofparaconsistent operators like pnot and pin enable us to extract useful, in thiscase contradictory, information from the database.

We first compute the result of the from clause, Φ(2), a paraconsistent bagof patient-symptom ordered pairs, by performing the following evaluations:

Φ(1) = 〈〈P1〉, 〈P2〉, ∅〉,Φ(2)+ = 〈P1, s1〉, 〈P1, s1〉, 〈P1, s2〉, 〈P2, s1〉, 〈P2, s3〉, 〈P2, s3〉,Φ(2)− = 〈P1, s3〉, 〈P2, s1〉, 〈P2, s1〉.

The condition of the where clause is then evaluated for all possible patient-symptom ordered pairs:

〈p, s〉 pnot(s pin p.test)〈P1, s1〉 false〈P1, s2〉 false〈P1, s3〉 true〈P2, s1〉 contradiction〈P2, s2〉 unknown〈P2, s3〉 false

resulting in the following paraconsistent bag Θ:

Θ+ = 〈P2, s1〉,Θ− = 〈P1, s1〉, 〈P1, s2〉, 〈P1, s3〉, 〈P2, s1〉, 〈P2, s1〉, 〈P2, s1〉, 〈P2, s3〉.

Finally, projecting the patients from the above and removing duplicates resultsin the following answer to the select expression:

〈〈P2〉, 〈P1〉, 〈P2〉〉.The result states that P2 showed contradictory test result for some symptom(actually s1) but not for all symptoms (for example, not for s3), and P1 did notshow contradictory result for any symptom.


4 Conclusions and Future Work

The existing ODMG 3.0 Object Data Standard [6] is not capable of handlingincomplete and/or inconsistent information in an object-oriented database. Formany applications that is a severe limitation as they abound in such information,called paraconsistent information. We have presented a technique for represent-ing and manipulating paraconsistent information in object-oriented databases.Our technique is based upon two new data types pboolean and pbag<t> thatare generalizations of the boolean and bag<t> data types, respectively, of [6].We also presented operators over these data types that have a 4-valued seman-tics. Most importantly, we presented a 4-valued operational semantics for theselect expression of OQL, which makes possible querying contradictory infor-mation contained in the database.

We have recently completed a first prototype implementation of our results.Kelley [8] describes in detail an object-oriented database management systemcapable of storing incomplete and/or inconsistent information and answeringqueries based on the 4-valued semantics presented in this paper.

To the best of our knowledge, our work is the first treatment of inconsistentinformation in object-oriented databases. Such information occurs often in ap-plications where, to determine some fact many sensors may be employed, someof which may contradict each other. Examples of such application areas includemedical information systems, astronomical systems, belief systems, meteorolog-ical systems, military and intelligence systems.

Due to space limitations we have presented a minimal but complete extensionto the standard of [6] for handling paraconsistent information. An exhaustive ex-tension would have to have generalized versions of other operators and features,such as for all, exists, andthen, orelse, < some, >= any, etc. We have leftthat as a future extension of the work presented in this paper.

Some other future directions in which we plan to extend this work are todevelop query languages and techniques for databases that contain quantitativeparaconsistency (a finer notion of paraconsistency with real values for beliefand doubt factors) and temporal paraconsistency (dealing with paraconsistentinformation that evolves with time).

References

1. R. Bagai and R. Sunderraman. A paraconsistent relational data model. Interna-tional Journal of Computer Mathematics, 55(1):39–55, 1995.

2. R. Bagai and R. Sunderraman. Bottom-up computation of the Fitting model forgeneral deductive databases. Journal of Intelligent Information Systems, 6(1):59–75, 1996.

3. R. Bagai and R. Sunderraman. Computing the well-founded model of deductivedatabases. International Journal of Uncertainty, Fuzziness and Knowledge-BasedSystems, 4(2):157–175, 1996.

4. C. Baral, S. Kraus, and J. Minker. Combining multiple knowledge bases. IEEETransactions on Knowledge and Data Engineering, 3(2):208–220, 1991.


5. N. D. Belnap. A useful four-valued logic. In G. Epstein and J. M. Dunn, editors,Modern Uses of Many-valued logic, pages 8–37. Reidel, Dordrecht, 1977.

6. R. G. G. Cattell, and D. K. Barry. The Object Data Standard: ODMG 3.0. MorganKaufmann Publishers, 2000.

7. N. C. A. da Costa. On the theory of inconsistent formal systems. Notre DameJournal of Formal Logic, 15:497–510, 1974.

8. S. J. Kelley. A Paraconsistent Object-Oriented Database Management System.MS thesis, Department of Computer Science, Wichita State University, November2001.

9. V. S. Subrahmanian. Amalgamating knowledge bases. ACM Transactions onDatabase Systems, 19(2):291–331, 1994.

10. N. Tran and R. Bagai. Infinite Relations in Paraconsistent Databases. LectureNotes in Computer Science, 1691:275–287, 1999.

11. N. Tran and R. Bagai. Efficient Representation and Algebraic Manipulation ofInfinite Relations in Paraconsistent Databases. Information Systems, 25(8):491–502, 2000.

Decision Support with Imprecise Data forConsumers

Gergely Lukacs

Universitat KarlsruheInstitute for Program Structures and Data Organization

Am Fasanengarten 5, 76128, Karlsruhe, [email protected]

Abstract. Only imperfect data is available in many decision situations,which therefore plays a key role in the decision theory of economic sci-ence. It is also of key interest in computer science, among others whenintegrating autonomous information systems: the information in one sys-tem is often imperfect from the view of another system. The case studyfor the present work combines the two issues: the goal of the informationintegration is to provide decision support for consumers, the public. Bythe integration of an electronic timetable for public transport with a ge-ographically referenced database, for example, with rental apartments,it is possible to choose alternatives, for example, rental apartments fromthe database that have a good transport connection to a given location.However, if the geographic references in the database are not sufficientlydetailed, the quality of the public transport connections can only becharacterized imprecisely. This work focuses on two issues: the repre-sentation of imprecise data and the sort operation for imprecise data.The proposed representation combines intervals and imprecise probabil-ities. When the imprecise data is only used for decision making with theBernoulli-principle, a more compact representation is possible withoutrestricting the expressive power. The key operation for decision making,the sorting of imprecise data, is discussed in detail. The new sort opera-tion is based on so called π-cuts, and is particularly suitable for consumerdecision support.

1 Introduction

There is great potential in Business-to-Consumer Electronic commerce (B2CEC) for information based marketing, and to integrate independent productevaluations of for example, consumer advisory centres with the offers of Web-shops. In both cases, in fact, a decision support system for consumers has to bebuilt. As an example, it would be possible to show not only the retail price ofhousehold machines, but also the cost of energy they consume over their lifetime.In traditional stores, there is little possibility to communicate the latter to theconsumer, but in Web-shops customized calculations could be easily carried out.Interestingly, many important possibilities for climate change mitigation havenegative net costs, i.e., although there is usually a higher initial investment


152 G. Lukacs

required, the reduced energy consumption and energy costs more than make upfor these costs1.

In decision support, however, imperfect data is often unavoidable. Imperfec-tions occur e.g., because the future cannot be predicted with certainty, or be-cause the decision alternatives or the preferences of the decision maker are notknown perfectly. In many cases, not negligible uncertainties result from thesesources. Therefore, the handling of uncertainties is a key issue in the theory andpractice of professional decision making. If we want to offer decision support forconsumers, appropriate ways to describe and to operate on imperfect data arenecessary, and information systems, database management systems have to beextended correspondingly.

The handling of imperfect data in information systems is far from beeinga straightforward issue, even simple operations cannot be easily extended forimperfect data. It is not surprising than that this issue has been very extensivelystudied in the research. A corresponding bibliography [1], almost comprehensiveup to the year 1997, lists over 400 papers in this field. Even though there is anobvious correspondence between decision making and imperfect data, only a fewpapers mention decision making. That is, where this paper sets up: we study thehandling of imperfect data from a decision theoretic point of view. Furthermore,our goal is to provide decision support not for professionals but for consumers,i.e., a very wide user group, the characteristics of which have to be consideredin the way of handling imperfect data.

There are many possible criteria consumers want to consider in their de-cisions. As opposed to professional decision making, it is now not possible tosupport all criteria by the system, some of them can only be considered by theusers. We call these criteria external criteria. The system thus shall give free-dom for the users to consider their external criteria. It shall not select some goodalternatives, but rather, it shall sort all available alternatives, so that the usercan pick an appropriate alternative from the sorted list. For this reason, the sortoperation plays a key role in decision support for consumers.

The rest of the paper is organized as follows. In Section 2 the case study andin Section 3 the neccesary background in decision theory and probability theoryare introduced. Related work is analysed in Section 4. The new contributions ofthe paper, a very powerful representation of imprecise values and a sort operationfor imprecise values, are described in Sections 5 and 6. Conclusions and futurework are discussed in Section 7.

2 Case Study

On the long term we envision applications mentioned previously, i.e., B2C Web-shops, where the consumer can readily consider (customized) running energycosts of his various consumer decisions, such as the selection of household ma-chines.1 IPCC. Climate change 2001: Mitigation. Available at: http://www.ipcc.ch.

Decision Support with Imprecise Data for Consumers 153

Currently, we are working with a seemingly different case study, geographicinformation retrieval (GIR). There is a database with geographically referenceddata, e.g., rental apartments, restaurants or hotels on the one side and electronictimetable for public transport including digital maps on the other side. Queriessuch as “find rental apartments close by public transport to my new workingplace in Karlsruhe” can be carried out by the integrated system. A particularchallenge is that some data objects only have incomplete geographic references,e.g., only the settlement is known, or the house number is missing for a completeaddress that is very common in classified advertisements.

The current GIR case study and the decision support for consumers in Web-shops may make the impression of not having much in common. However, theyshare the following features, making them similar for our investigations. First,from a decision theoretic point of view, in both cases an alternative has to beselected from a set of alternatives, where the evaluations of the alternatives areimprecise (due to the incomplete address or due to e.g., unknown technical dataor imprecisely predictable energy prices). Second, in both cases the user group ispotentially very large, such that — as opposed to professional decision making— a solid mathematical background of the decision maker cannot be assumed.(Lastly, the GIR case study also has an energy/cost saving and climate changemitigating potential.)

3 Background Issues

3.1 Decision Making under Imprecision

The consequences of a decision often depend on chance. As an example, thereare two alternatives having uncertain outcomes, characterized by probabilityvariables. (For the sake of simplicity we assume over the rest of the paper thata smaller outcome is better.) In a simple decision model one could consider theexpected value of the two probability variables, and select the alternative withthe lower expected value. As an example, there are two alternatives Alt1 andAlt2 and their outcomes are characterized by the probability variables w1 andw2. Using a simple decision model one shall prefer the first alternative over thesecond alternative, if the expected value of the first outcome is smaller than theexpected value of the second outcome:

Alt1 > Alt2 ⇔ E(w1) < E(w2) (1)

However, considering the expected value is often unsatisfactory, because e.g.,it would not explain buying a lottery ticket or signing an insurance contract— both actions reasonable under special circumstances. Decision makers oftenvalue potentially very large wins more and potentially very large losses less thanthe expected values would suggest. The names of the corresponding decisionbehaviours are “risk-sympathy” and “risk-aversion”, the decision strategy con-sidering only the expected values corresponds to “risk-neutrality”.

154 G. Lukacs

The risk strategy of a decision maker describes, how he behaves when theoutcomes of the alternatives are probabilistic variables. The Bernoulli-principle[2] from the decision theory of economic sciences is concerned with decisionmaking under imprecision. It uses a utility function u that describes the riskstrategy of the decision maker. The utility function is applied to the (uncertain)outcomes of the alternatives, so that instead of the outcomes the correspondingutilities are considered. In the case of risk-aversion, the utility function associatescomparatively smaller utilities with larger outcomes, i.e., it is a concave function.In the case of risk-sympathy, relatively higher utilities are associated with highoutcomes, i.e., the utility function is convex. When comparing two alternatives,instead of the expected values of their outcomes, the expected values of theirutilities calculated with the utility function u are compared:

Alt1 > Alt2 ⇔ E(u(w1)) < E(u(w2)) (2)

The utility function plays in fact two roles: it describes the risk preference,but also the height preference of the decision maker. The height preference isthe relative utility of an outcome in comparison to another outcome, which alsoplays a role in risk free decisions. The double role of the utility function does notresult, according to many decision theoretists, in any limitation of the approach.

If the decision maker prefers a larger outcome to a smaller outcome (orvice versa), which is very often the case, then is the height preference functionmonotonous. In this case, the utility function considering both the height prefer-ence and the risk preference is also monotonous for all reasonable risk behaviors.

3.2 Probability Theory

Probability theory offers a powerful way of describing imprecise values. Classicalprobability theory can be defined on the basis of the Kolgomorov Axioms (forsimplicity we only handle discrete domains):

Definition 1. (K-Probability) A domain of values dom(A) and power set Aof the domain are given. The set-function Pr on A is a K-probability, if it satisfiesthe Kolgomorov Axioms:

0 ≤ Pr(A) ≤ 1 A ∈ A (3)

Pr(dom(A)) = 1 (4)

If A1;A2; . . . ;Am (Ai ∈ A) satisfy Ai ∩Aj = ∅ when i = j:

Pr(⋃k

Ak) =∑k

Pr(Ak) (5)

An important feature of K-probabilities is that if the function Pr is knownfor some particular events, i.e., subsets of dom(A), e.g., for the events havingonly one element, the function can be calculated for all other events, too.


For classical probability theory, precise probabilities are required. This isoften seen as a major limitation, since precise probabilities can often not bedetermined. Therefore, much effort has been put into finding other, more generaltheories for describing imprecise data. The most widely known such theories arefuzzy measures, possibility and necessity measures [3,4], belief and plausibilitymeasures (also called the Dempster-Shafer theory of evidence) [5,6], lower andupper probabilities [7] and lower and upper previsions [8]. The most powerfuland general theory out of these is the theory of lower and upper previsions[9,10]. However, it does not only describe possible alternatives, but also theirutilities to the decision maker. This is unnecessary for our purposes, as we willexplain it later, because we do not expect from the users to make their utilityfunctions explicit. The next most powerful theory is the theory of lower andupper probabilities, sufficiently general for our purposes.

Lower and upper probabilities, also called as R- and F-probabilities, interval-valued probabilities, imprecise probabilities, or robust probabilities (see e.g., [10],[11], [8]) are very powerful in describing as much, or as little, information, as isavailable. Our brief overview is based on [12] and [10].

Definition 2. (R-Probability) An interval-valued set-function PR on A isan R-probability, if it satistifes the following two axioms:

PR(A) = [PrL(A); PrU (A)], A ∈ A0 ≤ PrL(A) ≤ PrU (A) ≤ 1 ∀A ∈ A (6)

The setM of K-probabilities Pr over A fulfilling the following conditions (calledalso the structure of the R-probability) is not empty:

PrL(A) ≤ Pr(A) ≤ PrU (A), ∀A ∈ A (7)

The name “R”-probability stays for “reasonable”, meaning that the lowerand upper probabilities are not contradicting, and that there is at least oneK-probability function fulfilling the boundary conditions defined by the lowerand upper probabilities. An R-probability can very well be redundant, i.e., someof the lower and upper probabilities could be dropped without changing thestructure of the R-probability, the information content of the R-probability. AnF-probability is a special type of R-probability:

Definition 3. (F-Probability) An R-probability fulfilling the following axiomsis called an F-probability:

infPr∈M

Pr(A) = PrL(A)

supPr∈M

Pr(A) = PrU (A)

∀A ∈ A (8)

The boundaries PrL and PrU are in an F-probability not too wide, they andthe corresponding structure M implicate each other. An F-probability is calledtherefore “representable” [11] or coherent [8]. The name “F”-probability refersto the word “feasible”, meaning that for all lower and upper probabilities thereexists a K-probability from the structure realizing those values.

156 G. Lukacs

K-, R- and F-probabilities associate probabilities or lower and upper prob-abilities with every possible subset of dom(A). A partially defined R- and F-probabilities is defined only on some, but not on each possible subsets of dom(A).We only give the definition for the partially defined F-probability:

Definition 4. (Partially defined F-probability) There are subsets AL andAU of A defined:

A′ = A \ dom(A), ∅ (9)

AL ⊆ A′, AU ⊆ A′, AL ∩ AU ⊂ A′, AL ∪ AU = ∅ (10)

A partially defined F-probability associates with all A ∈ AL a lower bound PrL(A)and with all A ∈ AU an upper bound PrU (A), such that a non-empty structureM of K-probabilities Pr, fulfilling the following conditions, exists:

PrL(A) ≤ Pr(A), ∀A ∈ ALPr(A) ≤ PrU (A), ∀A ∈ AU

infPr∈M

Pr(A) = PrL(A), ∀A ∈ ALsup

Pr∈MPr(A) = PrU (A), ∀A ∈ AU

(11)

A very significant difference between precise (K-)and imprecise (R- andF-)probabilities is that additivity holds for the first, but not for the secondtheory. This has the following important consequence. A K-probability can befully described with probabilities on only a relatively small subset of A, e.g.,all events with a single element. From these probabilities, using the additivityfeature, the probabilities of all other events can be calculated. This does not holdfor R- and F-probabilities. Since the handling of all possible events is too costlyin most cases, partially determined R- and F-probabilities have to be workedwith, and it is a very critical issue, which events are selected and associated withlower and upper probabilities.

4 Related Work

Approaches to handle imperfect data in information systems, or more specificallyin database management systems, are summarized in this section. We concen-trate on the description of imprecise values and the sort operation, the researchfoci of the presented work.

4.1 Description of Imprecise Values

Commercial database management systems only support imprecise values by“NULL”-values [13,14,15,16]. That is, if a value is not precisely known, all par-tial information shall be ignored, the value shall be declared unknown, and thecorresponding attribute shall get the value “NULL”.

The approach presented in [17] supports imprecise values by allowing a de-scription by a set of possible values or by intervals (for ordinal domains). As an


example, an attribute can get the value “25−50”, meaning that the actual valueis between these two values.

Probability theory based extensions to different data models, such as the re-lational or the object oriented have been researched over the last two decades[18,19,20,21,22,23,24,25,26,27]. Both the description of imprecise values and op-erations on them are investigated. In the followings, we focus on how imprecisevalues are described, and how the sort operation, essential in decision support,is extended for imprecise values.

The approach in [25] describes a classical (K-)probability in a relation. Eachtuple has an associated probability, expressing that the information in the tupleis true. The sum of the probabilities in the relation equals 1. The major draw-back of the approach is that for every single imprecise value a separate relationis required. The approach in [19,20] is similar to the previous one, with the dif-ference that the constraint referring to the sum of the probabilities in a relationis dropped. It becomes thus possible to describe several imprecise values in asingle relation. The approach in [18,24] supports imprecise probabilities, ratherthan only precise probabilities. Imprecise probabilities can be associated withany possible set of values from the domain of the attribute in question. How-ever, there are no further criteria set, ensuring e.g., the conditions for an R- orF-probability. Hence, information in these probabilistic relations can be contra-dictory or redundant, which we consider a major drawback of the approach.

All of the previously introduced approaches associate probabilities with tu-ples. This has the disadvantage that pieces of information on a single imprecisevalue are scattered over several tuples. A more handy approach is to encapsu-late all pieces of information on an imprecise value in a single (compound) at-tribute value. A corresponding approach with precise probabilities is presentedin [28]. The approach in [23,22] too, handles the probabilities inside of com-pound attribute values. It also allows “missing” probabilities, a very special caseof partially determined F-probabilities. The argument for supporting “missing”probabilities is that some relational operations can result in them. However,supporting only this special case of partially determined F-probabilities, the ex-pressive power of the data model suffers. The approach in [21] associates lowerand upper probabilities with only single elements of the domain and not witheach or any possible subsets of the domain. This approach has therefore, fromthe point of view of imprecise probabilities, a very limited expressive power, andno explanation justifying this limitation is given.

The approach in [26] supports both an interval description and, optionally,a description with K-probabilities. The approach in [27] supports imperfectionsboth on the tuple and on the attribute levels. The underlying formalism is,however, the Dempster-Shafer theory of evidence, known also as belief and plau-sibility measures. This has in comparison to the theory of imprecise probabilitiesa much more limited expressive power.

Summarizing the above arguments, there is no approach known where bothintervals and imprecise probabilities are supported. Furthermore, all approachessupporting imprecise probabilities only allow very special forms of imprecise

158 G. Lukacs

probabilities having a limited expressive power, and they do this without givinga plausible explanation for this approach.

4.2 Sort Operation on Imprecise Data

One major application area of imperfect data is decision support, where thesort operation is essential for ranking the alternatives. Still, surprisingly, thereis little work on extending the sort operation, or the closely related comparison,in database management systems for imprecise data.

The approach in [26] gives a proposal for a comparison operation. The resultof the comparison is determined by whether it is more probable that the firstimprecise value is larger, or that the second imprecise value is larger. This ap-proach results in an unsatisfying semantics, as the following example shows. Thereason for this is that the possible values themselves are not considered, onlytheir relation to each other. Consider, e.g., the case that the first alternative hasa probability of 51% for being 1 minute better, than the second alternative, andthe second alternative has a probability of 49% for being 99 minutes better. Thecomparison operation considers the first alternative better, even though in mostdecision situations it is more reasonable to consider the second alternative to bebetter.

The approach in [17] only considers binary criteria. In a comparison of twoalternatives, the one is considered to be better where the probability of fulfillingthe binary criterion larger is. The semantics of this approach, too, is often notsatisfactory, as the example shows. Let us consider the binary criterion that thetravel time is under 30 minutes. The first alternative fulfills this criteria witha probability of 55%, the second alternative with a probability of 50%. Hence,the first alternative will win over the second one. It is easily possible, however,that this is an unreasonable choice, e.g., when the value of the first alternativedefinitely over 29 minutes is, and the second alternative has a value of under 10minutes with a probability of 50%.

Summarizing, the sort operation is very neglected issue in the proposals forextending database technology for imprecise data and the few available proposalshave an unsatisfactory semantics for decision support purposes.

5 Description of Imprecise Values

Our first task is to specify a formalism for imprecise data with a very highexpressive power. The formalism shall allow to accommodate as much — or aslittle — information about the actual value, as available. It shall not force usto ignore partially available information or to make the impression that moreinformation is available, as really is.

The following description fulfills these requirements. We denote the attributein question with A and the domain of the possible actual values with dom(A).

The first possibility to specify an imprecise value for attribute A is to specifythe set of possible values, i.e., a subset of dom(A). We denote this set of possible


values with PV , e.g., if there is an imprecise value w for attribute A, the set ofpossible actual values is w.PV (where w.PV ⊂ dom(A)).

The second possibility to specify an imprecise value is an imprecise prob-ability (e.g., R-, F- or partially defined F probability). In this case, there aresets of possible values and associated lower and upper probabilities. I.e., boththe subsets of dom(A) for which lower probabilities are available, and the lowerprobabilities themselves have to be specified, this applies to the upper probabil-ities, too. We introduce the following notation. The subsets of dom(A) for whichlower probabilities are available are denoted by SSPrL . The lower probability fora particular event e ∈ SSPrL is denoted by PreL. For the upper probabilities asimilar notation is used, only the letter L is replaced for U .

All possibilities to describe an imprecise attribute value, the set of possi-ble actual values, lower and upper probabilities for some subsets of the possibleactual values, are optional. Hence, the expressive power of the formalism is max-imal: as much, or as little, information on the actual value can be described, asis available.

5.1 Description for Decision Support

The introduced formalism to describe imprecise values is very general, its ex-pressive power is very high. This has the drawback that supporting it in a realsystem is expensive in terms of realization effort, storage place and time. It isespecially awkward having to describe arbitrary subsets of dom(A), needed bothfor the set of possible values, and for the lower and upper probabilities.

If we have ordinal data, and we only want to use imprecise data for a specialpurpose, e.g., decision making with the Bernoulli principle, we can define a lessgeneral and less costly formalism. By appropriately choosing the restricted for-malism, the decisions made will be not effected, and hence the expressive powerwill be unrestricted from the point of view of decision making. We assume inthe following, that the utility function u is monotonous, which is reasonable inmost cases as discussed previously.

First, we give a definition, when two imprecise values are considered to beequivalent.

Definition 5. Two imprecise values w1 and w2 are equivalent, if the minimumand the maximum of their expected utilities (calculated according to the Bernoulliprinciple) do not differ from each other for any monotonous utility function.

The rationale of this definition is that if there are no differences in the min-imum and the maximum values of the expected utilities, there is no differencein the information content relevant to the decision making: either the one orthe other imprecise value is preferred, or no preference relation can be set upbetween the two imprecise values. The following theorems state the existence ofan imprecise value in a simple form for an arbitrary imprecise value. The firsttheorem refers to the case when the imprecise values are characterized by setsof possible values only (i.e., no imprecise probabilities are available):

160 G. Lukacs

Theorem 1. For any imprecise value w1 characterized by the sets of possiblevalues w1.PV , there exists an equivalent imprecise value w2 characterized by aspecial set of possible values, a closed interval over dom(A).

The proof of the theorem and the construction of the equivalent imprecisevalue are straightforward. We denote the limits of the closed interval by L andU , e.g., for the value w2 by w2.L and w2.U . We now turn to the case when animprecise value is characterized by lower and upper probabilities for any subsetsof dom(A), the most general description with imprecise probabilities.

Theorem 2. There is an imprecise value w1 characterized by lower andupper probabilities for the subsets w1.SSPrL and w.SSPrU . There ex-ists an equivalent imprecise value w2, where w2.SSPrL = w2.SSPrU == a1, a1, a2, . . . a1, . . . , an (ai denote the values in dom(A):dom(A) = a1, . . . , an and ai < aj, when i < j)and the lower and upperprobabilities represent a partially defined F-probability.

The theorem states that for an imprecise value described with a very generalimprecise probability, an equivalent imprecise value of a very simple form can bedefined, where only special subsets of dom(A) are considered. The proof of thetheorem and the algorithm for constructing the equivalent imprecise value arebeyond the space limits of this paper. The notation for this special form is asfollows. The set of subsets for which lower and upper probabilities are availablefollow directly from dom(A), and consequently, do not need a special notation.The lower and upper probabilities for the subset a1, . . . , ai are denoted byPr≤aiL and Pr≤aiU .

Example 1. Figure 1 shows an example for an imprecise value in the generalform (a) and the equivalent imprecise value in the form restricted for decisionsupport purposes (b) (the details of the calculation are not presented). Therestricted form containts only an interval of the possible values, and lower andupper probabilities for some special events.

a: Imprecise value b: Equivalent imprecise valuein general form for decision supportw.PV = 10, 20, 40 w.L = 10w.Pr20

L = 0, 1 w.Pr≤10L = 0 w.Pr≤10

U = 0, 7w.Pr10

U = 0, 7 w.Pr≤20L = 0, 1 w.Pr≤20

U = 0, 8w.Pr10

U = 0, 6 w.Pr≤30L = 0, 2 w.Pr≤30

U = 1w.Pr10,20

U = 0, 8 w.U = 40w.Pr40

U = 0, 8

Fig. 1. Equivalent imprecise values


6 Sorting Imprecise Values

The sort operation has a central role in decision making: the alternatives have tobe sorted on the basis of their utility, and the best alternative have to be chosenfrom the sorted list. If there are external criteria to be considered, rather thanjust those considered by the system when sorting the alternatives, than it maybe reasonable to choose not the first, but another alternative from the sortedlist. We only consider single attributes now, i.e., an alternative is described bya single, though potentially imprecise value, and our goal is to define decisiontheoretically founded sort operation for imprecise values.

On the basis of the Bernoulli-principle an exact value — the expected utility— could be associated with each of the imprecise values, and the sort operationwould be straightforward. There are two preconditions for applying this simpleapproach: (1) the utility function u has to be known; (2) the imprecise valueshave to be characterized by precise (K-)probabilities. These preconditions donot hold in our case. First, professional decision makers may have the necessarymathematical background and know explicitly the utility function, consumerscannot be expected to describe their preferences by such mathematical means,even if they very well have such preferences — implicitly. Second, our imprecisevalues are characterized by an interval of possible values and imprecise prob-abilities, rather than only by precise (K-)probabilities. For these reasons, thestraightforward approach for the sort operation cannot be applied. Our sort op-eration has to be defined in a way that our decision makers, consumers, do notneed any special mathematical background and can handle according to theirimplicitly available risk strategy, which can be any reasonable risk strategy.Furthermore, the sort operation has to be defined on our, for decision makingoptimised, general description of imprecise values.

There are only two special cases where a preference relation can be set upbetween two imprecise values, even if the utility function is the same for bothvalues. The first special case is when the intervals of possible values do notoverlap with each other, i.e., w1.U < w2.L. In this case, the first actual value isdefinitely smaller than the second actual value, and the first value (alternative)has to be chosen. The second special case is when the structure of the impreciseprobability of the first imprecise value dominates the structure of the impreciseprobability of the second imprecise value, i.e., ∀ai ∈ dom(A) : w1.P r

≤aiU <

w2.P r≤aiL . In this case it is reasonable to choose the first value based on the

available information, but we are actually not sure, which actual value is larger.In most cases, however, these special cases do not hold, and no preference relationcan be set up between the two alternatives. Consequently, we can only determinea partial sorting of the values (alternatives), which is complicated to present theuser (in comparison to a sorted list) and gives little support for the decision.Furthermore, these preference relations can only be applied, if we assume thatthe same utility function holds for both imprecise values. However, it is possible,that the utility functions are different for different imprecise values.

The key concept of our approach is the so called π-cut (where “π” staysfor probability). A π-cut consists of a qualifier q and a probability value v,

162 G. Lukacs

where (q ∈ ≤;≥ and 0 ≤ v ≤ 1. A π-cut determines, with which value, and,consequently, in which position, an imprecise value should appear in the resultof the sort operation. We call this value the π-cut value of the imprecise valuefor the particular π-cut.

This π-cut value of an imprecise value is the value that is exceeded by theactual value of the imprecise value with a probability specified in the probabilityvalue of the π-cut. e.g., if the probability value v of a π-cut is 0.1 (or 10%), theπ-cut value of an imprecise value will be exceeded by the actual value of theimprecise value with a probability of 0.1.

The quantifier of the π-cut is necessary to handle imprecise (F-)probabilities,instead of only precise (K-)probabilities. The quantifier ≤ means that the prob-ability value of the π-cut v should be not exceeded by the probability of theimprecise value, the quantifier ≥ that it should be exceeded. If the differencebetween the lower and upper probabilities is too large, it may occur that noappropriate π-cut value can be found on the basis of the above principles. Inthis case the lower and upper ends of the interval of possible values can be used.The formal definition is as follows:

Definition 6. (π-cut value) The π-cut value w.PCπ-c of a given imprecisevalue w for a given π-cut π-c = (q; v) is as follows:

w.PC(q;v) =

ajwhere j = maxPr≤aiU≤v

(i) : q = “ ≤ ” ∧ ∃w.Pr≤aiU ≤ v

w.L : q = “ ≤ ”∧ ∃w.Pr≤aiU ≤ vajwhere j = min

Pr≤aiL≤v

(i) : q = “ ≥ ” ∧ ∃w.Pr≤aiL ≤ v

w.U : q = “ ≥ ”∧ ∃w.Pr≤aiL ≤ v

(12)

An imprecise value occurs as many times in the result of the sort operation,as there are π-cuts defined. An imprecise value appears in the result with itsπ-cut value, and — in addition — the corresponding π-cut.

The result of this extended sort operation is indeed very easy to interpreteven for users with very little mathematical background, e.g., “This alternative(imprecise value) has a chance of less than 10% (more than 90%) for reachingthis value”. Furthermore, the approach allows the user to use any risk strategy,if there are sufficiently many π-cuts set. In most cases only a few, predefinedπ-cuts give a sufficiently detailed result for the users. As an example, in manycases a few π-cuts, such as (≤; 0.1) und (≥; 0.9), are sufficient. For advancedusers, the setting of arbitrary π-cuts should also be allowed.

Example 2. There are four alternatives available. Each of the alternatives ischaracterized by a key ID, a value W and a description. Two alternatives haveprecise and two imprecise values (s. Fig. 2).

We use the π-cuts (≤; 0, 1) and (≥; 0, 9). The π-cut values of alternativesAlt3 are PC(≤;0,1) = 40 and PC(≥;0,9) = 50, those of Alt4 PC(≤;0,1) = 20 andPC(≥;0,9) = 60. The result of the sort operation is represented in Fig. 3. The


ID W Description

Alt1 10 Karl-W. Str. 5:2 bedroom, 45 qm, sunny . . .

Alt2 30 Klosterweg 28:3 bedroom, top-flor.. . . .

Alt3 L = 30 Karlsruhe Waldstadt:Pr≤30

L = 0 Pr≤30U = 0, 06 2 bedroom, groundfloor . . .

Pr≤40L = 0, 30 Pr≤40

U = 0, 60U = 50

Alt4 L = 20 Karlsruhe:Pr≤20

L = 0, 01 Pr≤20U = 0, 15 2 bedroom, 2nd floor . . .

Pr≤30L = 0, 15 Pr≤30

U = 0, 30Pr≤40

L = 0, 15 Pr≤40U = 0, 45

Pr≤50L = 0, 50 Pr≤50

U = 0, 70U = 60

Fig. 2. Alternatives with imprecise values

ID WPCπ−C π − C

Alt1 10Alt4 20 (≤; 0, 1)Alt2 30Alt3 40 (≤; 0, 1)Alt3 50 (≥; 0, 9)Alt2 60 (≥; 0, 9)

Fig. 3. Result of sorting imprecise values with π-cuts

alternatives with precise values, Alt1 and Alt2, occur once, the alternatives withimprecise values, Alt3 and Alt4, occur twice in the sorted result list.

The user of the system is looking now for an appropriate alternative. Alt1and Alt2 are, because of the external criteria, not suitable. Alt3 with a π-cut of≤ 10% ((≤; 0, 1)) is the next item in the list. The user thinks that even thoughthis alternative has some chance to be good, considering the external criteria, itis too risky to choose it. So he goes further, and finds that Alt3 is a very goodalternative considering the external criteria, and so he decides to choose it.

7 Summary and Future Work

We presented a group of applications where decision support is offered for con-sumers by integrating information from autonomous information systems, andwhere the handling of imprecise data is required. The central issues in sucha setting are a sufficiently powerful description of the imprecise data and theextension of the join and sort operations for imprecise data.

164 G. Lukacs

In the current paper, we presented a formalism with very high expressivepower to describe imprecise data. We also presented a restricted formalism mucheasier to handle, for the case when imprecise data is used for decision making, inother words a sort operation has to be defined over imprecise data. The restrictedformalism does not loose expressive power in this application. We also defineda sort operation appropriate for consumer decisions for the imprecise data. Themain concept of the sort operation is the so called π-cut.

Future work is required on extending the join operation (known in databasemanagement) for handling imprecise data. The join operation has a central role,too, for two reasons. First, imprecise data often occurs, when integrating (orjoining) data from autonomous information systems. Second, if there are severalprobabilistically dependent imprecise values, they still preferably have to bedescribed in separate relations, and the dependences have to be considered inthe join operation. At this point, the theory of Bayesian networks intensivelyresearched and applied in Artificial Intelligence could be adapted to data models.

Acknowledgements. The author is partially supported by the Hungarian Na-tional Fund for Scientific Research (T 030586). Earlier discussions with Peter C.Lockemann and Gerd Hillebrand contributed significantly to the ideas presented.

References

1. Curtis E. Dyreson. A bibliography on uncertainty management in informationsystems. In Amihai Motro and Philippe Smets, editors, Uncertainty managementin information systems, pages 413–458. Kluwer Academic Publishers, 1997.

2. von J. Neumann and O. Morgenstern. Theory of games and economic behavior.Princeton, 1944.

3. Didier Dubois and Henri Prade. Possibility theory : an approach to computerizedprocessing of uncertainty. Plenum Pr., New York, 1988. Original: Theorie despossibilites.

4. L.A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets andSystems, 1:3–28, 1978.

5. A.P. Dempster. Upper and lower probabilities induced by a multivalued mapping.The annals of mathematical statistics: the off. journal of the Institute of Mathe-matical Statistics, 38:325–339, 1967.

6. Glenn Shafer. A mathematical theory of evidence. Princeton Univ. Pr., Princeton,New Jersey, 1976.

7. C.A.B. Smith. Consistency in statistical inference and decision (with discussion).Journal of the Royal Statistical Society, 1961.

8. Peter Walley. Statistical Reasoning with Imprecise Probabilities. Chapmann andHall, 1991.

9. Peter Walley. Towards a unified theory of imprecise probability. In Proceedingsof the First International Symposium on Imprecise Probabilities and their Appli-cations, Ghent, Belgium, 29 June - 2 July, 1999.

10. Kurt Weichselberger. The theory of interval-probability as a unifying concept foruncertainty. In Proceedings of the First International Symposium on ImpreciseProbabilities and their Applications, Ghent, Belgium, 29 June - 2 July, 1999.


11. Peter J. Huber. Robust statistics. John Wiley & Sons, 1981.12. Kurt Weichselberger. Axiomatic foundations of the theory of interval-probability.

In Symposia Gaussiana, Proceedings of the 2nd Gauss Symposium, pages 47–64,1995.

13. Edgar F. Codd. Extending the database relational model to capture more meaning.ACM Transactions on Database Systems, 4(4):397–434, December 1979.

14. Witold Lipski. On semantic issues connected with incomplete informationdatabases. ACM Transactions on Database Systems, 4(3):262–296, 1979.

15. Joachim Biskup. A foundation of Codd’s relational maybe-operations. ACM Trans-actions on Database Systems, 8(4):608–636, December 1983.

16. Tomasz Imielinski and Witold Lipski. Incomplete information in relationaldatabases. Journal of the ACM, 31(4):761–791, October 1984.

17. Joan M. Morrissey. Imprecise information and uncertainty in information systems.ACM Transactions on Information Systems, 8(2):159–180, April 1990.

18. Michael Pittarelli. Probabilistic databases for decision analysis. InternationalJournal of Intelligent Systems, 5:209–236, 1990.

19. Debabrata Dey and Sumit Sarkar. A probabilistic relational data model and alge-bra. ACM Transactions on Database Systems, 21(3):339–369, September 1996.

20. Debabrata Dey and Sumit Sarkar. PSQL: A query language for probabilistic rela-tional data. Data and Knowledge Engineering, 28:107–120, 1998.

21. Laks V.S. Lakshmanan, Nicola Leone, Robert Ross, and V.S. Subrahmanian. Prob-View: A flexible probabilistic database system. ACM Transactions on DatabaseSystems, 22(3):419–469, September 1997.

22. Daniel Barbara, Hector Garcia-Molina, and Daryl Porter. The management ofprobabilistic data. IEEE Transactions on Knowledge and Data Engineering,4(5):478–502, 1992.

23. Daniel Barbara, Hector Garcia-Molina, and Daryl Porter. A probabilistic relationaldata model. In F. Bancilhon, C. Thanos, and D. Tsichritzis, editors, Advances inDatabase Technology, EDBT’90, International Conference on Extending DatabaseTechnology, Venice, Italy, March, 1990, volume 416 of Lecture Notes in ComputerScience, pages 60–74. Springer-Verlag, 1990.

24. Michael Pittarelli. An algebra for probabilistic databases. IEEE Transactions onKnowledge and Data Engineering, 6(2):293–303, April 1994.

25. Roger Cavallo and Michael Pittarelli. The theory of probabilistic databases. InProceedings of the 13th VLDB Conference, Brigthon, pages 71–81, 1987.

26. Curtis E. Dyreson and Richard T. Snodgrass. Supporting valid-time indeterminacy.ACM Transactions on Database Systems, 23(1):1–57, March 1998.

27. Ee-Peng Lim, Jaideep Srivastava, and Shashi Shekhar. An evidential reasoning ap-proach to attribute value conflict resolution in database integration. IEEE Trans-actions on Knowledge and Data Engineering, 8(5):707–723, 1996.

28. Yoram Kornatzky and Solomon Eyal Shimony. A probabilistic object-oriented datamodel. Data and Knowledge Engineering, 12(2):143–166, 1994.

Genetic Programming: A Parallel Approach

Wolfgang Golubski

University of SiegenDepartment of Electrical Engineering and Computer Science

Holderlinstr. 3, 57068 Siegen, [email protected]

Abstract. In this paper we introduce a parallel master-worker modelfor genetic programming where the master and each worker have theirown equal-sized populations. The workers execute in parallel startingwith the same population and are synchronized after a given intervalwhere all worker populations are replaced by a new one. The proposedmodel will be applied to symbolic regression problems. Test results ontwo test series are presented.

1 Introduction

The importance of regression analysis in economy, social sciences and other fieldsis well known. The objective of solving a symbolic regression problem is to findan unknown function f which best fits given data (xi, yi) i = 1, . . . , k for a fixedand finite k ∈ IN. With this function f the output y for arbitrary x not belongingto the data set can be estimated. The genetic programming approach [5] is oneof the successful methods to solve regression problems.But genetic programming (abbreviated as GP) can be a very slow process.

Hence parallelization could be a possible alternative to speed-up the evaluationprocess.Small computer networks consisting of a handful computers (e.g. five ma-

chines) are accessible for and manageable by many people. Therefore the startpoint of our investigation was the question if a simple parallelization model ofthe GP approach can lead to significant improvements in the solution searchprocess.The aim of this paper is to answer this question and to present a parallel

master-worker model applied to a basic (sequential) genetic programming algo-rithm in order to find more suitable real functions which can work better than abasic GP. The most important quality feature is given by the success rate, thatis, the measure of how often the algorithm evaluates an acceptable solution.The parallel model presented here works with a multiple-population where

master and worker have own equal-sized populations. The workers execute inparallel starting with the same population and are synchronized after a giveninterval where all worker populations are replaced by a new one.To show the quality of our model we present test results of 6 test series.

Our proposed parallel model is applied to 42 different regression problems on


Genetic Programming: A Parallel Approach 167

polynomial functions of degree not higher than 10. The presented results arepromising.The paper is structured as follows. In the next section we present the basic

genetic programming algorithm which is used for finding good fitting functionsin the sequential case. In the third section we describe the parallel master-workermodel. Then in Section 4 we present our test results. Before finishing the paperwith a conclusion we give a literature review in the fifth section.

2 Basic Genetic Programming

Koza [5] used the idea of a genetic algorithm to evolve LISP programs solvinga given problem. Genetic algorithms manipulate the encoded problem represen-tation which is normally a bit string of fixed length. Crossover, mutation andselection are operators of evolution being applied on these bit string represen-tations. Over several generations these operators modify the set of bit stringsand lead to an optimization with respect to the given fitness function. So geneticalgorithms perform a directed stochastic search. In order to apply a genetic algo-rithm to the problem of regression we need to know the structure of the solutionwhich is given by the function itself.In genetic programming, Koza used tree structures instead of string repre-

sentation. He wanted to evolve a LISP program itself for a given task. LISPprograms are represented by their syntactic tree structure. Starting with a set ofLISP programs new programs are evolved by recombination and selection untila sufficiently good solution has been found. Crossover operators operate on thetree structures by randomly selecting and exchanging sub-trees between indi-viduals. This way a new tree representing which stands for a new program isgenerated. Each individual is evaluated according to a given fitness function. Outof the set of individuals of the same generation the best programs were chosen tobuild the offspring of the next generation. This process is also called selection. Incontrast to evolutionary or genetic algorithms no mutation operators are used inthe genetic programming approach. So, the tree structure representation of eachindividual in genetic programs overcome the problem of fix length representationin genetic algorithms.Let us now describe the way of representing a real function as a tree. The

prefix notation of a real function can easily be transformed into a tree structurewhere arithmetic operators such as +, · and − as well as pdiv stand for the nodesof the tree and variables of the function are the leaves of the tree. As an examplewe consider the function x3 + x2 + 2 ∗ x. The corresponding prefix notation hasthe form +(+(∗(∗(x, x), x), ∗(x, x)),+(x, x)). The corresponding tree is drawnin Figure 1. The application of genetic programming to regression problems iswell-known, see [5,3]. Therefore we call the approach presented in this section asbasic genetic programming. The main parts are described next in more detail:

1. InitializationThe genetic algorithm is initialized with a sufficiently large set of real func-tions which are randomly generated. The size (depths of each tree) is initially

168 W. Golubski

*

+ +

+

X

*

X

XX

X *

X X

Fig. 1. Tree representation of the function x3 + x2 + 2 ∗ x

restricted so that simple functions are in the initial generation. The leavesof the tree are randomly selected out of a given interval.

2. SelectionThe fitness of each member of the previously generated population is com-puted, i.e. the total error over all sample points. The functions with thesmallest errors are the fittest of their generation.

3. Stopping CriterionIf there exits a real function (tree) with an error smaller than a predefinedfitness threshold then the genetic program has found an appropriate solutionand stops. Otherwise the algorithm goes on to step 4.

4. RecombinationA recombination process takes place to generate 90% of the population ofthe next generation. Two of the fittest functions are randomly chosen andrecombined. For this one node of the first tree is randomly selected as itscrossover point. In the same way a crossover point of the second function ischosen. Both crossover nodes including their subtrees are exchanged resultingin two new trees, see Figure 2 for an example.

5. Reproduction10% of the generation were directly chosen out of the fittest trees of theprevious generation. With reproduction and recombination the whole newgeneration is generated.

6. Our genetic program processes with step 2 until a given number of genera-tions is reached.

The parameter settings are summarized in Tables 1 and 2.

3 Parallel Genetic Programming

The start point of our investigation was the question if a simple parallelization ofthe basic genetic programming approach would lead to a significant improvementof the success rate and the solution search.


Table 1. Basic Parameter Settings

Parameter Value

# test functions 42max. generation 50fitness type AbError, see Eq.(2)fitness threshold 10−6

recombination property 90%reproduction property 10%function sets ×,+,−, pdivterminal sets x

Strongly influenced by the master-worker or master-slave paradigm of par-allel or distributed systems [2] we have implemented the genetic programmingapproach as master-worker programs (processes). The master process fulfills twotasks: (1) the management of the worker communications including synchroniza-tion and (2) the handling of the fittest individual sets. That means that the mas-ter stores the population (more precisely the fittest individuals). Each workerunit is responsible for two (sub-)processes:(1) holding the communication to theserver and (2) executing the basic GP algorithm. That means that each workerstores its own population and executes the GP operations like recombination,reproduction and fitness evaluation. Without the communication part a workerbehaves like a usual GP, e.g. as described in Section 2.

In more detail, the master-worker parallel genetic programming model worksas follows, see Figure 3. The following numbers are related to this figure. Themaster and the worker processes must be started, see 1. The master waits untilall worker are ready to work. Then the master sends the parameter settings (i.e.the parameter settings of the basic GP and the synchronization number) to eachworker. The synchronization number represents the number of generation stepsto be executed on a worker without a break unless a solution would be found.Every time a solution is evaluated then the whole execution process will bestopped. The worker initializes the basic GP algorithm by the received parametersettings and creates the start population. Then the worker performs so muchgeneration steps as given by the synchronization number, see 2. Now the basic GPwill be interrupted and the current population will be transmitted to the masterprocess, see 3. During this time the master is waiting for the workers responsesin order to collect all workers fittest individual sets. If all worker transmissionhas been received the master picks up the best individuals, i.e. the worker’sfittest individual sets will be sorted by fitness value and the size of this set willbe reduced so that the size of the fittest individual sets on master and workersare the same. Next the master broadcasts the new set of fittest individuals toeach worker process, see 4. Each worker replaces its fittest individual set by thereceived one and resumes the GP execution. The just described steps will be

170 W. Golubski

C1 C2

+

C3

−

C4 C5

*

D3 +

+

D1 D2

*

D4 D5

C1 C2

+ −

*

D3 +

+

D4 D5

D1 D2

C4 C5

*

*

*

*

*

pdiv

C3 pdiv

Fig. 2. Crossover between two Programs where the dashed circles are the crossoverpoints (Ci and Di are variables)

repeated until one worker finds a successful solution or the prescribed maximalgeneration steps has been reached.Our model is implemented with Java and Java RMI.

4 Results

In this section we will discuss how our parallel genetic programming approachis tested. The system has been applied to symbolic regression problems.We have chosen 42 polynomial functions of degrees not higher than 10

f(x) =10∑

i=1

ai ∗ xi (1)

where ai ∈ Z and Z denotes the set of integer numbers. Two functions (quinticand sextic polynomials) are taken from the literature [5] and the other onesare randomly generated. A data set is generated for each of these functions byrandomly chosen real numbers xi in a predefined interval [-1,1]. The outputs Y iare computed as f(xi) = Y i for each of the previously chosen functions. In ourtests we are using 50 data pairs per function. For each of these functions we run100 differently initialized genetic programs in order to see how well our methodperforms. The fitness function is defined as the absolute error, i.e.


Population

Fittest Individuals

MASTER

Population

Fittest Individuals

WORKER 1

Population

Fittest Individuals

WORKER 2

Population

Fittest Individuals

WORKER 3

Population

Fittest Individuals

WORKER 4

1

1 1 1 1

222 2

3 3 34 3 4 4 4

Fig. 3. The Master-Worker Genetic Programming Model

AbError =50∑

i=1

|f(xi)− Y i| , (2)

If a function has an AbError smaller than the given threshold value then thefunction represents a successful solution of the problem. The whole parametersettings of the genetic programming algorithm are listed in Tables 1 and 2.Six various parameter settings, more precisely two parameter settings differ-

ent in the population size, have been applied in each case to the basic geneticprogramming approach, the parallel model with two workers and the parallelmodel with four workers. The synchronization interval is set to 5 worker gener-ation steps. The obtained results are also listed in Table 2.Regarding the results it can be seen so far that our proposed method shows

quite good results. Both test series (T1xx and T2xx) show the same behavior.The basic version T1 has not a good success rate. But the used genetic program-ming algorithm is neither frequently tested to obtain an acceptable parametersettings nor optimized as done in [3]. Applying the parallel model to these worseparameter settings leads already to a dramatical improvement of the successrate. The 4 worker version T1-W4 evaluates a problem solution in 67% of allcases. The T2 test series behaves similar to the T1 one. The parallel model leadsto a significant improvement although the success rate of the basic version wasacceptable (i.e. 75%). In the 4 worker system the most (i.e. 96%) runs deliveracceptable function description. On the other side regarding the total numberof used generation steps we can see a similar coherence. The basic versions usemuch more generation steps than the parallel ones.

172 W. Golubski

Table 2. Additional Parameter Settings and Test Results

Parameter T1 T1-W2 T1-W4 T2 T2-W2 T2-W4population size 100 100 100 500 500 500# fittest individuals 14 14 14 72 72 72

Only for parallel# worker - 2 4 - 2 4# synchronization steps - 5 5 - 5 5total # executed generations 161773 134385 102130 78400 49345 32450max. # executed generations 210000 210000 210000 210000 210000 210000generation rate 77% 64% 48% 37% 23% 15%successful runs 1378 1985 2678 3158 3685 4028max. # runs 4200 4200 4200 4200 4200 4200success rate 31% 47% 67% 75% 88% 96%

What is also interesting is, most of the time a genetic program stopped byreaching the fitness threshold, the algorithm only needed a small number ofgenerations (< 20) to find a real function. So it looks like our method performsquite well. However, we are running more tests in order to verify this for otherfunctions as well being more complicated e.g. on fuzzy functions [4].

5 Comparison to Existing Approaches

Let us now review the literature. There are numerous papers on parallel geneticalgorithm (see [1,7] for a good overview with more literature references) but onlya few on parallel genetic programming [8,6].At first we summarize in a few words the most important approaches. Usu-

ally one can divide up parallel genetic algorithm into three main categories: (1)global single-population master-slave algorithms where the fitness evaluation isparallelized, (2) single-population fine-grained algorithms suited for massivelyparallel computers where each processor resides one individual and (3) multiple-population coarse-grained algorithms being more sophisticated where each popu-lation exchanges individuals with the others with a fixed exchange rate. The con-nection of populations is strongly influenced by the underlying network topology(e.g. hyper-cubes). A population can only exchange individuals with a popula-tion in the neighborhood. The parallel genetic programming approaches [8,6] areboth of the category (3).To sum up, it can be said that the parallel model presented in Section 4 can

be characterized by (i) master-worker paradigm, (ii) multiple-population wheremaster and worker have own equal-sized populations, (iii) the workers executein parallel starting with the same population except at initialization phase, (iv)synchronization after a given interval where all worker populations are replacedby a new one and (v) the proposed algorithm does not behave like the basic GP.


The proposed parallel master-worker model is obviously different from par-allel genetic algorithms of kind (1) and (2). In contrast to (1) each worker hasits own population. The massively parallel approach of kind (2) is completelyincomparable. In some sense our model can be regarded as an exotic version ofthe multiple-population approach of (3). But usually the underlying topologyis not a star nor is there central management of the population exchange (per-formed by the master), that is, the construction of the new worker populationis missing. [8] is the approach most similar to ours but the latter points juststated (i.e. topology and exchange) still remain true and the implementationtechnique (by using MPI) is another one. Our model is rather influenced by theclient-server and master-worker paradigms known from distributed systems.

6 Conclusions and Further Work

We presented a master-worker model suited for execution genetic programmingin parallel. The proposed method shows quite good results on solving regressionproblems. What has to be done next is to process more tests on more complicatedproblems.

References

1. E. Cantu-Paz: A survey of parallel genetic algorithms. Calculateurs Paralleles, Re-seaux et Systems Repartis. Vol. 10, No. 2., Paris: Hermes (1998) 141-171

2. G. Coulouris, J. Dollimore and T. Kindberg: Distributed Systems - Concepts andDesign, (3rd Edition). Addison-Wesley (2000)

3. J. Eggermont and J.I. van Hemert: Adaptive Genetic Programming Applied to Newand Existing Simple Regression Problems. Proceedings on the fourth Europeanconference on Genetic Programming (EuroGP2001), Lecture Notes in ComputerScience Vol. 2038, Springer (2001)

4. W. Golubski and T. Feuring: Genetic Programming Based Fuzzy Regression. Pro-ceedings of KES2000 4th International Conference on Knowledge-Based IntelligentEngineering Systems and Allied Technologies, Brighton (2000) 349–352

5. J.R. Koza:Genetic Programming II. Cambridge/MA: MIT Press (1994)6. M. Oussaidene, B. Chopard, O. Pictet and M. Tomassini:Parallel Genetic Program-

ming and its Application to Trading Model Induction. Parallel Computing, 23 (1997)1183-1198

7. M. Tomassini: Parallel and Distributed Evolutionary Algorithms: A Review. Evo-lutionary Algorithms in Engineering and Computer Science, J. Wiley and Sons,Chichester, K. Miettinen, M. Makela, P. Neittaanmaki and J. Periaux (editors)(1999) 113-133

8. M. Tomassini, L. Vanneschi, L. Bucher and F. Fernandez: A Distributed ComputingEnvironment for Genetic Programming using MPI. Recent Advances in ParallelVirtual Machine and Message Passing Interface, J. Dongarra, P. Kaksuk and N.Podhorszki (Eds), Lecture Notes in Computer Science Vol. 1908, Springer (2000)322-329


Software Uncertainty

Manny M. Lehman1 and J.F. Ramil2

1Department of Computing, Imperial College180 Queen’s Gate, London SW7 2BZ, UK

[email protected] Dept., Faculty of Maths and Computing

The Open University, Walton Hall, Milton Keynes MK7 6AA, [email protected]

Abstract. This paper presents reasoning implying that the outcome of theexecution of an E-type program or E-type software system (software for short)of whatever class are not absolutely predictable. It is intrinsically uncertain.Some of the sources of that uncertainty are identified and it is argued that thephenomenon qualifies as a Principle of Software Uncertainty. The latterrepresents an example of an assertion in a Theory of Software Evolution whichis ripe for development based on empirical generalisations identified inprevious research, most recently in the FEAST projects. The paper brieflydiscusses some practical implications of uncertainty, and the other conceptspresented, on evolution technology and software processes. Though much ofwhat is presented here has previously been discussed, its presentation as acohesive whole provides a new perspective.

1 Program Specification

When computers first came into general use it was taken for granted, and may still bein some circles, that, subject to correct implementation, the results of computationsnot violating the domain limits set by its specification, would be absolutelypredictable. The paper argues that as a consequence of the inevitable changes in thereal world in which a computing system operates, the outcome of program executionin the real world cannot and should not be taken for granted.

In general, a specification [34] prescribes domain boundaries and definesproperties of the required and acceptable program behaviour when executed and thenature of the inputs to and outputs from the various operations that the program is toperform. It also addresses the quality of the results of execution, in terms of, forexample, numerical precision. The need for a specification is generally accepted asbasic for initial development process. They can provide many benefits, particularlywhen stated in a formal notation [11]. Whether formal or not, they provide developerswith an explicit statement of the purpose the program is to address, precisely what isrequired, and a guide for potential users of the software of the facilities available fromthe program, its capabilities and limitations. It is also realised by some that aspecification is important as a means for facilitating the inevitable evolution,enhancement and extension of the software.

Software Uncertainty 175

Meeting these expectations requires that the system and software requirementswere satisfactorily identified and stated in the first place and as subsequently modifiedduring development, usage and evolution. It is also necessary that the software andthe larger system that includes the designated hardware on which it is to run, otherassociated software such as an operating system, have been appropriately expressed,adequately validated (for example by comprehensive testing) and documented. Therole and influence of any other elements, including humans, involved in systemoperation must also be taken into account. Finally, it is necessary that the software hasbeen correctly installed and that the hardware is operating without fault. However, inhis 1972 Turing Lecture, Dijkstra [6] pointed out that testing can, at best, demonstratethe presence of faults, never their absence. To ensure satisfaction of the specification,the software must be verified by demonstrating its correctness, in the fullmathematical sense of the term [9], [10], relative to that specification. This, of course,requires that it has been derived from a formal specification or otherwise constructedin a way that enables and facilitates proofs of correctness or other forms ofverification. Only where this is possible is the term correctness meaningful in itsmathematical sense.

Specifications may take many forms. To be appropriate in the context ofverification it must be formal, complete in the sense that it reflects all propertiesidentified as critical for application success, free of inconsistencies and at a level ofgranularity and explicitness that permits implementation with a minimum ofassumptions; a topic further considered below. It is also desirable that the formalismsused facilitate representation of the domains involved. These prerequisities are noteasily met for all types of applications and programs. They are clearly met byprograms of type S, as described below. These operate in an isolated, abstract andclosed, for example mathematical, domain. The need for software systems of this typeis somewhat restricted. They are exemplified by programs used in programminglectures and textbooks and, more significantly, to address completely definedproblems or situations in a well understood and fully defined domain [33]. The latterwould include mathematical function or model evaluation in the physical sciences, forexample, in calculating the focal length of a lens to be used in a prescribed andprecisely defined instrument. This class of programs stands in sharp contrast to thosethat operate in and interact with a real world domain and termed E-type systems. Theoutputs of the latter are used in some practical application or problem solution withinsome prescribed domain. Such programs, that have become pervasive at all levels ofbusiness and other organisations, are therefore, of universal interest. They do,however, present major challenges. In the present discussion, it is their behaviouralproperties that are of particular relevance. The behaviour of E-type software cannot befully specified and formalised as required for demonstrations of correctness. Aspectsof their behaviour can be formally specified. This produces, at best, a partialspecification. Moreover, when such formal partial specifications are available,obstacles to the demonstration of correctness would be likely to arise as a result ofsystem size and various sources of complexity, constraints that cannot furtherconsidered here. In discussing uncertainty in software, what is of interest is theobstacle to complete specification and formalisation that arises from the fact that theapplication domains involved are essentially unbounded in the number of theirattributes and that many of these attributes are subject to change. Moreover, whenhumans are involved in system operation the unpredictability of individual humanbehaviour also makes appropriate formalisation at best difficult, more generally

176 M.M. Lehman and J.F. Ramil

impossible. Thus correctness, in the formal sense, is meaningless in the E-typecontext. One can merely consider how successful the system is in meeting its intendedpurpose. But need this be a matter for concern? Demonstration of correctness, even ofthose parts of a system whose specification can be adequately formalised, are likely tobe the little interest real world users or other stakeholders, who are, by the way, oftenlikely to hold inconsistent viewpoints [8]. Some increase in confidence that thesoftware as a whole will satisfy its stakeholders may result from a demonstration ofcorrectness against a partial specification and/or that parts of the program are correctwith respect to their specification. But the concern of stakeholders will generally bewith the results of execution. It is those that are assessed and applied in the context ofproblem being solved or applications being pursued in the operational domain. Whatmatters, is the validity and applicability of the results obtained and consequentialbehaviour when the outputs of execution are used. Whether system execution issuccessful will be judged in terms of whatever criteria had previously been explicitlyor implicitly determined or the behaviour that was desired and expected. This is theultimate criterion of software system acceptability in the real world. A programclassification briefly described below is useful for clarification of these concepts.

2 SPE Classification Scheme

2.1 Type S

The SPE software classification scheme has been described and discussed many times[15], [17], [30]. Two views of S-type programs have been given in the variousdiscussions of the term. The first considers S-type programs as those operating in anabstract and closed, for example mathematical, domain. Their behaviour may be fullyand absolutely specified [33]. As already suggested, this situation has only limitedpractical relevance. The second view is, however, of wider relevance for computing inthe real world. It identifies S-type programs as those for which the sole criteria foraccepting satisfactory completion of a development contract (whatever its nature) isthat the completed product satisfies a specification reflecting the behaviouralproperties expected from program execution. As stated above, this presupposesexistence of an appropriate formal specification, accepted by the customer ascompletely defining the need to be fulfilled or the computational solution desired.Such a specification becomes a contract between a supplier and the representative ofprospective users. Whether the results of execution are useful in some sense, whetherthey provide a solution to the specified problem, will be of concern to both users andproducers. However, once the specification has been contractually accepted and theproduct has been shown to satisfy it, the contract has been fulfilled, by definition.Thus, if the results do not live up to expectations or need to be adjusted, that is, iftheir properties need to be redefined, rectification of the situation requires a new,revised, specification to be drawn up and a new program to be developed. Dependingon the details of changes required, new versions may be developed from scratch orobtained by modification of those rejected. However achieved, it is a new program.


2.2 Type E

Type E software has been briefly described in section 1. As, by definition, a programused to solve real world problems or support real world activity, it is the type of mostgeneral interest and importance. Conventionally, its development is initiated bymeans of eliciting requirements [e.g. 27]. At best, only parts of the statement ofrequirements, mathematical functions for example, can be fully described, definedand formalised. The criterion of correctness is replaced by validity or acceptability ofthe results of execution in real world application. If there are any that do not meet theneed for which the development was triggered in the first place, the system is likely tobe sent back for modification. Whether the source of dissatisfaction originates insystem conception, specification, design and/or implementation is, in the context ofthe present discussion, irrelevant though different procedures to identify the source ofdissatisfaction and to implement the necessary changes or take other action must beadopted in each situation. Stakeholders have several choices. For example, othersoftware may be acquired, the activity to be supported or its procedures may bechanged to be compatible with the system as supplied or the latter may be changed. Ifthe third approach is taken, one may be faced with the alternatives of changing codeor documentation or both.

Since the real world and activities within it are dynamic, always changing, suchchanges are not confined to the period when the system is first accepted and used. Asstated in the first law of software evolution [12], [14], [15], briefly described as “thelaw of continuing change”, similar choices and the need for action will arisethroughout the system’s active lifetime. The system must be evolved, adapted tochanging needs, opportunities and operational domain characteristics. The latter maybe the result of exogenous changes or of installation and use of the system, actionsthat change the operational domain and are likely to change the applicationimplemented or the activity supported. Evolution is, therefore, inevitable, intrinsic tothe very being of an E-type system. This is illustrated by Figure 1 which presents asimplified, high level, view of the software evolution process. Note that the numberand nature of the steps in the figure merely typifies a sequence of activities that makeup a software process and is not intended to be definitive. A more detailed viewwould show it to be a multi-level, multi-loop, multi-agent feedback system [2], [14],[22]. We briefly return to this point later in the paper.

2.3 Type P

P-type software was included in the original schema to make the latter as inclusive aspossible. They cover programs intended to provide a solution to a problem that can beformally stated even though approximation and consideration of the real world issuesare essential for its solution. It was suggested that this type could, in general, betreated as one or other of the other two. But there are classes of problems, chessplayers and some decision support programs, for example, that do not satisfactorily fitinto the other classes and for completeness the type P is retained. However, whentype P programs are used in the real world they acquire type E properties and, withregard to the issue of uncertainty, for example, they inherit the properties of that type.Hence they need not be separately considered in the present paper.


Theories, models procedures, laws

of application and system domains

Application concept

Application domain

Computational procedures

and algorithms

Views

Evolving understanding and structure

Requirements statement

Program definition

Program

Fig. 1. Stepwise program development omitting feedback loops

2.4 S-type as a System Brick

S-type programs are clearly of theoretical importance in exposing concepts such ascomplete specification, correctness and verification. Examples studied, and evenused, in isolation are often described as toy programs. Their principle value ispedagogical. However, where the behaviour and properties of a large systemcomponent can be formally, specified, it may, in isolation be defined to be of type S,giving it a property that can play an important role in achieving desired systemproperties. The fact that its properties are fully defined and verifiable indicates thecontribution it can make to overall system behaviour. The specification, thoughcomplete in itself, is generally, based on assumptions about the domain within whichit executes. However, when so integrated into a host E-type system, that systembecomes the execution domain. As is discussed further below, this leads to a degreeof uncertainty in system behaviour which is, in turn, reflected in componentbehaviour. A component of type S when considered in isolation, acquires theproperties of type E once it is integrated into and executed within an E-type systemand domain.

Clearly the S-type property can make contribution to achieving required softwarebehaviour. It cannot guarantee such behaviour. It has a vital role to play in systembuilding that will become ever more important as the use of component basedarchitectures, COTS and reuse becomes more widespread. Knowledge of theproperties of a component and of any assumptions made, explicitly or implicitly,during its conception, definition, design and implementation, whether ab initiodevelopment for direct use, as a COTS offer or development for organisational reuse,is vital for subsequent integration into a real world system. It has long beenrecognised that formal specification is a powerful tool for recording, revealing and


understanding the content and behaviour of a program [11], provided that the latter isnot too complex or large. It would appear essential for COTS and reuse units to be sospecified, considered and processed as S-type programs. The most general view of S-type programs sees them as bricks in large systems construction.

The S-type concept is significant for another reason, one that has immediatepractical implications. Development of elemental units for integration into a system isnormally assigned to one individual, or to a small group. The activity they undertakegenerally involves assumptions, conscious or unconscious, that resolve on-the-spotissues that arise as the design and code or documentation text is evolved. Resolutionof issues that are adopted will generally be based on time and space-local views of theproblem and/or the solution being implemented and may well be at odds with designor implementation of other parts of the system. Even if then justified, theassumption(s) that resolution requires will be reflected in the system and can becomea source of failure at a later time when changes in, for example, the applicationdomain invalidate it (them). Assumptions become a seed for uncertainty. Basing workassignments to an individual or small group on a formal specification strictly limitstheir freedom of choice, of uncertainty in the assignment that forces local decisionsthat are candidates for subsequent invalidation. Specifications of practical real worldsystems such as those already in general use or to be developed in the future cannot,in general, be fully formalised. Those of many of the low level elements, the modulesthat are the bricks from which the system is to be constructed, can. The principle thatfollows is that, wherever possible, work assignments to individuals, implementers orfirst line managers should be of type S. Application of the S-type program concept cansimplify management of the potential conflicts and the seeds of invalid assumptionsthat must arise when groups of people implement large systems concurrently withindividuals taking decisions in relative isolation.

But even S-type bricks are not silver bullets [3]. No matter how detailed thedefinition when an S-type task is assigned, issues demanding implementation-timedecisions will still be identified by the implementer(s) during the developmentprocess. In principle, they must be resolved by revision of the specification and/ordocumentation. However, explicit or implicit, conscious or unconscious adoption ofassumptions by omission or commission will inevitably happen, may remainunrecorded and are eventually forgotten. Even when conscious, they are adopted byimplementers who conclude that the assumptions are justified on the basis of a localview of the unit, the system as a whole and its intended application. With the passageof time, however, and the application and domain changes that must be expected,some of these assumptions are likely to become invalid. Thus even S-type bricks carryseeds of uncertainty that can cause unanticipated behaviour or invalid results. The useof S-type bricks minimises the likelihood of failure, the uncertainty at the lowest levelof implementation, where sources of incorrect behaviour are most difficult to identify.It does not reduce it to zero. As described below, the assumptions issues raised areinescapable, a major source of program execution uncertainty.

It is worth noting that in practice the formal specification of each of the S-typebricks does not tend to be a one-off activity. As implementation progresses and newunderstanding emerges and the specification itself must be updated to reflectproperties (e.g., bounds of variables) that only become apparent when the low-level ofabstraction issues are tackled and/or when emergent properties of the larger systemare identified. This issue is recognised at the S-type level, for example, by aretrenchment method recently proposed [1], which acknowledges the fact and


provides tools to help overcome the fact that not only the program, but also its formalspecification are likely to require change and adaptation as implementation progressesand usage and subsequent program evolution take over.

2.5 The Wider SPE Concept

As described so far the SPE classification relates only to programs and to integratedand interacting assemblies of such programs into what has been loosely termedsystems. The first step in generalising the concept, particularly in relation to E-typesystems is to extend the class to include embedded real-world systems, that is totalsystems which include hardware that is tightly coupled to and usually controlled bysoftware. In such systems the software must be regarded as, at least part of, the meanswhereby the system achieves its purpose. Neither the term real world nor embeddedhas been defined here but they are used as in common usage.

This suggests, that whatever type of software is installed in the system, onceembedded, the hardware/software system as such is also appropriately designated asof type E. Its hardware elements operate as part of the real world supporting other realworld activity. Thus such systems must be expected to share at least some of theproperties of E-type software and some of its evolutionary characteristics [17]. Theywill evolve, but the pattern of evolution, rates of change and other time dependentbehaviour are likely to be quite different since, unlike software, system evolution isnot achieved by applying change upon change to a uniquely identifiable soft artefact –code and/or documentation - but by replacement of elements or parts of the system.Moreover, its physical parts are subject to physical laws, with material characteristics,such as size, weight, energy consumption, processing speed, memory size, and fitnessin the proposed environments rapidly become constraints for the evolution of thehardware parts of the embedded system. Given this extension, it is then appropriate torefer to E-type systems in general, even where software does not play a dominant, oreven any, role in driving and controlling system behaviour and evolution.

One of the major sources of the intrinsic need of E-type software for continualevolution, stems from the fact that the software evolution process is a feedbacksystem [2], [14], [22]. This is true for real world systems and, in particular, for thosein which human organisations or individuals are involved, whether as drivers, asdriven or as controllers as exemplified by social organisations such as cities [12].Hence the concept of E-type can be further extended to include this wider class.

Whether such generalisation is useful and what, other than intrinsic evolution, arethe behavioural similarities between these various types of systems remains to beinvestigated. If successful, such investigation should contribute to understanding ofhow artificial systems evolve and to mastery of their design and evolution [32]. It is,however, certainly appropriate to include general systems such as computerapplications in the E-type class. As indicated above, at the centre of the argumentsbeing presented in this paper, a computer application and the software implementingand supporting it are interdependent, the evolution of one triggers that of the other, asillustrated by the closed loop of Figure 2. Hence, the concept of E-type applications isalso useful.


Developmentstep 3

Application concept

Application domain

Developmentstep i +1

Developmentstep 2

Developmentstep 4

Developmentstep i

Operationalprogram

Program

Exogenouschange

Bounding

Fig. 2. Installation, operation and evolution

2.6 S-type as Part of a Real World System

So far the concept of S-type has been treated as a property of a program in isolation.However programs are seldom developed to stand alone except, perhaps, whenstudying programming methodology or by students of programming. In so far asindustry is concerned, a program is, in general, used in association with othersoftware or, in the case of, for example embedded systems, hardware. It is eitherintegrated into such software and becomes part of the larger system or it providesoutput used elsewhere. And even if, in rare instances a program is used in totalisolation, the results it produces are, in general, intended for application in some otheractivity. In all these situations, the S-type program has become part of, an element in,an E-type system, application or activity. It will require the same attention, as wouldany E-type element in the same system. The specification of the S type program may,for example, need to be evolved to adapt it to the many sources of change alreadydiscussed. The S-type categorisation applies only to the program in isolation, a matterof major significance when the concepts of reuse, COTS and component-basedarchitectures are considered [23].

3 Computer Applications and Software, Their OperationalDomain, and Specifications Linking Them

The remainder of this paper considers one aspect of the lifetime behaviour of E-typesoftware and applications. Underlying the phenomena to be considered is a view,


illustrated in Figure 3, of the relationships between a program, its specification andthe application it implements or supports.

APPLICATIONCONCEPT

OPERATIONALSYSTEM

FORMALSPECIFICATION

ABSTRACT

CONCRETE(REAL WORLD)

REIFICATION

ABSTRACTIO

N

Fig. 3. An early view [16]

This view is a direct application of the mathematical concepts of theory and itsmodels in the software context [16], [33], [34], [35]. A more recent and detaileddepiction is provided by Figure 4.

verificationvalidation

validation

Specification

is mod

el of

E-typeProgram

abstr

actio

n reification

is model of

& validation

Application

Real World Domain

Fig. 4. A more recent and detailed view

A specification can be seen as a theory of the application in its real world operationaldomain [33] obtained by abstraction and successive refinement [36] from knowledgeand understanding of the both. It may, for example be presented as a statement ofrequirements, with formal parts where possible. Conversely the real world and theprogram are both models of the theory, or, equivalently, of the specification [33]. Theexecutable program is obtained by reification based on successive refinement or anequivalent process. Program elements reflecting formal specification elements shouldbe verified and the program, in part and as a whole, validated against thespecification. This can ensure that, within bounds set by the specification and to theextent to which the validation process covers the possible states of the operationalenvironment, the program will meet the purpose of the intended application as


expressed in the specification. These checks are, however, not sufficient. The systemmust be also be validated under actual operational conditions since it will haveproperties not addressed in the specification. The checks should be repeated in wholeor part whenever changes are subsequently made to the program to maintain thevalidity of the real world/specification/program relationship.

4 Inevitability of Assumptions in Real World Computing

The real world per se and the bounded real world operational sub-domain have anunbounded number of properties. Since a statement of requirements and a programspecification are, of necessity, finite an unbounded number of properties will havebeen excluded. Such exclusions will include properties that were unrecognised orassumed irrelevant with respect to the sought-for solution. Every exclusion impliesone or more assumptions that, explicitly or implicitly, by omission or commission,knowingly or unknowingly become part of the specification. It will be up to thesubsequent verification and validation processes to confirm the validity of the set ofknown assumptions in the context of the application and its operational domain. Theconscious and unconscious bounding of the latter, of functional content and itsdetailed properties during the entire development process, determines to a greatextent, the resultant unbounded assumption set.

Validation of the specification with respect to the real world domain is required toensure that it satisfies the needs of the application in its operational domain. Since thereal world is dynamic, validation, in whole or in part, must be periodically repeated toensure that, as far as possible, assumptions invalidated by changes in the externalworld or the system itself are corrected by the appropriate changes. Assumptionsrevealed by such changes, must be recorded and included in the known set forsubsequent validation. Figure 5 expresses the need for the program to be periodically(ideally continually) validated with respect to the real world domain over the systemlifetime. A desirable complementary goal is to maintain the specification as anabstraction of the real world and the program. We termed this goal to maintain thesystem (program, specification, documentation) as a model-like reflection of the realworld.

The issue of assumptions does not arise only from the relationship between thespecification and the intended application in its domain. The reification process alsoadds assumptions. This is exemplified, for example, by decisions taken to resolveissues not addressed in the specification which implies that they were overlooked orwere considered irrelevant in generating the specification. Moreover, the abstractionand reification processes are, generally, carried out independently and assumptionsadopted in the two phases may well be incompatible. Real world validation of thecompleted program is, therefore, necessary. Testing over a wide range of inputparameters and conditions is a common means of establishing the validity of theprogram. But the conclusions reached from tests are still subject to Dijkstra’s stricture[6] and in a dynamic world, are, at most, valid at the time when the testing isundertaken or in relation to the future as then foreseen. Hence, the overall validity ofthe assumption set relates to real world properties at the time of validation.

As indicated in Figure 5, assumption relationships are mutual. The specification isnecessarily based on assumptions about the application and its real world operational


domain because the latter are unbounded in the number of their properties and thespecification is bounded. Moreover, in general, the application has potential for manymore features than can be included based on available budgets and the time availableto some designated completion point. Hence, the specification also reflectsassumptions about the application and implementation. But users, in general, cannotbe fully aware of the details and complexities of the specification.

Continual validation

Specification

ApplicationE-type

Program

Assumptions about one another

Must remain compatiblewith one another

Assumpti

ons a

bout

one a

nothe

r

Assumptions about

one another

Real World Domain

verificationvalidation& validation

Fig. 5. The role of assumptions

Even a disciplined and well-managed application will be based on assumptions aboutthe system. Similarly, as they pursue their various responsibilities, systemimplementers will make interpretations and assumptions about the specification,particularly its informally expressed parts. And, probably to a lesser extent, thoseresponsible for developing the specification will make assumptions about theprogram, its content, performance and the system domain (hardware and human) inwhich it executes. Assumption relationships between the three entities, thespecification, the application in its real world domain and the program are clearlymutual. To the extent that they are recognised and fully understood, both individualassumptions and the mutual compatibility of the set will be validated. But those thathave been adopted unconsciously or arise from omission will not knowingly havebeen validated. Nor will validity bounds have been determined or even recognised.

All these factors represent sources of invalid assumptions reflected in the program.Moreover, the real world is always changing. Hence even previously validassumptions and their reflection in program and documentation texts, structure andtiming may become invalid.

5 A Principle of Software Uncertainty

5.1 A Principle

Given the above background, we are now in a position to introduce a Principle ofSoftware Uncertainty in a revised formulation of a statement first published someyears ago [18], [19], [20], [21] on the basis of insights developed during earlier


studies of software evolution [17]. The Principle may be stated in short as: “Theoutcome of the execution of E-type software entails a degree of uncertainty, theoutcome of execution cannot be absolutely predicted”, or more fully, “Even if theoutcome of past execution of an E-type program has previously been admissible, theoutcome of further execution is inherently uncertain; that is, the program may displayinadmissible behaviour or invalid results”. This statement makes no reference to thesource of the uncertainty. Clearly the presence in the software of assumptions, knownor implied, that may be invalid is sufficient to create uncertainty. There may be othersources. As stated the principle refers explicitly to E-type software, its relevance withrespect to S-type programs requires additional discussion and is not pursued here.

Use of the terms admissible and inadmissible in the statement is not intended toreflect individual or collective human judgement about the results of execution withinthe scope of the principle, though such judgement is certainly an issue, a possiblesource of uncertainty. Such use addresses the issue whether the results of executionfulfil the objective purpose of the program. This principle is stated in terms ofadmissibility rather than satisfactory to avoid any ambiguity which might arise fromthe mathematical meaning of satisfy [29] which, in the Computer Science context isused to address the relationship between a formal specification and a program whichhas been verified with respect to it.

5.2 Interpretations

Thought the above statement about uncertainty of the admissibility of a programexecution was not intended to include uncertainty about human judgement, theprinciple still applies if views, opinions, desires or expectations of humanstakeholders are considered in determining the admissibility of an E-type programs.Since the circumstances of execution and other exogenous factors are likely to affectthose views and all are subject to change [13] the level of uncertainty increases incircumstances where they have a role.

Admissible execution is possible only when the critical assumption set reflected inthe program is valid. For E-type programs, the validity of the latter cannot be proven,if only because one cannot identify all members of that set. And even those that canbe identified may become invalid. Thus, if changes have occurred in the applicationor the domain and rectifying changes have not been made in the system, the programmay display unsuccessful or unacceptable behaviour no matter how admissibleexecutions were in the past. Uncertainty may even be present in a program verified ascorrect with respect to a formal specification since the latter may cease to be anadequate abstraction of the real world as a consequence of changes in the executiondomain.

Finally consider effort-related interpretations of admissibility. Sustainedadmissibility of results requires that the critical assumption set is maintained valid.This requires human effort, analysis and implementation. Now in all organisationshuman effort is bounded. Since organisations responsible for maintaining theassumption set and the software have their own priorities, one can never be certainthat enough effort will be available to achieve timely adjustments that keep criticalassumptions valid. Thus, even when invalid assumptions are identified, it is notcertain that changes will be in place in time to guarantee continual admissibility ofprogram execution., This further aspect of uncertainty, however, is a prediction about


future behaviour rather than being relevant to current execution so need not beregarded as being part of the principle.

There may well be other sources of uncertainty. In terms of understanding ofsoftware evolution and the principle, the most important sources are those illustratedby Figure 5 and qualified by the observation of the unboundedness of the number ofproperties of real world domains and, hence, of the assumption set. It is the only onethat is not, in a sense, self-evident, at least until after having been stated.Nevertheless, the others must be explicitly stated since uninformed purchasers ofsoftware systems need to consciously accept that future inadmissible behaviour of onesort or another cannot be prevented. This reasoning is more than sufficient to justifyidentification of the software uncertainty as a principle. But the likelihood and impactof such uncertainty can be reduced if it is recognised that its sources relate to theinevitability of change in the real world and in the humans that populate it. Thisimplies that satisfaction (in the conventional sense) with future computational resultscannot be guaranteed, however pleasing past results have been. Though, perhaps, selfevident to the informed, for a society increasingly dependent on correctly behavingcomputers it is vital that this fact is more widely recognised, not the least bypoliticians and other policy makers.

5.3 Relationship to Heisenberg’s Uncertainty Principle?

It is natural to enquire whether this principle is related in any way to Heisenberg’sPrinciple of Uncertainty. The present authors have not been able to make anyconnection, sees it, at most, as an analogue. The late Gordon Scarrott, however, saw itin a different light and left behind a series of papers in which he argued, inter alia,that the Software Principle was an instance of the Heisenberg Principle. His thesis ispresented in several papers including one entitled "Physical Uncertainty, Life andPractical Affairs" [31]. It appears that the authors’ copy is an early version of a paperpresented by Scarrott in a Royal Society lecture.

6 Empirical Evidence for the Software Uncertainty Principle?

6.1 General Observation

Recent studies [7] have indirectly reinforced the reasoning that led to formulation ofthe Principle of Software Uncertainty as here stated. Unfortunately, sufficient relevantdata, such as histories of fault reports relating to our collaborators’ systems, to permiteven an initial meaningful estimate of the degree of satisfaction or acceptability of theindividual stakeholders involved with the systems we studied, were not available. Norwas such a study part of the identified FEAST studies. It was, however, clear thatcontinual change was present in all the systems observed, and that a portion of suchchanges addressed progressive invalidation of reflected assumptions. More work isneeded to assess, for example, the proportion of changes due to invalidation ofassumptions versus changes triggered by other reasons.

The evolution process is a feedback system in which the deployment of successiveversions or releases and the consequent stakeholder reaction, learning and


familiarisation or even dislike, for example, play a role. Such studies are, therefore,not straight forward, and ascertaining the ultimate trigger of particular changerequests is difficult; may even not be possible. Despite these difficulties and the factthat direct investigation is required, the FEAST studies supported the conclusions thatunderlie the present discussion and, in particular, that continual evolution is inevitablefor real world applications. This, in turn, indicates the need to maintain the validity ofthe assumption set and their reflections in the software. That is believed to be animportant component of software maintenance activity.

6.2 Approach to a Theory of Evolution

As indicated above, the FEAST projects have reinforced the reasoning that led toformulation of the Principle of Software Uncertainty. For practical reasons, thestudies were limited to the traditional, that is, for example, non Object Oriented andnon-component based paradigms. Plans are underway to extend the FEAST studies tothe latter. However, subject to the preceding limitation, the overall results of theprojects included identification of behavioural invariants from data on the evolutionof a wide spectrum of industrially developed and evolved systems.

Previous discussions of the basic concepts and insights presented in this paper canbe found in earlier publications [7], [17]. The contribution made here is to bringtogether, explicitly and implicitly, to the extent that this can be done in a conferencepaper, the observations, models, laws and empirical generalisations [5] that have beenaccumulated over a period of more that 30 years into a cohesive whole. It nowappears that these provide observations at the immediate behavioural level for furtherstudy of software evolution by providing a wide phenomenological base. The later isan important input for theory formation, whose eventual output would be a formaltheory of software evolution. If it can be formed such a theory would, inter alia, helpclarify formulations such as that of the Principle of Software Uncertainty and putthem on a more solid, coherent and explicit foundation.

It would appear that the Principle of Software Uncertainty that has been the focusof the present paper in the context of the Soft-Ware Conference on SoftwareUncertainty is a candidate for a theorem in the proposed theory. An informal outlineproof of the principle has been developed in [24]. It is, at best, an outline sincedefinitions of many of the terms used need refinement, steps in the reasoning need tobe filled in and some may not conform to the style accepted by those moreexperienced than the present authors in theory development and its formalisation. Itdoes, however, convey the intent of what is planned in the SETh project [25]. It hadbeen intended to include an improved, though not yet complete version in the presentpaper but time has not permitted preparation of a proof that satisfies us. We hope,however, that our presentation at the workshop together with the concepts presentedhere and in other FEAST publications will encourage others to work in this area.

7 Practical Implications

Given the reasoning that underlies its formulation it might be thought that thePrinciple of Software Uncertainty is a curio of theoretical interest but of little practical


value. Consider, for example, the statement that a real world domain has anuncountable number of properties and cannot, therefore, be totally covered by aninformation base of finite capacity. The resultant incompleteness represents anuncountable number of assumptions that underlie use of the system and impact resultsachieved. The reader may well think, "so what?" Clearly, the overwhelming majorityof these assumptions are not relevant in the context of the software in use orcontemplated. Neither do they have any impact on behaviour or on the results ofexecution. It is, for example, quite safe to assume that one may ignore the existence ofblack holes in outer space when developing the vast majority of programs. It willcertainly not lead to problems unless one is working in the area of cosmology or,perhaps, astronomy. On the other hand, after a painful search for the cause of errorsduring the commissioning of a particle accelerator, a tacit assumption that one mayignore the influences of the moon for earth bound problems was shown to be wrong.It was discovered that as a result of the increased size of the accelerator thegravitational pull of the moon was the basic source of experimental error [4]. Manymore examples (see, for example, [28] for a discussion of the Arianne 501destruction) could be cited of essentially software failure due to the invalidation ofreflected assumptions as a result of changes external to the software. The aboveshould convince the reader that degradation of the quality, precision or completenessof the assumption set reflected by a program, its specification and other documents,represents a major societal threat as the ever wider, more penetrating and integrateduse of such systems spreads. There is room here for serious methodological andprocess research, development and improvement.

At least one exercise has been undertaken to show that the observations, inferencesand conclusions achieved in the FEAST studies and as exemplified by the examplesin the present paper, are more than of just theoretical interest. At the request of theFEAST collaborators a report was prepared, to present some of the direct practicalmeasures suggested by the FEAST results. That report has been, or is shortly, to bepublished [26]. There follows a list of some of the practical recommendations thataddress, directly or indirectly, the implications of the Principle:• when developing a computer application and associated systems, estimate and

document the likelihood of change in the various areas of the application domainsand their spread through the system to simplify subsequent detection ofassumptions that may have become invalid as a result of changes,

• seek to capture by all means, assumptions made in the course of programdevelopment or change,

• store the appropriate information in a structured form, related possibly to thelikelihood of change as in (a), to facilitate to detect any that have become invalidin periodic review,

• assess the likelihood or expectation of change in the various categories ofcatalogued assumptions, and as reflected in the database structure to facilitatesuch review,

• review the assumptions database by categories as identified in (c), and asreflected in the database structure, at intervals guided by the expectation orlikelihood of change or as triggered by events,

• develop and provide methods and tools to facilitate all of the above,• when possible, separate validation and implementation teams to improve

questioning and control of assumptions,


• provide for ready access by the evolution teams to all appropriate domainspecialists with in-depth knowledge and understanding of the application domain.

A more general consequence of the Principle is that just as computer users, in thebroadest sense of the term, must learn to treat the results of computation with care, somust software users. It must never be assumed that information is correct simplybecause its source is a computer. This realisation calls for careful thought and ageneral educational process in government, industry and the educational system thatmaintains faith in the computer but, at the same time, ensures adequate care in howthe results of their use are managed. It should not be beyond the wit of society toachieve this.

References – * indicates reprint as a chapter in Lehman and Belady 1985

1. Banach R and Poppleton MR, Retrenchment, Refinement and Simulation, in J.P. Bowen,S.E. Dunne, A. Galloway, King S. (editors) ZB2000: Formal Specification andDevelopment in Z and B, Springer, 2000, 525 pp

2.* Belady LA and Lehman MM, An Introduction to Program Growth Dynamics, in W.Freiburger, (editor) Statistical Computer Performance Evaluation, Academic Press, NewYork, pp. 503-511

3. Brooks FP, No Silver Bullet - Essence and Accidents of Software Engineering,Information Processing 86, Proc. IFIP Congress 1986, Dublin, Sept. 1-5, Elsevier SciencePublishers (BV), North Holland, pp. 1069 - 1076

4. CERN, The Earth breathes on LEP and LHC, CERN Bulletin 09/98; 23 February 1998,http://bulletin.cern.ch/9809/art1/Text_E.html <as of Nov. 2001>

5. Coleman JS, Introduction to Mathematical Sociology, The Free Press Of Glencoe,Collier-Macmillan Limited, London, 1964, 554 pps

6. Dijkstra EW, The Humble Programmer, ACM Turing Award Lecture, CACM, v. 15,n.10, Oct. 1972, pp. 859 – 866

7. FEAST Projects web site: http://www.doc.ic.ac.uk/~mml/feast/ includes a list of ProjectFEAST and the authors' papers, PDF versions of those recent papers not restricted bycopyright transfers

8. Finkelstein A, Gabbay D, Hunter A, Kramer J and Nuseibeh B, Inconsistency Handlingin Multi-Perspective Specifications, IEEE Trans. on Softw. Eng., v. 20, n. 8, Aug. 1994,pp. 569 - 578

9. Hoare CAR, An Axiomatic Basis for Computer Programming, CACM, v. 12, n.10, Oct.,pp. 576 - 583

10. id., Proof of a Program FIND, CACM, v. 14, n. 1, Jan.11. Van Lamsweerde A, Formal Specification: a Roadmap, in A. Finkelstein (ed.), The

Future of Software Engineering, 22nd ICSE, Limerick, Ireland, 2000, ACM Order N.592000-1, pp. 149-159

12.* Lehman MM, Programs, Cities, Students—Limits to Growth, Imp. Col. Inaug. Lect. Ser.,v.9, 1970 - 1974, pp. 211 - 229; also in Gries, 1978

13.* id, Human Thought and Action as an Ingredient of System Behaviour, in TheEncyclopædia of Ignorance, R Duncan and M Weston-Smith, editors, Pergamon Press,London, 1977, pp. 347 - 354

14.* id, Laws of Program Evolution-Rules and Tools for Programming Management, Proc.Infotech State of the Art Conference, Why Software Projects Fail, April 9-11, 1978, pp.1V1 - lV25

15.* id, Program Life Cycles and Laws of Software Evolution, Proc. lEEE Spec. Iss. on Softw.Eng., Sept. 1980, pp. 1060 - 1076


16. id, A Further Model of Coherent Programming Process, Proc. Softw. Proc. Worksh.,Egham, Surrey, 6 – 8 Feb. 1984, IEEE Cat. no . 84 CH 2044-6, pp. 27-35

17. Lehman MM and Belady LA, Program Evolution—Processes of Software Change,Academic Press, London, 1985

18. Lehman MM, Uncertainty in Computer Application and its Control Through theEngineering of Software, J. of Software Maintenance: Research and Practice, v. 1, n. 1,Sept. 1989, pp. 3 - 27

19. Lehman MM, Software Engineering as the Control of Uncertainty in ComputerApplication, SEL Software Engineering Workshop, Goddard Space Centre, MD, 29 Nov.1989, publ. 1990

20. id, Uncertainty in Computer Application, CACM, v. 33, n. 5, May 1990, pp. 584 - 58621. id, Uncertainty in Computer Applications is Certain - Software Engineering as a Control,

Proc. CompEuro 90, Int. Conf. on Computer Systems and Software Engineering, TelAviv, 7 - 9 May, 1990, Publ. by IEEE Comp. Soc. Press, n. 2041, pp 468 - 474

22. id, Feedback in the Software Evolution Process, Keynote Address, CSR Eleventh AnnualWorkshop on Software Evolution: Models and Metrics, Dublin, 7-9 Sept. 1994,Workshop Proc., Information and Software Technology, sp. is. on Software Maintenance,v. 38, n. 11, 1996, Elsevier, 1996, pp. 681-686

23. Lehman MM and Ramil JF, Software System Maintenance and Evolution in an Era ofReuse, COTS and Component Based Systems, Joint Keynote Lecture, InternationalConference on Software Maintenance and Int. Workshop on Empirical Studies ofSoftware Maintenance WESS 99, Oxford, 3 Sept. 1999

24. id, Towards a Theory of Software Evolution - And Its Practical Impact, invited talk, Proc.ISPSE 2000, Intl. Symposium on the Principles of Software Evolution, Kanazawa, Japan,Nov 1-2, 2000, IEEE CS Press, pp. 2-11

25. Lehman M M, SETh – Towards a Theory of Software Evolution, EPSRC Proposal, Casefor Support Part 2, Dept. of Comp. ICSTM, 5 Jul. 2001

26. Lehman MM and Ramil JF, Rules and Tools of Software Evolution Planning,Management and Control, Annals of Software Engineering, Spec. Iss. on Softw.Management, v. 11., issue 1, 2001, pps. 15 – 44

27. Nuseibeh B, Kramer J and Finkelstein A, A Framework for Expressing the RelationshipsBetween Multiple Views in Requirements Specification, Trans. on Software Engineering,vol. 20, n. 10, Oct. 1994, pp 760 – 773

28. Nuseibeh B, Arianne 5 Who Dunnit?, IEEE Software, May/June 1997, pp. 15-1629. The Compact Oxford English Dictionary, 2nd, Micrographically Reduced Edition,

Oxford Univ. Press 198930. Pfleeger S, Software Engineering – The Production of Quality Software, Macmillan Pub.

Co., 198731. Scarrott G, Copies of various relevant papers, published and unpublished and including

the one reference that can be obtained from one of the authors (mml) of this paper32. Simon HA, The Sciences of the Artificial, M.I.T. Press, Cambridge, MA. 1969, 2nd ed,

198133. Turski WM, Specification as a Theory with Models in the Computer World and in the

Real World, System Design, Infotech State of the Art Rep. (P Henderson ed), se. 9, n. 6,1981, pp 363 - 377

34. Turski WM and Maibaum TSE, The Specification of Computer Programs, Addison-Wesley, Wokingham, 1987

35. Turski WM, An Essay on Software Engineering at the Turn of the Century, in T.Maibaum (editor), Fundamental Approaches to Software Engineering, Proc. Third Int.Conf. FASE 2000. March/April 2000. LNCS 1783, Springer-Verlag, Berlin, pp. 1 – 20

36. Wirth N, Program Development by Stepwise Refinement, CACM, v.14, n.4, Apr. 1971,pp. 221 - 227


Temporal Probabilistic Concepts from HeterogeneousData Sequences

Sally McClean, Bryan Scotney, and Fiona Palmer

School of Information and Software EngineeringFaculty of Informatics, University of Ulster, Cromore Road

Coleraine, BT52 1SA, Northern Irelandsi.mcclean, [email protected];[email protected]

Abstract. We consider the problem of characterisation of sequences ofheterogeneous symbolic data that arise from a common underlying temporalpattern. The data, which are subject to imprecision and uncertainty, areheterogeneous with respect to classification schemes, where the class valuesdiffer between sequences. However, because the sequences relate to the sameunderlying concept, the mappings between values, which are not known abinitio, may be learned. Such mappings relate local ontologies, in the form ofclassification schemes, to a global ontology (the underlying pattern). On thebasis of these mappings we use maximum likelihood techniques to handleuncertainty in the data and learn local probabilistic concepts represented byindividual temporal instances of the sequences. These local concepts are thencombined, thus enabling us to learn the overall temporal probabilistic conceptthat describes the underlying pattern. Such an approach provides an intuitiveway of describing the temporal pattern while allowing us to take account ofinherent uncertainty using probabilistic semantics.

1 Background

It is frequently the case that data mining is carried out in an environment that containsnoisy and missing data, and the provision of tools to handle such imperfections in datahas been identified as a challenging area for knowledge discovery in databases.Generalised databases have been proposed to provide intelligent ways of storing andretrieving data. Frequently, data are imprecise, i.e. we are uncertain about the specificvalue of an attribute but only that it takes a value that is a member of a set of possiblevalues. Such data have been discussed previously as a basis of attribute-orientedinduction for data mining [12, 13]. This approach has been shown to provide apowerful methodology for the extraction of different kinds of patterns from relationaldatabases. It is therefore important that appropriate functionality is provided fordatabase systems to handle such information.

A database model that is based on partial values [3, 4, 28] has been proposed tohandle such imprecise data. Partial values may be thought of as a generalisation ofnull values, where rather than not knowing anything about a particular attribute value,as is the case for null values, we may be more specific and identify the attribute value

192 S. McClean, B. Scotney, and F. Palmer

as belonging to a set of possible values. A partial value is therefore a set such thatexactly one of the values in the set is the true value.

Most previous work has concentrated on providing functionality that extendsrelational algebra with a view to executing traditional queries on uncertain orimprecise data. However, for such imperfect data, we often require aggregationoperators that provide information on patterns in the data. Thus while traditionalquery processing is tuple-specific, where we need to extract individual tuples ofinterest, processing of uncertain data is often attribute-driven, where we need to useaggregation operators to discover properties of attributes of interest. Thus we mightwant to aggregate over individual tuples to provide summaries which describerelationships between attributes. The derivation of such aggregates from imprecisedata is a difficult task for which, in our approach, we rely on the EM algorithm [5] foraggregation of the partial value model. Such a facility is an important requirement inproviding a database with the capability to perform the operations necessary forknowledge discovery in an imprecise and uncertain environment.

In this paper we are concerned, in particular, with identifying temporalprobabilistic concepts from heterogeneous data sequences. Such heterogeneity iscommon in distributed databases, which typically have developed independently.Here, for a common concept we may have heterogeneous classification schemes.Local ontologies, in the form of such classification schemes, may be mapped onto aglobal ontology. The resolution of such conflicts remains an important problem fordatabase integration in general [16, 25, 26] and for database clustering in particular[21]. More recently, learning database schema for distributed data has become anactive research topic in the database literature [9].

A problem area of significant current research interest in which heterogeneous datasequences occur is that of gene expression data, where each sequence corresponds to adifferent gene [7] Previous work, e.g. [23], has clustered discretised gene expressionsequences using mutual entropy. However, such an approach is unable to capture thefull temporal semantics of dynamic sequences [20]. Our aim here is to develop amethod that captures the temporal semantics of sequences via temporal probabilisticconcepts. In the context of gene expression data this approach therefore provides ameans of associating a set of genes, via their expression sequences, with anunderlying temporal concept along with its accompanying dynamic behaviour. Forexample, we may associate a gene sequence cluster with an underlying growthprocess that develops though a number of stages that are paralleled by the associatedgenes. We return to the context of gene expression data in the illustrative example inSection 5.

The general solution that we propose involves several distinct tasks. We assumethat the data comprise heterogeneous sequences that have an underlying similartemporal pattern. Such data may have been produced by the use of a prior clusteringalgorithm, e.g. using mutual entropy. Since we are concerned with temporallyclustering heterogeneous sequences, we must first determine the mappings betweenthe states of each sequence in a cluster and the global concept; for this we computethe possible mappings. Then, on the basis of these mappings, we use maximumlikelihood techniques to learn the probabilistic description of local probabilisticconcepts represented by individual temporal instances of the expression sequences.This stage is followed by one in which we learn the global temporal concept. Finally,

Temporal Probabilistic Concepts from Heterogeneous Data Sequences 193

we use the concept description to determine the most probable pathway for eachconcept. Such an approach has a number of advantages: it provides an intuitive way of describing the underlying shape of the process by

explicitly modelling the temporal aspects of the data, Such segmental models haveconsiderable potential for sequence data;

it provides a way of mapping heterogeneous sequences; it allows us to take account of natural variability via probabilistic semantics; it allows sequences to be characterised in a temporal probabilistic concept model;

concepts may then be matched with known processes in the data environment.We build on our previous work on integration [19, 25, 26] and clustering [21] ofmultinomial data that are heterogeneous with respect to classification schemes. In ourprevious papers we have assumed that the schema mappings are made available bythe data providers. In this paper the novelty partly resides in the fact that we must nowlearn the schema mappings as well as the underlying concepts. Such schema mappingproblems for heterogeneous data are becoming increasingly important as moredatabases become available on the Internet, providing opportunities for knowledgediscovery from open data sources. An additional novel aspect of this paper is thelearning of temporal probabilistic concepts (TPCs) from such sequences.

2 The Problem

In Table 1 we present three such sequences where the sequences are identified, on thebasis of mutual entropy clustering, to have similar temporal semantics. We note thatthe codes (0, 1, or 2) should be regarded as symbolic rather than numerical data, andwe re-label them in Table 2 to emphasise this.

Table 1. Raw sequence data

time t1 t2 t3 t4 t5 t6 t7 t8 t9

Sequence 1 0 0 1 1 1 1 1 1 0Sequence 2 0 0 0 1 1 1 1 2 2Sequence 3 0 0 0 2 2 2 2 1 1Sequence 4 0 0 0 2 2 1 1 1 1

Table 2. Re-labelled sequence data

Sequence 1 A A B B B B B B ASequence 2 C C C D D D D E ESequence 3 F F F H H H H G GSequence 4 I I I K K J J J J

Examples of possible mappings are presented in Table 3, where L, M and N are theglobal labels that we are learning. The global sequence (L, L, L, M, M, M, M, N, N)could therefore characterise the temporal behaviour of the global ontologyunderpinning these data. We note that these mappings are not exact in all cases; e.g.,in Sequence 1 the ontology is coarser than the underlying global ontology, and neitherSequence 1 nor Sequence 4 exactly map onto the global sequence. This highlights the


necessity for building probabilistic semantics into the temporal concept. Although, insome circumstances, such schema mappings may be known to the domain expert,typically they are unknown and must be discovered by an algorithm.

Table 3. Schema mappings for Table 2

A L C L F L I LB M D M G N J NA N E N H M K M

Definition 2.1: We define a sequence to be a set S =s1,…,sL, where L is the(variable) length of the sequence and si , i=1,…,L, are members of a set A comprisingthe letters of a finite alphabet.

In what follows we refer to such letters as the values of the sequence.Malvestuto [17] has discussed classification schemes that partition the values of an

attribute into a number of categories. A classification P is defined to be finer than aclassification Q if each category of P is a subset of a category of Q. Q is said to becoarser than P. Such classification schemes may be specified by the database schemaor may be identified by appropriate algorithms. The relationship between twoclassification schemes is described by a correspondence graph [17, 18] where nodesrepresent classes and arcs indicate that associated classes overlap.

For heterogeneous distributed data it is frequently the case that there is a sharedontology that specifies how the local semantics correspond to the global meaning ofthe data; these ontologies are encapsulated in the classification schemes. Themappings between the heterogeneous local and global schema are then described by acorrespondence graph represented by a correspondence table. In our case we envisagedata with heterogeneous schema that may have arisen because either the sequencesrepresent different variables that are related through a common latent variable, or thedata may be physically distributed and related through a common ontology.

We define a correspondence table for a set of sequences to be a representation ofthe schema mappings between the sequence and hidden variable ontologies. Thecorrespondence table for Table 3 is presented in Table 4; the symbolic values in eachsequence are numbered alphabetically. It is this that we must learn in order todetermine the mappings between heterogeneous sequences.

Table 4. The correspondence table for the sequence data in Table 1

Global ontology Sequence 1 Sequence 2 Sequence 3 Sequence 41 1 1 1 12 2 2 3 33 3 3 2 2

3 Clustering Heterogeneous Sequence Data

The general solution that we propose involves several distinct tasks. We assume thatthe data comprise heterogeneous sequences that have an underlying similar temporal


pattern. Since we are concerned with clustering heterogeneous sequences, the firststep is to determine the mappings between the values of each sequence in a clusterand the global ontology; for this we compute the possible mappings.

We are trying to find mappings between the heterogeneous sequences in order toidentify homogeneous clusters; this involves identification of where the symbols inthe respective (sequence) alphabets co-occur. Finding these schema mappingsinvolves searching over the possible set of mappings. Such a search may be carriedout using a heuristic approach, for example, a genetic algorithm, to minimise thedivergence between the mapped sequences. In order to restrict the search space, wemay limit the types of mapping that are permissible. For example, we may allow onlyorder-preserving mappings; the fitness function may also be penalised to prohibittrivial mappings, e.g. where every symbol in the sequence is mapped onto the samesymbol of the global ontology.

The schema mappings from each local ontology to the global ontology envisagedin this paper may serve one, or both, of the following two functions: re-labelling the symbols of a local ontology to the symbols of the global ontology; changing the granularity because the granularity of a local ontology is coarser than

that of the global ontology.Each value of the local ontology is mapped to a set of values of the global ontology;these may be singleton sets having one element, or they may be sets of more than oneelement, referred to as partial values. Partial values are defined formally in Section 4.In this paper we present a simple algorithm in which each local ontology value ismapped to the global ontology value to which it most frequently corresponds. Wherethe most frequently corresponding global ontology value is not unique, the localontology value is mapped to the set of most frequently corresponding values, i.e., apartial value. This mapping may be generalised by using fuzzy logic. The mostfrequently corresponding global ontology value may be considered to be a fuzzyconcept, resulting in the use of partial values where a local ontology value maps to aset of global ontology values to each of which it frequently corresponds.

Since the search space for optimal mappings is potentially very large, we proposean ad hoc approach that can be used to initialise a heuristic hill-climbing method suchas a genetic algorithm. Our objective is to minimise the distance between localsequences and the global sequence once the mapping has been carried out. However,since the global sequence is a priori unknown, we propose to approximate thisfunction by the distance between mapped sequences. Our initialisation method thenfinds a mapping, as summarised in Figure 1. If we wish to provide a number of solutions, say to form the initial mating pool fora genetic algorithm, we can choose a number of different sequences to act as proxiesfor the global ontology in the first step of the initialisation method.

Example 3.1: We consider the data in Table 2. Here we select sequence 2 as theproxy for the global ontology as it is at the finest granularity of any of the sequences.Using the algorithm above to determine mappings from sequence 1 to sequence 2


Choose one of the sequences whose number of symbols is maximal (S* say); thesesymbols act as a proxy for values of the global ontology. For each remainingsequence Si , of length L, let

L1,...,j and i, otherwise0

uany for r)|us(if1h

ij

*jij

ru =∀ ==

=s

.

Here ijs , the j’th value in the local sequence, is a value in the alphabet of Si; *

js ,

the corresponding value in the global ontology, is a value in the alphabet of S*.

Then, i,r, compute ∑=

=L

1j

ijru

iru hh , and find i

ru such that i

ruir

h = )(hmax iru

u.

In the i’th sequence, the value r is then mapped to iru . If i

ru is not unique, r is

mapped to the partial value iru .

Fig. 1. Algorithm for mapping allocation

gives hAC = 2, hAE = 1; hBC = 1, hBD = 4, hBE = 1. We thus induce the mapping AC andBD. Similarly, from sequence 3 to sequence 2 we have hFC = 3; hGE = 2; hHD = 4,and we thus induce the mapping FC, GE, and HD. From sequence 4 tosequence 2 we have hIC = 3; hJD = 2, hJE = 2; hKD = 2, and we thus induce the mappingIC, JD,E, and KD.

4 Concept Learning

4.1 Concept Definitions

A concept is defined over a set X of instances; training examples are members x of X[24] Given a set of training examples T of a target concept C, the learner mustestimate C from T. The concepts we are concerned with here may be thought of assymbolic objects which are described in terms of discrete-valued features Xi :

i=1,…,n, where Xi has domain Di=vi1,…,iimv . A symbolic object is then defined in

terms of feature values as: O= [X1 = v1];…; [Xn = vn] [27]. A logical extension ofthis definition is to define the object attribute values to lie in subsets of the domain,that is, O = [X1 S1]; …; [Xn Sn] where Si Di for i=1,…,n. Each set Si representsa partial value of the set of domain values Di [4, 28].

Definition 4.1: A partial value is determined by a set of possible attribute values ofan attribute X, of which one and only one is the true value. We denote a partial valueby = ( a ar s,... ) corresponding to a set of h possible values a ar s,... of the

same domain, in which exactly one of these values is the true value of . Here, h is


the cardinality of ; ( a ar s,... ) is a subset of the domain set a a1 ,... k of attributeX, and h≤k.

Example 4.1: Consider the features expression level, function with respectivedomains low, medium, high and growth, control. Then examples of concepts are:C1=[expression level = high]; [function = growth]C2=[expression level = (low, medium)]; [function = control]In concept C2, low, medium is an example of a partial value. In this case we knowthat the expression level is either low or medium.

We have previously defined a partial probability distribution, which assignsprobabilities to each partial value of a partition of the set of possible values [20].

Definition 4.2: A partial probability distribution is a vector of probabilities ( =(p

1,...,p

r) which is associated with a partition formed by partial values 1,…,

r) of

attribute A. Here pi is the probability that the attribute value is a member of partial

value i and .1p r

1ii =∑

=

Example 4.2: An example of a partial probability distribution on the partition of thedomain values of expression level given by (low, medium, high) is then:(low, medium, high) = (0.99, 0.01).

This distribution means that the probability of having either a low or mediumexpression level is 0.99, the probability of having a high expression level is 0.01.

Probabilistic Concepts have been used to extend the definition of a concept touncertain situations where we must associate a probability with the values of eachfeature vector [10, 11, 27]. For example, a probabilistic concept might be:

C3 = [expression level = high]:0.8, [expression level = medium]:0.2, [function =growth]:1.0.

This means that the concept is characterised by having high expression level withprobability 0.8, medium expression level with probability 0.2, and function growth.

We are concerned with learning concepts that encapsulate both probabilistic andtemporal semantics from heterogeneous data sequences. Some previous work hasaddressed the related problem of identifying concept drift in response to changingcontexts [6, 14, 15]. However, our current problem differs in that, rather than seekingto learn concepts, which then change with time, our focus is on learning temporalconcepts where the temporal semantics are an intrinsic part of the concept. This isachieved by regarding time as one of the attributes of the symbolic object thatrepresents the concept.

We regard time as being measured at discrete time points T=t1, …tk. A timeinterval is then a subset S of T such that the elements of S are contiguous. A localprobabilistic concept (LPC) may then be defined on a time interval of T.

Example 4.3: Let S =t1, t2 be a time interval of T. Then we may have a localprobabilistic concept

C4 = [Time = S], [expression level = high]:0.8, [expression level = medium]:0.2,


That is, during time interval S there is a high expression level with probability 0.8 andmedium expression level with probability 0.2.

From these local probabilistic concepts we must then learn the global temporalconcept. A global temporal concept is the union of local probabilistic concepts thatrelate to time intervals that form a partition of the time domain T.

Definition 4.3: A temporal probabilistic concept (TPC) is defined in terms of a timeattribute with domain T = t1 ,…, tk and discrete-valued features Xj, where Xj hasdomain Dj=vj1,…,

jjmv , for j = 1,…,n. Then we define a partition of T as T1,…,Tq

where Ti T, TiTj= for i ≠ j and q

1i=∪ Ti =T. For each feature Xj , let

Sij=Sij1,…,ijijrS be a partition of Dj in time interval Ti , where Siju Dj for u = 1, …,

rij. A local probabilistic concept for interval Ti is then defined as

LPCi= Ti , Si1:( pi11,…,i1i1rp ), …, Sin:( pin1,…,

ininrp )

where 1,...njfor 1pijr

1sijs ==∑

=. The TPC is then defined as TCP= i

q

1iLPC

=∪ .

Example 4.4: Let Time have domain T=t1, t2, t3 that is partitioned into two timeintervals T1=t1, t2 and T2=t3; X1=expression level has domain D1=low, medium,high. Typical local probabilistic concepts are LPC1 = T1, D1 :(0, 0.2, 0.8), LPC2 =T2, D1 : (0.9, 0.1, 0). The corresponding TPC is then the union LPC1LPC2. Hence,during time interval T1 there is a high expression level with probability 0.8 andmedium expression level with probability 0.2; during time interval T2 there is a lowexpression level with probability 0.9 and a medium expression level with probability0.1.

4.2 Learning Local Probabilistic Concepts

In this section we describe an algorithm for learning local probabilistic conceptswhich takes account of the fact that the schema mappings discussed in Section 3 maymap a value in the local ontology onto a set of possible values in the global ontology.

We have previously developed an approach that allows us to aggregate attributevalues for any attribute with values that are expressed as a partial probabilitydistribution [20, 22]. Such partial probability distributions correspond in our currentcontext to local probabilistic concepts.

Notation 4.1: We consider an attribute Xj with corresponding global ontology domainDj =v1,...,vk which has instances x1,...,xm. Then the value of the r’th instance of Xj isa (partial value) set given by Sr

(j) for j=1,…,m.

We further define:

∈=

otherwise0

S vif1 (j)rij

irq for i=1,…,k.


Definition 4.4: The partial value aggregate of a number of partial values on attributeXj for time interval Ti, denoted pvagg (Aij), is defined as a vector-valued function:

pvagg (Aij) = )p,...,(p ijkij1 , where the ijp ’s are computed from the iterative scheme:

k.1,...,=ifor ))/mp/(ppm

1r

k

1u

jur

)1n(iju

jir

1)(nijs

(n)ijs ∑ ∑

= =

−−= qq

Here (n)

ijsp is the value of ijsp at the n’th iteration and the ijsp s are the probabilities

associated with the respective values v1,...,vk of attribute Xj in time interval Ti. Thisformula may be regarded as an iterative scheme, which at each step apportions thedata to the (partial) values according to the current values of the probabilities. Theinitial values are taken from the uniform distribution.

We can show [20, 22] that this formula produces solutions for the ijsp s which

minimise the Kullback-Leibler information divergence; this is equivalent tomaximising the Likelihood. It is in fact an application of the EM algorithm [5, 29].We illustrate the algorithm using the data presented in Table 1 and the mappingslearned in Example 3.1.

Example 4.5: Applying the mappings learned in Example 3.1, we induce the mappedgene sequences presented in Table 5.

Table 5. Mapped sequence data

Sequence 1 C C D D D D D D CSequence 2 C C C D D D D E ESequence 3 C C C D D D D E ESequence 4 C C C D D D,E D,E D,E D,E

Then, for example, using only the data at the eighth time point, (column 9 of Table 5)we obtain:

( )( )( )( )( )/4pp1/2/ppp

/4pp1/1/ppp

/40/ppp

1)(nE

1)(nD

1)(nE

1)(nE

(n)E

1)(nE

1)(nD

1)(nD

1)(nD

(n)D

1)(nC

1)(nC

(n)C

−−−−

−−−−

−−

++=

++=

=

Iteration yields the solution 3/2p ,3/1p ,0p EDC === . Similarly, using only the

data at the ninth time point, (column 10 of Table 5) we obtain:

( )( )( )

( )( )/4pp1/2/ppp

/4pp1/pp

/41/ppp

1)(nE

1)(nD

1)(nE

1)(nE

(n)E

1)(nE

1)(nD

1)(nD

(n)D

1)(nC

1)(nC

(n)C

−−−−

−−−

−−

++=

+=

=

In this case, iteration yields the solution 4/3p ,0p ,4/1p EDC === . These are both

examples of local probabilistic concepts.


4.3 Learning Temporal Probabilistic Concepts

Once we have learned the local probabilistic concepts, the next task is to learn thetemporal probabilistic concept for the combined sequences. This is carried out usingtemporal clustering. The algorithm is described in Figure 2.

The similarity metric for the distance between two local probabilistic concepts,used here for clustering is given by d12 = 1 + 2 – 12, where is the log-likelihoodfunction for a local probabilistic concept, given by Definition 4.5. Here 1 is the log-likelihood for LPC 1, 2 is the log-likelihood for LPC2, and 1 is the log-likelihoodfor LPC1 and LPC2 combined. Then we can use a chi-squared test to decide whetherLPC1 and LPC2 can be combined to form a new LPC.

Definition 4.5: The log-likelihood of the probabilistic partial value )p,...,(p k1 in a

local probabilistic concept, as defined in Section 4.1, is given by:

∑∑ ∑== =

==k

1ii

m

1ri

k

1iir .1p subject to )pqlog(

Here the pi’s are first found using the iterative algorithm in Section 4.2.

Input:A set of sequences that have been aligned using the schema mappings

Clustering contiguous time periods:Beginning at the first time point, test for similarly of contiguous localprobabilistic concepts (LPCs)If LPCs are similar then combine,else, compare with previous LPC clusters and combine if similarIf LPC is not similar to any previous cluster, then start a new cluster

Characterisation of temporal clusters:For each cluster: identify local probabilistic conceptCombine optimal clusters to provide temporal probabilistic concept(TPC)

Output:Temporal probabilistic concept (TPC)

Fig. 2. Temporal clustering for mapped heterogeneous sequences

Example 4.6: We consider clustering for the LPCs in Example 4.5. Here the valuesfor the first two time points (columns) are identical so the distance d12 is zero and wecombine LC1 and LC2 to form LC12. We now must decide whether LC12 should becombined with LC3 or whether LC3 is part of a new local probabilistic concept. Inthis case:

12 = 8 log pC, where pC=1, pD=pE=0, so 12 = 0,3 = 3 log pC + log pD, where pC=3/4, pD=1/4, pE=0, so 3 = -2.249,123 = 11 log pC +log pD, where pC=11/12, pD=1/12, pE=0, so 123 = -3.442.


The distance between LPC12 and LPC3 is then 0 – 2.249 + 3.442 = 1.193. Sincetwice this value is inside the chi-squared threshold with 1 degree of freedom (3.84),we therefore decide to combine LPC12 and LPC3.

5 An Illustrative Example

We illustrate our discussion using sequences of gene expression data. These data areanalysed in a number of papers, e.g. Michaels et al. [23], D'haeseleer et al. [8] and areavailable at: http://stein.cshl.org/genome_informatics/expression/somogyi.txt.

The data contain sequences of 112 gene expressions for rat cervical spinal cordmeasured at nine developmental time points (E11, E15, E18, E21, P0, P7, P14, A),where E=Embryonic, P=Postnatal, and A=Adult. The continuous gene expressionswere discretised by partitioning the expressions into three equally sized bins. Thiseffects a smoothing of the time series without (hopefully) masking the underlyingpattern. Associations between these gene expression time series were then identifiedusing mutual entropy; the clusters based on this distance metric are described in detailin Michaels et al. [23].

We now use the cluster to learn the mapping that characterises the cluster. Once wehave succeeded in mapping the local sequence ontologies to a global ontology, we canderive local probabilistic concepts and a temporal probabilistic concept for eachcluster. We consider cluster 2 to illustrate the approach. In this case the geneexpression sequences are presented in Table 6. We note that codes 0, 1, and 2 arelocal references and may have different meanings in different genes.

Table 6. Gene expression sequences for Cluster 2.

Gene E11 E13 E15 E18 E21 P0 P7 P14 AMgluR7 RNU06832 0 0 0 2 2 2 1 2 2L1 S55536 0 0 1 2 2 2 1 1 1GRa2 (Ý) 0 0 1 2 2 2 2 2 2GRa5 (#) 0 0 1 2 2 2 2 2 2GRg3 RATGABAA 0 0 1 2 2 2 2 2 1MgluR3 RATMGLURC 0 0 1 2 2 2 1 2 2NMDA2B RATNMDA2B 0 0 1 2 2 2 1 2 2Statin RATPS1 0 0 1 2 2 2 2 2 2MAP2 RATMAP2 0 0 2 2 2 2 2 2 2Pre-GAD67 RATGAD67 0 0 2 2 2 2 2 2 2GAT1 RATGABAT 0 0 2 2 2 2 2 2 2NOS RRBNOS 0 0 2 2 2 2 2 2 1GRa3 RNGABAA 0 0 2 2 2 2 1 2 2GRg2 (#) 0 0 2 2 2 1 1 1 2MgluR8 MMU17252 0 0 2 2 2 2 1 1 1TrkB RATTRKB1 0 0 2 1 1 1 1 1 1Neno RATENONS 0 1 2 2 2 2 2 2 2GRb3 RATGARB3 0 1 2 2 2 2 2 2 1TrkC RATTRKCN3 0 1 2 2 2 2 2 2 2GAP43 RATGAP43 1 1 2 2 2 2 2 2 2NAChRd RNZCRD1 1 2 0 0 0 0 0 0 0Keratin RNKER19 2 0 0 0 1 0 0 0 0Ins1 RNINS1 2 0 0 0 0 0 0 0 0GDNF RATGDNF 2 2 0 0 1 1 1 0 0SC6 RNU19140 2 2 0 0 0 0 0 0 0Brm (I I) 2 2 1 1 1 1 1 1 0


The mappings for the data in Table 6 were then learned using the algorithm in Figure1 in Section 3, and these mappings are shown in Table 7. Using these mappings thesequences were transformed to the global ontology, as illustrated in Table 8.

Using the iteration algorithm in Definition 4.4, partial value aggregates may becomputed for each of the nine time-points, and these are shown in Table 9, along withthe log likelihood values computed as in Definition 4.5.

The clustering algorithm described in Figure 2 is then applied, and the likelihoodratio test used at each stage to determine the similarity of the local probabilisticconcepts. The first LPCs considered are E11 and E13. The combined data for E11 andE13 gives the probabilistic partial value 0.039) 0, (0.961,)p,p,(p 210 = and

corresponding log likelihood value of 12 = -8.438. The distance between E11 andE13 is then measured as d12 = 1+212, where 1 = 0 and 2 = -6.969, giving 2*d12 =2.937 < 5.99 (the critical value for the chi-squared test with 2 degrees of freedom at a95% significance level). Hence E11 and E13 are found to be not significantlydifferent form each other, and therefore are combined to form a new cluster.

Table 7. The mappings to the global ontology for the sequences in Table 6.

Local Ontology0 1 20 1 20 2 20 0 20 0 20 0,2 20 0,1 20 0,1 20 0 20 * 20 * 20 * 20 2 20 1 20 2 20 2 20 2 00 0 20 0,2 20 0 2* 0 22 0 02 2 02 * 02 2 02 * 0

GlobalOntology

2 2 0

At the next stage, E15 is compared with this new cluster. In this case we must thencompute the probabilistic partial value for the combined data from E11, E13 and E15,giving 0.257) 0, (0.743,)p,p,(p 210 = and corresponding log likelihood value of 12 =

-43.808. The distance between the cluster E11, E13 and E15 is then d12 = 1+212

= -8.438 – 14.824 + 43.808, giving 2*d12 = 41.093 > 5.99. Hence E15 is found to besignificantly different from the cluster E11, E13 and is not combined. At the next


stage, therefore, E18 is compared with E15, and found to be significantly different.E18 is also significantly different to the E11, E13 cluster, and hence forms the startof a new cluster. Subsequently, E18, E21, P0, P14, A and P7 are found to beclusters, along with E11, E13 and E15. These clusters are then characterised bythe local probabilistic concepts E11, E13, (0.961, 0, 0.039), E15, (0.28, 0.0.72), E18, E21, P0, P14, A, (0, 0, 1) and P7, (0, 0.154, 0.846) respectively.

6 Summary and Further Work

We have described a methodology for describing and learning temporal conceptsfrom heterogeneous sequences that have the same underlying temporal pattern. Thedata are heterogeneous with respect to classification schemes where the class valuesdiffer between sequences. However, because the sequences relate to the sameunderlying concept, the mappings between values may be learned. On the basis ofthese mappings we use statistical learning methods to describe the local probabilisticconcepts. A temporal probabilistic concept that describes the underlying pattern isthen learned. This concept may be matched with known genetic processes andpathways.

Table 8. The sequences in Table 6 mapped using the transformations in Table 7

Gene E11 E13 E15 E18 E21 P0 P7 P14 AmGluR7_RNU06832 0 0 0 2 2 2 1 2 2L1_S55536 0 0 2 2 2 2 2 2 2GRa2_(Ý) 0 0 0 2 2 2 2 2 2GRa5_(#) 0 0 0 2 2 2 2 2 2GRg3_RATGABAA 0 0 0,2 2 2 2 2 2 0,2mGluR3_RATMGLURC 0 0 0,1 2 2 2 0,1 2 2NMDA2B_RATNMDAB 0 0 0,1 2 2 2 0,1 2 2statin_RATPS1 0 0 0 2 2 2 2 2 2MAP2_RATMAP2 0 0 2 2 2 2 2 2 2pre-GAD67_RATGAD67 0 0 2 2 2 2 2 2 2GAT1_RATGABAT 0 0 2 2 2 2 2 2 2NOS_RRBNOS 0 0 2 2 2 2 2 2 2GRa3_RNGABAA 0 0 2 2 2 2 1 2 2GRg2_(#) 0 0 2 2 2 2 2 2 2mGluR8_MMU17252 0 0 2 2 2 2 2 2 2trkB_RATTRKB1 0 0 0 2 2 2 2 2 2neno_RATENONS 0 0 2 2 2 2 2 2 2GRb3_RATGARB3 0 0,2 2 2 2 2 2 2 0,2trkC_RATTRKCN3 0 0 2 2 2 2 2 2 2GAP43_RATGAP43 0 0 2 2 2 2 2 2 2nAChRd_RNZCRD1 0 0 2 2 2 2 2 2 2keratin_RNKER19 0 2 2 2 2 2 2 2 2Ins1_RNINS1 0 2 2 2 2 2 2 2 2GDNF_RATGDNF 0 0 2 2 2 2 2 2 2SC6_RNU19140 0 0 2 2 2 2 2 2 2Brm_(I_I) 0 0 2 2 2 2 2 2 2


Table 9. Partial probability aggregates and corresponding log likelihood values

E11 E13 E15 E18 E21 P0 P7 P14 Ap0

1 0.92 0.28 0 0 0 0 0 0p1 0 0 0 0 0 0 0.154 0 0p2

0 0.08 0.72 1 1 1 0.846 1 1Log likelihood 0 -6.969 -14.824 0 0 0 -11.162 0 0

The approach is illustrated using data of gene expression sequences. Although this is amodest dataset, it serves to explain our approach and demonstrates the necessity ofconsidering the possibility of a temporal concept for such problems, where there is anunderlying temporal process involving staged development.

For the moment we have not considered performance issues since the problem wehave identified is both novel and complex. Our focus, therefore, has been on definingterminology and providing a preliminary methodology. In addition to addressing suchperformance issues, future work will also investigate the related problem ofassociating clusters with explanatory data; for example our gene expression sequencescould be related to the growth process.

Acknowledgement. This work was partially supported by the MISSION (Multi-agentIntegration of Shared Statistical Information over the (inter)Net) project, IST projectnumber 1999-10655, which is part of Eurostat’s EPROS initiative funded by theEuropean Commission.

References

1. Bassett, D.E. Jr., Eisen, M.B., Boguski, M.S.: Gene Expression Informatics - it’s All inYour Mine. Nature genetics supplement 21 (1999) 51-55

2. Cadez, I., Gaffney, S., Smyth, P.: A General Probabilistic Framework for ClusteringIndividuals. In: Proc. ACM SIGKDD (2000) 140-149

3. Chen, A.L.P., Tseng, F.S.C.: Evaluating Aggregate Operations over Imprecise Data. IEEETransactions on Knowledge and Data Engineering 8 (1996) 273-284

4. Demichiel, L.G.: Resolving Database Incompatibility: An Approach to PerformingRelational Operations over Mismatched Domains. IEEE Transactions on Knowledge andData Engineering 4 (1989) 485-493

5. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data viathe EM Algorithm (with discussion). J. R. Statist. Soc. B 39 (1977) 1-38

6. Devaney, M., Ram, A.: Dynamically Adjusting Concepts to Accommodate ChangingContexts. In: Proc. ICML-96 Pre-Conference Workshop on Learning in Context-SensitiveDomains, Bari, Italy (1996)

7. D’haeseleer, P., Wen, X., Fuhrman, S., Somogyi, R.: Mining the Gene Expression Matrix:Inferring Gene Relationships from Large-Scale Gene Expression Data. In: Paton, R.C.,Holcombe, M. (eds.): Information Processing in Cells and Tissues. Plenum Publishing(1998) 203-323

8. D’haeseleer, P., Liang S., Somogyi, R.: Gene Expression Data Analysis and Modelling. In:Tutorial at the Pacific Symposium on Biocomputing (1999)

9. Doan, A.H., Domingues, P., Levy, A.: Learning Mappings between Data Schemes. In:Proc. AAAI Workshop on Learning Statistical Models from Relational Data, AAAI '00,Austin, Texas, Technical Report WS00006 (2000) 1-6


10. Fisher, D.H.: Knowledge Acquisition via Incremental Conceptual Clustering. MachineLearning 2 (1987) 139-172

11. Fisher, D.: Iterative Optimisation and Simplification of Hierarchical Clusterings. Journal ofAI Research 4 (1996) 147-179

12. Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A.,Stefanovic, N., Xia, B., Zaiane O.: DBMiner: A System for Mining Knowledge in LargeRelational Databases. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.): Proc. 2ndInternational Conference on Knowledge Discovery and Data Mining (KDD’96), Portland,Oregon (1996) 250-255

13. Han, J., Fu, Y.: Exploration of the Power of Attribute-oriented Induction in Data Mining.In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusay, R. (eds.): Advances inKnowledge Discovery, AAAI Press / The MIT Press (1996) 399-421

14. Harries, M., Horn K., Sammut, C.: Extracting Hidden Context. Machine Learning 32(2)(1998) 101-126

15. Harries, M., Horn, K.: Learning Stable Concepts in a Changing World. In: Antoniou, G.,Ghose, A., Truczszinski, M. (eds.): Learning and Reasoning with ComplexRepresentations. Lecture Notes in AI, Vol. 1359. Springer-Verlag (1998) 106-122

16. Lim, E.-P., J. Srivastava, Shekher, S.: An Evidential Reasoning Approach to AttributeValue Conflict Resolution in Database Management. IEEE Transactions on Knowledge andData Engineering 8 (1996) 707-723

17. Malvestuto, F.M.: The Derivation Problem for Summary Data. In: Proc. ACM-SIGMODConference on Management of Data, New York, ACM (1998) 87-96

18. Malvestuto, F.M.: A Universal-Scheme Approach to Statistical Databases containingHomogeneous Summary Tables. ACM Transactions on Database Systems 18 (1993) 678-708

19. McClean, S.I., Scotney, B.W., Shapcott, C.M.: Aggregation of Imprecise and UncertainInformation for Knowledge Discovery in Databases. In: Proc. 4th International Conferenceon Knowledge Discovery in Databases (KDD'98) (1998) 269-273

20. McClean, S.I., Scotney, B.W., Shapcott, C.M.: Incorporating Domain Knowledge intoAttribute-oriented Data Mining. Journal of Intelligent Systems 6 (2000) 535-548

21. McClean, S.I., Scotney, B.W., Greer, K.R.C.: Clustering Heterogenous DistributedDatabases. In: Kargupta, H., Ghosh, J., Kumar, V. Obradovic, Z. (eds.): Proc. KDDWorkshop on Knowledge Discovery from Parallel and Distributed Databases (2000) 20-29

22. McClean, S.I., Scotney, B.W., Shapcott, C.M.: Aggregation of Imprecise and UncertainInformation in Databases. Accepted, IEEE Trans. Knowledge and Data Engineering (2001)

23. Michaels, G.S., Carr, D.B., Askenazi, M., Fuhrman S., Wen, X., Somogyi, R.: ClusterAnalysis and Data Visualisation of Large-Scale Gene Expression Data. Pacific Symposiumon Biocomputing 3 (1998) 42-53

24. Mitchell, T.: Machine Learning. New York: McGraw Hill (1997)25. Scotney, B.W., McClean, S.I.: Efficient Knowledge Discovery through the Integration of

Heterogeneous Data. Information and Software Technology (Special Issue on KnowledgeDiscovery and Data Mining) 41 (1999) 569-578

26. Scotney B.W., McClean, S.I, Rodgers, M.C.: Optimal and Efficient Integration ofHeterogeneous Summary Tables in a Distributed Database. Data and KnowledgeEngineering 29 (1999) 337-350

27. Talavera, L., Béjar, J.: Generality-based Conceptual Clustering with Probabilistic Concepts.IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2) (2001) 196-206

28. Tseng, F.S.C., Chen, A.L.P., Yang, W-P.: Answering Heterogeneous Database Querieswith Degrees of Uncertainty. Distributed and Parallel Databases 1 (1993) 281-302

29. Vardi, Y., Lee, D.: From Image Deblurring to Optimal Investments: Maximum LikelihoodSolutions for Positive Linear Inverse Problems (with discussion). J. R. Statist. Soc. B(1993) 569-612


Handling Uncertainty in a Medical Study of DietaryIntake during Pregnancy

Adele Marshall1, David Bell2, and Roy Sterritt2

1 Department of Applied Mathematics and Theoretical PhysicsQueen’s University of Belfast, Belfast, BT9 5AH, UK

[email protected] School of Information and Software Engineering, Faculty of InformaticsUniversity of Ulster, Jordanstown Campus, Newtownabbey, BT37 0QB

da.bell, r.sterritt @ulst.ac.uk

Abstract. This paper is concerned with handling uncertainty as part of theanalysis of data from a medical study. The study is investigating connectionsbetween the birth weight of babies and the dietary intake of their mothers. Baye-sian belief networks were used in the analysis. Their perceived benefits include(i) an ability to represent the evidence emerging from the evolving study, deal-ing effectively with the inherent uncertainty involved; (ii) providing a way ofrepresenting evidence graphically to facilitate analysis and communication withclinicians; (iii) helping in the exploration of the data to reveal undiscoveredknowledge; and (iv) providing a means of developing an expert system applica-tion.

1 Introduction

This paper is concerned with handling uncertainty as part of the analysis of data froma medical study. The study is recording and analysing details of pregnant women at 28weeks gestation to examine the relationship between the dietary intake of a motherand the birth weight of her baby. The study is an extension of a major new medicalresearch programme, HAPO (Hyperglycaemia and Adverse Pregnancy Outcome),currently underway at the Royal Victoria Hospital, Belfast.

This paper describes the development of a decision support system (expert system)that will deal with (i) uncertainty in the dietary and associated data; (ii) preliminaryanalysis of the data; and (iii) collection of the data in the medical study itself.

1.1 Background and Previous Work

Consumption of a nutritionally adequate diet is of particular importance during preg-nancy and has a considerable influence on birth outcome. Previous research has shownthat sudden and severe restriction of energy and protein intake during pregnancy re-

Handling Uncertainty in a Medical Study of Dietary Intake during Pregnancy 207

duces birth weight by as much as 300g [1]. However, studies have also shown that anadequate energy and protein consumption may not be sufficient without the accompa-niment of vitamins, minerals [2] and even fatty acids [3].

There is substantial evidence in the literature to suggest a strong association withlow birth weight and increased incidence of neonatal mortality [4] and higher neonatalmorbidity [5]. Impairment of foetal growth and low birth weight due to inadequatematernal and foetal nutrition can also increase the risk of chronic diseases in adulthood[6]. Maternal diet not only influences the immediate outcome of pregnancy but alsothe longer-term health of her offspring [7], possibly influencing susceptibility to dis-eases such as ischaemic heart disease, hypertension and Type 2 diabetes.

Additionally, there is evidence in the literature to suggest a relationship betweenhigh birth weight, caused by excessive foetal growth (macrosomia), and potential riskssuch as difficulties in childbirth and intensive postnatal care. These problems are mostcommonly associated with diabetic pregnancy. This has led to numerous attempts torelate complications in pregnancy and birth outcome to the level of maternal glycae-mia [8], [9]. However, these studies have not been able to identify a possible thresholdlevel of glycaemia above which there is a high risk of macrosomia.

1.2 The HAPO Medical Study

The US National Institute of Health (NIH) has recently approved an extensive inter-national HAPO (Hyperglycaemia and Adverse Pregnancy Outcome) study in 16 keycentres around the world. The Royal Victoria Hospital is one of these centres. Eachcentre will recruit 1500 pregnant women in the study over a two year period.

Research is primarily concerned with the following hypothesis used to investigatethe association between maternal blood glucose control and pregnancy outcome.

Hypothesis: Hyperglycaemia in pregnancy is associated with an increased risk ofadverse maternal, foetal and neonatal outcomes.

In addition to the specified HAPO variables, the Royal Victoria Hospital will be col-lecting information on dietary intake of the pregnant women in the study. Informationgathered will include: anthropometric measures socio-economics family history metabolic status food frequency and diet measures of pregnancy outcomesThe details are recorded at 28 weeks gestation during an interview where a food fre-quency questionnaire is completed. Additional information is gathered about the preg-nancy outcome and details recorded on the newborn baby.

The calendar of events for the HAPO study is shown in Figure 1. It shows the ini-tial recruitment and data collection using the food frequency questionnaire followed

208 A. Marshall, D. Bell, and R. Sterritt

by the pregnancy outcome variables and home visit follow-up. A follow-up is alsoscheduled to record information on the general health of the baby when 2-3 years old.

Fig. 1. HAPO event calendar.

The food frequency questionnaire was designed to gather relevant information as thedietary intake of the pregnant mothers. The food frequency questionnaire is responsi-ble for the collection of information on the frequency that pregnant women eat food ineach of the various food groups. There is a high level of complexity and uncertaintyinvolved in determining the nutritional value and content of various different foods inaddition to capturing the effect of variations between manufacturers and products.After consultation with experts it was advised that the food frequency questionnairedeal with this problem by focusing on groups of food types such as cereals, meat / fishand so on, as summarised in Figure 2.

Fig. 2. Food groups analysed in the HAPO study.

The food groups may be associated with different nutritional levels. For example,meat and fish will have a strong association with the nutritional influence of proteinswhereas cereals will be strongly associated with folic acid, a nutrient that plays animportant role in the prevention of neural tube defects.

There are five parts to the study, evaluating:1. Dietary intake in pregnancy compared to non-pregnant women of childbearing age.2. Links between diet and lifestyle/socio-economic factors.3. Links between diet and maternal/foetal glucose.4. links between dietary intake and pregnancy outcome.5. Possible follow-up of links between dietary intake in pregnancy and both maternal

and outcome at two years.Part 1 is a control to assess whether there is a significant difference in the diet of preg-nant women in relation to other women. Parts 2 to 5 can be supported with the intro-


duction and development of a decision support (expert) system for assessing the die-tary variables for pregnant women.

1.3 Expert Systems

An expert system is a computer program that is designed to solve problems at a levelcomparable to that of a human expert in a given domain [10]. Rule-based expert sys-tems have achieved particular success in medical applications [11] largely because thedecisions that are made can be traced and understood by domain experts. However,their development has suffered due to the difficulty of gathering relevant information,an inability to handle uncertainty adequately, and the maintenance problem of keepingthe stored knowledge up-to-date [12].

Another possible AI technique that can be used within an expert system is neuralnetworks. Unfortunately, clinicians and healthcare providers have been reluctant toaccept this approach because neural networks provide little insight into how they drawtheir conclusions, leading to a lack of confidence in the decisions made [13].

Bayesian belief networks (BBNs) are another possibility. A key advantage is theirability to represent potentially causal relationships between variables making themideal for representing causal medical knowledge. The graphical nature of BBNs alsomakes them attractive for communication between medical and non-medical research-ers. BBNs allow clinicians a better insight into the workings of the model therebyimproving their confidence in the process and the decisions made.

Example of relevant work in this area include Research into the graphical conditional independence representation of BBNs as an

expert system for medical applications [14]. The application of BBNs in the treatment of kidney stones [15]. The diagnosis of patients suffering from dyspnoea (shortness of breath) [16]. The representation of medical knowledge [17]. Assistance in the assessment of breast cancer [21]. Assessing the likelihood of a baby suffering from congenital heart disease [22].In addition, various other applications have been developed in biology for predictionand diagnosis [13].

2 Research and Methodology: Handling Uncertainty

2.1 Bayesian Belief Networks

Bayesian belief networks (BBNs) are probabilistic models which represent potentialcausal relationships in the form of a graphical structure [18], [19]. There has beenmuch discussion and concern over the use of the term causality. In the context ofBBNs, causal is used is in the representation of relationships which have the potentialto be causal in nature—when one variable directly influences another. [20] explain thisreasoning by stating that it is rare that firm conclusions about causality can be drawn


from one study but rather the objective is to provide representations of data that arepotentially causal—those that are consistent with, or suggestive of causal interpreta-tion. For the potential casual relationships to be considered truly causal, external inter-vention is required to attach an overall understanding to the whole process of generat-ing the data and the deeper concepts involved.

The BBNs are a set of variables and a set of relationships between them, whereeach variable is a set of mutually exclusive events. Relationships are represented ingraphical form. The graphical models use Bayes’ Theorem, the probability of an eventoccurring given that some previous event has already taken place. Uncertainty in thedata can therefore be handled appropriately by the probabilistic nature of the BBN.

The structure of the BBN is formed by nodes and arrows that represent the vari-ables and causal relationships presented as a directed acyclic graph. An arrow or di-rected arc, in which two nodes are connected by an edge, indicates that one directlyinfluences the other. Attached to these nodes are the probabilities of various eventsoccurring. This ability to capture potential causality supported by the probabilities ofBayes’ Theorem makes the BBN appealing to use.

2.2 Development Process of the BBN

There are many ways of constructing a BBN according to the amount of expertknowledge and data available. If there is a significant uncertainty in the expert knowl-edge available, learning of the BBNs may only occur through induction from the data.Alternatively, a high contribution of expert advice may lead to the initial developmentof the BBN structure originating entirely from the experts, with probabilities attachedfrom the data.

A purely expert-driven approach is unattractive because of the basic difficulty inacquiring knowledge from experts. Such a problem can be alleviated by supplyingexperts with data induced relationships. A pure data approach would often seem ideal,providing the opportunity of discovering new knowledge along with the potential forfull automation. However, the data approach will not always capture every possiblecircumstance and still demands the attention of expert or human assistance for inter-pretation purposes—a discovery only becomes knowledge with human interpretation.It is between these two extremes that most developments lie, but that in turn makes thedevelopment process unclear.

A BBN combines a mixture of expert consultation and learning from data. The ob-jective is to achieve the best of both worlds and minimise the disadvantages of eachsource, thus reducing the uncertainty. Generally, the approach consists of three mainstages:

HUMAN: Probabilities and structure defined by consultation with experts and lit-erature.

SYSTEM: Learning of structure and probabilities from data.COMBINE: Knowledge base amended with discoveries and probabilities obtained

from the data, or the structure induced from data is adjusted to includeexpert reasoning.


As the intention is to investigate the nature of BBNs as a research development tech-nique, one possible approach is to run the system and human stages in parallel withoutany collaboration andcompare the outcomes in the combine stage. Alternatively,BBNs may be developed using any combination of the three development componentson the first data set. Then, as the data collection continues the process of developingBBNs can also continue, with various BBNs developed for the growing data set. Thisexercise will indicate the benefits of using BBNs as a research development techniqueas more and more discoveries on the data are obtained.

The focus of this paper is the system component of the development process inwhich the structure and probabilities are derived from analysing the study question-naires for implicit causal relationships. This will be followed by a second stage deri-vation of a BBN from the experts involved in the study, with the final outcome being acomparison of the resulting BBNs.

In reality, the process, human-system-combine or system-human-combine is similarto an evolutionary development process since a re-examination or re-learning of thesystem will be required throughout the study at significant stages to include new data.This process should facilitate fine-tuning in the development of the causal network.

2.3 Expected BBN Formulated from Literature

The research literature reveals various potential causal relationships that can be repre-sented in a BBN structure. For example, one simple causal relationship [1], is thedirect influence of the level of proteins in the diet of the mother on the final birthweight of the baby. This may be represented as part of a BBN model as shown inFigure 3.

Fig. 3. BBN representation of a causal relationship.

The hypothetical BBN in Figure 4 represents potential causal relationships inducedfrom consultation of the literature and the study domain. The model considers thefoods classified into their basic food type. The arrows indicate direct influences. Forexample, in this model, the food groups proteins, vitamins and minerals and fatty acidswill all have potential causal influences on the birth weight of the baby.


Fig. 4. A hypothetical BBN for the HAPO study.

The birth weight of the baby will in turn have a direct influence on its survival. Forexample a baby with a very low birth weight may be more likely to die. Also takeninto account in the BBN structure is the influence on baby weight on variables ob-tained from the follow-up study—for example, the occurrence of adult chronic dis-eases, such as ischaemic heart disease. At the current stage in the data collection, it isonly possible to hypothesise about such relationships as follow-up variables will notbe available to the research team until at least 2003 and throughout the babies devel-opment into children and adulthood.

3 Preliminary Results

The medical study has now started with the first set of participants’ details being re-corded. Currently, the study has recruited and interviewed 294 women, 108 of whomhave had their babies. Preliminary statistical analysis has been carried out on thisinitial data set. The pregnancy outcomes include measures recorded for the new-bornbaby along with additional information such as delivery type and mothers condition.Baby information includes the baby weight, length, head circumference and sex. Inthis sample of data, the birth weights of the 108 babies ranged from 1.9kg to 4.93kg(4.18-10.85 lbs.) with an average of 3.51kg (7.7 lbs.). Variables identified as having adirect influence on baby weight include the frequencies at which pasta, bread andpotatoes are consumed. Other potential influencing factors are the number of ciga-rettes smoked, the number of children that the mother has already, and whether thereare any relatives who have diabetes. The statistical analysis performed on the data sethas indicated some useful observations on the variables that have a possible influenceon the baby outcomes. To investigate these relationships further, it would be useful toconstruct a BBN.


3.1 Resulting BBNs

The focus of this paper is the development of an initial BBN in which the structureand probabilities are derived from analysing the data collected from the study ques-tionnaires. It is hoped that this will be followed by a second stage derivation of a BBNfrom the experts involved in the study.

The BBNs are constructed using the PowerConstructor package [23]. PowerCon-structor takes advantage of Chow and Lui’s algorithm [24] which uses mutual infor-mation for learning causal relationships and enhances the method with the addition offurther procedures to form a three-stage process of structure learning from the data.The first phase (drafting) of the PowerContructer software utilises the Chow-Liu algo-rithm for identifying strong dependencies between variables by the calculation ofmutual information. The second stage (thickening) performs conditional independence(CI) tests on pairs of nodes that were not included in the first stage. Stage 3 (thinning)then performs further CI tests to ensure that all edges that have been added are neces-sary. This three-stage approach manages to keep to one CI test per decision on an edgethroughout each stage and has a favourable time complexity of O(N2) unlike many ofits competitors which have exponential complexity.

Preliminary analysis was carried out using the PowerConstructor package on thefood frequency variables for the first 108 pregnant mothers along with the outcomevariable, the baby's birth weight. The BBN in Figure 5 was induced from the data.

Fig. 5. Initial BBN for the birth weight outcome and food frequency variables using [23].

As before, the variables are represented by ovals in the structure while the edges be-tween the ovals represent potential causal relationships between the variables. From


inspection of the BBN, it is apparent that there are many inter-relationships betweenthe dietary intake variables, but already at this early stage in the data collection a rela-tionship is emerging for the baby’s birth weight. This may be made clear by removingsome of the less significant variables and repeating the induction of the BBN. Theresulting BBN is shown in Figure 6.

Fig. 6. BBN representing the birth weight outcome along with food type variables using [23].

The relationship emerging from the BBN is that the baby’s birth weight seems to bedirectly influenced by the variety of bread the mother consumes during pregnancy. Infact if the probabilities are considered it is evident that the greater the variety of breadconsumed, the greater the probability of a larger baby. In addition to this, the variablebread is in turn influenced by the consumption of puddings which is in turn influencedby the consumption of fruit and vegetables, potatoes, pasta, rice which are influencedby meat and cereal.

The BBN in Figure 6 captures some of the significant relationships on baby weight.However, as the data set grows, it is expected that the number of edges in the BBNwill also increase. It is hoped that the BBNs will not only be a tool for the representa-tion of the evolving model but also as a research development technique to aid discus-sion as the study progresses, helping to identify further causal relationships.

4 Conclusions and Observations

The paper has discussed the use of Bayesian belief networks (BBNs) for handlinguncertainty in a medical study. The study is concerned with modelling the influencingfactors of a mother’s dietary intake during pregnancy on the final birth outcomes ofthe baby. In particular, the baby’s birth weight is important as this may cause furthercomplications for both the mother and baby.


A food frequency questionnaire designed to record the frequencies of consumptionof various different foods is being used to collect dietary information. There is a highlevel of complexity and uncertainty involved in determining the nutritional value andcontent of various different foods and in capturing the effect of variations betweenmanufacturers and products. In addition to developing a system that can handle suchuncertainty, the development project itself is a research project with undiscoveredknowledge and unproven hypotheses.

The objectives are to develop a system that can represent relationships betweendietary and outcome variables, to be able to handle uncertainty, while also producingresults in such a way that can be understood by clinicians. BBNs seem to be an appro-priate technique for such a challenge. Advantages of using them include their ability torepresent potentially causal relationships, their visual graphical representation andtheir capability of dealing with uncertainty using probability theory.

The systems or software aspect of this project is to engineer an intelligent system.Ideally this would involve acquiring or learning from an environment with provenhypotheses; however, the medical research in this project is running in parallel sothere is uncertainty in the expert knowledge available. Thus, the process to engineerthe system should also assist in expressing the evidence contained in the evolvingstudy to the medical experts and the system designers as well as exploring the studydata as it is gathered for undiscovered knowledge.

The study is currently at the data collection stage and further evolution of the BBNswill follow. It is hoped that an analysis of this evolution will also provide interestinginsights into the BBN development process.

Acknowledgements. The authors are greatly indebted to our collaborators Mrs. A.Hill (Nutrition and Dietetics, Royal Hospital Trust), Professor P.R. Flatt and Dr. M.J.Eaton-Evans (Faculty of Science, University of Ulster, Coleraine), Dr. D. McCanceand Professor D.R. Hadden (Metabolic Unit, Royal Victoria Hospital).

References

1. Stein, Z., Fusser, M.: The Dutch Famine, 1944-45, and the Reproductive Process. PediatricResearch 9 (1975) 70-76

2. Doyle, W. Crawford, M.A., Wynn, A.H.A., Wynn, S.W.: The Association Between Mater-nal Diet and Birth Dimensions. Journal of Nutrition and Medicine 1 (1990) 9-17

3. Olsen, S.F., Olsen, J., Frische, G.: Does Fish Consumption During Pregnancy Increase FetalGrowth? International Journal of Epidemiology 19 (1990) 971-977

4. Bakketeig, L.S., Hoffman, H.J., Titmuss Oakley, A.R.: Perinatal Mortality. In: BrackenM.B. (ed.): Perinatal Epidemiology, Oxford University Press, New York, Oxford (1984)99-151

5. Walther, F.J., Raemaekers, L.H.J.: Neonatal Morbidity of SGA Infants in Relation to theirNutritional Status at Birth. Acta Paediatric Research Scandia 71 (1982) 437-440


6. Barker, D.J.P.: The Fetal and Infant Origins of Adult Disease. British Medical Journal,London (1992)

7. Mathews, F., Neil, H.A.W.: Nutrient Intakes During Pregnancy in a Cohort of NulliparousWomen. Journal of Human Nutrition and Dietetics 11(1998) 151-161

8. Pettitt, D.J., Knowler, W.C., Baird, H.R., Bennett, P.H.: Gestational Diabetes: Infant andMaternal Complications in Relation to 3rd Trimester Glucose Tolerance in Pima Indians.Diabetes Care, 3 (1980) 458-464

9. Sermer, M., Naylor, C.D., Gare, D.J.: Impact of Increasing Carbohydrate Intolerance onMaternal-fetal Intrauterine Growth Retardation. Human Nutrition and Clinical Nutrition41C (1995) 193-197

10. Cooper, G.F.: Current Research Directions in the Development of Expert Systems Based onBelief Networks. Applied Stochastic Models and Data Analysis 5 (1989) 39-52

11. Millar, R.A.: Medical Diagnostic DSS - Past, Present and Future. JAMIA 1 (1994) 8-2712. Bratko, I., Muggleton, S.: Applications of Inductive Logic Programming. Comm. ACM

38(11) (1995) 65-7013. Lisboa, P.J.G., Ifeachor, E.C., Szczepaniak P.S. (eds.): Artificial Neural Networks in Bio-

medicine. Springer (2000)14. Andersen, L.R., Krebs, J.H., Damgaard, J.: STENO: An Expert System for Medical Diag-

nosis Based on Graphical Models and Model Search. Journal of Applied Statistics 18 (1991)139-153

15. Madigan, D.: Bayesian Graphical Models for Discrete Data. Technical Report, University ofWashington, Seattle (1993)

16. Lauritzen, S.L., Spiegelhalter, D.J.: Local Comparisons with Probabilities on GraphicalStructures and their Application to Expert Systems. Journal Royal Statistical Society B50(2) (1988) 157- 224

17. Korver, M., Lucas, P.J.F.: Converting a Rule-Based Expert System into a Belief Network.Med. Inform. 18(3) (1993) 219-241

18. Buntine, W.: Graphical Models for Discovering Knowledge. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusay, R. (eds.): Advances in Knowledge Discovery, AAAIPress / The MITPress (1996) 59-82

19. Ramoni, M., Sebastiani, P.: Bayesian Methods for Intelligent Data Analysis. In: Berthold,M., Hand, D.J. (eds.): Intelligent Data Analysis: An Introduction, Springer, New York(1999)

20. Cox, D.R., Wermuth N.: Multivariate Dependencies, Chapman and Hall (1996)21. Hojsgaard, S., Thiesson, B., Greve, J., Skjoth, F.: BIROST - A Program for Inducing Block

Recursive Models from a Complete Database, Institute of Electronic Systems, Departmentof Mathematics and Computer Science, Aalborg University, Denmark (1992)

22. Spiegelhalter, D.J., Dawid, A.P., Lauritzen, S.L., Cowell, R.G.: Bayesian Analysis in Ex-pert Systems, Statistical Science 8(3) (1993) 219-283

23. Cheng, J., Bell, D.A., Liu, W.: An Algorithm for Bayesian Network Construction fromData. 6th International Workshop on AI and Stats (1997) 83-90

24. Chow, C. J. K., Liu, C. N.: Approximating Discrete Probability Distributions with Depend-ence Trees, IEEE Trans. Information Theory, Vol. 14(3), (1968) 462-467

D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, pp. 217-231, 2002.© Springer-Verlag Berlin Heidelberg 2002

Sequential Diagnosis in the Independence BayesianFramework

David McSherry

School of Information and Software Engineering, University of UlsterColeraine BT52 1SA, Northern [email protected]

Abstract. We present a new approach to test selection in sequential diagnosis(or classification) in the independence Bayesian framework that resembles thehypothetico-deductive approach to test selection used by doctors. In spite of itsrelative simplicity in comparison with previous models of hypothetico-deductive reasoning, the approach retains the advantage that the relevance of aselected test can be explained in strategic terms. We also examine possibleapproaches to the problem of deciding when there is sufficient evidence todiscontinue testing, and thus avoid the risks and costs associated withunnecessary tests.

1 Introduction

In sequential diagnosis, tests are selected on the basis of their ability to discriminatebetween competing hypotheses that may account for an observed symptom or fault[1], [2], [3], [4]. In spite of the strong assumptions on which it is based, theindependence Bayesian approach to diagnosis (also known as Naïve Bayes) oftenworks well in practice, for example in domains such as the diagnosis of acuteabdominal pain [5], [6]. Early approaches to test selection in the independenceBayesian framework were based on the method of minimum entropy, in which priorityis given to the test that minimises the expected entropy of the distribution of posteriorprobabilities [2]. A similar strategy is the basis of the theory of measurementsdeveloped by deKleer and Williams, who combined model-based reasoning withsequential diagnosis in a system for localising faults in digital circuits [1]. However,the absence of a specific goal in the minimum entropy approach means that therelevance of selected tests can be difficult to explain in terms that are meaningful tousers.

In previous work, we have argued that for the relevance of tests to be explained interms that are meaningful to users, the evidence-gathering process should reflect theapproach used by human experts [7]. It was this aim that motivated the developmentof Hypothesist [3], [8], an intelligent system for sequential diagnosis in theindependence Bayesian framework in which test selection is based on the evidence-gathering strategies used by doctors. Doctors are known to rely on hypothetico-deductive reasoning, selecting tests on the basis of their ability to confirm a target

218 D. McSherry

diagnosis, eliminate an alternative diagnosis, or discriminate between competingdiagnoses [9], [10], 11].

Hypothesist is goal driven in that the test it selects at any stage depends on thetarget hypothesis it is currently pursuing, which is continuously revised in the light ofnew evidence. In order of priority, its main test-selection strategies are confirm(confirm the target hypothesis in a single step), eliminate (eliminate the likeliestalternative hypothesis), and validate (increase the probability of the target hypothesis).

A major advantage of the approach is that the relevance of a test can be explainedin terms of the strategy it is selected to support. However, there is considerablecomplexity associated with the co-ordination of Hypothesist’s test-selection strategies.For example, one reason for giving priority to the confirm strategy is that the measureused to select the most useful test in the validate strategy can be applied only to teststhat do not support the confirm strategy. A potential drawback of this policy is that thesystem may sometimes be forced to select from tests that only weakly support theconfirm strategy when strategies of lower priority may be more strongly supported bythe available tests. Similar problems have been identified in a multiple-strategyapproach to attribute selection in decision-tree induction [12].

In this paper, we present a new approach to test selection in sequential diagnosisand classification that avoids the complexity of Hypothesist’s multiple-strategyapproach while retaining the advantage that the relevance of a selected test can beexplained in strategic terms.

We also examine possible approaches to the problem of deciding when todiscontinue the testing process. The ability to recognise when there is sufficientevidence to support a working diagnosis, and thus avoid the risks and costs associatedwith unnecessary tests, is an important aspect of diagnostic reasoning [10]. Othergood reasons for minimising the length of problem-solving dialogues in intelligentsystems include avoiding frustration for the user, minimising network traffic in Web-based applications, and reducing the length of explanations of how a conclusion wasreached [13], [14], 15]. On the other hand, minimising dialogue length in sequentialdiagnosis must be balanced against the risk of accepting a diagnosis before it is fullyverified, an error sometimes referred to as premature closure [10], [16].

In Section 2, we present an intelligent system prototype called VERIFY in whichHypothesist’s multiple-strategy approach to test selection is replaced by the singlestrategy of increasing the probability of a target hypothesis. In Section 3, we examineVERIFY’s ability to explain the relevance of tests when applied to a well-knownclassification task. In Section 4, we describe how the trade-off between unnecessaryprolongation of the testing process and avoiding the risk of premature closure isaddressed in VERIFY. Related work is discussed in Section 5 and our conclusions arepresented in Section 6.

2 Sequential Diagnosis in VERIFY

The model of hypothetico-deductive reasoning on which VERIFY is based can beapplied to any domain of diagnosis (or classification) for which the following areavailable: a list of hypotheses to be discriminated (i.e. possible diagnoses or outcomeclasses) and their prior probabilities, a list of relevant tests (or attributes), and the

Sequential Diagnosis in the Independence Bayesian Framework 219

conditional probabilities of their results (or values) in each hypothesis. VERIFYrepeats a diagnostic cycle of selecting a target hypothesis, selecting the most usefultest, asking the user for the result of the selected test, and updating the probabilities ofall the competing hypotheses in the light of the new evidence.

The approach is based on the usual assumptions of the independence Bayesianframework. The hypotheses H1, H2, ..., Hn to be discriminated are assumed to bemutually exclusive and exhaustive, and the results of tests are assumed to beconditionally independent in each hypothesis. The test selected by VERIFY at anystage of the evidence-gathering process depends on the target hypothesis (diagnosis oroutcome class) it is currently pursuing. The target diagnosis is the one that is currentlymost likely based on the evidence previously reported. As in Hypothesist, the targetdiagnosis is continually revised in the light of new evidence. As new evidence isobtained, the probability of the target hypothesis Ht is sequentially updated accordingto the independence form of Bayes’ theorem:

∑=

=n

iiirii

ttrttrt

HpHEpHEpHEp

HpHEpHEpHEpEEEHp

121

2121

)()|(...)|()|(

)()|(...)|()|(),...,,|(

where E1 is the most recently reported test result and E2,..., Er are previously reportedtest results. The probability of each competing hypothesis is similarly updated.

2.1 Evidential Power

The measure of attribute usefulness used in VERIFY to select the attribute that moststrongly supports its strategy of increasing the probability of the target hypothesis iscalled evidential power.

Definition 1. The evidential power of an attribute A in favour of a target hypothesisHt is

(A, Ht, ) = ∑=

n

i 1p(A = vi | Ht) p(Ht | A = vi, )

where v1,v2,...,vn are the values of A and is the evidence, if any, provided by

previous test results.

Of course, we do not suggest that such a probabilistic measure is used by doctors.Where no test results have yet been reported, we will simply write (A, Ht) instead of(A, Ht, ). At each stage of the evidence-gathering process, the attribute selected byVERIFY is the one whose evidential power in favour of the target hypothesis isgreatest. While increasing the probability of the target hypothesis is one of theattribute-selection strategies used in Hypothesist, the latter uses a different measure ofattribute usefulness.

220 D. McSherry

2.2 Example Domain

The contact lenses data set is based on a simplified version of the optician’s real-worldproblem of selecting a suitable type of contact lenses, if any, for an adult spectaclewearer [17]. A point we would like to emphasise is that the problem of contact lensselection is used here merely as an example, and alternatives to Bayesianclassification such as decision-tree induction may well give better results on thecontact lenses data set, for example in terms of predictive accuracy. Outcome classesin the data set are no contact lenses, soft contact lenses, and hard contact lenses.Attributes in the data set and conditional probabilities of their values are shown inTable 1. The prior probabilities of the outcome classes are also shown.

Table 1. Prior probabilities of outcome classes and conditional probabilities of attribute valuesin the contact lenses data set

__________________________________________________

Contact lens type: None Soft Hard0.63 0.21 0.17

__________________________________________________

Age of patientyoung 0.27 0.40 0.50pre-presbyopic 0.33 0.40 0.25presbyopic 0.40 0.20 0.25

Astigmatismpresent 0.53 0.00 1.00absent 0.47 1.00 0.00

Spectacle prescriptionhypermetrope 0.53 0.60 0.25myope 0.47 0.40 0.75

Tear production ratenormal 0.20 1.00 1.00reduced 0.80 0.00 0.00

__________________________________________________

Figure 1 below shows an example consultation in VERIFY in which the evidence-gathering process is allowed to continue until any hypothesis is confirmed or nofurther attributes remain. Initially the target hypothesis is no contact lenses, and theuser is informed when it changes to soft contact lenses. In Section 4, we describetechniques for recognising when there is sufficient evidence to discontinue theevidence-gathering process, thus reducing the number of tests required, on average, toreach a solution.

When VERIFY is applied to the example domain, the target hypothesis, with aprobability of 0.63, is initially no contact lenses. From Table 1,

p(age = young | no contact lenses) = 0.27p(age = pre-presbyopic | no contact lenses) = 0.33p(age = presbyopic | no contact lenses) = 0.40


By Bayes’ theorem,

50.017.050.021.040.063.027.0

63.027.0)youngage|lensescontactp(no =

×+×+××==

Similarly,p(no contact lenses | age = pre-presbyopic) = 0.63

andp(no contact lenses | age = presbyopic) = 0.75

The evidential power of age in favour of the target hypothesis is therefore:

(age, no contact lenses) = 0.27 0.50 + 0.33 0.63 + 0.40 0.75 = 0.64_____________________________________________________________________

VERIFY: The target hypothesis is no contact lenses

What is the tear production rate?

User: normal

VERIFY: The target hypothesis is soft contact lenses

Is astigmatism present?

User: no

VERIFY: What is the age of the patient?

User: young

VERIFY: What is the spectacle prescription?

User: hypermetrope

VERIFY: The surviving hypotheses and their posterior probabilities are:soft contact lenses (0.86) no contact lenses (0.14)

_____________________________________________________________________

Fig. 1. Example consultation in VERIFY

Similarly, the evidential powers of astigmatism, spectacle prescription and tearproduction rate in favour of no contact lenses are 0.63, 0.63,and 0.85. According toVERIFY, the most useful attribute is therefore tear production rate. When the userreports that the tear production rate is normal in the example consultation, the revisedprobabilities of no contact lenses, soft contact lenses, and hard contact lenses are 0.25,0.42, and 0.33 respectively. For example,

p(soft contact lenses | tear production rate = normal) =

42.017.0121.0163.020.0

21.01 =×+×+×

×

The target hypothesis therefore changes to soft contact lenses. VERIFY now choosesthe most useful among the three remaining attributes. For example, the probabilitiesrequired to compute the evidential power of astigmatism in favour of soft contactlenses are now:

222 D. McSherry

p(astigmatism = present | soft contact lenses) = 0p(astigmatism = absent | soft contact lenses) = 1p(soft contact lenses | astigmatism = present, tear production rate = normal) = 0p(soft contact lenses | astigmatism = absent, tear production rate = normal) =

78.017.01021.01163.020.047.0

21.011 =××+××+××

××

So,(astigmatism, soft contact lenses, tear production rate = normal) = 0 0 + 1 0.78 = 0.78

Similarly, the evidential powers of age and spectacle prescription in favour of softcontact lenses are now 0.43 and 0.45 respectively, so the next question that VERIFYasks the user is whether astigmatism is present. The remaining questions in theexample consultation (age and spectacle prescription) are similarly selected on thebasis of their evidential powers in favour of soft contact lenses.

___________________________________________________________________

tear production rate = reduced : no contact lenses (1)

tear production rate = normal

astigmatism = present

spectacle prescription = myope

age of patient = young : hard contact lenses (0.88)

age of patient = pre-presbyopic : hard contact lenses (0.75)

age of patient = presbyopic : hard contact lenses (0.72)

spectacle prescription = hypermetrope


age of patient = pre-presbyopic : no contact lenses (0.53)

age of patient = presbyopic : no contact lenses (0.58)

astigmatism = absent

age of patient = young

spectacle prescription = myope : soft contact lenses (0.82)

spectacle prescription = hypermetrope : soft contact lenses (0.86)

age of patient = pre-presbyopic


spectacle prescription = hypermetrope : soft contact lenses (0.83)

age of patient = presbyopic


spectacle prescription = hypermetrope : soft contact lenses (0.67)___________________________________________________________________

Fig. 2. Consultation tree for the example domain


2.3 Consultation Trees

The questions sequentially selected by VERIFY, the user’s responses, and the finalsolution can be seen to generate a path in a virtual decision tree of all possibleconsultations. The complete decision tree, shown in Fig. 2, provides an overview ofVERIFY’s problem-solving behaviour in the example domain. The posteriorprobabilities of the solutions reached by VERIFY are shown in brackets. Animportant point to note is that such an explicit decision tree, which we refer to as aconsultation tree, is not constructed by VERIFY. The example consultation tree wasconstructed off line by a process that resembles top-down induction of decision trees[18] except that there is no partitioning of the data set.

It can be seen from Fig. 2 that in contrast to Cendrowska’s rule-based approach toclassification in the example domain [17], only no contact lenses can be confirmedwith certainty in the independence Bayesian framework. Test results on the pathfollowed by VERIFY in the example consultation are shown in italics. Although thesolution is soft contact lenses, its probability is only 0.86. It can also be seen from Fig.2 that this is the maximum probability for soft contact lenses over all possibleconsultations. Similarly the maximum probability of hard contact lenses over allconsultations is 0.88. When applied to the contact lenses data set, VERIFY correctlyclassifies all but one example, a case of no contact lenses which it misclassifies as softcontact lenses with a probability of 0.6.

2.4 Findings Tthat Always Support a Given Hypothesis

An important phenomenon in probabilistic reasoning is that certain test results maysometimes increase and sometimes decrease the probability of a given hypothesis,depending on the evidence provided by other test results [7], [16], [19]. For example,although the likelier spectacle prescription in no contact lenses is hypermetrope, itdoes not follow that this finding always increases the probability of no contact lenses.In the absence of other evidence, a spectacle prescription of hypermetrope does in factincrease the probability of no contact lenses from its prior probability of 0.63, since byBayes’ theorem:

p(no contact lenses | spectacle prescription = hypermetrope) = 0.66

On the other hand, if astigmatism is known to be absent, then a spectacle prescriptionof hypermetrope decreases the probability of no contact lenses:

p(no contact lenses | astigmatism = absent) = 0.58p(no contact lenses | spectacle prescription = hypermetrope, astigmatism = absent) = 0.55

The relevance of a test result whose effect on the probability of a target hypothesisvaries not only in magnitude but also in sign is difficult to explain in other than case-specific terms. However, it is known that a test result always increases the probabilityof a target hypothesis, regardless of the evidence provided by other test results, if itsconditional probability in the target hypothesis is greater than in any competinghypothesis [7], [16]. Such a test result is called a supporter of the target hypothesis[16]. The supporters of the hypotheses in the example domain can easily be identifiedfrom Table 1. For example, spectacle prescription = hypermetrope is a supporter of

224 D. McSherry

soft contact lenses, while spectacle prescription = myope always increases theprobability of hard contact lenses.

Table 2 shows the supporters of each hypothesis in the example domain. As weshall see in Section 3, the ability to identify findings that always increase theprobability of a given hypothesis helps to improve the quality of explanations inVERIFY. Certain test results may have more dramatic effects, such confirming atarget hypothesis in a single step or eliminating a competing hypothesis. A findingwill confirm a hypothesis in a single step (and always has this effect) if it occurs onlyin that hypothesis. In medical diagnosis, such findings are said to be pathognomonicfor the diseases they confirm [11]. It can be seen from Table 1 that while a reducedtear production rate always confirms no contact lenses, neither of the other hypothesesin the example domain can be confirmed in a single step. A finding eliminates agiven hypothesis (and always has this effect) if it never occurs in that hypothesis. Thepresence of astigmatism can be seen to eliminate soft contact lenses, while theabsence of astigmatism eliminates hard contact lenses.

Table 2. Supporters of the hypotheses in the example domain______________________________________________________

no contact lenses age = presbyopictear production rate = reduced

soft contact lenses age = pre-presbyopicastigmatism = absentspectacle prescription = hypermetrope

hard contact lenses age = youngastigmatism = presentspectacle prescription = myope

______________________________________________________

3 Explanation in VERIFY

Before answering any question, the user can ask why it is relevant. VERIFY differsfrom Hypothesist in the way it responds to such requests for explanation. Theexplanation provided by Hypothesist depends on the strategy that the attribute (or test)was selected to support. In the confirm, eliminate and validate strategies, the user isshown the value of the attribute that will confirm the target hypothesis, eliminate thelikeliest alternative hypothesis, or increase the probability of the target hypothesis [3].

Though having only a single strategy on which to base its explanations of attributerelevance, VERIFY is capable of providing explanations that closely resemble itspredecessor’s. It always shows the user the value of the selected attribute thatmaximises the probability of the target hypothesis. If this value happens to be onethat always confirms the target hypothesis, the user is informed of this effect. The useris similarly informed if the value that maximises the probability of the targethypothesis is one that always eliminates a competing hypothesis. If the maximisingvalue has neither of these effects, VERIFY simply informs the user that the value


increases the probability of the target hypothesis and whether it always has this effect;that is, whether it is a supporter of the target hypothesis.

Fig. 3 shows an example consultation in which VERIFY explains the relevance ofits first two questions by showing the answers that will confirm the target hypothesisor eliminate a competing hypothesis. In the second half of the example consultation,VERIFY identifies two findings (age = young and spectacle prescription =hypermetrope) that will increase the probability of soft contact lenses, but only thelatter can always be guaranteed to increase the probability of soft contact lenses. Thisdistinction is reflected in the explanations that VERIFY provides.

___________________________________________________________________________VERIFY: The target hypothesis is no contact lenses


User: why

VERIFY: If tear production rate is reduced, this finding always confirmsno contact lenses


User: normal

VERIFY: The target hypothesis is soft contact lenses


User: why

VERIFY: If astigmatism is absent, this finding always eliminates hardcontact lenses


User: no

VERIFY: What is the age of the patient?

User: why

VERIFY: If the age of the patient is young, this will increase theprobability of soft contact lenses

What is the age of the patient?

User: young

VERIFY: What is the spectacle prescription?

User: why

VERIFY: If the spectacle prescription is hypermetrope, this finding alwaysincreases the probability of soft contact lenses

What is the spectacle prescription?

User: hypermetrope

VERIFY: The surviving hypotheses and their posterior probabilities are:soft contact lenses (0.86) no contact lenses (0.14)

____________________________________________________________________

Fig. 3. Explanation of reasoning in VERIFY

226 D. McSherry

4 When Can Testing Be Discontinued?

In this section we examine possible approaches to the problem of deciding when thereis sufficient evidence to discontinue the testing process in VERIFY. One simplestrategy is to discontinue the evidence-gathering process when the probability of theleading (most likely) hypothesis reaches a predetermined level. However, a potentialdrawback in this approach is that the evidence provided by omitted tests maydramatically reduce the probability of the leading hypothesis [7], [16].

To illustrate this problem, Fig. 4 shows the consultation tree for the exampledomain that results if testing is discontinued in VERIFY when the probability of theleading hypothesis reaches 0.70. This simple criterion has the advantage that the useris asked at most two questions before a solution is reached. However, it can be seenfrom Fig. 2 that if testing is allowed to continue at the node shown in italics there isone combination of results of the omitted tests (spectacle prescription =hypermetrope, age = presbyopic) that would reduce the probability of hard contactlenses to 0.42, making no contact lenses the most likely classification with a posteriorprobability of 0.58. Another combination of results reduces the probability of hardcontact lenses to 0.47, once again with the additional evidence favouring no contactlenses. The trade-off for the reduction in consultation length is therefore that twocases of no contact lenses in the contact lenses data set are now misclassified as hardcontact lenses.____________________________________________________________________

tear production rate = reduced : no contact lenses (1) tear production rate = normal astigmatism = present : hard contact lenses (0.71) astigmatism = absent : soft contact lenses (0.78)

Fig. 4. Consultation tree for the example domain with testing discontinued when the probabilityof the leading hypothesis reaches 0.70

As this example illustrates, a more reliable approach may be to discontinue testingonly when the evidence in favour of the leading hypothesis is such that its probabilitycannot be less than an acceptable level regardless of the evidence that additional testsmight provide [16]. In practice, what is regarded as an acceptable level is likely todepend on the cost associated with an incorrect diagnosis. The approach we haveimplemented in VERIFY is to discontinue testing only when both of the followingconditions are satisfied:

(a) the probability of the leading hypothesis is at least 0.5(b) its probability can never be less than 0.5 if the consultation is allowed to continue

Although a higher threshold may be appropriate in certain domains, (b) ensures thatno competing hypothesis can ever be more likely regardless of the results of additionaltests that might be selected by VERIFY if the consultation is allowed to continue.Finding the minimum probability of the leading hypothesis involves a breadth-firstsearch of the space of all possible question-answer sequences that can occur if theconsultation is allowed to continue. Condition (a) ensures that this look-ahead searchis attempted only if the leading hypothesis is a sufficiently promising candidate. Assoon as a node is reached at which the probability of the target hypothesis falls below


0.5, the search is abandoned and the next most useful question is selected in the usualway. On the other hand, if the search reveals that the probability of the leadinghypothesis can never fall below 0.5, VERIFY selects this hypothesis as the solution tothe problem presented by the user. It shows the user the current probability of theleading hypothesis as well its minimum possible probability as determined by look-ahead search.

Figure 5 shows an example consultation illustrating VERIFY’s ability to recognisewhen there is sufficient evidence to discontinue testing. Although the leadinghypothesis is initially no contact lenses with a prior probability of 0.63, VERIFY needonly look ahead a single step to determine that its probability will be reduced to 0.25if the tear production rate is normal. Look-ahead search is not attempted when theuser reports that the tear production is in fact normal as soft contact lenses, now theleading hypothesis, has a probability of only 0.42. Following the user’s answer to thenext question, the leading hypothesis changes to hard contact lenses as its probabilityhas increased to 0.71 (as we can see from Fig 4).

Look-ahead search for the minimum possible probability of hard contact lenses istherefore initiated and reveals that its probability will fall to 0.42 if the spectacleprescription is hypermetrope and the age is presbyopic (as we can see from Fig. 2). Asthere is not yet sufficient evidence to discontinue the testing process, VERIFY nowasks the user about the spectacle prescription. When the user reports that the spectacleprescription is myope, this increases the probability of hard contact lenses to 0.80 andlook-ahead search now reveals that the age of the patient can at worst reduce itsprobability to 0.72. VERIFY therefore ends the consultation and reports that thelikeliest solution is hard contact lenses.____________________________________________________________________

VERIFY: The target hypothesis is no contact lensesWhat is the tear production rate?

User: normalVERIFY: The target hypothesis is soft contact lenses

Is astigmatism present?User: yesVERIFY: The target outcome class is hard contact lenses

What is the spectacle prescription?User: myopeVERIFY: The surviving hypotheses and their probabilities are: hard contact

lenses (0.80) no contact lenses (0.20). The probability of hardcontact lenses cannot be less than 0.72.

____________________________________________________________________

Fig. 5. Recognising when there is sufficient evidence to discontinue testing

Figure 6 shows the consultation tree for the example domain with testing discontinuedin VERIFY only when conditions (a) and (b) are satisfied. The tree was constructedoff line by a process that uses look-ahead search to determine if further testing mayaffect the solution reached by VERIFY. The reduced consultation tree has an averagepath length of 3 nodes compared with 3.8 for the full consultation tree.

228 D. McSherry

5 Related Work

PURSUE is an algorithm for strategic induction of decision trees in which attributeselection is based on evidential power in the current subset of the data set [20].However, the conditional probabilities on which an attribute’s evidential power isbased are continuously revised in PURSUE as the data set is partitioned. In contrast,there is no partitioning of the data set in VERIFY and no updating of conditionalprobabilities. VERIFY also differs from PURSUE in requiring no access to the data,if any, from which the prior and conditional probabilities were derived. It cantherefore be applied to diagnosis and classification tasks in which the only availableprobabilities are subjective estimates provided by a domain expert.____________________________________________________________________

tear production rate = reduced : no contact lenses (1)

tear production rate = normal

astigmatism = present

spectacle prescription = myope : hard contact lenses (0.8)

spectacle prescription = hypermetrope


age of patient = pre-presbyopic : no contact lenses (0.53)

age of patient = presbyopic : no contact lenses (0.58)

astigmatism = absent : soft contact lenses (0.78)

____________________________________________________________________

Fig. 6. Consultation tree for the example domain with testing discontinued when the probabilityof the leading hypothesis cannot be less than 0.5

It is interesting to note the analogy between deciding when to discontinue theevidence-gathering process in sequential diagnosis and pruning a decision tree in thecontext of decision-tree induction. The trade-off between accuracy and simplicity ofinduced decision trees has been the subject of considerable research effort [13], [21],and lessons learned from this research may have important implications for sequentialdiagnosis in the independence Bayesian framework. For example, loss of accuracyresulting from drastic reductions of decision-tree size is often surprisingly small,while more conservative reductions can sometimes produce small but worthwhileimprovements in accuracy [13]. Though tending to favour conservative reductions indialogue length, VERIFY’s policy of maintaining consistency with the solutionsobtained in full-length dialogues also ensures that only reductions which have noeffect on accuracy are allowed. Whether improvements in accuracy may be possibleby relaxing the consistency requirement is an important issue to be addressed byfurther research.

In decision-tree induction, the choice of splitting criterion is known to affect thesize of the induced decision tree before simplification, for example as measured by itsaverage path length [13], [15]. Similarly, the test-selection strategy used in sequentialdiagnosis is likely to affect the number of tests required, on average, to reach asolution. Another issue to be addressed by further research is how evidential power


compares with alternative measures such as entropy [2] in terms of average dialoguelength before any reduction based on look-ahead search.

There is increasing awareness of the need for intelligent systems to support mixed-initiative dialogue [4], [22]. For example, intelligent systems are unlikely to beacceptable to doctors if they insist on asking the questions and ignore the user’sopinion as to which symptoms are most relevant. In VERIFY, the user can volunteerdata at any stage of the consultation without waiting to be asked. VERIFY updates theprobabilities of each hypothesis in the light of the reported evidence, revises its targethypothesis if necessary, and proceeds to select the next most useful test in the usualway. Another issue to be addressed by system designers is the problem of incompletedata. Often in real-world applications there are questions that the user is unable toanswer, for example because the answer requires an expensive or difficult test that theuser is reluctant or unable to perform [4]. As in Hypothesist, a simple solution to theproblem of incomplete data in VERIFY is to select the next most useful test if the useris unable (or declines) to answer any question.

Efficiency of test selection in VERIFY is unlikely to deteriorate significantly whenthe system is applied to domains with larger numbers of tests than the exampledomain. However, a practical limitation of the proposed strategy for discontinuing thetesting process is that look-ahead search is likely to be feasible in real time only if thenumber of available tests is relatively small. In previous work we have presentedtechniques for reducing the computational effort in a search for the minimumprobability of a given hypothesis [16]. However, the ability to recognise combinationsof test results that can be eliminated in a search for the minimum probability dependson the assumption that p(E | H) > 0 for every test result E and hypothesis H. Forexample, this condition ensures that no combination of test results that minimises theprobability of a hypothesis can include a supporter of the hypothesis [16]. Similartechniques could be used to reduce the complexity of look-ahead search in VERIFYwhen applied to domains in which this condition is satisfied (or imposed). In thispaper, we have avoided this assumption as it means that no hypothesis can beconfirmed or eliminated with certainty.

It is also worth noting that the need for look-ahead search at problem-solving timecould be eliminated by constructing and simplifying an explicit consultation tree offline and using the simplified tree to guide test selection in future consultations.However, this would compromise the system’s abilities to support mixed-initiativedialogue and tolerate incomplete data, both of which rely on freedom fromcommitment to an explicit decision tree.

By selecting tear production rate (the most expensive of the available tests) as themost useful test in the example domain, VERIFY reveals its unawareness of therelative costs of tests. One way to address this limitation would be to constrain theselection of tests so that priority is given to those with negligible cost. A morechallenging issue to be addressed by further research, though, is the system’sinsensitivity to differences in the costs associated with different misclassificationerrors. As shown by research in cost-sensitive learning, an approach to classificationthat takes account of such differences is often essential for optimal decision making[23].

230 D. McSherry

6 Conclusions

We have presented a new approach to test selection in sequential diagnosis in theindependence Bayesian framework that resembles the hypothetico-deductive approachto evidence gathering used by doctors. The approach avoids the complexity associatedwith the co-ordination of multiple test-selection strategies in previous models ofhypothetico-deductive reasoning, while retaining the advantage that the relevance of aselected test can be explained in strategic terms. We have also examined twoapproaches to the problem of deciding when to discontinue the evidence-gatheringprocess. In the first approach, testing is discontinued when the probability of theleading hypothesis reaches a predetermined threshold. In the second approach, testingis discontinued only if the solution would remain the same if testing were allowed tocontinue. The second approach tends to favour more conservative reductions in thelength of problem-solving dialogues, while ensuring that only reductions that can haveno effect on accuracy are accepted. Whether improvements in accuracy on unseen testcases may be possible by relaxing this constraint is one of the issues to be addressedby further research.

References

1. de Kleer, J, Williams, B.C.: Diagnosing Multiple Faults. Artificial Intelligence 32, 97-130,1987.

2. Gorry, G.A., Kassirer, J.P., Essig, A., Schwartz, W.B.: Decision Analysis as the Basis forComputer-Aided Management of Acute Renal Failure. American Journal of Medicine 55,473-484, 1973.

3. McSherry, D.: Hypothesist: a Development Environment for Intelligent DiagnosticSystems. In: Keravnou, E., Garbay, C., Baud, R., Wyatt, J. (eds): Artificial Intelligence inMedicine. LNAI, Vol. 1211. Springer-Verlag, Berlin Heidelberg, 223-234, 1997.

4. McSherry, D.: Interactive Case-Based Reasoning in Sequential Diagnosis. AppliedIntelligence 14, 65-76, 2001.

5. Adams, I.D., Chan, M., Clifford, P.C. et al.: Computer Aided Diagnosis of AcuteAbdominal Pain: a Multicentre Study. British Medical Journal 293, 800-804, 1986.

6. Provan, G.M., Singh, M.: Data Mining and Model Simplicity: a Case Study in Diagnosis.Proceedings of the Second International Conference on Knowledge Discovery and DataMining. AAAI Press, Menlo Park, California, 57-62, 1996.

7. McSherry, D. Intelligent Dialogue Based on Statistical Models of Clinical DecisionMaking. Statistics in Medicine 5 497-502, 1986.

8. McSherry, D.: Dynamic and Static Approaches to Clinical Data Mining. ArtificialIntelligence in Medicine 16, 97-115, 1999.

9. Elstein, A.S., Schulman, L.A., Sprafka, S.A.: Medical Problem Solving: an Analysis ofClinical Reasoning. Harvard University Press, Cambridge, Massachusetts, 1978

10. Kassirer, J.P., Kopelman, R.I.: Learning Clinical Reasoning. Williams and Wilkins,Baltimore, Maryland 1991

11. Shortliffe, E.H., Barnett, G.O.: Medical Data: Their Acquisition, Storage and Use. In:Shortliffe, E.H. and Perreault, L.E. eds: Medical Informatics: Computer Applications inHealth Care. Addison-Wesley, Reading, Massachusetts 37-69, 1990.


12. McSherry, D.: A Case Study of Strategic Induction: the Roman Numerals Data Set. In:Bramer, M., Preece, A., Coenen, F. eds: Research and Development in Intelligent SystemsXVII. Springer-Verlag, London, 48-61, 2000

13. Breslow L.A., Aha D.W. Simplifying Decision Trees: a Survey. Knowledge EngineeringReview 12,1-40, 1997.

14. Doyle, M., Cunningham, P.: A Dynamic Approach to Reducing Dialog in On-LineDecision Guides. In: Blanzieri, E., Portinale, L. (eds.) Advances in Case-Based Reasoning.LNAI, Vol. 1898. Springer-Verlag, Berlin Heidelberg, 49-60, 2000.

15. McSherry, D.: Minimizing Dialog Length in Interactive Case-Based Reasoning.Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence,993-998, 2001.

16. McSherry, D.: Avoiding Premature Closure in Sequential Diagnosis. Artificial Intelligencein Medicine 10, 269-283, 1997.

17. Cendrowska, J.: PRISM: an Algorithm for Inducing Modular Rules. International Journalof Man-Machine Studies 27, 349-370, 1998.

18. Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1, 81-106, 1998.19. Szolovits, P., Pauker, S.G.: Categorical and Probabilistic Reasoning in Medical Diagnosis.

Artificial Intelligence 11, 115-144, 1978.20. McSherry, D.: Explanation of Attribute Relevance in Decision-Tree Induction. In: Bramer,

M., Coenen, F., Preece, A. (eds.) Research and Development in Intelligent Systems XVIII.Springer-Verlag, London 2001 39-52

21. Bohanec, M., Bratko, I.: Trading Accuracy for Simplicity in Decision Trees. MachineLearning 15, 223-250, 1994.

22. Berry, D.C., Broadbent, D.E. Expert Systems and the Man-Machine Interface. Part Two:The User Interface. Expert Systems 4, 18-28, 1987.

23. Elkan, C.: The Foundations of Cost-Sensitive Learning. Proceedings of the SeventeenthInternational Joint Conference on Artificial Intelligence, 973-978, 2001.


Static Field Approach for Pattern Classification

Dymitr Ruta and Bogdan Gabrys

Applied Computational Intelligence Research UnitDivision of Computing and Information Systems, University of Paisley

High Street, Paisley PA1-2BE, UKruta-ci0, [email protected]

Abstract. Recent findings in pattern recognition show that dramatic improve-ment of the recognition rate can be obtained by application of fusion systemsutilizing many different and diverse classifiers for the same task. Apart from agood individual performance of individual classifiers the most important factoris the useful diversity they exhibit. In this work we present an example of anovel non-parametric classifier design, which shows a substantial level of diver-sity with respect to other commonly used classifiers. In our approach inspirationfor the new classification method has been found in the physical world. Namelywe considered training data as particles in the input space and exploited the con-cept of a static field acting upon the samples. Specifically, every single datapoint used for training was a source of a central field, curving the geometry ofthe input space. The classification process is presented as a translocation in theinput space along the local gradient of the field potential generated by the train-ing data. The label of a training sample to which it converged during the trans-location determines the eventual class label of the new data point. Based on se-lected simple fields found in nature, we show extensive examples and visual in-terpretations of the presented classification method. The practical applicabilityof the new model is examined and tested using well-known real and artificialdatasets.

1 Introduction

Research in pattern recognition proves that no individual method can be shown to bethe best for all classification problems [1], [2], [3]. Instead, an interesting alternative isto construct a number of diverse, generally well performing classification methods,and combine them on different levels of classification process. Combining classifiershas been shown to offer a significant classification improvement for some non-trivialpattern classification problems [2], [3]. However, the highest improvement of a multi-ple classifier system is subject to the diversity exhibited among the component classi-fiers [4], [5], [6]. In this paper we propose an example of such a novel classifier thatperforms well individually and can be shown diverse with respect to other commonlyused classifiers. In designing the new classifier we exploit a notion of a static fieldgenerated by a set of samples treated as physical particles. Our approach is closely

Static Field Approach for Pattern Classification 233

related to the idea of an information field, recently emerging from the studies in in-formation theory, where increasingly deep analogies are drawn with the physicalworld [7], [8]. Shannon entropy representing probabilistic interpretation of informa-tion content is an example of a direct counterpart to the thermodynamic entropy re-lated to physical particles. Information or its uncertainty is quite often compared toenergy, with all its aspects [8]. The latest findings led even to the formulation of thequantum information theory based on well-developed quantum physics [7]. Themathematical concept of a field so commonly observed in nature, has hardly beenexploited for the data analysis. In [9], Hochreiter and Mozer use electric field meta-phor to Independent Component Analysis (ICA) problem where joint and factorialdensity estimates are treated as a distribution of positive and negative charges. In [10],Principe et al introduce the concept of information potentials and forces aroused fromthe interactions between the samples, which Torkkola and Campbell [11] used furtherfor transformation of the data attempting to maximize their mutual information.

Inspired by these tendencies we adapt directly the concept of the field to the data,which can be seen as particles and field sources. The type of the field is uniquely de-fined by the definition of potential and can be absolutely arbitrary, chosen according tothe purpose of the data processing. For classification purposes the idea is to assign apreviously unseen sample to one of the classes of the data, the topology of whichshould be learnt from the training data. As a response to this demand we propose touse an attracting action between the samples similar to the result of gravity actingamong masses. Training data are fixed to their positions and represent the sources ofthe static field acting dynamically on loose testing samples. The field measured in aparticular point of the input space is a result of a superposition of the local fieldscoming from all the sources. Thus the positions of the training data uniquely deter-mine the field in the whole input space and in this way define the paths of the translo-cations of the testing data along the forces aroused from the local changes of the field.The ending point of such transposition has to be one of the training samples, which inturn determines the label of the classified sample. Such a static field classification(SFC) resembles, to a certain degree, non-parametric density estimation based ap-proaches for classification [1].

The remainder of the paper is organized as follows. Section 2 explains the way inwhich the data is used as sources of the field, including implementation details. InSection 3 we show how the field generated upon the labeled data can be used for clas-sification of previously unseen data. The next section provides the results from theexperiments with real datasets. Finally, conclusions and plans for the future work inthis area are presented briefly.

2 Data as Field Sources

Inspired by the field properties of the physical world one can consider each data pointas a source of a certain field affecting the other data in the input space. In general thechoice of a field definition is virtually unrestricted. However, for the classificationpurposes considered in this paper, we use a central field with a negative potential

234 D. Ruta and B. Gabrys

increasing with the distance from a source. An example of such a field is the omni-present gravity field or electrostatic field. Given the training data acting as fieldsources, every point of the input space can be uniquely described by the field proper-ties measured as a superposition of the influences from all field sources. In this paperwe consider a static field in a sense that the field sources are fixed to their initial posi-tions in the input space. All dynamic aspects imposed by the field are ignored in thiswork.

Given a training set of SN m -dimensional data: ,...,, N21 xxx=SX , let eachsample be the source of a field defined by a potential:

)( ijij rfcsU

−= (1)

where c represents the field constant, is stands for a source charge of the data point

ix , and )( ijrf

is a certain non-negative function decreasing with an increasing lengthof the vector ijr

describing the distance between the source ix and the point jx in the

input space. Note that the potential is always negative, which decides about the at-tracting properties of the data. In this work we adopt the gravitational field for which:

ij

ijr

rf 1

)( = (2)

For notation simplicity we assume: ijij rr = . The overall potential in a certain point

jx of the input space is a superposition of potentials coming from all the sources:

∑=

−=N

iij

ij r

scU

1

(3)

Considering a new data point in such a field, we can immediately associate with it theenergy equal to:

∑=

−=N

iij

ijj r

scsE

1

(4)

Simplifying the model further we can assume that all data points are equally importantand have the same source charge equal to unit, 1=is , thus eliminating source charges

from equations (3) and (4). Another crucial field property is its intensity, which issimply a gradient of the potential and can be formally expressed by:

∂∂

∂∂

∂∂

−=∇−=jm

j

j

j

j

j

j x

U

x

U

x

UU ,...,,

21

jE

(5)

Solution of equation (5) leads to the following form of the field vector:

∑∑∑===

−−=

−−−=

N

iij

N

iij

imjmN

iij

ij

rc

r

xx

r

xxc

13

13

13

11 )()(...,,

)( ij

j

xxE (6)


A vector of field intensity or shortly a field vector shows the direction and the magni-tude of maximum fall of the field potential. By analogy in the presence of a new datapoint by analogy we can introduce the force, which is a result of the field acting uponthe data. Due to our simplifications the force vector and field vector have identicalvalues and directions:

jj EF

= (7)

The only difference between them is a physical unit, which is of no importance in ourapplication. The concept of field forces will be directly exploited for the classificationprocess described in section 3. The field constant c does not affect the directions offorces but only decides about their magnitudes. As previously, without any loss ofgenerality we can assume its unit value and in that way free all the field equationsfrom any parameters, apart from the definition of the distance itself.

2.1 Field Generation

From a perspective of classification, the generation of a field could represent thetraining process of a classifier. However, as the training data uniquely determine thefield and all its properties, the training process may be omitted. All the calculationsrequired to classify new data are carried out online during the classification processavoiding any imprecision caused by approximations that might have been appliedotherwise. It is very similar to generation and operation of the very well known k-nearest neighbor classifier. In case of a large amount of data to be classified, anotheroption is available although not completely precise as in previous case. Namely onecan split the input space into small hyper boxes and calculate all the field propertiesrequired in the center of each hyper box. The field can be approximated that way inany point just by local aggregation procedures. The training process would be sub-stantially prolonged, but the classification phase would require calculations related tojust one or couple of points from the neighborhood and therefore can be drasticallyshortened. For both methods the critical factor is calculation of the distances from theexamined points and all the sources. Using matrix formulation of the problem andmathematical software this task can be obtained rapidly even for thousands of sources.In this case it was achieved using the P3 processor and Matlab 5, with calculations ofall distances between 1000 10-D points and 1000 10-D sources taking less than 1 sec-ond.

Let mNX × denotes the matrix of N m -dimensional data points, we want to exam-ine the field at, and mN

SSX × be the matrix of SN training data – field sources. The task

is to obtain the matrix SNND × of all the distances between the examined data and thetraining data. As opposed to time consuming double loop implementation, introducingmatrix formulation leads to significant savings in terms of code length and processingtime. Namely D can be calculated as simply as:

T

S

T

S

T

SS XXNXXNXXD •+••−•= )1,(2),1( 11 (8)


where “ ” denotes the operator of element wise matrix multiplication (multiplicationof corresponding elements), “ • ” represents standard matrix multiplication and

),( mn1 stands for a matrix of size ),( mn with all elements equal to one. Matlab im-plementation of the above rule results in the 20 times shorter time of calculationscomparing to the double loop algorithm. Given the distance matrix D all the propertiesof the field can be obtained by simple algebra performed on D. Avoiding numericalproblem the distances have been limited from the bottom by an arbitrary thresholdd preventing by zero.

2.2 Numerical Example

As an example we generated 50 random 2-D points from a range (0,1) in both dimen-sions. Figures 1 and 2 provide the visualization of the field arising from the trainingdata. Not surprisingly, the potential is the lowest in the maximum data concentrationsand generally in the middle of the regions occupied by the data. This phenomenonnicely correlates with the classification objective, as the highly concentrated regionsshould have greater chance of data interception. The same applies to the dramatic localdecrease of the potential around the field sources. Presence of the field can be alsointerpreted as a specific curvature of the input space imposed by the presence of fieldsources, that is, the data. Each point in such a curved input space will be forced tomove along the force vectors ultimately ending up in a position of one of the fieldsources. In this way the field built upon the data has the ability to uniquely transformthe input space. For the classification purposes, such a transformation leads to the splitof the whole input space into the subspaces labeled according to the labels of fieldsources intercepting the data from these subspaces.

Figures 1 and 2 show a visualization of the static data field generated by 50 random2-D points from the range of (0,1) in each dimension.

Fig. 1. 3-D visualization of the potential generated by 2-D data .


Fig. 2. Vector plot of the field pseudo-intensity (‘pseudo’ as the vectors point only in the truedirection of the field but their lengths are here fixed for visualization clarity).

3 Classification Process

Given the field, classification process is very straightforward and is simply reduced tothe gradient descent source finding. The slide begins from the position of a new datapoint to be classified. For algorithm stability we ignore the actual values of the field orforce vectors, following just the direction of maximum decrease of the potential. Thesample is shifted along this direction by a small step d and the whole field is recal-culated again in the new position. This procedure is repeated until the unknown sam-ple approaches any source at the distance lower or equal d . If this happens the sampleis labeled according to the source it was intercepted by. Parameter d corresponds tothe length of the shift vector and if fixed, could cause different relative shifts for dif-ferently scaled dimensions. To avoid this problem we normalize the input space tocover all the training data within the range (0,1) in all dimensions and set the step dto the fixed value: 01.0=d . During classification process the new data is transformedto the normalized input space and even if its position falls outside the range (0,1) inany dimension, the step d remains well scaled. The step d has been deliberatelydenoted by the same symbol as lower limit of the distance introduced in previoussection. This ensures that the sample never misses the source on its trajectory andadditionally two parameters are reduced to just one. To speed up the classificationprocess one can extend the step as long as the data size is small enough and the dis-tances between the sources remain larger than d .


3.1 Matrix Implementation

Rather than classifying samples one by one we used Matlab capabilities to classifysamples simultaneously. Given distance matrix D obtained by the (7), the matrix offorces mN ×F can be immediately obtained by (6). Exploiting the triangular relationbetween forces and shifts and given constant d , the matrix of shifts mNX ×∆ can becalculated by the formula:

),1()1,()( mm

dX

11 •••=∆

FFF

(9)

The full SFC algorithm can be expressed in the following sequence of steps:1. Given the sources – training data SX , and data to be classified X , calculate the matrix of

distances D according to (8).2. Calculate the matrix of field forces at the positions of unlabeled data to be classified.3. Given a fixed step, calculate the shifts of the samples according to (9).4. Shift the samples to the new locations calculated in the step 3.5. For all samples check if the distance to any source is less or equal to the step d . If yes

classify these samples with the same labels as the sources they were intercepted by and re-move them from the matrix X.

6. If matrix X is empty finish else go to step 1.Transformation as presented above leads to the split of the whole input space into thesubspaces labeled according to the labels of field sources intercepting the data fromthese subspaces. Figure 3 presents graphical interpretation of the classification processfor an artificial dataset with 8 classes. One can notice that the information about thelabels of the training data is not used till the very end of the classification process.This property makes the method an interesting candidate for an unsupervised cluster-ing technique. The class boundary diagram reveals an interesting effect of the pre-sented classification method. Namely, occasionally one can observe a narrow strip ofone class getting deep into the area of another class. This is the case of the potentialridge, which is balanced from both sides by the data causing the field vector to go in-between, sometimes even reaching another class. Although this phenomenon is notparticularly desirable for an individual classifier, as we show in the experiments itcontributes to the satisfactory level of the diversity the SFC classifier exhibits withother classifiers.

3.2 Comparison with Other Classifiers

The static field classification presented in this paper share some similarities with otherestablished classifier designs. The process of field generation can be seen in fact asindirect parametric estimation of the data density where kernels are defined by poten-tials generated by each training data. Although technically similar, the two approacheshave diametrically different meaning. Rather than data density we calculate potential,which in our approach imposes specific curvature of the input space used further for


classification purposes. This part of our method represents purely unsupervised ap-proach, as the information about the labels is not used at any point.

3a 3b

3c 3d

Fig. 3. Visualization of the static field based classification process performed on the 8-classes 2-D artificial data of 160 samples. 3a: Scatter plot of the training data. 3b: Vector plot of the fieldpseudo-intensity. 3c: Trajectories of exemplary testing data sliding down along potential gradi-ents. 3d: Class boundaries diagram.

The classification process of falling into potential well of a single source resembles k-nearest neighbor classification where k=1. However rather than joining based on theleast distance, we apply a specific translocation leading to the nearest field source meton the trajectory determined by maximum fall of a field potential. Some similaritiescan be also found with Bayesian classifiers, which pick a class with maximum a poste-riori probability calculated on the basis of assumed or estimated class probability den-sity functions. In our approach, instead of probability and probability distribution weapply potential, which we use in a less restrictive manner. Furthermore all trainingdata regardless of the class take part in forming the decision landscape for each data tobe classified. No comparison is made after the matching procedure, the result of whichis that a classifier designed in this way is able to produce only binary outputs.

It is worth mentioning that the labels of any data are not used during classificationprocess either as the testing samples are intercepted purely on the basis of the fieldbuilt upon the training data and not their labels. One can say that classification process


is hidden until the labels of the sources are revealed. What the SFC classifier doesafterwards is passing the labels from the field generators to the captured data. Oncethe labels of the sources are known, any data can be classified according to these la-bels. However if not, the method is simply matching the data with sources accordingto the descending potential rule, which can be potentially exploited for clustering algo-rithms

4 Diversity

Diversity among classifiers is the notion describing the level, to which classifiers varyin data representation, concepts, strategy etc. This multidimensional perception ofdiversity results however in a simple effect observed at the outputs of classifiers: theytend to make errors for different input data. This phenomenon has been shown to becrucial for effective and robust combining methods [3], [4], [5].

The diversity can be measured in a variety of ways [4], [5], [6], but the most effec-tive turned out to be the measures evaluating directly disagreement to errors amongclassifiers [5], [6]. In [6] we investigated the usefulness and potential applicability of avariety of pairwise and non-pairwise diversity measures operating on binary outputs(correct/incorrect). For the purpose of this paper we will be using a Double Fault (DF)measure, which turned out to be the best among analyzed pairwise measures. Recall-ing the definition of the DF measure, the idea is to calculate the ratio of the number ofsamples misclassified by both classifiers 11n to the total number of samples n :

nnF /11= (10)

Using this simple measure one can effectively assess the diversity between all pairs ofclassifiers as well as quite reasonably evaluate the combined performance [6].

5 Experiments

The static field based classification method requires a number of evaluation proce-dures. First of all one need to check its performance over typical real and artificialdatasets and compare it against other classifiers. Secondly, as we mentioned above, weintended to develop a classifier, which would be diverse to other commonly used clas-sifiers. It is important for a good classifier to meet both these conditions on a satisfac-tory level to be successfully used in combining schemes.

5.1 Experiment 1

In this experiment we used 4 well-established datasets to check an individual perform-ance of the SFC and compare it against the performances of another 9 typically usedclassifiers. Table 1 shows the details of datasets picked and a list of the other classifi-ers used. For all but one case the datasets have been split into two equal parts used for

Static Field Approach for Pattern Classification 24 1

training and testing respectively. For the Phoneme dataset, due to its large size, we used 200 samples for training and 1000 for testing. Classification runs have been repeated 10 times for different random splits. Table 2 shows averaged individual performances of classifiers from this experiment. Although the performance of the SFC classifier is never the best result, it remains close to the best. This makes the SFC a valid candidate for combining provided it exhibits a satisfactory level of diversity.

Table 1. Description of datasets and classifiers used for experhnents.

eudo Rsher support vector classifier

Table 2. Individual performances of classifiers applied to the datasets: IM, Phoneme, Coneto- rus andSynthetzc. The results are obtalned for 10 random splits and averaged.

5.2 Experiment 2

The diversity properties of the SFC classifier are evaluated in the next experiment. As mentioned in Section 4 we decided to apply the DF measure to examine the diversity among all pairs of classifiers. The results have been obtained for all datasets mentioned above. Due to a large size of the DF measures obtained for all pairs of considered classifiers we present the results graphically in the form of diversity diagrams as shown in Figure 4.

Iris Phoneme Conetorus M k , of Gaussians

Fig. 4. Diversity diagram obtalned for 4 considered datasets.


The coordinates of each small square correspond to the indices of classifiers for whichthe DF measure is calculated. Note that the diagrams are diagonally symmetrical. Theshade of the squares reflects the magnitude of the DF measure values. The lower theDF measure, the lighter the square and the more diverse the corresponding pair ofclassifiers. To support a single value, which would reflect diversity properties of indi-vidual classifiers, we averaged the DF measures between a considered classifier andall remaining classifiers. This is shown numerically in Table 3. Both, diagrams fromFigure 4 and averaged results from Table 3 show very good diversity properties of theSFC classifier. Only for the Phoneme dataset, the SFC demonstrates quite averagediversity level. For the remaining 3 datasets SFC turned out to be the most diversedespite quite average individual performance.

Table 3. Averaged values of DF measures between individual classifiers and all the remainingclassifiers. The DF values have been expressed as a percentages of the occurrences of pairwisecoincident errors to the total number of samples.

Dataset[%]

Loglc fisherc ldc nmc persc pfsvc qdc Parzenc rbnc Sfc

Iris 0.48 0.36 0.45 0.44 0.12 0.21 0.53 0.47 0.37 0.11Phoneme 5.61 6.98 6.08 7.85 6.11 6.80 6.95 6.81 5.88 6.70Conetorus 10.82 11.42 12.39 11.05 11.46 10.99 12.62 12.40 12.70 10.80Gaussians 2.73 2.33 3.44 2.55 3.73 3.18 2.68 3.27 2.83 1.74

5.3 Experiment 3

In the last experiment we investigated a parametrical variability of the presented clas-sifier. Recalling the force definition (7), the only parameter of the field having a po-tential influence on the classification results is the type of distance appearing in thepotential definition (1). For the Conetorus dataset we applied the SFC with differentpowers of the distance appearing in the denominator in (3). Additionally, we examineda simple exponential function with one parameter as an alternative definition of po-tential. Table 4 shows all configurations of the SFC examined in this experiment, aswell as individual performances obtained for the Conetorus dataset for a single split.Visual results, including field images and class boundaries are shown in Figures 5 and6. For both functions the results depict a clear meaning of the parameter a . Namely itaccounts for the balance between local and global interactions among the samples.

Table 4. Individual performances of various configurations of SFC classifier. The results ob-tained for Conetorus dataset using single random split (50% for training, 50% for testing ).

Rational ExponentialPotential defi-nitionParameter a 0.1 0.6 2 5 10 20 50 100Performance 72.86 76.88 82.41 81.42 69.85 76.38 80.90 81.91

∑ =−= N

1i

a

ir/1)x(U ∑ =

−−= N

1i

arie)x(U


a=.1

a=.6

a=2

a=5

Fig. 5. Potential with rational function of the distance for different values of the parameter a.

The larger the value of a , the more local the field, so that virtually only the nearestneighbors influence the field in a particular point of the input space. For smaller a thefield becomes more global and below a certain critical value some training samplesare no longer able to intercept any testing samples.


a=10

a=20

a=50

a=200

Fig. 6. Exponential dependence on the distance for different values of parameter a.

Technically it is the case when a single source cannot curve the geometry strongenough to create closed enclave of higher potential around itself. The presented SFCclassifier will not be able to classify such samples, which may just find local minimumof the potential rather than the field source. The critical value of parameter a seems tobe the function of the number of samples and its optimization could be included into


the classifier design. However the results in Table 4 suggest that more local fields tendto result in a better performance and therefore it is safer to apply larger values of a .

6 Conclusions

In this paper, we introduced a novel non-parametric classification method based on thestatic data field adopted from the physical field phenomena. The meaning of trainingdata has been reformulated as sources of a central static field with the negative poten-tial increasing with the distance from the source. Attracting force among the datadefines a specific complex potential landscape resembling the joint potential wells.The classification process has been proposed as a gradient descent translocation of theunlabelled sample ultimately forced to approach one of the sources and inherit itslabel. The Static Field Classifier (SFC) has been implemented using an efficient ma-trix formulation suitable for Matlab application. Extensive graphical content has beenused to depict different geometrical interpretations of the SFC as well as fully visual-ize the classification process.

The presented SFC has been evaluated in a number of ways. An Individual per-formance has been examined on a number of datasets and compared to other wellperforming classifiers. The results showed relatively average performance of the SFCif applied individually. However it has shown the highest level of diversity with otherclassifiers for 3 out of 4 datasets, making it a very good candidate for classifier com-bination purposes. Various types of fields have also been examined within the generalSFC definition. The conducted experiments suggested the use of local fields for thebest performance as well as for boundaries invariability, but further experiments arerequired for fuller interpretation of these results.

The properties mentioned above as well as the results from the presented experi-ments allow considering the SFC as an alternative non-parametric approach for theclassification, particularly useful for combining with other classifiers.

References

1. Duda R.O., Hart P.E., Stork D.G.: Pattern Classification. John Wiley & Sons, New York(2001).

2. Bezdek J.C.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing.Kluwer Academic, Boston (1999).

3. Sharkey A.J.C.: Combining Artificial Neural Nets: Ensemble and Modular Multi-net Sys-tems. Springer-Verlag, Berlin Heidelberg New York (1999).

4. Sharkey A.J.C., Sharkey N.E.: Combining Diverse Neural Nets. The Knowledge Engineer-ing Review 12(3) (1997) 231-247.

5. Kuncheva L.I., Whitaker C.J.: Ten Measures of Diversity in Classifier Ensembles: Limitsfor Two Classifiers. IEE Workshop on Intelligent Sensor Processing, Birmingham (2001)10/1-10/6


6. Ruta D., Gabrys B.: Analysis of the Correlation Between Majority Voting Errors and theDiversity Measures in Multiple Classifier Systems. International Symposium on Soft Com-puting, Paisley (2001).

7. Zurek W.H.: Complexity, Entropy and the Physics of Information. Proc. of the Workshopon Complexity, Entropy, and the Physics of Information. Santa Fe (1989).

8. Klir G.J., Folger T.A.: Fuzzy Sets, Uncertainty, and Information. Prentice-Hall InternationalEdition (1988).

9. Hochreiter S., Mozer M.C.: An Electric Approach to Independent Component Analysis.Proc. of the Second International Workshop on Independent Component Analysis and Sig-nal Separation, Helsinki (2000) 45-50.

10. Principe J., Fisher III, Xu D.: Information Theoretic Learning. In S. Haykin (Ed.): Unsuper-vised Adaptive Filtering. New York NY (2000).

11. Torkkola K., Campbell W.: Mutual Information in Learning Feature Transformations. Proc.of International Conference on Machine Learning, Stanford CA (2000).


Inferring Knowledge from Frequent Patterns

Marzena Kryszkiewicz

Institute of Computer Science, Warsaw University of TechnologyNowowiejska 15/19, 00-665 Warsaw, Poland

[email protected]

Abstract. Many knowledge discovery problems can be solved efficiently bymeans of frequent patterns present in the database. Frequent patterns are usefulin the discovery of association rules, episode rules, sequential patterns andclusters. Nevertheless, there are cases when a user is not allowed to access thedatabase and can deal only with a provided fraction of knowledge. Still, the userhopes to find new interesting relationships. In the paper, we offer a new methodof inferring new knowledge from the provided fraction of patterns. Two newoperators of shrinking and extending patterns are introduced. Surprisingly, asmall number of patterns can be considerably extended into the knowledgebase. Pieces of the new knowledge can be either exact or approximate. In thepaper, we introduce a concise lossless representation of the given and derivablepatterns. The introduced representation is exact regardless the character of thederivable patterns it represents. We show that the discovery process can becarried out mainly as an iterative transformation of the patterns representation.

1 Introduction

Let us consider the following scenario that is typical for collaborating (e.g. telecom)companies: a company T1 requests some services offered by a company T2. To thisend T2 must collect some knowledge about T1. T1 however may not wish to revealsome facts to T1 unintentionally. Therefore, it is important for T1 to be aware of all theconsequents derivable from the required information. Awareness of what can bederived from a fraction of knowledge can be crucial for the security of the company.On the other hand, methods that enable reasoning about knowledge can be very usefulalso in the case when the information available within the company is incomplete.

The problem of inducing knowledge from the provided set of association rules wasfirst addressed in [4]. It was offered there how to use the cover operator (see [3]) andextension operator (see [4]) in order to augment the original knowledge. The coveroperator does not require any information on statistical importance (support) of rulesand produces at least as good rules as original ones; the extension operator requiresinformation on support of original rules.

In [5] it was proposed a different method of indirect inducing knowledge from thegiven rule set. It was proved there that it is better first to transform the provided ruleset into corresponding patterns P, then to augment P with new patterns (bounded bypatterns in P), and finally to apply old and new patterns for association rulesdiscovery. The set of rules obtained this way is guaranteed to be a superset of therules obtained by indirect “mining around rules”. Additionally, it was shown in [5]

248 M. Kryszkiewicz

how to test the consistency of the rule set and patterns as well as how to extractconsistent subset of rules and patterns.

In this paper we follow the idea formulated in [5] that patterns are a better form forknowledge derivation than association rules. We treat patterns as an intermediateform for derivation of other forms of knowledge. The more patterns we are able toderive from itemsets, the more association rules we are able to discover. This paperextends considerably the original method of deriving new patterns from the providedfraction of patterns. In particular, we propose here how to obtain extended andshrunken patterns that are not bounded by known itemsets. In addition, we define alossless concise representation of patterns and show how to use it in order to obtain allderivable itemsets. The introduced representation is an adapted version of generatorsand closed itemsets representation developed recently as an efficient representation ofall frequent patterns present in the database [6,7]. The original representation ofpatterns is defined in terms of database transactions, while the adapted version isdefined in terms of available patterns. Though both representations are conceptuallyclose, the related problems differ considerably – in the case of original representationthe main problem is its discovery from the transactional database; in the case of ourrepresentation – the issue we mainly address is restriction of an intended patternaugmentation to the representation transformation.

The layout of the paper is as follows: In Section 2 we introduce the basic notions ofpatterns (itemsets) and association rules. Section 3 reminds the results obtained in [5]related to discovering association rules from the given rule set by means of boundeditemsets. In Section 4 we propose a new method of augmenting the number ofpatterns by means of two new operators of shrinking and extending. In Section 5 weoffer a notion of a concise lossless representation of available and derivable patterns.It is offered in Section 6, how to restrict the overall pattern derivation process byapplying the concise representation of patterns. Section 7 concludes the results.

2 Frequent Itemsets, Association Rules, Closed Itemsets, andGenerators

The problem of association rules was introduced in [1] for sales transaction database.The association rules identify sets of items that are purchased together with other setsof items. For example, an association rule may state that 90% of customers who buybutter and bread buy milk as well. Let us recollect the problem more formally:

Let I = i1, i2, ..., im be a set of distinct literals, called items. In general, any set ofitems is called an itemset. Let D be a set of transactions, where each transaction T is asubset of I. An association rule is an expression of the form X ⇒ Y, where∅ ≠ X,Y ⊂ I and X ∩ Y = ∅. Support of an itemset X is denoted by sup(X) and definedas the number (or the percentage) of transactions in D that contain X.

Property 1 [1]. Let X,Y⊆I. If X⊂Y, then sup(X)≥sup(Y).

The property below is an immediate consequence of Property 1.

Property 2. Let X,Y,Z⊆I. If X⊆Y⊆Z and sup(X)=sup(Z), then sup(Y)=sup(X).

Inferring Knowledge from Frequent Patterns 249

Property 3 [6]. Let X,Y,Z⊆I. If X⊆Z and sup(X)=sup(Z), then X∪Y⊆Z∪Y andsup(X∪Y)=sup(Z∪Y).

Support of the association rule X ⇒ Y is denoted by sup(X ⇒ Y) and defined assup(X ∪ Y). Confidence of X ⇒ Y is denoted by conf(X ⇒ Y) and defined assup(X ∪ Y) / sup(X). (In terms of sales transactions, conf(X ⇒ Y) determines theconditional probability of purchasing items Y when items X are purchased.)

The problem of mining association rules is to generate all rules that have supportgreater than a user-defined threshold minSup and confidence greater than a thresholdminConf. Association rules that meet these conditions are called strong. Discovery ofstrong association rules is usually decomposed into two subprocesses [1], [2]:Step 1. Generate all itemsets whose support exceeds the minimum support minSup.

The itemsets of this property are called frequent.Step 2. From each frequent itemset generate association rules as follows: Let Z be a

frequent itemset and ∅ ≠ X ⊂ Z. Then X ⇒ Z\X is a strong association ruleprovided sup(Z)/sup(X) > minConf.

The number of both frequent itemsets and strong association rules can be huge. Inorder to obey this problem, several concise representations of knowledge have beenproposed in the literature (see e.g. [3], [6], [7]). In particular, frequent closed itemsets,which constitute a closure system, are one of the basic lossless representations offrequent itemsets [7]. The closure of an itemset X (denoted by γ(X)) is defined as thegreatest (w.r.t. set inclusion) itemset that occurs in all transactions in D in which Xoccurs. The itemset X is defined closed if γ(X)=X.

Another basic representation of frequent itemsets is based on a notion of agenerator. A generator of an itemset X (denoted by G(X)) can be defined as aminimal (w.r.t. set inclusion) itemset that occurs in all transactions in D in which Xoccurs.

3 Known Itemsets, Bounded Itemsets, and Association Rules

In the previous section we outlined how to calculate association rules from a set of allfrequent itemsets occurring in the database. Here, we assume that the database is notavailable to a user, however supports of some itemsets are known to him/her. In thesequel, the itemsets the supports of which are known will be called known itemsetsand will be denoted by K. The purpose of the user is to generate as many strongassociation rules as possible based on known itemsets.

The simplest way to generate strong association rules from K is to apply theprocedure described as Step 2 in Section 2. The set of derivable association rules maybe increased considerably if we know how to construct new itemsets and estimatetheir supports based on K. Here we remind the approach proposed in [5]:

A pair of itemsets Y,Z in K will be called bounding for X, if Y⊆X⊆Z. The itemset Xwill be called bounded (also called derivable in [5]) in K if there is a pair of boundingitemsets Y, Z∈K for X. The support of any (unknown) itemset X can be estimated bysupports of bounding itemsets as follows:

minsup(Y)| Y∈K ∧ Y⊆X ≥ sup(X) ≥ maxsup(Z)| Z∈K ∧ X⊆Z.

250 M. Kryszkiewicz

The set of all bounded itemsets in K will be denoted by BIS(K), that is:

BIS(K) = X⊆I| ∃Y,Z∈K, Y⊆X⊆Z.

Obviously, BIS(K) ⊇ K. Pessimistic support (pSup) and optimistic support (oSup) ofan itemset X∈BIS(K) w.r.t. K are defined as follows:

pSup(X,K) = maxsup(Z)| Z∈K ∧ X⊆Z,

oSup(X,K) = minsup(Y)| Y∈K ∧ Y⊆X.

The real support of X∈BIS(K) belongs to [pSup(X,K), oSup(X,K)]. Clearly, if X∈K,then sup(X)=pSup(X,K)=oSup(X,K). In fact, it may happen for a bounded itemset Xnot present in K that its support can be precisely determined. This will happen whenpSup(X,K)=oSup(X,K). Then, sup(X)=pSup(X,K). In the sequel, we will call suchitemsets exact bounded ones. The set of all exact bounded itemsets will be denoted byEBIS(K),

EBIS(K) = X∈BIS(K)| pSup(X,K)=oSup(X,K).

Example 1. Let K = ac[20], acf[20], d[20], ad[20], cd[20], f[30], cf[30], aef[15],def[15] (the values provided in square brackets in the subscript denote the supportsof itemsets). We note that the itemset df∉K is bounded by two known subsets: dand f, and by the known superset: def. Hence, pSup(de,K) = maxsup(def) = 15, and oSup(de,K) = minsup(d),sup(f) = 20. In ourexample, BIS(K) = K ∪ de[15,20], df[15,20], ef[15,30] (the values provided in squarebrackets in the subscript denote pessimistic and optimistic supports of itemsets,respectively). Since no newly derived itemset (i.e. itemset in BIS(K) \ K) has equalpessimistic and optimistic supports, then EBIS(K) = K.

Property 4. Let X,Y∈BIS(K) and X⊂Y. Then:a) pSup(X,K) ≥ pSup(Y,K),b) oSup(X,K) ≥ oSup(Y,K).

Property 5. Let X∈BIS(K). Then:a) maxpSup(Z,K)| Z∈BIS(K) ∧ X⊆Z = pSup(X,K),b) minoSup(Y,K)| Y∈BIS(K) ∧ Y⊆X = oSup(X,K).

Knowing BIS(K) one can induce (approximate) rules X⇒Y provided X∪Y ∈ BIS(K)and X ∈ BIS(K). The pessimistic confidence (pConf) of induced rules is defined asfollows:

pConf(X⇒Y,K) = pSup(X∪Y,K) / oSup(X,K).

The approximate association rules derivable from K are called theory for K and aredenoted by T:

T(K) = X⇒Y| X, X∪Y ∈ BIS(K).


It is guaranteed for every rule r∈T(K) that its real support is not less than pSup(r,K)and its real confidence is not less than pConf(r,K). (Please, see [5] for the GenTheoryalgorithm computing T(K)).

Now, let us assume that the user is not provided with known itemsets, but withassociation rules R the support and confidence of which are known. According to [5]the knowledge on rules should be first transformed into the knowledge on itemsets asfollows:

Let r∈R be a rule under consideration. Then the support of the itemset from whichr is built equals to sup(r) and the support of the itemset that is the antecedent of r isequal to sup(r) / conf(r). All such itemsets are defined as known itemsets for R and aredenoted by KIS(R), that is KIS(R) = X∪Y| X⇒Y∈R ∪ X| X⇒Y∈R.

Now, it is sufficient to extract frequent known itemsets K from KIS(R), and then tocompute T(K).

4 Deriving Unbounded Itemsets

In this section we will investigate if and under which conditions the given set ofknown itemsets K can be augmented by itemsets that are not bounded in K. Let usstart with the two fundamental propositions:

Proposition 1. Let X,Y,Z⊆I such that Z,Y⊇X and sup(X)=sup(Z). Then:

Y’ ⊆ Y ⊆ Y” and sup(Y’) = sup(Y) = sup(Y”),

where Y’ = X∪(Y\Z), Y” = Y∪Z.

Proof: Let X,Y,Z⊆I, Z,Y⊇X, sup(X) = sup(Z) and V = Y\Z. By Property 3,X∪V ⊆ Z∪V and sup(X∪V) = sup(Z∪V). Since Y⊇X and V = Y\Z, thenX∪V = X∪(Y\Z) ⊆ Y ⊆ Y∪Z = Z∪(Y\Z) = Z∪V and sup(X∪(Y\Z)) = sup(Y∪Z). Now,by Property 2, sup(Y) = sup(X∪(Y\Z)).

Proposition 1 states that each itemset Y in K can be shrunken into subset Y’ of thesame support as sup(Y) and extended into superset Y” of the same support as sup(Y), ifthere is a pair of itemsets X,Z∈K, such that Z,Y⊇X and sup(X)=sup(Z).

Example 2. Let X = f[30], Z = cf[30], Y = acf[20], Y’ = aef[15]. Then, by Proposition1, we can shrink Y into new exact subset X∪(Y\Z) = af[20], and we can extend Y’ intonew exact superset Y’∪Z = acef[15].

The proposition below shows how to perform even stronger itemset shrinking thanthat proposed in Proposition 1.

Proposition 2. Let X,Y,Z⊆I such that Z,Y⊇X and sup(X)=sup(Z). ∀V∈K such thatV⊆Y, sup(Y)=sup(V), the following holds:

Y’ ⊆ Y and sup(Y’) = sup(Y),

where Y’ = X∪(V\Z).

252 M. Kryszkiewicz

Proof: Let X,Y,V,Z⊆I, Z,Y⊇X, sup(X)=sup(Z), V⊆Y, sup(Y)=sup(V). As X∪V ⊇ X,then by Proposition 1, X∪((X∪V)\Z) ⊆ X∪V and sup(X∪((X∪V)\Z)) = sup(X∪V). Inaddition, by Property 3, X∪V ⊆ X∪Y and sup(X∪V) = sup(X∪Y). Taking into accountthat, (X∪V)\Z = V\Z (since Z⊇X) and X∪Y = Y (since Y⊇X), we obtain:X∪(V\Z) = X∪((X∪V)\Z) ⊆ X∪V ⊆ X∪Y = Y and sup(X∪(V\Z)) = sup(X∪((X∪V)\Z))= sup(X∪V) = sup(X∪Y) = sup(Y).

Proposition 2 states that each itemset Y in K can be shrunken into subset Y’=X∪(V\Z)of the same support as sup(Y), if X,Z,V∈K and Z,Y⊇X, sup(X)=sup(Z), V⊆Y,sup(Y)=sup(V).

The extended and shrunken itemsets w.r.t. K will be denoted by EIS(K) and SIS(K),respectively and are defined as follows:

EIS(K) = Y∪Z| ∃X,Y,Z∈K such that Z,Y⊇X and sup(X)=sup(Z),

SIS(K) = X∪(V\Z)| ∃X,Y,V,Z∈K such that Z,Y⊇X, sup(X)=sup(Z),V⊆Y and sup(Y)=sup(V).

The support of each shrunken and extended itemset equals to the support of someknown itemset in K. However, K augmented with such shorter and longer itemsetsderived by EIS(K) and SIS(K) will bound greater number of itemsets than BIS(K), i.e.,EBIS(K∪EIS(K)∪SIS(K))⊇EBIS(K) and BIS(K∪EIS(K)∪SIS(K))⊇BIS(K).

Example 3. Let K be the known itemsets from Example 1. Then newly deriveditemsets (EIS(K) ∪ SIS(K)) \ K = af[20], acd[20], acef[15], adef[15], cdef[15].

If (EIS(K) ∪ SIS(K)) \ K ≠ ∅, then further augmentation of (exact) bounded itemsetsis feasible. In order to obtain maximal knowledge on itemsets we should apply theoperators EIS and SIS as many times as no new exact bounded itemsets can bederived.

By EBIS*(K) we will denote all exact bounded itemsets that can be derived from Kby multiple use of the EIS and SIS operators. More formally,

EBIS*(K) = Ek(K), where

• E1(K) = EBIS(K),• En(K) = EBIS(En-1(K) ∪ EIS(En-1(K)) ∪ SIS(En-1(K))), for n ≥ 2,• k is the least value of n such that En(K) = En+1(K).

Example 4. Let K be the known itemsets from Example 1 and K’ = EBIS*(K) (see Fig.1 for K and K’). In our example, the number of known itemsets increased from 9itemsets in K to 23 exact itemsets in K’. Some of the newly derived itemsets could befound as bounded earlier (e.g. de[15,20], df[15,20] ∈ BIS(K)), however the process ofshrinking and extending itemsets additionally enabled precise determination of theirsupports (e.g. sup(de)=15, sup(df)=20).


acdef[15]

acdf[20] acef[15] adef[15] cdef[15]

acf[20] acd[20] adf[20] cdf[20] ace[15] aef[15] ade[15] def [15] cde[15]

ac[20] af[20] ad [20] cd[20] df[20] cf[30] de[15]

d[20] f [30]

Fig. 1. All exact itemsets K’=EBIS*(K) derivable from known itemsets K (underlined) bymultiple use of SIS and EIS operators. Closed itemsets and generators in K’ are bolded

By BIS*(K) we will denote all bounded itemsets that can be derived from K bymultiple use of the EIS and SIS operators. More formally,

BIS*(K) = BIS(EBIS*(K)).

5 Lossless Representation of Known and Bounded Itemsets

In this section we offer a concise representation of known and bounded itemsets in Kthat allows deriving each item in BIS(K) and determining its pessimistic andoptimistic support without error. In our approach we follow the idea of applying theconcepts of closed itemsets and generators of itemsets. We will however extend theseconcepts in order to cover the additional aspect of possible imprecision of supportdetermination for bounded itemsets, which was not considered so far in the context ofclosures and generators.

A closure of an itemset X∈BIS(K) in K is defined to be a maximal (w.r.t. setinclusion) known superset of X that has the same support as pSup(X,K). The set of allclosures of X in K is denoted by γ(X,K), that is:

γ(X,K) = MAXY∈K| Y⊇X ∧ sup(Y)=pSup(X,K).

Let B be a subset of bounded itemsets such that K⊆B⊆BIS(K). The union of closuresof the itemsets B in K will be denoted by C(B,K), that is:

C(B,K) = ∪X∈B γ(X,K).

For B=K, C(B,K) will be denoted briefly by C(K). An itemset X∈K is defined closedin K iff γ(X,K)=X.

A generator of an itemset X∈BIS(K) in K is defined to be a minimal (w.r.t. setinclusion) known subset of X that has the same support as oSup(X,K). The set of allgenerators of X in K is denoted by G(X,K), that is:

G(X,K) = MINY∈K| Y⊆X ∧ sup(Y)=oSup(X,K).

Let B be a subset of bounded itemsets such that K⊆B⊆BIS(K). The union ofgenerators of the itemsets B in K will be denoted by G(B,K), that is:

254 M. Kryszkiewicz

G(B,K) = ∪X∈B G(X,K).

For B=K, G(B,K) will be denoted briefly by G(K). An itemset X∈K is defined a keygenerator in K iff G(X,K)=X.

Example 5. Let K’ be the set of exact itemsets in Fig. 1. Let X = acd (noteacd∈K’). Then: γ(X,K’) = acdf[20] and G(X,K’) = ac[20], d[20]. Now, letX’=ef (note ef∈BIS(K’)\K’). Then, pSup(X’,K’) = 15, oSup(X’,K’) = 30. Thus:γ(X’,K’) = acdef[15], G(X’,K’) = f[30].

The union of closures of itemsets in K’: C(K’) = cf[30], acdf[20], acdef[15] andthe union of generators of itemsets in K’: G(K’) = f[30], d[20], ac[20], af[20],de[15], ace[15], aef[15] (see Fig. 1). Thus, G(K’) ∪ C(K’) consists of 10 itemsetsout of 23 itemsets present in K’.

Clearly, G(K) ∪ C(K) ⊆ K. Supports of known itemsets can be determined either fromsupports of closures in K or from supports of generators in K as follows:

Property 6. Let X∈K. Then:a) sup(X) = maxsup(Z)| Z∈C(K) ∧ X⊆Z,b) sup(X) = minsup(Y)| Y∈G(K) ∧ Y⊆X.

Proof: Ad. a) For each Z∈γ(X,K): Z is a known superset of X in C(K) andsup(Z)=pSup(X,K)=sup(X). On the other hand, supports of known supersets of X arenot greater than sup(X) (by Property 1). Thus, sup(X) = maxsup(Z)| Z∈C(K) ∧ X⊆Z.Ad. b) Analogous to that for the case a.

The next property states that the pessimistic (optimistic) support of a bounded itemsetis the same when calculated both w.r.t. K and w.r.t. C(K) (w.r.t. G(K)). In addition,the same closures (generators) of a bounded itemset will be found both in K and inC(K) (in G(K)).

Property 7. Let X∈BIS(K). Then:a) pSup(X,K) = pSup(X, C(K)),b) γ(X,K) = γ(X, C(K)),c) oSup(X,K) = oSup(X, G(K)),d) G(X,K) = G(X, C(K)).

Proof: Ad. a) pSup(X,K) = maxsup(Z)| Z∈K ∧ X⊆Z = /* by Property 6a */ =maxsup(Z)| Z∈C(K) ∧ X⊆Z = pSup(X, C(K)).Ad. b) By definition, γ(X,K) = MAXY∈K| Y⊇X ∧ sup(Y)=pSup(X,K) = /* byProperty 7a */ = MAXY∈K| Y⊇X ∧ sup(Y)=pSup(X,C(K)) = MAXY∈C(K)| Y⊇X ∧sup(Y)=pSup(X,C(K)) = γ(X,C(K)).

Ad. c-d) Analogous to those for the cases a-b, respectively. The lemma below states that each closure in K is a closed itemset and each generatorin K is a key generator.

Lemma 1. Let X∈BIS(K). Then:a) If Y∈γ(X,K), then γ(Y,K)=Y,


b) If Y∈G(X,K), then G(Y,K)=Y.

Proof: Ad. a) Let Y∈γ(X,K). By definition of closure, Y is a maximal known supersetof X such that sup(Y)=psup(X,K). Hence, no known proper superset of Y has the samesupport as Y. In addition, as Y is a known itemset, then sup(Y)=psup(Y,K). Thus weconclude that Y is the only known maximal (improper) superset of Y such thatsup(Y)=psup(Y,K). Therefore and by definition of closure, γ(Y,K)=Y.Ad. b) Analogous to that for the case a.

Property 8.a) C(K) = X∈K| γ(X,K)=X,b) G(K) = X∈K| G(X,K)=X.Proof: Ad. a) We will prove an equivalent statement: X∈C(K) iff γ(X,K)=X.(⇒) By definition of C, X∈C(K) implies ∃Y∈K such that X∈γ(Y,K). Hence by Lemma1a, γ(X,K)=X.(⇐) By definition of C, C(K) ⊇ γ(X,K). As γ(X,K) = X, then X∈C(K).Ad. b) Analogous to that for the case a.

By Property 8, all closures in K are all closed itemsets in K and all generators in K areall key generators in K. Therefore, further on we will use the notions of closeditemsets and closures in K interchangeably. Similarly, we will use the notions of keygenerators and generators in K interchangeably.

The proposition below specifies a way of determining closed itemsets andgenerators based on supports of known itemsets.

Proposition 3. Let X∈K.a) C(K) =X∈K| ∀Y∈K, if Y⊃X, then sup(X)≠sup(Y),b) G(K) =X∈K| ∀Y∈K, if Y⊂X, then sup(X)≠sup(Y).Proof: Ad. a) Let X∈K. By Property 8a, X∈C(K) iff γ(X,K) = X iff MAXY∈K|Y⊇X ∧ sup(Y)=pSup(X,K) = X iff MAXY∈K| Y⊇X ∧ sup(Y)=sup(X) = X iff∀Y∈K, if Y⊃X, then sup(X)≠sup(Y).Ad. b) Analogous to that for the case a.

The lemma below states an interesting fact, that the union of closures (generators) ofany subset of bounded itemsets containing all known itemsets is equal to the closeditemsets (key generators) in K.

Lemma 2. Let B be a subset of bounded itemsets such that K⊆B⊆BIS(K).a) C(B,K) = C(K),b) G(B,K) = G(K).

Proof: Ad. a) C(B,K) = ∪X∈B γ(X,K) ⊇ ∪X∈K γ(X,K) = C(K). On the other hand,

C(B,K) = ∪X∈B γ(X,K) = Y∈K| X∈B, Y∈γ(X,K) = /* by Lemma 1a */ = Y∈K|X∈B, Y∈γ(X,K), γ(Y,K)=Y ⊆ Y∈K| γ(Y,K)=Y = /* by Property 8a */ = C(K).Since C(B,K) ⊇ C(K) and C(B,K) ⊆ C(K), then C(B,K) = C(K).Ad. b) Analogous to that for the case a.

256 M. Kryszkiewicz

The immediate conclusion from Lemma 2 is that the union of closures (generators) ofall exact bounded itemsets and all bounded itemsets is equal to the closed itemsets(key generators) in K.

Proposition 4.a) C(EBIS(K)) = C(K),b) C(BIS(K)) = C(K),c) C(C(K)) = C(K),d) G(EBIS(K)) = G(K),e) G(BIS(K)) = G(K),f) G(G(K)) = G(K).

Proof: Ad. a,b) Immediate by Lemma 2a.Ad. c) By Property 8a, C(K) = X∈K| γ(X,K)=X. Now, the following equation istrivially true: C(K) = X∈C(K)| γ(X,K)=X. In addition, by Property 7b, we haveγ(X, C(K))=γ(X, K) for any X∈C(K). Hence, C(K) = X∈C(K)| γ(X, C(K))=X = /*by Property 8a */ = C(C(K)).Ad. d-f) Analogous to those for the cases a-c, respectively.

Proposition 5 claims that each itemset in BIS(K) is bounded by a subset that is a keygenerator in K and by a superset that is a closed itemset in K.

Proposition 5.a) BIS(K) = X⊆I| ∃Y∈G(K), Z∈C(K), Y⊆X⊆Z,b) EBIS(K) = X⊆I| ∃Y∈G(K), Z∈C(K), Y⊆X⊆Z, pSup(X,K)=oSup(X,K).

Proof: Ad. a) Let X⊆I. By definition of bounded itemsets, X∈BIS(K) iff ∃Y,Z∈K,Y⊆X⊆Z iff ∃Y,Z∈K such that Y⊆X⊆Z and ∃Y’∈G(K), ∃Z’∈C(K) such that Y’∈G(Y,K)∧ Z’∈γ(Z,K) iff ∃Y,Z∈K, ∃Y’∈G(K), ∃Z’∈C(K), such that Y’⊆Y⊆X⊆Z⊆Z’ iff /* by thefacts: G(K)⊆K and C(K)⊆K */ ∃Y’∈G(K), ∃Z’∈C(K), such that Y’⊆X⊆Z’.

Ad. b) Immediate by definition of EBIS and Proposition 5a.

By Proposition 5 and Property 7a,c, the pair (G(K),C(K)) allows determination of allexact bounded itemsets and their supports as well as determination of all boundeditemsets and their pessimistic and optimistic supports. Hence, (G(K),C(K)) is alossless representation of EBIS(K) and BIS(K). In the sequel, the union of G(K) andC(K) will be denoted by R(K), that is:

R(K) = G(K) ∪ C(K).

Proposition 6.a) C(R(K)) = C(K),b) G(R(K)) = G(K),c) R(R(K)) = R(K),d) R(EBIS(K)) = R(K),e) R(BIS(K)) = R(K),f) EBIS(K) = EBIS(R(K)),g) BIS(K) = BIS(R(K)).


Proof: Ad. a) By definition of C, C(C(K)) = ∪γ(X, C(K))| X∈C(K), C(R(K)) =

∪γ(X, R(K))| X∈R(K), and C(K) = ∪γ(X,K)| X∈K. Since, C(K) ⊆ R(K) ⊆ K andγ(X, C(K)) = γ(X, R(K)) = γ(X, K), then C(C(K)) ⊆ C(R(K)) ⊆ C(K). However, wehave C(C(K)) = C(K) (by Proposition 4c). Hence, C(R(K)) = C(K).Ad. b). Analogous to that for the case a.Ad. c). R(R(K)) = G(R(K)) ∪ C(R(K)) = /* by Proposition 6a-b */ = G(K) ∪ C(K) =R(K).Ad. d). R(EBIS(K)) = G(EBIS(K)) ∪ C(EBIS(K)) = /* by Proposition 4a,d */ =G(K) ∪ C(K) = R(K).Ad. e). Follows by Proposition 4b,e; analogous to that for the case d.Ad. f). By Proposition 5b.Ad. g). By Proposition 5a.

As follows from Proposition 6a-b, G(K) and C(K) are determined uniquely by R(K),hence R(K) is a lossless representation of EBIS(K) and BIS(K). Clearly, ifG(K) ∩ C(K) ≠ ∅, then R(K) is less numerous than the (G(K),C(K)) representation.

6 Lossless Representation of Derivable Itemsets

Anticipating that the concise lossless representation of BIS*(K) and EBIS*(K) can besignificantly less numerous than the corresponding (exact) bounded itemsets, we willinvestigate how to determine BIS*(K) and EBIS*(K) efficiently by manipulatingmainly the concise lossless representation of known itemsets.

Let K’ be the set of known itemsets augmented by extending and shrinking, i.e.K’ = K ∪ EIS(K) ∪ SIS(K). The proposition below claims that seeking for C(K’)among shrunken itemsets SIS(K) is useless. Similarly, seeking for G(K’) amongextended itemsets EIS(K) is useless.

Proposition 7.a) C(K ∪ EIS(K) ∪ SIS(K)) = C(K ∪ EIS(K)),b) G(K ∪ EIS(K) ∪ SIS(K)) = G(K ∪ SIS(K)),c) R(K ∪ EIS(K) ∪ SIS(K)) = C(K∪EIS(K)) ∪ G(K∪SIS(K)).

Proof: Let K’ = K∪EIS(K)∪SIS(K) and K” = K∪EIS(K). By Proposition 2, eachitemset Y’∈SIS(K) is a subset of some item Y∈K and sup(Y’)=sup(Y). Hence, if Y’⊂Y,then Y’ is neither closed in K’ nor in K” (by Proposition 3a). If however, Y’=Y, thenY’∈K. Hence, itemsets in SIS(K) either are not closed in K’ and in K” or belong to K.

Ad. b) Follows from definition of EIS(K), Proposition 1 and Proposition 3b; can beproved analogously to the proof for the case a.Ad. c) Follows immediately from Proposition 7a-b.

Now, we will address the following issue: Is the representation R(K ∪EIS(K) ∪ SIS(K)) derivable from the representation R(K) without referring to theknowledge on supports of itemsets in K \ R(K)?

258 M. Kryszkiewicz

The lemma below states that the closed itemsets in the union of known itemsets K1

and K2 are equal to the closed itemsets in the union of C(K1) and C(K2), and the keygenerators in K1 ∪ K2 are equal to the key generators in the union of G(K1) and G(K2).

Lemma 3. Let K1,K2 be subsets of known itemsets.a) C(K1 ∪ K2) = C(C(K1) ∪ C(K2)),b) G(K1 ∪ K2) = G(G(K1) ∪ G(K2)).

Proof: In Appendix.

Lemma 3 implies that the knowledge on closed itemsets and generators of known,extended and shrunken itemsets can be directly applied for determining G(K∪EIS(K))and C(K∪SIS(K)).

Corollary 1.a) C(K ∪ EIS(K)) = C(C(K) ∪ C(EIS(K))),b) G(K ∪ SIS(K)) = G(G(K) ∪ G(SIS(K))).

Proof: Ad. a-b) Follow immediately from Lemma 3a,b, respectively.

Thus, by Proposition 7c and Corollary 1, R(K ∪ EIS(K) ∪ SIS(K)) is directlyderivable from the representation R(K) and the sets: C(EIS(K)), G(SIS(K)). Theproposition below states further that the two sets: C(EIS(K)) and G(SIS(K)) can alsobe determined directly from R(K).

Proposition 8.a) C(EIS(K)) = C(Y∪Z| ∃X∈G(K), ∃Y,Z∈C(K) such that Z,Y⊇X and

sup(X)=sup(Z)),b) C(EIS(K)) = C(EIS(R(K))),c) G(SIS(K)) = G(X∪(V\Z)| ∃X,V∈G(K), ∃Y,Z∈C(K) such that Z,Y⊇X,

sup(X)=sup(Z), V⊆Y and sup(Y)=sup(V)),d) G(SIS(K)) = G(SIS(R(K))).

Proof: In Appendix.

Hence, we conclude that R(K ∪ EIS(K) ∪ SIS(K)) is directly derivable from R(K) -the knowledge on itemsets in K \ R(K) is superfluous.

Let us define the operator R* as follows:

R*(K) = Rk(K), where• R1(K) = R(K),• Rn(K) = R(Rn-1(K) ∪ EIS(Rn-1(K)) ∪ SIS(Rn-1(K))), for n ≥ 2,• k is the least value of n such that Rn(K) = Rn+1(K).

R* iteratively transforms only the concise lossless representation of known itemsets bymeans of EIS and SIS operators. The lemma below shows a direct correspondencebetween the auxiliary operators Rk (used for defining R*(K)) and Ek (used for definingEBIS*(K) in Section 4), namely Rk is a concise lossless representation of Ek.


Lemma 4.a) Ek(K) = EBIS(Rk(K)),b) R(Ek(K)) = Rk(K).

Proof: In Appendix.

The immediate conclusion from Lemma 4 is that R*(K) is a lossless representation ofEBIS*(K)).

Proposition 9.a) EBIS*(K) = EBIS(R*(K)),b) BIS*(K) = BIS(R*(K)),c) R(EBIS*(K)) = R*(K),d) R(BIS*(K)) = R*(K).

Proof: Ad. a) Follows immediately by Lemma 4a.Ad. b) BIS*(K) = /* by definition */ = BIS(EBIS*(K)) = /* by Proposition 9a */ =BIS(EBIS(R*(K))). Let K’ = R*(K). Then, BIS*(K) = BIS(EBIS(K’)) = /* by Proposition6g */ = BIS(R(EBIS(K’))) = /* by Proposition 6d */ = BIS(R(K’)) = BIS(R(R*(K))) = /*by definition of R* */ = BIS(R*(K)).Ad. c) Follows immediately by Lemma 4b.Ad. d) R(BIS*(K)) = /* by Proposition 9b */ = R(BIS(R*(K))) = /* by Proposition 6e */= R(R*(K)) = /* by definition of R* */ = R*(K).

Proposition 9 states that R*(K) is not only a lossless representation of EBIS*(K)), butalso a lossless representation of BIS *(K)).

7 Conclusions

In the paper we proposed how to generate maximal amount of knowledge in the formof frequent patterns and association rules from a given a set of known itemsets orassociation rules. Unlike in earlier work in [5] we enabled generations of patterns thatare not bounded by the given sample of known itemsets. We determined the caseswhen the original set of known itemsets can be augmented with the new patterns thatare supersets (EIS(K)) and subsets (SIS(K)) of known itemsets K, such that theirsupport can be precisely determined. The procedure of augmenting the set of knownor exact derivable itemsets can be repeated as long as all exact itemsets arediscovered. In order to avoid superfluous calculations, we introduced the losslessconcise representation (R(K)) of a given sample of itemsets. We proposed a method ofperforming the pattern augmentation procedure as an iterative transformation of aconcise lossless pattern representation into the new ones. When no changes occur as aresult of a consecutive transformation, the obtained representation is considered final(R*(K)). We proved that all derivable bounded (BIS*(K)) and exact bounded(EBIS*(K)) patterns can be derived by bounding only the itemsets present in the finalrepresentation.

260 M. Kryszkiewicz

References

1. Agrawal, R., Imielinski, T., Swami, A.: Mining Associations Rules between Sets of Itemsin Large Databases. In: Proc. of the ACM SIGMOD Conference on Management of Data.Washington, D.C. (1993) 207-216

2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast Discovery ofAssociation Rules. In: Advances in Knowledge Discovery and Data Mining. AAAI, MenloPark, California (1996) 307-328

3. Kryszkiewicz, M.: Representative Association Rules. In: Proc. of PAKDD ’98. Melbourne,Australia. LNAI 1394. Springer-Verlag (1998) 198-209

4. Kryszkiewicz, M.: Mining with Cover and Extension Operators. In: Proc. of PKDD ’00.Lyon, France. LNAI 1910. Springer-Verlag (2000) 476-482

5. Kryszkiewicz, M.: Inducing Theory for the Rule Set. In: Proc. of RSCTC ’00. Banff,Canada (2000) 353-360

6. Kryszkiewicz, M.: Concise Representation of Frequent Patterns based on Disjunction-freeGenerators. In: Proc. of ICDM ’2001, IEEE Computer Society Press (2001).

7. Pasquier N., Bastide Y., Taouil R., Lakhal L., Efficient Mining of Association Rules UsingClosed Itemset Lattices. Information Systems 24 (1999) 25-46

Appendix: Proofs

In the appendix we will prove Lemma 3, Proposition 8, and Lemma 4. In the proof ofLemma 3, we will apply the beneath lemma.

Lemma 5.a) ∀Y∈K, if Y⊃X, then sup(X)≠sup(Y) iff ∀Y∈C(K), if Y⊃X, then sup(X)≠sup(Y).b) ∀Y∈K, if Y⊂X, then sup(X)≠sup(Y) iff ∀Y∈G(K), if Y⊂X, then sup(X)≠sup(Y).

Proof: Ad. a) (⇒) Trivial as K ⊇ C(K).(⇐) (by contradiction). Let ∀Y∈C(K), if Y⊃X, then sup(X)≠sup(Y). Let Y’∈K be anitemset such that Y’⊃X and sup(X)=sup(Y’). Let Y∈γ(Y’,K). Then, Y∈C(K), Y’⊆Y andsup(Y)=pSup(Y’,K)=sup(Y’). Hence, Y⊇Y’⊃X and sup(Y)=sup(Y’)=sup(X), whichcontradicts the assumption.

Ad. b) Analogous to that for the case a.

Proof of Lemma 3: Ad. a) Let W(X,Y) denote condition: Y⊃X implies sup(X)≠sup(Y).By Proposition 3a:• C(K1∪K2) = X∈K1∪K2| ∀Y∈K1∪K2, W(X,Y) = X∈K1 | ∀Y∈K1∪K2, W(X,Y) ∪ X∈K2 | ∀Y∈K1∪K2, W(X,Y).

Now, X∈K1 | ∀Y∈K1∪K2, W(X,Y) = X∈K1 | ∀Y∈K1, W(X,Y) ∩ X∈K1 | ∀Y∈K2, W(X,Y) = /* by Proposition 3a */ = C(K1) ∩ X∈K1 | ∀Y∈K2, W(X,Y) = /* by Lemma 5a */ = C(K1) ∩ X∈K1 | ∀Y∈C(K2), W(X,Y) = C(K1) ∩ X∈C(K1)| ∀Y∈C(K2), W(X,Y) = /* by Proposition 4c */ C(C(K1)) ∩ X∈C(K1)| ∀Y∈C(K2), W(X,Y) = /* by Proposition 3a */ = X∈C(K1)| ∀Y∈C(K1), W(X,Y) ∩ X∈C(K1)| ∀Y∈C(K2), W(X,Y) =


X∈C(K1)| ∀Y∈C(K1)∪C(K2), W(X,Y).

Thus, we proved that: (*) X∈K1 | ∀Y∈K1∪K2, W(X,Y) = X∈C(K1)| ∀Y∈C(K1)∪C(K2), W(X,Y).

Similarly, one can prove that: (**) X∈K2 | ∀Y∈K1∪K2, W(X,Y) = X∈C(K2)| ∀Y∈C(K1)∪C(K2), W(X,Y). By (*) and (**) we obtain: C(K1∪K2) = X∈C(K1)| ∀Y∈C(K1)∪C(K2), W(X,Y) ∪

X∈C(K2)| ∀Y∈C(K1)∪C(K2), W(X,Y) = X∈C(K1)∪C(K2)|∀Y∈C(K1)∪C(K2), W(X,Y) = /* by Proposition 3a */ = C(C(K1)∪C(K2)).

Ad. b) Analogous to that for the case a.

Proof of Proposition 8: Ad. a) By definition, an itemset W∈EIS(K) providedW=Y∪Z, where Y,Z∈K, and ∃X∈K, such that Z,Y⊇X and sup(X)=sup(Z). Let usassume X,Y,Z,W are such itemsets and in addition W∈C(EIS(K)). We will show thatthere are itemsets X’∈G(K), and Y’,Z’∈C(K) such that Z’,Y’⊇X’ and sup(X’)=sup(Z’)and W=Y’∪Z’.

Let X’∈G(X,K), Y’∈γ(Y,K), Z’∈γ(Z,K). Hence, X’∈G(K), Y’,Z’∈C(K), X’⊆X,Y’⊇Y, Z’⊇Z, and sup(X’)=sup(X)=sup(Z)=sup(Z’), sup(Y’)=sup(Y). Now, sinceZ,Y⊇X, then we conclude Z’,Y’⊇X’. We deduce further: Y’∪Z’∈EIS(K).

In addition, W = Y∪Z ⊆ Y∪Z’ ⊆ Y’∪Z’ and sup(W) = sup(Y∪Z) = /* by Property 3*/ = sup(Y∪Z’) = /* by Property 3 */ = sup(Y’∪Z’). Hence, W ⊆ Y’∪Z’ andsup(W) = sup(Y’∪Z’). Since W∈C(EIS(K)), then by Proposition 3a: ∀V∈EIS(K),V⊃W implies sup(X)≠sup(Y). Thus, there is no proper superset of W in EIS(K) thesupport of which would be equal to sup(W). This implies W = Y’∪Z’.

As W was chosen arbitrarily, we proved that any closed itemset in C(EIS(K)) canbe built solely from R(K).

Ad. b) Follows from (the proof of) Proposition 8a.

Ad. c-d) Analogous to those for the cases a-b, respectively.

Proof of Lemma 4. Ad. a) (by induction) Let k=1.• E1(K) = EBIS(K) = /* by Proposition 6f */ = EBIS(R(K)) = EBIS(R1(K)).

Hence, the lemma is satisfied for k=1. Now, we will apply the following inductionhypothesis for k>1: For every i<k: Ei(K) = EBIS(Ri(K)).• Ek(K) = EBIS(Ek-1(K)∪EIS(Ek-1(K))∪SIS(Ek-1(K))) = /* by Proposition 6f */ =

EBIS( R(Ek-1(K) ∪ EIS(Ek-1(K)) ∪ SIS(Ek-1(K))) ) = /* by Proposition 7c */ = EBIS( G( Ek-1(K) ∪ SIS(Ek-1(K)) ) ∪ C( Ek-1(K) ∪ EIS(Ek-1(K)) ) ).

Let, K’ = Rk-1(K). Now, G( Ek-1(K) ∪ SIS(Ek-1(K) ) = /* by induction hypothesis */ =

G( EBIS(K’) ∪ SIS(EBIS(K’)) ) = /* by Lemma 3b */ = G( G(EBIS(K’)) ∪ G(SIS(EBIS(K’)) ) = /* by Proposition 8d */ = G( G(EBIS(K’)) ∪ G(SIS(R(EBIS(K’))) ) = /* by Propositions 4d, 6d */ =

262 M. Kryszkiewicz

G( G(K’) ∪ G(SIS(R(K’))) ) = /* by Proposition 8d */ = G( G(K’)) ∪ G(SIS(K’)) ) = /* by Lemma 3b */ = G( K’ ∪ SIS(K’) ) = G( Rk-1(K)) ∪ SIS(Rk-1(K)) ).

Thus, we proved that: (*) G( Ek-1(K) ∪ SIS(Ek-1(K)) ) = G( Rk-1(K) ∪ SIS(Rk-1(K)) ).

Anologously, one can prove that: (**) C( Ek-1(K) ∪ EIS(Ek-1(K)) ) = C( Rk-1(K) ∪ EIS(Rk-1(K)) ).

By (*) and (**) we obtain:• Ek(K) = EBIS( G( Rk-1(K) ∪ SIS(Rk-1(K)) ) ∪ C( Rk-1(K) ∪ EIS(Rk-1(K)) ) ) =

/* by Proposition 7c */ = EBIS(R( Rk-1(K) ∪ SIS(Rk-1(K)) ∪ EIS(Rk-1(K)) )) =EBIS(Rk(K)).

Ad. b) R(Ek(K)) = /* by Lemma 4a */ = R(EBIS(Rk(K))) = /* by Proposition 6d */ =R(Rk(K)) = /* by definition of R* */ = Rk(K).

Anytime Possibilistic Propagation Algorithm

Nahla Ben Amor1, Salem Benferhat2, and Khaled Mellouli1

1 Institut Superieur de Gestion de Tunisnahla.benamor, [email protected]

2 Institut de Recherche en Informatique de Toulouse (I.R.I.T)[email protected]

Abstract. This paper proposes a new anytime possibilistic inferencealgorithm for min-based directed networks. Our algorithm departs from adirect adaptation of probabilistic propagation algorithms since it avoidsthe transformation of the initial network into a junction tree which isknown to be a hard problem. The proposed algorithm is composed ofseveral, local stabilization, procedures. Stabilization procedures aim toguarantee that local distributions defined on each node are coherent withrespect to the ones of its parents. We provide experimental results which,for instance, compare our algorithm with the ones based on a directadaptation of probabilistic propagation algorithms.

1 Introduction

In possibility theory there are two different ways to define the counterpart ofBayesian networks. This is due to the existence of two definitions of possibilisticconditioning: product-based and min-based conditioning [5] [7] [14]. When weuse the product form of conditioning, we get a possibilistic network close to theprobabilistic one sharing the same features and having the same theoretical andpractical results [4]. However, this is not the case with min-based networks [10][12].

This paper focuses on min-based possibilistic directed graphs, by proposinga new algorithm for propagating uncertain information in a possibility theoryframework. Our algorithm, is an anytime algorithm. It is composed of severalsteps, which progressively get close to the exact possibility degrees of a variableof interest. The first step is to transform the initial possibilistic graph into anequivalent undirected graph. Each node in this graph contains a node from theinitial graph and its parents and will be quantified by their local joint distributioninstead of the conditional one. Then, different stability procedures are used, inorder to guarantee that joint distributions on a given node are in agreement withthose of its parents.

The algorithm successively applies one-parent stability, two-parents stabil-ity,..., n-parents stability, which respectively checks the stability with respect toonly one parent, two parents,..., all parents. We show that the more parents weconsider, the bette the results, comparing with the exact possibility degrees ofvariables of interest. We also provide experimental results showing the merits ofour algorithm.


264 N.B. Amor, S. Benferhat, and K. Mellouli

Section 2 gives a briefly background on possibility theory. Section 3 intro-duces min-based possibilistic graphs and briefly recalls a standard propagationalgorithm. Section 4 presents our new algorithm. Section 5 considers the casewhere a new evidence is taken into account. Lastly, Section 6 gives some exper-imental results.

2 Basics of Possibility Theory

Let V = A1, A2, ..., AN be a set of variables. We denote by DA = a1, .., anthe domain associated with the variable A. By a we denote any instance of A.Ω = ×Ai∈VDAi denotes the universe of discourse, which is the Cartesian prod-uct of all variable domains in V . Each element ω ∈ Ω is called a state of Ω. ω[A]denotes the instance in ω of the variable A. In the following, we only give a briefrecalling on possibility theory, for more details see [7].A possibility distribution π is a mapping from Ω to the interval [0, 1]. It repre-sents a state of knowledge about a set of possible situations distinguishing whatis plausible from what is less plausible. Given a possibility distribution π definedon the universe of discourse Ω, we can define a mapping grading the possibilitymeasure of an event φ ⊆ Ω by Π(φ) = maxω∈φ π(ω). A possibility distributionπ is said to be α-normalized, if h(π) = maxω π(ω) = α. If α = 1, then π is simplysaid to be normalized. h(π) is called the consistency degree of π.Possibilistic conditioning: In the possibilistic setting conditioning consists inmodifying our initial knowledge, encoded by a possibility distribution π, by thearrival of a new sure piece of information φ ⊆ Ω. In possibility theory there aretwo well-known definitions of conditioning:- min-based conditioning proposed in an ordinal setting [7] [14]:

Π(ψ | φ) =Π(ψ ∧ φ) if Π(ψ ∧ φ) < Π(φ)1 otherwise (1)

- product-based conditioning proposed in a numerical setting and which is adirect counterpart of probabilistic conditioning:

π(ψ |p φ) =

π(ψ∧φ)Π(φ) if Π(φ) = 00 otherwise

(2)

Possibilistic independence: There are several definitions of independencerelations in the possibilistic framework [2] [5] [11]. In particular, two definitionshave been used in the perpective of possibilistic networks.- the Non-Interactivity independence [16] defined by:

Π(x ∧ y | z) = min(Π(x | z), Π(y | z)),∀x, y, z. (3)

- the Product-based independence relation defined by:

Π(x |p y ∧ z) = Π(x |p z),∀x, y, z. (4)

Anytime Possibilistic Propagation Algorithm 265

3 Min-Based Possibilistic Graphs

This section defines min-based possibilistic graphs and briefly recalls the di-rected adaptation of probabilistic propagation algorithm in a possibility theoryframework.

3.1 Basic Definitions

A min-based possibilistic graph over a set of variables V , denoted by, consists of:- a graphical component which is a DAG (Directed Acyclic Graph) where nodesrepresent variables and edges encode the link between the variables. The parentset of a node A is denoted by UA.- a numerical component which quantifies different links. For every root node A(UA = ∅), uncertainty is represented by the a priori possibility degree Π(a) ofeach instance a ∈ DA, such that maxaΠ(a) = 1. For the rest of the nodes (UA =∅) uncertainty is represented by the conditional possibility degree Π(a | uA) ofeach instances a ∈ DA and uA ∈ DUA . These conditional distributions satisfythe following normalization condition: maxaΠ(a | uA) = 1, for any uA. The setof a priori and conditional possibility degrees induces a unique joint possibilitydistribution defined by:

Definition 1 Given the a priori and conditional possibilities, the joint distribu-tion denoted by πm, is expressed by the following min-based chain rule:

πm(A1, .., AN ) = mini=1..N

Π(Ai | UAi) (5)

Example 1 Let us consider the min-based possibilistic network ΠG composedby the DAG of Figure 1 and the initial distributions given in Tables 1 and 2.

Table 1. Initial distributions

a Π(a) b a Π(b | a) c a Π(c | a)a1 1 b1 a1 1 c1 a1 0.3a2 0.9 b1 a2 0 c1 a2 1

b2 a1 0.4 c2 a1 1b2 a2 1 c2 a2 0.2

These a priori and conditional possibilities encode the joint distribution rel-ative to A,B,C and D using (5) as follows: ∀a, b, c, d, πm(a ∧ b ∧ c ∧ d) =min(Π(a), Π(b | a), Π(c | d), Π(d | b ∧ c). For instance πm(a1 ∧ b2 ∧ c2 ∧ d1) =min(1, 0.4, 1, 1) = 0.4. Moreover we can check that h(πm) = 1.


Table 2. Initial distributions

d b c Π(d | b ∧ c) d b c Π(d | b ∧ c)d1 b1 c1 1 d2 b1 c1 1d1 b1 c2 1 d2 b1 c2 0d1 b2 c1 1 d2 b2 c1 0.8d1 b2 c2 1 d2 b2 c2 1

A

D

C B

Fig. 1. Example of a Multiply Connected DAG

3.2 Possibilistic Propagation in Junction Trees

This section summarizes a direct adaptation of probabilistic propagation algo-rithm [15] in the possibilistic framework. For more details see [4]. The principleof this propagation method is to transform the initial DAG into a junction treeand then to perform the propagation on this new graph.

Given a min-based possibilistic network, the construction of its correspondingjunction tree is performed in the same manner as in the probabilistic case [15].In a first step, the DAG is moralized by adding undirected edges between theparents and by dropping the direction of existing edges. Then, the moral graphis triangulated which means that every cycle of length four or greater containsan edge that connects two non-adjacent nodes in the cycle. It is possible to havedifferent triangulations of a moral graph. In particular, we can simply constructa unique cluster containing all the variables. However such triangulation is notinteresting since it does not allow local computations. The task of finding anoptimal triangulation is stated as an NP-complete problem [6]. Finally, the tri-angulated graph is transformed into a junction tree where each node representsa cluster of variables and each edge is labeled with a separator corresponding tothe intersection of its adjacent clusters.

Once the junction tree is constructed, it will be initialized using the initialconditional distributions and the observed nodes. Then, the propagation processstarts via a message passing mechanism between different clusters after choosingan arbitrary cluster to be the pivot of the propagation. Similarly to the proba-bilistic networks, the message flow is divided into two phases:- a collect-evidence phase in which each cluster passes a message to its neighbor


in the pivot direction, beginning with the clusters farthest from the pivot.- a distribute-evidence phase in which each cluster passes messages to its neigh-bors away from the pivot direction, beginning with the pivot itself.

These two message passes ensure that the potential of each cluster corre-sponds to its local distribution. Thus, we can compute the possibility measureof any variable of interest by simply marginalizing any cluster potential contain-ing it.

4 Anytime Possibilistic Propagation Algorithm

4.1 Basic Ideas

The product-based possibilistic networks are very close to Bayesian networkssince conditioning is defined in the same way in the two frameworks. This is notthe case for min-based networks since the minimum operator has different prop-erties like the idempotency property. Therefore, we propose a new propagationalgorithm for such networks which is not a direct adaptation of probabilisticpropagation algorithms. In particular, we will avoid the transformation of theinitial network into a junction tree.

Given a min-based possibilistic network ΠG, the proposed algorithm locallycomputes for any instance a of a variable of interest A the possibility distri-bution Πm(a) inferred from ΠG. Note that computing Πm(a) corresponds topossibilistic inference with no evidence. The more general problem of computingΠm(a | e), where e is the total evidence, is advocated in Section 5. The basicsteps of the propagation algorithm are:

– Initialization. Transforms the initial network into an equivalent secondarystructure, also called here for simplicity moral graph, composed of clustersof variables obtained by adding to each node its parents. Then quantifiesthe graph using the initial conditional distributions. Lastly, incorporates theinstance a of the variable of interest A.

– One-parent stability. Ensures that any cluster agree with each of its parentson the distributions defined on common variables.

– n-parents stability. Ensures that any cluster agree on the distributions definedon common variables computed from 2, 3,..,n parents.

– n-best-parents stability. Ensures that only best instances in the distributionof each cluster agree with the best instances in the distribution computedfrom the parent set.

The proposed algorithm is an anytime algorithm since the longer it runs, thecloser to the exact marginals we get.

4.2 Initialization

From conditional to local joint distributions. The first step in the initial-ization procedure is to transform the initial network into an equivalent secondary


structure, also called moral graph, and denotedMG. Moral graphs are obtainedby adding to each node its parent set and by dropping the direction of existingedges. Each node in MG is called a cluster and denoted by Ci. Each edge inMG is labeled with the intersection of its clusters Ci and Cj , called separator,and denoted by Sij . Note that, contrary to classical construction of separators,a link is not necessary between the clusters sharing the same parents. The ini-tial conditional distributions are transformed into local joints. Namely, for eachcluster Ci of MG, we assign a local joint distribution relative to its variables,called potential and denoted by πCi .

We denote by ci and sij the possible instances of the cluster Ci and theseparator Sij respectively. ci[A] denotes the instance in ci of the variable A. Theoutline of this first phase of the initialization procedure is as follows:

Algorithm 1 From conditional to local joint distributionsBegin1. Building the moral graph:- For each variable Ai, form a cluster Ci = Ai ∪ UAi

- For each edge connecting two nodes Ai and Aj : form an undirected edge in the moralgraph between the clusters Ci and Cj labeled with a separator Sij corresponding totheir intersection.2. Quantify the moral graph:For each cluster Ci: πCi(Ai ∧ UAi)← Π(Ai | UAi)End

FromMG, we can associate a unique possibility distribution defined by:

Definition 2 The joint distribution associated with MG, denoted πMG is ex-pressed by:

πMG(A1, .., AN ) = mini=1..N

πCi (6)

Example 2 Let us consider the ΠG given in Example 1. The moral graph cor-responding to ΠG is represented in Figure 2. The initial distributions are trans-formed into joint ones as shown by Table 3.

Table 3. Initialized potentials of A, AB, AC and BCD

a πA a b πAB a c πAC b c d πBCD b c d πBCD

a1 1 a1 b1 1 a1 c1 0.3 b1 c1 d1 1 b2 c1 d1 1a2 0.9 a1 b2 0.4 a1 c2 1 b1 c1 d2 1 b2 c1 d2 0.8

a2 b1 0 a2 c1 1 b1 c2 d1 1 b2 c2 d1 1a2 b2 1 a2 c2 0.2 b1 c2 d2 0 b2 c2 d2 1


A

C

A

B

A

BCD

AC AB

Fig. 2. Moral graph of the DAG in Figure 1

Incorporating the variable of interest. Let A be the variable of interestand let a be any of its instances. We are interesting in computing Πm(a). Wefirst define a new possibility distribution πa from πm as follows:

πa(ω) =πm(ω) if ω[A] = a0 otherwise (7)

Proposition 1 Let πm be a possibility distribution obtained by (5) and πa be apossibility distribution computed from πm using (7). Then,

Πm(a) = h(πa) = maxωπa(ω)

This proposition means that the possibility degree Πm(a) is equal to theconsistency degree of πa.

The incorporation of the instance a inMG, should be such that the possibilitydistribution obtained from the moral graph is equal to πa. This can be obtainedby modifying the potential of the cluster Ci as follows:

πCi(ci)←πCi(ci) if ci[A] = a0 otherwise

Example 3 Suppose that we are interested with the value of Πm(D = d2), Table4 represents the potential of the cluster BCD after incorporating this variable.

Table 4. Initialized potential of BCD after incorporating the evidence D = d2

b c d πBCD b c d πBCD

b1 c1 d1 0 b2 c1 d1 0b1 c1 d2 1 b2 c1 d2 0.8b1 c2 d1 0 b2 c2 d1 0b1 c2 d2 0 b2 c2 d2 1


Proposition 2 shows that the moral graph obtained by incorporating thevariable of interest A leads, indeed, to the possibility distribution πa.

Proposition 2 Let ΠG be a min-based possibilistic network. Let MG be themoral graph corresponding to ΠG given by the initialization procedure.

Let πa be the joint distribution given by (7) (which is obtained after incor-porating the instance a of the variable of interest A). Let πMG be the joint dis-tribution encoded by MG (given by (6)) after the initialization procedure. Thenπa = πMG.

The following subsections present several stabilizing procedures which aim toapproach the exact value of h(π(a)) (henceΠm(a)). They are based on the notionof stability, which means that adjacent clusters agree on marginal distributionsdefined on common variables.

4.3 One-Parent Stability

One-parent stability means that any cluster agree with each of its parents onthe distributions defined on common variables. More formally,

Definition 3 Let Ci and Cj be two adjacent clusters and let Sij be their sepa-rator. The separator Sij is said to be one-parent stable if:

maxCi\Sij

πCi = maxCj\Sij

πCj (8)

where maxCi\SijπCi (resp. maxCj\Sij

πCj ) is the marginal distribution of Sijdefined from πCi (resp. πCj ).

A moral graphMG is said to be one-parent stable if all of its separators areone-parent stable.

The one-parent stability procedure is performed via a message passing mech-anism between different clusters. Each separator collects information from itscorresponding clusters, then diffuses it to each of them, in order to update themby taking the minimum between their initial potential and the one diffused bytheir separator. This operation is repeated until there is no modification on thecluster’s potentials. The potentials of any adjacent clusters Ci and Cj (withseparator Sij) are updated as follows:

– Collect evidence (Update separator):

πSij ← min( maxCi\Sij

πCi , maxCj\Sij

πCj ) (9)

– Distribute evidence (Update clusters):

πCi ← min(πCi , πSij ) (10)

πCj ← min(πCj , πSij ) (11)


These two steps are repeated until reaching the one-parent stability on allclusters. At each level of the stabilizing procedure, the moral graph encodes thesame joint distribution:

Proposition 3 LetMG be a moral graph, letMG′ be the resulted moral graphafter the modification of two adjacent clusters Ci and Cj using equations (9),(10) and (11). Then πMG = πMG′ .

It can be shown that the one-parent stability is reached after a finite numberof message passes, and hence it is a polynomial procedure. The following propo-sition shows that if a moral graph is stabilized at one-parent, then the maximumvalue of all its cluster’s potentials is the same.

Proposition 4 LetMG be a stabilized moral graph. Then ∀Ci,

α = maxπCi

From Propositions 2 and 3 we deduce that from the initialization to the one-parent stability level, the moral graph encodes the same joint distribution i.e.πa = πMG .

Example 4 Let us consider the moral graph initialized in Examples 2 and 3.Note first that this moral graph is not one-parent stabilized. For instance, theseparator A between the two clusters AB and A is not one-parent stable sincemaxAB\A πAB(a2) = 1 = πA(a2) = 0.9.

At one-parent stability, reached after two message passes, we obtain the po-tentials given in Table 5. The maximum potential is the same in the four clustersi.e. maxπA = maxπAB = maxπAC = maxπBCD = 0.9.

Table 5. Stabilized potentials


a1 0.9 a1 b1 0.9 a1 c1 0.3 b1 c1 d1 0 b2 c1 d1 0a2 0.9 a1 b2 0.4 a1 c2 0.9 b1 c1 d2 0.9 b2 c1 d2 0.8

a2 b1 0 a2 c1 0.9 b1 c2 d1 0 b2 c2 d1 0a2 b2 0.9 a2 c2 0.2 b1 c2 d2 0 b2 c2 d2 0.9

Note that, one-parent stability does not guarantee that the degree α cor-responds to the exact degree Πm(a) = h(πa) since the equality h(πMG) = αis not always verified. Indeed, we can check in the previous example thath(πMG) = 0.8 = 0.9. Nevertheless, as we will see later, experimentations showthat, in general, this equality holds.


4.4 N-Parents Stability

As noted before, one-parent stability does not always guarantee local computa-tions (from clusters) of the possibility measure Πm(A) of any variable of interestA. Thus, our idea is to improve the resulted possibility degree by consideringstability with respect to a greater number of parents. Therefore, we will increasethe parents number by first considering two parents, then three parents untilreaching n parents where n is the cardinality of the parent set relative to eachcluster. To illustrate this procedure we only present the two-parents stability.The principle of this procedure is to ensure for each cluster, having at least twoparents, its stability with respect to each pair of parents. More formally:

Definition 4 Let Ci be cluster in a moral graph MG, let Cj and Ck be twoparents of Ci. Let Sij be the separator between Ci and Cj and Sik be the separatorbetween Ci and Ck. Let C = Cj ∪ Ck and let S = Sij ∪ Sik. Let πC be the jointdistribution computed from πCj and πCk

. The cluster Ci is said to be stable withrespect to its two parents Cj and Ck if: maxCi\S πCi = maxC\S πC1.

In a similar way, a cluster Ci is said to be two-parents stable if it is stablewith respect to each pair of its parents. Then, a moral graph MG is said to betwo-parents stable if all of its clusters are two-parents stable.

The following procedure ensures the stability of Ci with respect to Cj andCk:

Algorithm 2 Stabilize a cluster Ci with respect to two-parents Cj and Ck

Begin- Compute πC using πCj and πCk

: πC ← min(πCj , πCk)

- Compute πS using πC : πS ← maxC\S πC- Update πCi using πS : πCi ← min(πCi , πS)End

The two-parents stability saves the joint distribution encoded by the moralgraph:

Proposition 5 Let πa be the joint distribution given by (7). Let πMG be thejoint distribution encoded byMG after the two-parents stability procedure. Then,πa = πMG.

The following proposition shows that the two-parents stability improves theone-parent stability:

1 where maxCi\S πCi (resp. maxC\S πC) is the marginal distribution of S defined fromπCi (resp. πC).


Proposition 6 Let α1 be the maximal degree generated by the one-parent stabil-ity (which is unique c.f. Proposition 4). Let α2 be the maximal degree generatedby the two-parents stability. Then,

α1 ≥ α2 ≥ Πm(a)

Example 5 Let us consider the inconsistent stabilized moral graph of example4. The two-parents stabilized potential of the cluster BCD with respect to its twoparents AB and AC is given by Table 6. Note, for instance, that the potentialof c2 ∧ b2 ∧ d2 decreases from 0.9 to 0.4. Thus, we should re-stabilize the moralgraph at one-parent (see Table 7). We can check that the resulted moral graphis two-parents stabilized. Moreover, we have h(πMG) = 0.8, in other terms, wehave reached the consistency degree of πa.

Table 6. Two-parents stabilized potential of BCD

b c d πBCD b c d πBCD

b1 c1 d1 0 b2 c1 d1 0b1 c1 d2 0.3 b2 c1 d2 0.8b1 c2 d1 0 b2 c2 d1 0b1 c2 d2 0 b2 c2 d2 0.4

Table 7. One-parent re-stabilized potentials


a1 0.4 a1 b1 0.3 a1 c1 0.3 b1 c1 d1 0 b2 c1 d1 0a2 0.8 a1 b2 0.4 a1 c2 0.4 b1 c1 d2 0.3 b2 c1 d2 0.8

a2 b1 0 a2 c1 0.8 b1 c2 d1 0 b2 c2 d1 0a2 b2 0.8 a2 c2 0.2 b1 c2 d2 0 b2 c2 d2 0.4

4.5 N-Best-Parents Stability

Ideally, we want to perform an n-parents stability where n is the cardinalityof the parent set relative to each cluster. In other terms, each cluster will bestabilized with respect to the whole set of its parents. However, this can beimpossible especially when a cluster has an important number of parents sincewe should compute their cartesian product. In order to avoid this problem, wewill relax the n-parents stability by only computing the best instances in thiscartesian product called best global instances. The main motivation in n-bestparents stability, is that our aim is to compute the exact value of h(πa), and


not the whole distribution πa. The idea is to cover for any cluster Ci, its nparents by only saving the best instances (i.e having the maximum degree) ofeach cluster and by combining them while eliminating the incoherent instances.Once the best global instances is constructed, we can compute the best instancesrelative to the n separators existing between Ci and its parents and comparethem with the ones obtained from Ci. If some instances in Ci are incoherentwith those computed from the parents, then we will decrease their degrees. Thisis illustrated by the following example,

Example 6 Let us consider the cluster CEFG in Figure 3, having three parentsABC, CDE and F . The Figure shows the best instances in each cluster (forinstance the best instance in the cluster F is f1). From the cartesian product ofthe best instances (i.e. best global instances) we can check that the best instancesrelative to the three separators C, E and F are c1 ∧ e1 ∧ f1 and c1 ∧ e2 ∧ f1.However, from the cluster CEFG, the best instances relative to the separatorsare c1 ∧ e1 ∧ f1 and c2 ∧ e1 ∧ f2. Thus, we should decrease the degree of theinstance c2 ∧ e1 ∧ f2 ∧ g1 from α to the next degree in ABC, CDE and F andre-stabilize the moral graph at one-parent.

CEFG

E

C F

c1 d2 e2

c1d2 e1 a1 b1 c1 a1 b1 c2

a2 b2 c1

f1

c1 e1 f1 g1 c2 e1 f2 g1.

a1 b1 c1 d2 e2 f1

a1 b1 c1 d2 e1 f1

a2 b2 c1 d2 e2 f1

a2 b2 c1 d2 e1 f1

B CDE ABC F

.

.

.

....

.

.

.

Fig. 3. Example of n-best-parents stability


5 Handling the Evidence

The proposed propagation algorithm can be easily extended in order to take intoaccount new evidence e which corresponds to a set of instanciated variables. Thecomputation of Πm(a | e) is performed via two calls of the above propagationalgorithm in order to compute successively Πm(e) and Πm(a ∧ e). Then usingthe min-based conditioning, we get:

Πm(a | e) =Πm(a ∧ e) if Πm(a ∧ e) < Πm(e)1 otherwise

The computation of Πm(e) needs a slight transformation on the initializationprocedure since the evidence can be obtained on several variables. More precisely,the phase of incorporation of the instance of interest is replaced by:Incorporating the instance a1∧, ..,∧aM of the variables of interest A1, .., AM , i.e:

∀i ∈ 1, ..,M, πCi(ci)←πCi(ci) if ci[Ai] = ai0 otherwise

6 Experimental Results

The experimentation is performed on random possibilistic networks generatedas follows:Graphical component: we used two DAG’s structures generated as follows:

– In the first structure the DAGs are generated randomly, by just varyingthree parameters: the number of nodes, their cardinality and the maximumnumber of parents.

– In the second one, we choose special cases of DAGs where nodes are partionedinto levels such that nodes of level i only receive arcs either from nodes ofthe same level, or from level i − 1. For instance the DAG of Figure 4 has4 levels: the first contains 5 nodes, the second 7 nodes, the third 3 nodesand the fourth 5 nodes. Note that if we consider only two levels by omittingthe intra-levels links, this structure corresponds to the QMR (Quick MedicalReference) network [13].

Numerical component: Once the DAG structure is fixed, we generaterandom conditional distributions of each node in the context of its parents.Then, we generate random variable of interest.

6.1 Stability vs. Consistency

In the first experimentation we propose to test the quality of the stability withrespect to the consistency degree h(πa) (i.e. Πm(a)). Regarding the first struc-ture, we have noted that the one-parent stability and the two-parents stabilityprovides, respectively, 99% and 99,999% of exact results. That is why, we testedthe second structure considering 19 levels from 2 to 20. At each level we generate


1 2 5 3 4

7 8 11 9 10

13 14 15

16 17 20 18 19

6 12

Fig. 4. Example of a DAG with 4 levels

300 networks with a number of nodes varying between 40 and 60 nodes, sincewe are limited, in some cases, by the junction tree algorithm2 which is enableto treat complex networks with a great number of nodes. Table 8 representsdifferent parameters for this experimentation.

Table 8. Parameters of the experimentation of stability vs consistency

levels nodes links levels nodes links2 45 68 12 40 833 40 80 13 49 1064 45 88 14 50 1075 40 90 15 49 1026 40 85 16 48 997 40 81 17 51 1058 40 86 18 54 1129 40 85 19 57 12010 40 84 20 60 12511 40 85

Figure 5, shows the results of this experimentation. At each level (from 2 to20), the first (resp. second, third, forth) bar from the left represents the per-centage of the networks where the one-parent (resp. two-parents, three-parents,n-best-parents) stability leads to consistency (i.e. generates the exact marginals).It is clear that the higher the number of parents considered in the stability pro-cedure, the better the quality of results. Moreover, this figure shows that thestability degree, even at one-parent, is a good estimation of the consistency de-

2 The junction tree algorithm (c.f. subsection 3.2) is used for providing exact valuesof Π(a).


gree (96,11%). In addition, we remark that the quality of estimation dependson the number of levels in the DAG since with a small number of levels (2, 3and 4), the one-parent stability is sufficient to reach consistency. These resultsis interesting since it shows that with networks having complex structures witha great number of nodes, we can use the one-parent stability which is a polyno-mial procedure. Indeed, as we will see later, in such cases the exact algorithmcan generate huge clusters where local computations are impossible.

Figure 6 represents the running time between different stability proceduresfor DAGs of 50 nodes and 100 links in average. It is clear that the one-parentstability is the faster one, while the n-best parents stability is the slowest one.

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20levels

one-parent stability (96,11%) two-parents stability (99,18%)

three-parents stability (99,46%) n-best parents stability (99,51%)

Fig. 5. Stability vs Consistency

0

5

10

15

20

25

30

35

one-parent two-parents three-parents n-best parents

stability

seco

nd

s

Fig. 6. Running time between different stability procedures


6.2 Correlation between Exact Marginals and Stability Degrees

We are now interesting with the correlation between the exact marginals andthe ones generated by the stabilization procedure. This experimentation is per-formed on 100 random networks with 20 levels and 60 nodes. For each networkthe evidence, the variable of interest and the instance of interest are we fixedrandomly. Then, we compare the possibility degree of the instance of interestgenerated by the junction tree algorithm (exact marginals) with those generatedby the one-parent stability procedure.

Figure 7 shows the results of this experimentation. Again we confirm thatthe one-parent stability is a good estimation of consistency. Indeed, it is clearthat in the cases where the equality, between exact marginals and those obtainedfrom the stability procedure, does not hold, the gap is not important.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

exact marginals

on

e-p

are

nt

sta

bil

ity

de

gre

es

Fig. 7. Correlation plots between exact marginals and one-parent stability degrees

6.3 Comparing Junction Tree Algorithm with One-Parent Stability

We also have compared experimentally the junction tree algorithm with one-parent stability. In this experimentation, using the first structure, we vary theratio Links/Nodes in order to test the limitation of the junction tree algorithm.For instance, with networks containing 40 (resp. 50, 60) nodes, the junction treealgorithm is blocked from the ratio 3.55 (resp. 2.72, 1.78) while the one-parentstability provides a result is a few seconds. When the junction algorithm is notblocked, it is faster that the one-parents stability. However, the difference doesnot exceed few seconds.


7 Conclusion

This paper has proposed an anytime propagation algorithm for min-based di-rected networks. The stability procedures improve those presented in [1] since weuse more than one-parent stability. Moreover, this paper contains experimentalresults which are very encouraging since they show that consistency is reachedin the most of cases and that our algorithm can be used in situations where thejunction tree algorithm is limited.For lack of space, we have not presented the consistency procedure which pro-vides exact values. A first version of this procedure is evoked in [1]. A futurework will be to improve it and to compare our algorithm with the ones used inpossibilistic logic [8] and in FCSP [9].

References

1. N. Ben Amor, S. Benferhat, K. Mellouli, A New Propagation Algorithm for Min-Based Possibilistic Causal Networks, procs. of ECSQARU’2001, 2001.

2. N. Ben Amor, S. Benferhat, D. Dubois, H. Geffner and H. Prade, Independence inQualitative Uncertainty Frameworks. Procs. of KR’2000, 2000.

3. S. Benferhat, D.Dubois, L. Garcia and H. Prade, Possibilistic logic bases and pos-sibilistic graphs, Procs. of UAI’99, 1999.

4. C. Borgelt, J. Gebhardt, and Rudolf Kruse, Possibilistic Graphical Models, Proc.of ISSEK’98 (Udine, Italy), 1998.

5. L.M de Campos and J. F. Huete, Independence concepts in possibility theory,Fuzzy Sets and Systems, 1998.

6. G. F. Cooper, Computational complexity of probabilistic inference using Bayesianbelief networks, Artificial Intelligence, 393-405, 1990.

7. D. Dubois and H. Prade, Possibility theory : An approach to computerized, Pro-cessing of uncertainty, Plenium Press, New York, 1988.

8. D. Dubois, J. Lang and H. Prade, Possibilistic logic, In Handbook of logic inArtificial intelligence and logic programming Oxford University press, Vol. 3, 439-513, 1994.

9. H. Fargier, Problemes de satisfaction de contraintes flexibles- application al’ordonnancement de production, These de l’Universite P. Sabatier, Toulouse,France, 1994.

10. P. Fonck, Propagating uncertainty in a directed acyclic graph, IPMU’92, 17-20,1992.

11. P. Fonck, Conditional independence in possibility theory, Uncertainty in ArtificialIntelligence, 221-226, 1994.

12. J. Gebhardt and R. Kruse, Background and perspectives of possibilistic graphicalmodels, Procs. of ECSQARU/FAPR’97, Berlin, 1997.

13. D. Heckerman, A tractable inference algorithm for diagnosing multiple diseases,Procs. of UAI’89, 1989.

14. E. Hisdal, Conditional possibilities independence and non interaction, Fuzzy Setsand Systems, Vol. 1, 1978.

15. F. V. Jensen, Introduction to Bayesien networks, UCL Press, 1996.16. L. A. Zadeh, Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and

Systems, 1, 3-28, 1978.


Macro Analysis of Techniques to Deal with Uncertaintyin Information Systems Development: Mapping

Representational Framing Influences

Carl Adams1 and David E. Avison 2

1 Department of Information Systems, University of Portsmouth, UK2 Department SID, ESSEC Business School, Cergy-Pontoise, France

Abstract. Development methods and techniques provide structure, directedtasks and cognitive tools with which to collect, collate, analyze and representinformation about system requirements and attributes. These methods andtechniques provide support for developers when learning about a system. Eachdevelopment technique has its own unique set of characteristics distinguishingit from other techniques. Consequently, different development techniques canrepresent the same set of requirements or a problem situation differently. A newclassification of techniques is developed based on representationalcharacteristics. To understand if these different representations are likely toimpact problem and requirement understanding this paper draws upon theframing effect of prospect theory. The classification is applied to works fromthe cognitive psychology literature which indicate how specific techniqueattributes may influence problem understanding. This classification is applied toapproximately 100 development techniques.

1 Introduction

Development methods and techniques provide structure, directed tasks and cognitivetools with which to collect, collate, analyze and represent information about systemrequirements and attributes. These methods and techniques provide support fordevelopers when learning about system requirements and dealing with many of theuncertainties of development. Each development technique has its own unique set ofcharacteristics, which distinguishes it from other techniques. Consequently, differentdevelopment techniques can represent, or frame, the same set of requirements or aproblem situation in a different way. According to the framing effect [46], [47], [48],suggested by prospect theory [28], different representations of essentially the samesituation will result in a different preferred ‘prospect’ or choice: peoples’understanding of a problem is profoundly influenced by how the problem ispresented. This paper aims to map framing influences of techniques used withininformation systems development.

The focus of this paper is on the effects on problem cognition due to distinctcharacteristics of development methods, or more typically the component techniques.Drawing on the cognitive psychology literature enables an analysis of how specificcharacteristics of techniques may influence problem understanding. A new

Macro Analysis of Techniques to Deal with Uncertainty 281

classification is developed based on a ‘natural’ grouping of representationalcharacteristics [51]. The classification also defines the problem/solution space fordifferent types of techniques. This classification is applied to approximately 100development techniques (see Appendix).

The structure of the rest of this paper is as follows. First there is a discussion onmethodologies and techniques and an examination of the main characteristics oftechniques used in information systems development. These characteristics are usedto develop a ‘natural’ classification based on the representational attributes oftechniques. The paper then examines the background to prospect theory and theframing effect and further works on framing influences. These framing influences areapplied to the developed classification how particular aspects of techniques mayinfluence problem cognition.

1.1 Development Methods and Techniques

Wynekoop and Russo [55] assess the use of development methods and conclude,‘there is little empirical insight into why some methodologies might be better thanothers in certain situations’ (p69). Interestingly, Wynekoop and Russo cite severalstudies indicating that development methods are adapted considerably byorganizations and even individual projects. Keyes maintains that there are nomethods, just techniques, not a common view within the IS literature, but thishighlights the prominent role of ‘techniques’ in development practice. In Wynekoopand Russo’s work, development techniques are seen as component parts ofmethodologies, collected together within a particular philosophical framework. Theselection and use of techniques distinguish one development method or approachfrom another. For instance, in Fitzgerald’s [18] postal survey investigating the use ofmethodologies, 60% of respondents were not using a formalised commercialdevelopment methodology and very few (6%) rigorously followed a developmentmethodology. Many of the respondents from Fitzgerald’s survey using a formaldevelopment methodology tended to adapt the methodology to specific developmentneeds. In a later study, Fitzgerald [19] found that considerable tailoring ofmethodologies was common practice, with the tailoring involving the use ofadditional techniques or method steps and/or missing out specific techniques and/ormethod steps. From this discussion, development techniques may be classed as a‘lowest common denominator’ between methodologies and play an influential role onhow an information system is developed.

When examining the use of techniques in information systems development, one isstruck by the variety of ‘different’ techniques available. For general problem andbusiness analysis (an integral part of many information systems developmentapproaches) there are a wealth of available techniques: for instance Jantsch [27]examined over 100 techniques for general business and technological forecasting; inthe Royal Society’s work on Risk Assessment [39] numerous techniques from severalbusiness areas are examined; Bicheno [6] examined 50 techniques and business toolsfocusing on improving quality; Couger [10], Adams [2] and de Bono [12], [13], [14]between them examined many techniques to improve creativity, innovation and lateralthinking in problem solving; and Obolensky [34] examined a range of techniquessuitable for business re-engineering. There is also a range of techniques aimed at

282 C. Adams and D.E. Avison

specific information systems development activities, for instance, techniques to helpconduct a feasibility study, analysis requirements, design a system and develop, testand monitor systems (e.g. [3], [5], [15], [16], [20], [22], [26], [56]). New technologiesand applications give rise to new techniques and new tools to support development(e.g. [36]).

Seemingly, therefore, there is an abundance of different techniques available todevelopers. However, there is much similarity between many of the differenttechniques. A closer examination of the items listed by Bicheno [6] reveals that theyare often heavily based on previous ones, with many newly-claimed techniques beingadaptations or compilations of other techniques.

1.2 What Techniques Offer

Given that development techniques play such an influential role in how aninformation system is developed it would be useful to consider what is gained fromusing a development technique. An initial list may include the following:• Reduces the ‘problem’ to a manageable set of tasks.• Provides guidance on addressing the problem.• Adds structure and order to tasks.• Provides focus and direction to tasks.• Provides cognitive tools to address, describe and represent the ‘problem’.• Provides the basis for further analysis or work.• Provides a communication medium between interested parties.• Provides an output of the problem-solving activity.• Provides general support for problem-solving activities.These items can be considered as aiding developers in understanding the problemsand requirements of an information system. This is supported by Wastell [52], [53]who, examining the use of development techniques, identified two concepts whichdescribe this learning support behaviour: ‘social defence’ against the unknown and,‘transitional objects and space’. ‘We argue that the operation of these defences cancome to paralyse the learning processes that are critical to effective IS development… These social defences refer to modes of group behaviour that operate primarily toreduce anxiety, rather than reflecting genuine engagement with the task at hand…(Transitional) spaces have two important aspects: a supportive psychologicalclimate and a supply of appropriate transitional objects (i.e. entities that provide atemporary emotional support’). [53, p3]

The social defences concept is used to describe how developers follow methods,techniques and other rituals of development as a means to cope with the stresses anduncertainties of the development environment. A negative aspect of these socialdefences is the potential for rules of the methods and techniques to become paramountrather than addressing the ‘real’ problems of information systems development.

This concept of supporting the learning process is consistent with the findings ofFitzgerald’s [19, p342] study, which found that there was considerable tailoring ofdevelopment methodologies and that tailoring was more likely to be conducted byexperienced developers. Inexperienced developers tend to rely more heavily onfollowing a development method or technique rigorously. Inexperienced developers


require more guidance and support in the development process and look to themethod, or collection of techniques, for that support.

Key elements here are that techniques play an important role within informationsystems development, influencing the learning and understanding process ofdevelopers (about the information system requirements) and the potential negativeinfluences when developers engage in the rituals of technique at the expense ofproblem understanding. The next section examines more closely the characteristics oftechniques to identify further possible influences on problem cognition.

1.3 Characteristics of Techniques

By examining a variety of techniques, certain attributes become apparent, forexample:• Visual attributes, e.g. visual representation and structure of technique output.• Linguistic attributes, e.g. terminology and language used–not just English

language, but also others such as mathematical and diagrammatical [2, p103].• Genealogy attributes, e.g. history of techniques, related techniques.• Process/procedure attributes, e.g. description and order of tasks.• People attributes, e.g. roles of people involved in tasks.• Goal attributes, e.g. aims and focus of techniques.• Paradigm attributes, e.g. discourse, taken–for-granted elements, cultural elements.• Biases, e.g. particular emphasis, items to consider, items not considered.• Technique or application-specific attributes.Some characteristics of a technique are explicit, for instance where a particular visualrepresentation is prescribed. Other characteristics might be less obvious, such as theunderlying paradigm. Many of the characteristics are interwoven, for instance thevisual and linguistic attributes might be closely aligned with the genealogy of atechnique. The next section will develop a classification based on the mainrepresentational characteristics of techniques. Literature from the cognitivepsychology field will be used to to examine how specific visual attributes oftechniques are likely to affect cognition.

2 Classification of Techniques by Representational Characteristics

This section develops an initial classification of techniques, based on Waddington’s[51] ‘natural’ attributes for grouping items. A similar classification of techniques bynatural attributes is described in [1]. Waddington discusses our ‘basic’, or natural,methods of ordering complex systems, the most basic of which relies on identifyingsimple relationships, hierarchies, patterns and similarities of characteristics. Thenatural grouping for development techniques is based the linguistic attributes (e.g.generic names) of a technique and by the final presentation of a technique (i.e.grouping techniques with similar looking presentations together). The result is sixgroups: (i) Brainstorming Approaches, (ii) Relationship Approaches, (iii) Scenario


Approaches, (iv) Reductionist Approaches, (v) Matrix Approaches and, (vi) ConflictApproaches.• Brainstorming Approaches. This group is defined by a generic name

‘brainstorming’. Representation for brainstorming techniques vary, but usuallycontain lists of items and/or some relationship diagram (e.g. a mind map).Brainstorming is probably the most well known, well used and most modified ofthe techniques. Brainstorming is often associated with De Bono (e.g. [13], [14])who covered it as one of a set of lateral thinking techniques, though others seem tohave earlier claims, such as [8, p262]. It is a group activity to generate a crossstimulation of ideas.

• Relationship Approaches. This group is defined mainly by the final presentation ofthe techniques, which is typically based on diagrams representing a definedstructure or relationships between component parts. Included in this grouping areNetwork Diagrams (e.g. [6, p40], [32]) and Cognitive Mapping [17], which somemight argue are quite different techniques, however, the final output presentationsare topologically very similar. A further characteristic is the use of a diagram topresent and model the situation.

• Scenario Approaches. This group is defined by generic linguistic attributes basedaround scenarios. These techniques involve getting participants to considerdifferent possible futures for a particular area of interest. Representation in theseapproaches can vary from lists of items to diagrams.

• Reductionist Approaches. This group is defined by the use of generic linguisticattributes (i.e. use similar terminology revolved around reducing the ‘problem’ areainto smaller component parts) and visual attributes based on well-definedstructures. Once a problem has been ‘reduced’ then the component parts areaddressed in turn before scaling back up to the whole problem again.

• Matrix Approaches. This group is defined by the final presentation, that of a matrixor list structure, though often the generic name ‘matrix’ is also used. Using someform of matrix or list approach for structuring and making decisions is widelyknown and frequently used (e.g. [27, p211]). A list of factors are compared oranalysed against another list of factors.

• Conflict Approaches. This grouping is defined by the generic name ‘conflict’. Itunderlines an approach to view the problem from different and conflictingperspectives.

Each group can be considered in terms of ‘social defence’ against the unknown [52].As discussed earlier, ‘social defence’ in this context represents organisational orindividual activities and rituals that are used to deal with anxieties and uncertainties.It is argued that the more quantitatively rigorous and detailed (depth of study) thetechnique then the higher the potential for being a social defence mechanism. Anotheruseful concept to consider techniques is the problem/solution space [38] that can beused to represent the scope of possible solutions by a technique. These concepts aredeveloped and applied to the six groups, a summary is represented in the Table 1, andproblem/solution space diagrams are presented in Figure 1. This initial classificationis applied to approximately 100 techniques, the results of which are shown in theAppendix.


Fig. 1. Problem solution space for each of the natural groups

Though providing a different vista on the classification of techniques, with possiblesocial defence attributes and problem/solution spaces mapped out, this groupingproves too simplistic in that it does not address ‘how’ techniques in a particular groupwould affect problem cognition. The next section draws upon psychology literature toinform how distinct representational attributes will affect problem cognition.


Table 1. Characteristics of Natural Grouping for Techniques to Deal with Uncertainty

GroupQuantitativelyrigorous/ Depth ofstudy

Potential forsocial defence

Area of Problem/ Sol.Space covered

Brainstorming LOW LOW SCATTEREDRelationship HIGH MEDIUM-HIGH LOCALISED

CLUSTERSScenario MEDIUM MEDIUM SCATTERED

CLUSTERSReductionist VERY HIGH HIGH LOCALISEDMatrix HIGH MEDIUM-HIGH

or HIGHLOCALISEDCLUSTERS

Conflict MEDIUM – HIGH MEDIUM-HIGH VERY LOCALISED

3 Impact on Problem Understanding: Lessons from Cognitive Psychology

3.1 Prospect Theory and the Framing Effect

For understanding cognitive influences on problem understanding we are initiallydrawn to prospect theory [28], which was developed a descriptive model of decision-making under risk. Prospect theory was also presented as a critique of expectedutility theory (EUT) [50] and collated the major violations of EUT for choicesbetween risky ‘prospects’ with a small number of outcomes. The main characteristicsof Kahneman and Tversky’s [44] original prospect theory are the framing effect, ahypothetical ‘S’ shaped value function with corresponding weighting function and atwo-phase process consisting of an editing phase and an evaluation phase. The focusof this paper is on the framing effect, a concept that was described as ‘preferencesmay be altered by different representations of the probabilities’ (p273), i.e. differentrepresentations of essentially the same situation will result in a different preferred‘prospect’ or choice. This was made more explicit as a framing effect in their laterwork [48]. They, along with others (e.g. [41]), have demonstrated several differenttypes of framing influences. There are some limitations of prospect theory,particularly regarding the implied cognitive processes. The theory seems to be good atdescribing what decision people will make and what items may influence thosedecisions, however, it is lacking in describing how people reach these decisions. (See[11] for a description of some of the weaknesses and alternative cognitive processmodels.) Another possible limitation focuses around the laboratory-based researchmethods and artificial scenarios used to develop the theory. However, there isconsiderable support for prospect theory and the cornerstone of prospect theory, theframing effect, is robust and likely to represent some key influences on decision-making [41]. The next section explores some key areas of framing influences relevantto information systems development techniques.


3.2 Visual Influences: Gestalt Psychologists

One of the earliest and most influential movements of cognitive psychology was thatof the Gestalt psychologists initiated by Max Wertheimer, Wolfgang Kohler and KurtKoffka [23], [25], [54]. ‘In Gestalt theory, problem representation rests at the heartof problem solving – the way you look at the problem can affect the way you solve theproblem. … The Gestalt approach to problem solving has fostered numerous attemptsto improve creative problem solving by helping people represent problems in usefulways.’ [31, p68]

The key element here is that the way in which a problem is represented will affectthe understanding of the problem, which is consistent with prospect theory. Relatingthis to techniques, one can deduce that the visual, linguistic and other representationimposed by a technique will impact on problem cognition.

The Gestalt movement in cognitive psychology has a (comparatively) long historyand has had a big impact on the understanding of problem solving. The movement hasspored various strands of techniques such as lateral thinking and some creativetechniques. Gillam [23] gives a more current examination of Gestalt theorists andworks, particularly in the area of perceptual grouping (i.e. how people understand andgroup items). Gillam shows that perceptual coherence (i.e. grouping) is not theoutcome of a single process (as originally proposed by Gestalt theory) but may bebest regarded as a domain of perception (i.e. the grouping process is likely to be morecomplex, influenced by context and other aspects) (ibid p161).

The Gestalt psychologists indicate a potentially strong influence on problemunderstanding, that of functional fixedness ‘prior experience can have negative effectsin certain new problem-solving situations … the idea that the reproductiveapplication of past habits inhibits problem solving’ [31]. The implication is thathabits ‘learnt’ using previous techniques and problems would bias the application ofnew techniques and problems. This could be particularly relevant given the glut of‘new’ techniques and may explain why many techniques are in reality rehashes ofolder techniques.

Another major area that a technique can influence cognition can be deduced fromsupport theory which indicates that support for an option will increase the more thatthe option is broken down into smaller component parts, with each part beingconsidered separately. The more specific the description of an event, then the morethe event will seem likely. The implications are that the more a technique breaksdown and considers a situation into component parts or alternatives, then the morelikely the situation will become apparent.

3.3 Structure Influences

A prescriptive structure is also likely to exert influence on problem cognition. Forinstance, hierarchy and tree structures are likely to exert some influence on problemcognition in binding attributes together (e.g. on the same part of a tree structure) andlimiting items to the confines of the imposed structure. In cognitive psychology this isknown as category inclusion. ‘One enduring principle of rational inference iscategory inclusion: categories inherit the properties of their superordinates’ [42].The implication is that techniques dictating hierarchical structures will force a (self-perpetuating) category inclusion bias. An element in one branch of a hierarchical


structure will automatically have different properties to an element in another branchof the hierarchical structure. For instance, take a functional breakdown of anorganization (such as that described in [56]). One might conclude from categoryinclusion that a task in an accounting department will always be different to a task ina personnel department, which clearly may not be the case as both departments willhave some similar tasks, such as ordering the stationary.

However, this category inclusion is not universally the case. Sloman [42] foundthat the process is likely to be more complex. In his study, participants frequently didnot apply the category inclusion principle ‘instead, judgments tended to beproportional to the similarity between premise and conclusion’ and concluded‘arbitrary hierarchies can always be constructed to suit a particular purpose. Butthose hierarchies are apparently less central to human inference than logic suggests’(p31). The initial premise surrounding a situation is likely to be related to theunderlying paradigm. Dictating a hierarchical structure in itself may not result incategory inclusion biases. However, coupled with an underlying paradigm of closedhierarchical properties, it will more likely result in category inclusion biases. Alongthe same theme are proximity influences and biases. The understanding of items canbe influenced by the characteristics of other items represented in close proximity.

3.4 Order and Discourse Influences

Perceptual processing is profoundly influenced by order of information presented andthe relational constructs of information [33]. The order and number of items in a listwill influence how people will understand (and recall) items and how people willcategorize them. The implications are that the language and order of describing aproblem situation, the questions asked and how they are asked and the impliedrelationships (all of which are usually prescribed by a technique) will bias problemunderstanding, e.g. by forcing ‘leading questions’ or ‘leading processes’.

The discourse and language used to describe a problem is likely to play a role inproblem understanding. Adams [2] discusses various different types of ‘languages ofthought’ used in problem representing and solving. People can view problems usingmathematical symbols and notation, drawings, charts, pictures and a variety of naturalverbal language constructs such as analogies and scenarios. Further, people switchconsciously and unconsciously between different modes of thought using the differentlanguages of thought (p72). The information systems development environment isawash with technical jargon and language constructs. In addition, different applicationareas have their own set of jargon and specific language. Individual techniques havetheir own peculiar discourse consisting of particular language, jargon and taken-for-granted constructs, all of which may exert influence. For instance, the initial discourseused affects understanding of a problem situation, particularly in resolvingambiguities [30] by setting the context with which to consider the situation. Resolvingambiguous requirements is a common task in information systems development [21].Effectively, techniques have the potential for leading questions and processes. Inaddition, cognitive psychology literature indicates that there will be a different weightattached to the normative than to descriptive representations and results of techniques.The basis for this is the ‘understanding/acceptance principle’ [43], which states that‘the deeper the understanding of a normative principle, the greater the tendency torespond in accordance with it’ [44, p349].


Language aspects highlight another set of possible influences, that ofcommunication between different groups of people (such as between analysts andusers). Differences of perspective between different groups of people in thedevelopment process has been discussed within the IS field under the heading of‘softer’ aspects or as the organizational or people issues (e.g. [7], [21], [29], [40]).Identifying differences and inconsistencies can be classed as a useful task identifyingand dealing with requirements [21]. From cognitive psychology there are also otherconsiderations. Teigen’s [45] work on the language of uncertainty shows that there isoften more than the literal meaning implied in the use of a term, such as contextualand relational information or some underlying ‘other’ message. The use of language isvery complex. The implications are that even if a technique prescribes a set of‘unambiguous’ language and constructs, there may well be considerable ambiguitywhen it is used.

3.5 Preference Influences

There are also likely to be individual preferences, and corresponding biases, for sometechniques or specific tasks within techniques, as Puccio [37, p171] relates: ‘Thecreative problem solving process involves a series of distinct mental operations (i.e.collecting information, defining problems, generating ideas, developing solutions,and taking action) and people will express different degrees of preference for thesevarious operations’. Couger [9, p5] has noted similar preferences: ‘It is not surprisingthat technical people are predisposed towards the use of analytical techniques andbehaviorally orientated people towards the intuitive techniques’. In addition, theremay be some biases between group and individual tasks, a point taken up by Poole(1990) who notes that group interaction on such tasks is likely to be complex withmany influences. The theme was also taken up by Kerr et al. [57], who investigatedwhether individual activities are better than group activities (i.e. have less errors orless bias), but their findings were inconclusive ‘the relative magnitude of individualand group bias depends upon several factors, including group size, initial individualjudgement, the magnitude of bias among individuals, the type of bias, and most of all,the group-judgment process …. It is concluded that there can be no simple answer tothe question “which are more biased, individuals or groups?”’ [57, p687]. Toaddress the potential individual/group biases many authors suggesting techniquesrecommend some consideration of the make-up of different groups using them (e.g.[6], [9]), though they give limited practical guidance on doing so.

3.6 Goal Influences

Goal or aim aspects also profoundly influence problem understanding by providingdirection and focus for knowledge compilation [3]. Goals influence the strategiespeople undertake to acquire information and solve problems. Further, when there is alack of clear goals people are likely to take support from a particular learning strategy,which will typically be prescribed by the technique:

‘The role of general methods in learning varies with both the specificity of theproblem solver’s goal and the systematicity of the strategies used for testinghypothesis about rules. In the absence of a specific goal people are more likely to use


a rule-induced learning strategy, whereas provision of a specific goal fosters use ofdifference reduction, which tends to be a non-rule-induction strategy’ [49].

The implications are that techniques with clear task goals will impact the focus andform of information collection (e.g. what information is required and where it comesfrom, along with what information is not deemed relevant) and how the information isto be processed. Further, if there are no clear goals then people are likely to rely moreheavily on the learning method prescribed by the technique. Technique attributes arelikely to dictate representations used.

3.7 Potential Blocks to Problem Cognition

In addition to specific representational attributes, framing can be considered in termsof providing conceptual blocking. From creative, innovative and lateral thinkingperspectives, Groth and Peters [24, p183] examined barriers to creative problemsolving amongst managers. They identified a long list of perceived barriers tocreativity including: fear of failure, lack of confidence, environmental factors, fear ofsuccess and its consequences, fear of challenge, routines, habits, paradigms, pre-conceived notions, rules, standards, tunnel sight, internal barriers, structure,socialization, external barriers, money, rebellion, health and energy, mood, attitudes,desire, time. They grouped the perceived barriers into ‘self imposed’, ‘professionalenvironment’ and ‘environmentally imposed’ categories. Fear of some sort seems tobe the predominant barrier, at least for these managers.

For more general barriers, Adams [2], identifies four main areas of conceptualblocks, these are represented in Table 2.These ‘blocks’ indicate that techniques couldhave a variety of adverse influences on problem cognition, including ‘blinkered’perception from a particular perspective, lack of emotional support as a transitionalobject, providing a flawed approach and logic and, not providing appropriatecognitive tools [2].

3.8 Summary of Representational Framing Influences

The framing influences discussed above indicate that any framing effect due to thecharacteristics of a technique is likely to be complex and interwoven. However, thereare some main themes that emerge. The visual, structure and linguistic aspects can becombined in a general ‘representational’ heading. Arguably, the more prescribed andstructured a technique is, then the more likely that ‘predictable’ framing influencescan be ascribed. Overall, the works from the cognitive psychology field give severalindications about how the characteristics of a technique are able to exert someinfluence on problem cognition.


Table 2. Four main areas of conceptual blocks.

Perceptual Blocks• Seeing what you expect to see –

stereotyping• Difficulty in isolating the problem• Tendency to delimit the problem

area too closely (i.e. imposing toomany constraints on the problem)

• Inability to see the problem fromvarious viewpoints

• Saturation (e.g. disregardingseemingly unimportant or less‘visible’ aspects)

• Failure to utilize all sensory inputs

Cultural and Environmental BlocksCultural blocks could include:• Taboos• Seeing fantasy and reflection as a waste of

time• Seeing reasons, logic, numbers, utility,

practicality as good; and feeling, intuition,qualitative judgments as bad

• Regarding tradition as preferable to change

Environmental blocks could include:• Lack of cooperation and trust among

colleagues• Having an autocratic boss• Distractions

Emotional Blocks• Fear of taking risks• No appetite for chaos• Judging rather than generating ideas• Inability to incubate ideas• Lack of challenge and excessive

zeal• Lack of imagination

Intellectual and Expressive Blocks• Use of appropriate cognitive tools and

problem solving language

4 Applying Framing Influences

As the previous discussion shows, the literature from cognition psychology indicatesthat framing influences are likely to be complex and involved. Attributing likelihoodof framing influences to techniques is likely to be somewhat subjective. In addition,the discussion earlier on the use and adaptation of methodologies and techniquesindicate that there is likely to be considerable variation in applying a technique.However, applying the identified framing influences with the main representationalattributes of techniques enables some likely framing effects to be identified. These aresummarised in Table 3. In addition there are likely to be further influences, such asindividual biases towards different types of techniques (or tasks within them),negative versus positive framing, and a range of perceptual blocks.

4.1 Summary

This paper has contended that techniques influence problem understanding duringinformation systems development. The influences can be considered under certainrepresentational attributes. The cognition psychology literature indicates how theseattributes are likely to affect problem understanding. In prospect theory this is knownas the framing effect. By classifying the characteristics of techniques this paper hastried to indicate how different types of technique are likely to influence problemcognition, and in doing so has tried to map the framing effect of techniques.


Some potential biases and blocks to cognition were identified. These biasesbecome more prominent when one considers that the results of a technique (i.e.diagrams, tables etc.) may be used by different groups of people to those thatproduced them (e.g. analysts may produce some charts and tables which will be usedby designers) and this is likely to perpetuate such biases throughout the developmentprocess.

Table 3. Potential for framing influences applied to natural classification of techniques

References

1. Adams C. (1996) Techniques to deal with uncertainty in information systemsdevelopment, 6th annual conference of Business Information Systems (BIT’96),Manchester Metropolitan University.

2. Adams J. (1987) Conceptual blockbusting, a guide to better ideas. Penguin,Harmondsworth, Middlesex.

3. Anderson J.R. (1987) Skill acquisition: compilation of weak-method problem solutions.Psychological Review, 94, pp192-210.

4. Anderson R.G. (1974) Data processing and management information systems. MacDonaldand Evans, London.

5. Avison D.E and Fitzgerald G. (1995) Information systems development: methodologies,techniques and tools, 2nd. ed., McGraw-Hill, Maidenhead.

6. Bicheno J (1994) The quality 50, a guide to gurus, tools, wastes, techniques and systems.PICSIE, Buckingham.

7. Checkland P (1981) Systems thinking, systems practice. Wiley, Chichester.8. Clark C (1958) Brainstorming - the dynamic new way to create successful ideas.

Doubleday, Garden City, NY.9. Couger D., Higgins L. and McLntyre S. (1993) (Un)Structured creativity in information

systems organisations, MIS Quarterly, December, pp375-397.

Potential framing influences

GroupStructureinfluencesE.g.functionalfixedness

Orderinfluences

Discourseinfluences

PrescribedgoalinfluencesE.g. rule-inducedlearning

Normative/Analytical biases

Brainstorming Low Low-Medium

Medium Low Low

Relation-ship High High Med-high Medium Med-High

Scenario Low-Medium Medium Medium Medium-High Low-Medium

Reductionist High High High High HighMatrix High High Low-

mediumLow Medium

Conflict Low-Medium Low High High Medium


10. Couger D. (1995) Creative problem solving and opportunity. Boyd and Fraser,Massachusetts.

11. Crozier R. and Ranyard R. (1999) Cognitive process models and explanations of decision-making. In: Decision-making cognitive models, Ranyard R., Crozier R and Svenson O.(eds), Routledge, London.

12. de Bono E. (1969) The mechanism of mind. Penguin, Harmondsworth, Middlesex.13. de Bono E. (1970) The use of lateral thinking. Penguin, Harmondsworth, Middlesex.14. de Bono E. (1977) Lateral thinking: a textbook of creativity. Penguin, Harmondsworth,

Middlesex.15. de Marco T. (1979) Structured analysis and systems specification, Prentice Hall,

Englewood Cliffs, NJ.16. Downs E., Clare P. and Coe I. (1988) Structured Systems Analysis and Design Method,

Prentice Hall, London.17. Eden C (1992) Using cognitive mapping for strategic options development and analysis

(SODA). In: Rosenhead J. (ed) (1992) Rational analysis for a problematic world, problemstructuring methods for complexity, uncertainty and conflict. Wiley, Chichester.

18. Fitzgerald B. (1996) An investigation of the use of systems development methodologies inpractice. In: Coelho J. et al. (eds) Proceedings of the 4th ECIS. Lisbon, pp143-162.

19. Fitzgerald B. (1997) The nature of usage of systems development methodologies inpractice. In: Avison D.E. (ed), Key Issues in Information Systems, McGraw-Hill,Maidenhead.

20. Flynn D.J. (1992) Information systems requirements: determination and analysis.McGraw-Hill, London.

21. Gabbay D. and Hunter A. (1991) Making inconsistency respectable: a logical frameworkfor inconsistency reasoning, Lecture notes in Artificial Intelligence, 535, Imperial College,London, pp19-32.

22. Gane C. and Sarson T. (1979) Structured systems analysis, Prentice Hall, EnglewoodCliffs, NJ.

23. Gillam B (1992), The status of perceptual grouping: 70 years after Wertheimer, AustralianJournal of Psychology, 44, 3, pp157-162.

24. Groth J. and Peters J. (1999) What blocks creativity? A managerial perspective. 8, 3,pp179-187.

25. Honderich T. (ed) (1995) The Oxford companion to philosophy, OUP, Oxford.26. Jackson M.A. (1983) Systems development, Prentice-Hall, Englewood Cliffs, NJ.27. Jantsch E (1967) Technological forecasting in perspective. A report for the Organisation

for Economic Co-operation and Development (OECD).28. Kahneman D. and Tversky A. (1979) Prospect theory: an analysis of decision under risk.

Econometrica, 47, pp263-291.29. Lederer A. and Nath R. (1991). Managing organizational issues in information system

development, Journal of Systems Management, 42, 11, pp23-3930. Martin C., Vu H., Kellas G. and Metcalf K. (1999) Strength of discourse context as a

determinant of the subordinate bias effect. The Quarterly Journal of ExperimentalPsychology, 52A, 4, pp813-839.

31. Mayer R.E. (1996) Thinking, problem solving, cognition, 2nd ed. Freeman, NY.32. Mizuno S. (ed) (1988) Management for quality improvement: the 7 new QC tools.

Productivity Press.33. Mulligan N.W. (1999) The effects of perceptual inference at encoding on organization and

order: investigating the roles of item-specific and relational information. Journal ofExperimental Psychology, 25, 1, pp54-69.

34. Obolensky N (1995): Practical business re-engineering; tools and techniques for achievingeffective change. Kogan Page, London.

35. Poole M.S. (1990) Do we have any theories of group communication? CommunicationStudies, 41, 3, pp237-247.


36. Proctor T. (1995) Computer produced mind-maps, rich pictures and charts as aids tocreativity. Creativity and Innovation Management, 4, pp43-50

37. Puccio G. (1999) Creative problem solving preferences: their identification andimplications, Creativity and Innovation Management, 8, 3, pp171-178.

38. Rosenhead J. (ed) (1992) Rational analysis for a problematic world, problem structuringmethods for complexity, uncertainty and conflict. Wiley, Chichester.

39. Royal Society (1992) Risk analysis perception and management, Royal Society, London.40. Sauer C. (1993) Why information systems fail: a case study approach, Alfred Waller,

Henley.41. Schneider S.L (1992) Framing and conflict: aspiration level contingency, the status quo

and current theories of risky choice. Journal of Experimental Psychology: LearningMemory and Cognition, 18, pp 104-57.

42. Sloman S.A. (1998) Categorical inference is not a tree: the myth of inheritance hierarchies,Cognitive Psychology, 35, pp1-33

43. Slovic P. and Tversky A. (1974) Who accepts Savage’s axiom? Behaviour Science, 19,pp368-373.

44. Stanovich K.E. and West R.F. (1999) Discrepancies between normative and descriptivemodels of decision-making and the understanding/acceptance principle, CognitivePsychology, 38, pp349-385.

45. Teigen K.H. (1988) The language of uncertainty. Acta Psychologica, 68, pp27-38.46. Tversky A. and Kahneman D. (1973) Availability: a heuristic for judging frequency and

probability. Cognitive Psychology, 5, pp207-232.47. Tversky A. and Kahneman D. (1974) Judgement under uncertainty: heuristics and biases.

Science, 185, pp1124-1131.48. Tversky A. and Kahneman D. (1981) The framing of decision and the rationality of choice.

Science, 221, pp453-458.49. Vollmeyer R., Burns B.D. and Holyoak K.J. (1996). The impact of goal specificity on

strategy use and the acquisition of problem structure. Cognitive Science, 20, pp75-100.50. Von Neumann J. and Morgenstern O. (1944) Theory of games and economic behaviour.

Princeton University Press, Princeton.51. Waddington C.H. (1977) Tools for thought. Paladin Frogmore, St Albans.52. Wastell D. (1996) The fetish of technique: methodology as a social defence, Information

Systems Journal, 6, 1, pp25-40.53. Wastell D. (1999) Learning dysfunctions in information systems development:

overcoming the social defences with transitional object, MIS Quarterly54. Wertheimer M. (1923) Untersuchungen zur Lehre von der Gestalt. Psychologische

Forschung, 4, pp301-350.55. Wynekoop J.L. and Russo N.L. (1995) Systems development methodologies. Journal of

Information Technology, Summer, pp65-73.56. Yourdon E. and Constantine (1979) Structure design: fundamentals of a discipline of

computer program and systems design, Prentice-Hall, Englewood Cliffs, NJ.

Appendix: Developed Classification of Techniques

The classification has six groups1. Brainstorming Approaches2. Relationship Approaches3. Scenario Approaches4. Reductionist Approaches5. Matrix Approaches6. Conflict Approaches


The classification has been applied to approximately 100 techniques, the results ofwhich are represented in the following table.

Classification groupsTechnique Description

1 2 3 4 5 6Affinity Diagram This is a brainstorming technique aimed at aiding

idea generation and grouping. It seems to beparticularly good at identifying commonalities inthinking within the group. Relies heavily on afacilitator to run the session.

*

Analytic HierarchyProcess

Saaty’s AHP uses a (3 level) hierarchy to representthe relationships. Used in analysis of spare parts formanufacturing.

*

AttributeAssociation

Works from the premis that all ideas originate fromprevious ideas (ie. they are just modified ideas).Based on lists of characteristics or attributes of aproblem or product. Each characteristic is changedand the result discussed. (A cross betweenbrainstorming and matrix?)

*

Association /Imagestechnique

Tries to link and find associations between processes(& items).

*

BoundaryExamination

Defining and stating assumptions about problemboundary.

*

Brainstorming Aimed at idea generation.See also lateral thinking.

*

Brainwriting -SharedEnhancementsVariation

Similar to brainstorming, but gets participants torecord ideas themselves.

*

Bug List Gets participants to list items that ’bug’ them aboutthe system. Aims to get a consensus on what theproblem areas are.

* *

Cognitive Mapping Develops a model of inter-relationships betweendifferent features.

*

Common CauseFailures (CCFs)

More an engineering tool to identify common causesfor possible failures.

*

Critical PathAnalysis (CPA),Critical PathMethod (CPM)

See network techniques.*

Critical SuccessFactors (CSF)

Looks at the critical factors which will influence thesuccess of an IS or, from a strategic view, all theorganisation’s IS. It is a matrix type technique withthe characteristics of the technique down one axis andthe factors on the other axis. Note: this looks like itmay also be appropriate to examine Critical FailureFactors.

*

Cross-ImpactMatrices

See Matrix techniques.*

Decision Matrices See Matrix techniques. *Decision Trees See tree techniques. *DecomposableMatrices

The components of each sub-system are listed andarranged within a matrix and the interactions betweenelements are weighted. Relationships betweencomponents can then be focused on. [A crossbetween matrix and relationship.]

*



1 2 3 4 5 6Delphi Aims to get a consensus view, or long term forecast,

from a group of experts by iteratively polling them.Developed by Helmer & Dalkey at the RANDcorporation in the 1960s.

*

DimensionalAnalysis

Aims to explore and clarify the dimensions and limitsof a problem /situation. It examines five elements of aproblem: substantive, spacial, temporal, qualitativeand quantitative dimensions.

*

ExternalDependencies

A summary list of external items that affect theproject. Oblensky states, these "need not be plannedof detailed. However, they do need to be summarisedto remind the project team that there are activitiesoutside of the project which they need to be awareof".

*

Fagan Reviews Effectively just getting a group of peers to criticallyreview an analysis, design or code module.

*

Failure Modes &Effect Analysis(FMEA)

Examines the various ways a product, or system, canfail and analyses what the effect of each fail modewould be.

*

Fault Tree Analysis A tree approach to relating potential fault causes. *Five ’Cs’ and ’Ps’ Checklists of thinks to consider. (The Cs are Context,

Customers, Company, Competition & Costs; The Psare Product, Place, Price, Promotion, People).

*

Five Whys Invented by Toyota, it is basically developing aquestioning attitude, to probe behind the initial givenanswers. There is also a ’Five Hows’ along the sameprinciples. These are very similar to the earlier lateralthinking ’Challenging Assumptions’ and the examinestage of a Method Study.

*

Five Ws and the H Who-what-where-when-why and how. Brainstormingtechniques answering these questions.

*

Force FieldAnalysis

Idea generation and list technique to identify ’forces’pulling or pushing towards an ideal situation.

* *

Future Analysis A technique specifically aimed at IS development, itexamines possible future scenarios which an ISwould have to operate in.

*

Gaming, GameTheory

Several gaming techniques to deal with competitiveor conflict situation.See also Metagames and Hyper games.

*

Hazard andOperability Studies(HAZOP)

A systematic technique to assess the potential hazardsof a project, system or process. Usually associatedwith the chemical industry.

*

Hazards Analysisand Critical ControlPoints (HACCP)

Identifies critical points in the work processing whichneed controls or special attention. Usually associatedwith production, particularly food production.

*

Hypergames A variation on game theory which develops a ’game’from the prospective of the different stakeholders.

*

Influence Diagrams,InterrelationshipDiagrams

Similar to Cognitive mapping, it generates logicalrelationships between events or activities.

*

’Johari’ window ofknowledge

The technique named after inventors (Joe Luff &Harry Ingham), tries to identify areas ofunderstanding and lack of understanding.

*

Lateral ThinkingTechniques:-Generation of

Several techniques, including:- The Generation ofAlternatives, Challenging assumptions, Suspendedjudgement, Dominant ideas and crucial factors,

* *



1 2 3 4 5 6Alternatives,Challengingassumptions,Suspendedjudgement,Dominant ideas andcrucial factors,Fractionation, Thereversal method,Brainstorming,Analogies and,Randomstimulation.

Fractionation, The reversal method, Brainstorming,Analogies and, Random stimulation. Arguably thesetypes of techniques would be suited to early analysisand problem identification. Equally, some of thetechniques could be used in the later stages ofsystems development. For instance, Fractionation andChallenging assumptions could be used in a designsituation.Many of these lateral thinking techniques have beenmodified and combined to make ’new’ techniques.

MaintainabilityAnalysis

Examines the component parts of a system andanalyses them, in probability terms, for easy ofmaintenance. Usually associated with engineeringproduct design.

*

Markov Chains,Markov Analysis

Uses probability to model different states within asystem.

*

Matrix Techniques,Matrix Analysis

There are several ’matrix’ techniques which aim torepresent and compare requirements or feature in amatrix format. Some techniques weighting or rankingof the requirements or features.

*

McKinsey 7 SFramework

A diagnostic tool to identify the interactions withinan organisation.

*

Metagames A variation of game theory which attempts to analysethe processes of cooperation and conflict betweendifferent ’actors’.

*

MorphologicalApproaches

This takes a systematic approach to examiningsolutions to a problem. It does this by identifying theimportant problem characteristics and looks at thesolutions for each of those characteristics. Firstdeveloped by Zwicky, a Swiss astronomer, in 1942.

*

NetworkTechniques

There are several diagramming techniques that can beclassed at network techniques. Some, like CPM andPERT are very quantitative relying heavily onnumbers. Others like Interrelationship diagrams relymore on subjective logical relationships orconnections.

*

Nominal GroupTechnique (NGT)

Similar to Brainstorming. *

Opposition-SupportMap

A representation of opposition and support forparticular actions.

*

Options Matrix See matrix techniques. *Planning AssistanceThrough TechnicalEvaluation ofRelevance Numbers(PATTERN)

Developed by Honeywell, it is the first large scaleapplication of Relevance Trees to numerical analysis,and makes use of computing support.

*

PrecedenceDiagrammingMethod (PDM)network

Similar to PERT, but has 4 relationships (FS finish -start, SS start - start, FF finish - finish, SF start –finish)

*

Preliminary HazardAnalysis (PHA)

*

Program Evaluation A networking technique similar to Critical Path *



1 2 3 4 5 6and ReviewTechnique (PERT)

Analysis, but addresses uncertainty in calculating thetask times.

Rapid Ranking Techniques aims to list and rank the important issuesto a problem.

*

RBO - RationalBargainingOverlaps

Technique used in negotiating situations. *

Relevance Trees,Reliance Trees

Relevance Trees (or Reliance Trees) (sometimesreferred to as hierarchical models or systems,probably first proposed by Churchman, Ackoff &Arnoff (1957).

ReliabilityNetworks

These are representations of the reliabilitydependencies between components of a system.Similar to CPA/PERT type networks, but represent’dependencies’ rather than order of events. Once thenetworks are drawn then estimates for failure rates ofeach component can be evaluated.Similar to Relevance/Reliance Trees.

*

Requirements,Needs and Priorities(RNP)

Based on lists and matrices, aims to understand theimpact of an application on the organisation prior todevelopment. Top management play a key role.

*

Risk Assessment/Engineering /Management

Attempts to identify and, where possible, quantify therisks in a project. Usually associated with large scaleengineering projects but principles can be appropriateto smaller scale situations.

*

RobustnessAnalysis

The aim is to "keep the options open". It does thisidentifying and analysing a range of scenarios andexamining actions are most ’robust’ in thosescenarios.

*

Scenario Writing /Analysis

Scenario planning gets participants to considerdifferent possible futures for a particular area ofinterest.

Shareholder ValueAnalysis (SVA)

Tries to identify the key values and needs of theshareholders and how those needs are currently beingmet.

*

StakeholderAnalysis

Tries to understand the needs of the key stakeholdersand how those needs are currently being met. It iseffectively a range of techniques where differenttechniques are used for different stakeholder groups,eg. use VCA for analysing supplier stakeholder groupand SVA for analysing the shareholder stakeholders.

*

Simulation The features and workings of a complex situation aresimulated. The model can then be changed (either theinputs or workings of the model) to observe what willhappen. Good for developing a deeper understandingof the problem area.

*

Strategic Choice Aims to deal with the interconnections ofdecisions/problems. Focuses attention on alternativeways of managing uncertainty

*

Strategic OptionsDevelopment andAnalysis (SODA)

Though it has a ’strategic’ title it is aimed at gettingconsensus actions in messy situation.

*

Soft SystemsMethods (SSM)

A well known method aimed at problemidentification and representing views of a problemform different stakeholders perspectives - a themecommon in many of the subjective techniques.

*



1 2 3 4 5 6SIL -suggestedintegration ofproblem elements

A German-developed brainstorming technique, getsparticipants to write down ideas, then pairs of ideasare compared to integrate and interrogate the ideas.

*

SynergisticContingencyEvaluation andReview Technique(SCERT)

Risk assessment technique used in oil processinginstallations, power plants and large engineeringprojects.

*

Systems FailureMethod (SFM)

Looks at three level of influence: organisation, teamand individual. Examines potential failure from thesethree levels.

* *

SWAT analysis(Strengths,Weaknesses,Opportunities &Threats)

Generates perceptions of how customers (or others)view the organisation (or problem situation).

*

Tree Analysis See decision trees. *Value Engineering/Management

Similar to Value Chain Analysis. *

Value ChainAnalysis

Analyses the supply chain within an organisation andtries to identify (and usually quantify) when extra’value’ is added to a product or service.

* *

Wildest Idea Tries to get people to come up with a wild idea toaddress a problem. With this as a starting point thegroup continue to generate ideas.

*

The Role of Emotion, Values, and Beliefs in theConstruction of Innovative Work Realities

Isabel Ramos1, Daniel M. Berry2, and Joao A. Carvalho3

1 Escola Superior de Tecnologia e Gestao de Viana do Castelo,Viana do Castelo, [email protected]

2 Department of Computer Science, University of Waterloo,Waterloo, ON, [email protected]

3 Departamento de Informatica, Universidade do Minho,Guimaraes, [email protected]

Abstract. Traditional approaches to requirements elicitation stress sys-tematic and rational analysis and representation of organizational con-text and system requirements. This paper argues that (1) for an orga-nization, a software system implements a shared vision of a future workreality and that (2) understanding the emotions, feelings, values, beliefs,and interests that drive organizational human action is needed in orderto invent the requirements of such a software system. This paper debunkssome myths about how organizations transform themselves through theadoption of Information and Communication Technology; describes theconcepts of emotion, feeling, value, and belief; and presents some con-structionist guidelines for the process of eliciting requirements for a soft-ware system that helps an organization to fundamentally change its workpatterns.

1 Introduction

Before the 90s, software systems were used mainly for automating existing tasksor for collecting or delivering information. With the rapid development of In-formation and Communication Technology (ICT), software systems became adriver for innovative work practices and for new models of management andorganization [12].

Terms like “globalization”, “knowledge management”, “organizational learn-ing”, “collaborative work”, “value creation”, “extended enterprise”, “client re-lationship management”, and “enterprise resource planning”, among others, arecreating a new understanding of human action in organizations. We are learn-ing that action is enabled, empowered, or extended by ICT. As a consequence,individuals and organizations can now be more creative, flexible, and adaptive.We have more complex and volatile environments and organizations. Changeis presented as inevitable. Holistic approaches to change management and the


The Role of Emotion, Values, and Beliefs 301

development of software systems are seen as imperative in order to cope withorganizational complexity.

Nearly everyone seems to accept that environments in which change emergesor is induced are incredibly complex. Thus, the software systems that are sup-posed to help organizations adapt or transform themselves are inherently com-plex. Their essence, their requirements defy rapid or systematic understanding.Yet, the development of these software systems is still expected to occur in amore orderly systematic fashion, at a well defined point in time, and to be ascheap and quick as possible.

The goals that drive the process are often economic and structural. Thesegoals include the improvement of organizational efficiency and effectivness, thereduction of costs, and the improvement of individual or group performance.The lofty goals notwithstanding, it is very difficult get these software systemsto be used successfully and effectively [27], [18]. People in organizations resistthe changes. They resist using the systems, misuse them, or reject them. Asa result, the goals are not achieved, intended changes are poorly implemented,and development budgets and schedules are not respected. Misplaced emotions,values, and beliefs are often offered as the causes of these problems.

Accordingly, this paper

– debunks some myths about how organizations transform themselves throughthe adoption of ICT applications;

– describes the concepts of emotion, value, and belief and how they affectdevelopment and acceptance of software systems; and

– presents some constructionist guidelines for the process of eliciting require-ments for software systems that help organizations to fundamentally changetheir work patterns.

2 Organizational Transformation Supported by theAdoption of Innovative Software Systems

Organizational transformation (OT) is the process of fundamentally changingan organzation’s processes in order to allow it to better meet new challenges. Itis often accompanied by the introduction of new software systems that make thenew process possible. OT in an organization is often prompted when it begins toconsider how it might automate its process. The organization realizes that justautomating current processes is a waste of computing resources. The currentprocesses were designed over the years to allow the organization to function in aunautomated, paper-driven environment. Data on paper are often accurate onlyto the day or longer. Automating current processes maintains these manual,paper-driven processes, when a computer and its software has the potential ofproviding a highly dynamic, automated, paper-free process with informationaccurate to the second rather than to the day or longer.

OT is connected not only to automating an organization’s processes. Even anorganization with fully automated processes may engage in OT. Other triggers

302 I. Ramos, D.M. Berry, and J.A. Carvalho

for OT include implementing a new management or business model, adopting anew best practice, desiring to satisfy clients better, creating a new internal orexternal image, promoting a new social order, obeying environmental rules, fos-tering collaborative practices, etc. Sometimes, OT is triggered as a consequenceof internal political fights.

We must explain the use, in this paper, of the word “transformation” insteadof “change”. Both words mean change, in the general sense, but the technicalterm “transformation” means radical change and the technical term “change”means evolutionary change. Evolutionary change refers to efficiency improve-ments, local quality improvements, change in procedures, and all kinds of local-ized change that have minor impact on the overall organization. Transformationimplies fundamental changes of meanings and practices relevant to individualworkers, groups, or the organization. Transformation means a change of identity.In whatever social order it occurs, it will have a big internal and environmentalimpact on the organization.

2.1 Rhetoric and Myths about Organizational Transformation

Some authors [7,16,19,23,5,8] present OT as a process that can be planned, man-aged, and controlled. According to these authors, OT is a rational and control-lable process that can be systematicall implemented using well-tested methodsand techniques to guide it. Consequently, OT can be made predictable, quick,and reasonably cheap. OT is best led by consulting firms that are experts in thefield. OT is often directed to the organizational structure: goals and strategies,processes, tasks and procedures, formal communication channels, co-ordinationand control of activities, work needs, and authority levels. Finally, OT is expectedto have impact on relevant concepts and practices and on political relations.

Resistance to transformation of meanings and practices is often expected.This resistance is seen as a problem to solve or minimise as soon as possible.Individuals are expected to adhere to values such as flexibility, creativity, col-laboration, and continuous learning. They are expected to be motivated to im-mediately, effectively, and creatively use the delivered system.

Every planned OT is seen initially as positive. In the end, the OT may fail.Since the OT is often justified by economic or political reasons, the failure isconsidered critical to the organization. Thus, there must be blame for the failure.The failure is often blamed on the leaders of the failed process, the consultingfirms that failed to implement it, or the individuals and groups that failed tochange. Ethical and moral considerations about the way the process was led andabout the obtained results are rarely considered, let alone reported.

ICT applications are often seen as drivers of the intended OTs. They areadopted to foster collaborative work, improve organizational learning, makeknowledge management effective, and so on.

This brief description of the rhetoric surrounding OT processes implicitlyexposes several myths about the process and about people as agents and bene-ficiaries or victims of the transformation.


2.2 Organizations as Separate Entities

We tend to see an organization as a separate entity with its own goals, strategies,potentialities, and constraints. However, an organization is the people that bringit to existence [13]. Goals and strategies emerge from the sense-making processesthat continually reshape how an individual perceives herself and the others inthe organization. This understanding leads to two main insights:

1. The idea of an organization being a separate entity with its own goals andstrategies serves mainly management interests. Traditionally, managementresponsibilities involve the co-ordination and control of individual and sub-group efforts, in order to guarantee the economic, social, and political suc-cess of the organization. The strategy is to limit emotions, interests, values,and beliefs that could reduce the probability of achieving the goals and toimplement strategies that management has defined as the best for the orga-nization.

2. Each of us has interests, beliefs, and, sometimes, values that are not intune with the organizational identity that, maybe, someone else is tryingto solidify. Of course, this potential conflict is why participation in decisionprocesses is so important a theme in the social sciences. Nevertheless, whenconsensus is not possible, there is the possibility of negotiation. There isalways the possibility of giving up some interests and beliefs in exchangefor other advantages. The imposition of decisions by powerful individualsor groups should be the last resort. Both negotiation and imposed decisionsmay lead to the emergence of negative emotions such as frustration, fear,anger, and depression. They may appear on the surface, or they may be heldin silence. They may have unpredictable consequences for the developmentof organizational identity and for organizational success.

2.3 Emotions, Values and Beliefs, and Change

It is useful to define the three concepts used to construct the core ideas in thispaper (1) emotions, (2) values and beliefs, and (3) change.

Emotions. According to Damasio [9], there are two types of emotions, (1)background emotions and (2) social emotions. Background emotions include thesensations of well being and malaise; calmness and tension; pain and pleasure;enthusiasm and depression. The social emotions include shame; jealousy; guilt;and pride.

There is a biological foundation that is shared by all these emotions:

– Emotions are complex sets of chemical and neuronal responses that emergein patterns. Their purpose is to help preserve the life of the organism.

– Even if learning processes and culture are responsible for different expressionsof emotions and for attaching different social meanings to them, emotionsare biologically determined. They depend on cerebral devices that are innateand founded in the long evolutionary history of life on Earth.


– The cerebral devices upon which emotions depend may be activated withoutawareness of the stimulus or without the exercise of will.

– Emotions are responsible for profound modifications of body and mind.These modifications give rise to neuronal patterns that are at the basis ofthe feelings of emotion.

– When someone experiences an emotion and expresses and transforms it intoan image, it can be said that she is feeling the emotion. She will know thatthe feeling is an emotion when the process of consciousness is added to theprocesses of emoting and feeling.

This complex notion of what are emotions, feelings, and awareness of feelingshelps us to understand why OT will never be instrumental, quick, or withouthigh costs. The notions of value and belief also reinforce this understanding.

Values and Beliefs. Human values are socially constructed concepts of rightand wrong that we use to judge the goodness or badness of concepts, objects,and actions and their outcomes [17]. The beliefs that a person holds about thereality in which he lives define for him the nature of that reality, his position init, and the range of possible relationships to that reality and its parts [25].

As is easily seen, the physiological and sociological nature of emotions andthe fact that values and beliefs are deeply rooted in personal and human historychallenges the myth that OTs can be fully planned, managed, and controlled,i.e., instrumentally implemented. OT concepts and practices often lead to radicalchanges in cherished and long-held beliefs and values. A radical change of theway in which we understand our reality and our roles and actions in it will triggerbackground and social emotions that need to be carefully dealt with by creatingtrustful spaces of interaction, patiently over time.

An anecdote illustrates this issue. Suppose that Joe dislikes the color blue. Noone can force him to like it. He can be forced to show some appearance of likingit, but then there is no transformation in his color preferences, and the forcingonly increases his dislike for blue. He could be brainwashed, but brainwashingwould hardly be considered an enlightened technique. Joe may be convinced ofthe advantages of liking blue, thus ensuring his motivation to cooperate with thetransformation process. However, not even Joe can guarantee the transformationof his color preferences. Nevertheless, if Joe is motivated to cooperate there are,in effect, some strategies to improve the chances of a successful transformation:

– by conjuring emotionally positive experiences from Joe’s past involving bluesensations, e.g., a peaceful, leisurely sunny summer afternoon with a crystal-clear blue sky spent with his girlfriend wearing a blue bathing suit, or

– by constructing pleasant views of Joe’s future involving blue sensations, e.g,a peaceful, leisurely sunny summer afternoon with a crystal-clear blue skyspent with his girlfriend wearing a blue bathing suit.

This anecdote shows also that an OT process can never be without a high ex-penditure of the resources needed to improve the chances of making it a success[29].


From everything said so far, it becomes clear also that resistance to change isnatural in human beings. Because a transformation’s implementers are as humanas the target group, they need to find also the roots of their own resistance. Thatis, when the implementers are trying to minimise resistance, they may end upresisting the arguments by the process subjects that suggest changes in theimplementers’ thinking, strategies, and plans.

Change. Nowadays, there are many mythological OTs fostered by managementand ICT gurus. In the name of so-called best practices, there is little consider-ation for their ethical and moral implications. The implementation of complexsystems, such as Enterprise Resource Planning systems, are rarely preceded byconsiderations about [4], [30], [41], [34], [20]:

– the system’s degradation of the employees’ quality of work life, by reducingjob security and by increasing stress and uncertainty in pursuing task andcareer interests;

– the system’s impact on the informal communication responsible for friend-ship, trust, feeling of belonging, and self respect;

– the power imbalances the system will cause; and– the employees’ loss of work and life meaning, which leads to depression.

2.4 Summary

In summary, this section has addressed some myths about organizational trans-formation in order to advance the idea that OT that challenges meaning struc-tures is difficult, resource consuming, and influenced by emotions in situationsthat require trust between the participants of the OT process [6], [3], [24]. Be-cause most actual OTs draw with them the adoption of complex software systemsthat support new work concepts and practices, the elicitation of the requirementsof those systems must include the understanding of the involved emotions, val-ues, beliefs, and interests.

The next section presents a constructionist perspective [21], [33] of require-ments elicitation that takes into account emotions, values and beliefs, andchange. Some general guidelines are offered to understand the structural, so-cial, political, and symbolic work dimensions [4] in which values, beliefs, andinterests are expressed. The section also includes guidelines for reading the emo-tions that elicitors and participants express in the informal and formal dialoguesthat occur during the process.

3 Requirements Elicitation

Traditionally, requirements engineering assumes a strong reality [11], [10], [39],[28], [40], [2], [22], [37], [42], [31]. The requirements engineer elicits informationfrom this strong reality and proceeds systematically to a requirments specifica-tion.


3.1 Socially Constructed Reality

Deviating from this tradition and viewing reality as socially constructed impliesseveral epistemological and methodological assumptions, including [38], [1]:

1. Reality is constructed through purposeful human action and interaction.2. The aim of knowledge creation is to understand the individual and shared

meanings that define the purpose of human action.3. Knowledge creation is informed by a variety of social, intellectual, and the-

oretical explorations. Tools and techniques used to support this activityshould foster such explorations.

4. Valid knowledge arises from the relationship between the members of somestakeholding community. Agreements on validity may be the subject of com-munity negotiations regarding what will be accepted as truth.

5. To make our experience of the world meaningful, we invent concepts, models,and schemes, and we continually test and modify these constructions in thelight of new experience. This construction is historically and socio-culturallyinformed.

6. Our interpretations of phenomena are constructed upon shared understand-ings, practices, and language.

7. The meaning of knowledge representations is intimately connected with theauthors’ and the readers’ historical and social contexts.

8. Representations are useful if they emerge out of the process of questioningthe status-quo, in order to create a genuinely new way of thinking and acting.

9. The criteria by which to judge the validity of knowledge representationsinclude that the representations [26]– are plausible for those who were involved in the process of creating them,– can be related to the individual and shared interpretations from whichthey emerged,

– express the views, perspectives, claims, concerns, and voices of all stake-holders,

– raise awareness of one’s own and others’ mental constructions,– prompt action on the part of people involved in the process of knowledgecreation, and

– empower that action.

The social construction of reality emerges from four main social processes: sub-jectification, externalization, objectification, and internalization [1].

Subjectification is the process by which an individual creates her own expe-riences. How an individual interprets what is happening is related to the realityshe perceives. This reality is shapped by her subjective conceptual structures ofmeaning.

Externalization is the process by which people communicate their subjectifi-cations to others, through a common language. By making something externallyavailable, we enable others to react to our previously subjective experiences andthoughts. By means of this communication, humans may transform the originalcontent of a thought and formulate another that is new, refined, changed or


developed. The mutual relation with others is dialectical and leads to continu-ous reinterpretation and change of meanings. Surrounding reality is created byexternalization.

Objectification is the process by which an externalized human act mightattain the characteristic of objectivity. Objectification happens after several re-flections, reinterpretations, and changes in the original subjective thoughts, whenthe environment has generally started to accept the externalization as meaning-ful. This process can be divided into phases: institutionalization and legitimiza-tion.

Internalization is the process by which humans become members of the so-ciety. It is a dialectic process that enables humans to take over the world inwhich others already live. This is achieved through socialization occuring duringchildhood, and in learning role-specific knowledge and the professional languageassociated with it.

3.2 A Constructionist Perspective of Requirements Elicitation

These core ideas have implications for practice of requirements engineering.Specifically for requirements elicitation, which is the focus of this paper, theseimplications are summarized in Table 1, found after the bibliographical refer-ences. This table works on three subprocesses of requirements elicitation:

1. the creation of knowledge about the current work situation, perceived prob-lems or expectations, and the vision of a new work situation that includesthe use of a software system that supports or implements innovative workconcepts and practices;

2. the representation of the created knowledge; and3. the joint invention by all stakeholders of requirements for a system that

acceptably meets all stakeholder’s needs, expectations, or interests.

These subprocesses, of course, are interconnected processes that are describedhere independently to simplify their analyses. The table has one column foreach of these subprocesses. The rows represent the constructionist perspectiveon project goals; the process structure; the final product; the use of theoreticalframeworks; methods, techniques, and tools; the role of the participants; and thereuse of previous product.

According to the constructionist perspective, knowledge is a social product,actively constructed and reconstructed through direct interaction with the envi-ronment. In this sense, knowledge is a real-life experience. As such, it is personal,sharable through interaction, and its nature is both rational and emotional.Knowledge representation is intimately connected with the knower–teacher andthe learner. Knowledge representation is never complete or accurate since it cannever replace the experience from which it is derived. However, a knowledgerepresentation can be useful if it makes ideas tangible and enables communica-tion and the negotiation of meanings. A system requirement is a specific formof knowledge representation.


Table 1. Practical implications for elicitation of constructionist assumptions

Knowledge Creation Knowledge

Representation

Requirements

Invention

Goals Understand (1) humanaction and interactionthat will be supportedby the software systemand (2) the meaningsbehind that action.Question and re-createthose meanings.

Express a multivoicedaccount of the realitythat we construct so-cially. It includes thevoice of the elicitor andall stakeholders of thesystem.

Reinvent the work reality

through the adoption of

a software system.

Processstructure

Process structure is the result of the joint effort of system’s stakeholders andelicitors for emancipation, fairness, and community empowerment. Its shape issituational, i.e., it varies with organizational history and culture, andresources involved.

Product Reformulation of mentalconstructions, recreationof shared meanings,awareness of contradic-tions and paradoxes ofconcepts and practices.Development of a com-mon and local languageto express feelings, per-ceptions and concep-tions.

Expression of individualand shared experience.

Shared interpretations

of adequate support of

work. Cannot be discon-

nected from historical

and social contexts of

requirements creators.

Theoreticalframeworks

Inform the process with the values and beliefs held by elicitors and thesystem’s stakeholders.

Methods,techniques,and tools

Inform the process withthe values held by elici-tors and the system’sstakeholders. Helpcreate graphical andtextual elements of acommon language.

Define the organizationof knowledge representa-tions.

Define the format in

which requirements are

expressed.

Have the potential of bias towards some stakeholders’ voices and of forcing aforeign language.

Role ofparticipants

Co-creators of knowl-edge, jointly nominatethe questions of interest.

Co-creators of a lan-guage to representknowledge, jointlydesign outlets forknowledge to be sharedmore widely within andoutside the site.

Co-inventors of a com-

mon future.

Reuse ofproduct

Created knowledge islocal, transferable onlyfor sites where peoplehave similar experiencesand beliefs.

Representations areconnected with thecontext in which theywere created. Iftransposed to a differentlocation, they mayinvoke different mentalconstructions in readers.

Reuse of stakeholders’

requirements is problem-

atic because of their his-

torical and sociocultural

dimension.

3.3 Integration of Organizational Theory into RequirementsElicitation

Recently, a number of authors, e.g., Bolman and Deal [4], Morgan [30], andPalmer and Hardy [32], attempted systematizations of organizational theory.Ramos investigated the usefulness of integrating this organizational theory into


the requirements engineering process [35]. She described the importance of thestructural, social, political, and symbolic dimensions of work in determining re-quirements. One result of this work are guidelines for understanding the meaningof human action and interaction. These guidelines are summarized in Tables 2and 3, found after the bibliographical references.

Table 2. Work aspects that should guide the choice of participants

Structural Participants should be representative of:Formal rolesTasksSkillsLevels of authorityAccessed/produced information

Social Participants should be representative of:Communication skillsNegotiation skillsInformal rolesDegrees of motivation to change work practicesParticipation in the shaping of organizational historyWillingness and experience in decision making processesProfessional statusKnowledge

Political Participants should be representative of:Individual interestsForm of power held: organizational authority, control of scarceresources, control of the definition of formal arrangements,restricted access to key information, control of organizationalborders, control of core activities, member of a strong coalition,charisma

Symbolic Participants should be representative of:Use of jargonUse of proverbs, slogans or metaphorsRelevant beliefs and superstitionsUse of humorStory tellingResponsibilities for symbolic eventsWays of instigating social routines and taken for granted techniquesto perform a task

Ways of conceiving the work space

Table 2 helps decide which stakeholders should be consulted during elicita-tion, that is, which participants should be chosen to represent the various workdimensions. For each dimension of work, the table lists the properties of thechosen individuals that must be considered.

Table 3 shows for each dimension of work, the human actions and interactionsthat can be relevant to requirements.


Table 3. Dimensions of human action in organizations

Structural Relevant organizational goals, objectives, and strategiesTasks, processes, rules, regulations, and proceduresCommunication channels and exchanged informationCoordination and controlFormal rolesHow authority is distributedNeeds of system support to workRelevant organizational and technological knowledge to be able toperform tasks

Social Shared goals and objectivesPerformance expectationsRewards or punishments for performanceMotivation factorsInformal roles and communicationPersonal knowledge and its impact on work concepts, practices, andrelationships

Fostered participation in decision makingUse of individual and group skills

Political Personal interests relating performed tasks, career progression, andprivate life

CoalitionsIndividual or group power playsConflict of interestsNegotiation processes: concepts, and practices

Symbolic Symbols used to deal with ambiguity and uncertaintyShared values and beliefsCommon languageRelevant myths, stories, and metaphorsRituals and ceremoniesRelevant messages organizational, work, or system stakeholdersLegitimized ways of expressing emotions

3.4 Towards a Constructionist Requirements Elicitiation Process

During requirements elicitation, all created knowledge should be representedand continually consulted about how previous and actual historical, social, andcognitive experiences have been shaping the process of its creation.

While creating the knowledge elements included in Tables 2 and 3, elicitorsand the system’s stakeholders participate in conversations. In these conversa-tions, the processes of subjectification, externalization, objectification, and in-ternalization are occurring continually, and their interplay creates a commonreality for elicitors and stakeholders.

Logic and emotion, awareness and unawareness, explicit and tacit are ever-present elements in the interactions, shaping thinking and action. Emotions,feelings, unconscious experience, and knowledge can be accessed only indirectly


through the actions and reactions of the participants in requirements elicitationand through their use of language in its most general sense [20]:

– vocal characterizers (noises one talks through, e.g., laughing, whispering,yelling, crying);

– vocal segregates (sounds used to signal agreement, confusion, understanding,e.g., “hmm-hmm”, “Huh?!”, “Ah!”, “Nu?!”);

– voice qualities (modifications of language and vocalizations, e.g., pitch, ar-ticulation, rhythm, resonance, tempo);

– idiom (dialect, colloquialism, jargon, slang);– discourse markers (“well”, “but”, “so”, “okay”);– stylistic devices (use of repetition, formulatic expressions, parallelism, figu-rative language, onomatopoeia, pauses, silences, mimicry);

– facial expressions (smile, frown, furrowed brow);– gestures (nodding, arm, motions, hand movements);– shifts in posture;– alterations in positioning from intimacy (touching) to social or public dis-tance;

– performance spaces (an allocated room or impromptu meeting in a corridor,rearranged seating, etc.);

– props (especially for ceremonial oratory); and– clothing, cosmetics, and even fragrance.

During the knowledge construction process, the elicitors should reflect criticallyon themselves as practitioners. This reflection has mainly three dimensions:

1. What theories and practical experience has been shaping our practice aselicitors? What are the alternatives? Why should we stick to our usual waysof thinking and acting?

2. What frameworks will we be using to guide our actions in the present sit-uation? Why? What goals will guide the interaction with members of thiscommunity? What ethical considerations are we assuming?

3. How effective is our communication with the system’s stakeholders? Whatfeelings have been present in interactions with them? What have we learned?In which way are our and others’ understandings and practice changing?

These guidelines are derived from case studies carried out by the first authorfor her Ph.D. dissertation [35], [36]. Two case studies were carried out in orderto identify what needs, expectations, and beliefs were sustaining specific OTs inwhich ICT applications (1) were being adopted to foster use of practices that thesenior management thought to be the best and (2) were being locally developedin opposition to work concepts and practices that senior management thoughtwere best. In each case, the OT was carried out successfully.

4 Conclusions and Future Work

This paper was written with the primary aim of addressing the implications ofemotions, values, beliefs, and interests in the conception and adoption of software


systems that support new work realities. The secondary aim of the paper is toadvance some general guidelines to understand the emotions, values, beliefs, andinterests relevant to requirements’ elicitation.

The approach to requirements elicitation implicit in the guidelines is lengthyand resource intensive. The transformation of values, beliefs and interests, andthe emotions and feelings attached to them is difficult and uncertain. It requirespatience and trust. At the end of a successful OT process that includes theadoption of ICT applications, stakeholders and requirements engineers will findthemselves transformed in some way. In a joint effort, they will have conceivedthe support of a new work reality that will be implemented. This new realitymust be nurtured until it solidifies close to the way it was originally envisioned.

Addressing only the structural, political, and economic aspects of the processwould mean to ignore that emotions and feelings are present even in our mostrational and objective decisions [9]. In the elicitation of requirements, emotionsand feelings are present in the choice of the problem to address, the choice oftechniques and tools to gather information about business goals and work prac-tices, the choice of stakeholders and the needs they express, the abstractions andpartitions of reality, the knowledge we find relevant, the requirements we elicit,and the formats in which we choose to represent knowledge and requirements.

In future research, the guidelines will be made more detailed so that engi-neers can choose the ones they will integrate into their preferred methods forelicitation. It is already planned to do several cases studies in which, by studyingthe implementation of the same ready-to-use package of software in different or-ganizations, the differences in the historical and socio-cultural backgrounds willbe mapped into differences in the implementations.

The basic assumptions of the constructionist perspective, from which Table1 is derived, are already implicitly integrated into the Soft Systems Method-ology (SSM). Authors in requirements engineering have been emphasising theinterconnectedness of science, society, and technology [15] and the relevance ofethnographic techniques for eliciting requirements in their context [28], [14].However, few specific guidelines have been provided to deal with the impactof emotions, beliefs, and values of the whole team involved in a requirementselicitation. There is also a shortage of guidelines to help elicit emotions, beliefs,and values from the visible and shared constructions of human action and inter-action that occur in organizations. Finally, almost no ideas have been providedto structure requirements elicitation around the social dynamics of a learningprocess.

In the future, the authors intend to develop an approach that will structurerequirements elicitation around the four processes that mold socially createdrealities and that will make use of the above guidelines and of strategies toeffectively influence the transformation of emotions, values, and beliefs. An initialversion of this approach has already been developed and tested in the field,but it needs to be improved in future action research projects. The authors donot intend to invent new techniques or a new method to guide requirements


elicitation. Rather, they intend to provide a general framework in which existingmethods and techniques could be integrated or reconstructed.

References

1. Arbnor, I., Bjerke, B.: Methodology for Creating Business Knowledge. Sage, Thou-sand Oaks, CA (1997)

2. Berry, D.M., Lawrence, B.: Requirements Engineering. IEEE Software 15:2 (March1998) 26–29

3. Boje, D.M., Gephardt, R., Thatchenkery, T.J.: Postmodern Management and Or-ganization Theory. Sage, Thousand Oaks, CA (1997)

4. Bolman, L.G., Deal, T.E.: Reframing Organizations: Artistry, Choice, and Lead-ership. Second Edition. Jossey-Bass, San Francisco, CA (1997)

5. Burke, W.W.: Organization Change: What We Know, What We Need to Know.Journal of Management Inquiry 4:2 (1995) 158–171

6. Cialdini, R.B.: Influence: Science and Practice. Harper Collins College, New York,NY (1993)

7. Cummings, T.G., Worley, C.G.: Essentials of Organization Development andChange. South-Western College Press, Mason, OH (2000)

8. Dahlbom, B., Mathiassen, L.: Computers in Context: The Philosophy and Practiceof Systems Design. Blackwell, Oxford, UK (1993)

9. Damasio, A.: The Feeling of What Happens: Body and Emotion in the Making ofConsciousness. Harcourt Brace, New York, NY (1999)

10. Davis, A., Hsia, P.: Giving Voice to Requirements. IEEE Software 11:2 (March1994) 12–16

11. Davis, A.M.: Software Requirements: Analysis and Specification. Prentice-Hall,Englewood Cliffs, NJ (1990)

12. Dickson, G.W., DeSanctis, G.: Information Technology and the Future Enterprise:New Models for Managers. Prentice Hall, Englewood Cliffs, NJ (2000)

13. Espejo, R., Schuhmannn, W., Schwaninger, M., Bilello, U.: Organizational Trans-formation and Learning: A Cybernetic Approach to Management. Jossey-Bass,Chicester, UK (1996)

14. Goguen, J.A., Jirotka, M.: Requirements Engineering: Social and Technical Issues.Academic Press, London, UK (1994)

15. Goguen, J.A.: Towards a Social, Ethical Theory of Information. In: Bowker, G.,Gasser, L., Star, L., Turner, W.: Social Science Research, Technical Systems andCooperative Work. Erlbaum, Mahwah, NJ (1997) 27–56

16. Greenwood, R., Hinings, C.R.: Understanding Strategic Change: the Contributionof Archetypes. Academy of Management Journal 36:5 (1993) 1052–1081

17. Hirschheim, R., Klein, H.K., Lyytinen, K.: Information Systems Development andData Modeling: Conceptual and Philosophical Foundations. Cambridge UniversityPress, Cambridge, UK (1995)

18. Iivari, J., Hirschheim, R., Klein, H.K.: A Paradigmatic Analysis Contrasting In-formation Systems Development Approaches and Methodologies. Information Sys-tems Research 9:2 (1998) 164–193

19. Jick, T.D.: Accelerating change for competitive advantage. Organizational Dynam-ics 24:1 (1995) 77–82

20. Jones, M.O.: Studying Organizational Symbolism. Sage, Thousand Oaks, CA(1996)


21. Kafai, Y., Resnick, M.: Constructionism in Practice: designing, thinking, and learn-ing in a digital world. Erlbaum, Mahwah, NJ (1996)

22. Kotonya, G., Sommerville, I.: Requirements Engineering. John Wiley & Sons, WestSussex, UK (1998)

23. Kotter, J.P.: Leading Change. Harvard Business School Press, Cambridge, MA(1996)

24. Kramer, R.M., Neale, M.A.: Power and Influence in Organizations. Sage, ThousandOaks, CA (1998)

25. Lincoln, Y.S., Guba, E.G.: Competing Paradigms in Qualitative Research. In: Den-zin, N.K., Lincoln, Y.S.: Handbook of Qualitative Research. Sage, Thousand Oaks,CA (1994) 105–117

26. Lincoln, Y.S., Guba, E.G.: Paradigmatic Controversies, Contradictions, andEmerging Confluences. In: Denzin, N.K., Lincoln, Y.S.: Handbook of QualitativeResearch. Sage, Thousand Oaks, CA (2000) 163–188

27. Lyytinen, K., Mathiassen, L., Ropponen, J.: Attention Shaping and SoftwareRisk—A Categorical Analysis of Four Classical Risk Management Approaches.Information Systems Research 9:3 (1998) 233–255

28. Macaulay, L.A.: Requirements Engineering. Springer, London, UK (1996)29. Marion, R.: The Edge of Organization: Chaos and Complexity Theories of Formal

Social Systems. Sage, Thousand Oaks, CA (1999)30. Morgan, G.: Images of Organization. Sage, Thousand Oaks, CA (1997)31. Nuseibeh, B., Easterbrook, S.: Requirements Engineering: A Roadmap. In: Finkel-

stein, A.: The Future of Software Engineering 2000. ACM, Limerick, Ireland (June2000)

32. Palmer, I., Hardy, C.: Thinking about management. Sage, Thousand Oaks, CA(2000)

33. Papert, S.: Introduction. In: Harel, I.: Constructionist Learning. MIT Media Lab-oratory, Cambridge, MA (1990)

34. Parker, S., Wall, T.: Job and Work Design: Organizing Work to Promote Well-Being and Effectiveness. Sage, Thousand Oaks, CA (1998)

35. Ramos, I.M.P.: Aplicacoes das Tecnologias de Informacao que suportam as di-mensoes estrutural, social, polıtica e simbolica do trabalho. Ph.D. DissertationDepartamento de Informatica, Universidade do Minho, Guimaraes, Portugal (2000)

36. Santos, I., Carvalho, J.A.: Computer-Based Systems that Support the Structural,Social, Political and Symbolic Dimensions of Work. Requirements Engineering 3:2(1998) 138–142

37. Robertson, S., Robertson, J.: Mastering the Requirements Process. Addison-Wesley, Harlow, England (1999)

38. Schwandt, T.A.: Three Epistemological Stances for Qualitative Inquiry: Interpre-tivism, Hermeneutics, and Social Constructionism. In: Denzin, N.K., Lincoln, Y.S.:Handbook of Qualitative Research. Sage, Thousand Oaks, CA (2000) 189–213

39. Siddiqi, J., Shekaran, M.C.: Requirements Engineering: The Emerging Wisdom.IEEE Software 9:2 (March 1996) 15–19

40. Sommerville, I., Sawyer, P.: Requirements Engineering, A Good Practice Guide.John Wiley & Sons, Chichester, UK (1997)

41. Spector, P.E.: Job Satisfaction: Application, Assessment, Causes, and Conse-quences. Sage, Thousand Oaks, CA (1997)

42. van Lamsweerde, A.: Requirements Engineering in the Year 00: A Research Per-spective. Proceedings of 22nd International Conference on Software Engineering.ACM, Limerick, Ireland (June 2000)


Managing Evolving Requirements Using eXtremeProgramming

Jim Tomayko

Carnegie Mellon University5000 Forbes Avenue, Pittsburgh, PA 15213, USA

[email protected]

Abstract. One problem of moving development at “Internet speed” is thevolatility of requirements. Even in a supposedly stable project like thatdescribed here, requirements change as the client sees “targets of opportunity.”That is one of the unintended side effects of having the client on-site frequently,although it does increase user satisfaction because they are not prevented fromadding functionality. This paper is an account of using an agile method,eXtreme Programming, to survive and manage rapid requirement changeswithout sacrificing quality.

1 Introduction

One of the most prevalent problems in software development is changingrequirements, either because all of the requirements are unknown at the beginning ofa project, or the clients simply changed their minds during its course, or somecombination of the two. The way that requirements are managed in eXtremeProgramming (XP), and other “agile” or “lightweight” development processes canameliorate the effects of requirements uncertainty. In fact, the strongest undercurrentof these methods is the phrase “embrace change.” As Jim Highsmith and MartinFowler have written [3]: “For a start, we don’t expect a detailed set of requirementsto be signed off at the beginning of the project; rather, we see a high-level view ofrequirements that is subject to frequent change.”

This paper tries to show how XP can adjust the development process to keep upwith most changes.

2 eXtreme Programming

XP is one of the growing numbers of lightweight methods now becoming popular forsoftware development [1]. An initial glance at XP reveals places where its processescan be extended by the addition of selected practices from more heavyweight methods(Table 1). Note that the only XP practice that cannot be extended is the 40-hour week.Perhaps the greater than 40-hour week is a direct result of requirements evolution.Actually the “40-hour week” is a metaphor for “the developers are alert and rested.”

316 J. Tomayko

The other requirements-oriented practices of XP can essentially be the means ofpreserving a normal working load.

Table 1. XP processes and related standard practices.

XP Topic Additional PracticePlanning GameSmall ReleasesMetaphorSimple DesignTestingRefactoringPair ProgrammingCollective OwnershipContinuous Integration40-hour WeekOn-site CustomerCoding Standard

Iterative estimation; COCOMO IIRapid Application DevelopmentProblem Frames, PrototypesSoftware Architectural StylesStatistical TestingSoftware Architectural StylesInspectionsOpen SourceContinuous Verification

Use CasesPersonal Software Process

Requirements per se are not mentioned in the list of XP practices. However, the XPpractices of metaphor, simple design, refactoring, on-site customer, testing, collectiveownership, and continuous integration are all requirements related. This paperdiscusses how these XP practices can be used to control requirements evolution.Along the way we will point out where we can use the fundamental XP values ofsimplicity, communication, feedback, and courage.

2.1 Metaphor

Metaphor n. a figure of speech containing an implied comparison, in which a wordor phrase ordinarily and primarily used of one thing is applied to another [12].

This definition is expanded in XP to encompass the entire initial customer anddevelopers’ understanding of the system’s story. As such, it is a substitute for thearchitecture, which keeps development focused [1]. Therefore, as the metaphor isbetter understood, so are the requirements.

For example, let the “Voyager probe computer network” be a metaphor for thesystem we want to build. We will not implement the system as a duplicate ofVoyager’s configuration. We will just match its functionality

Similarly, the use of Michael Jackson’s Problem Frames is a way of fleshing outthe metaphor [5]. An example is given in [10] and one section is paraphrased here.

Let us say that we have the following user story:

A probe has a Command Computer, Attitude Control Computer and a DataProcessing Computer. It also has three experiment computers, each of which has alimited version of the software residing in a primary computer as a backup. If one ofthe primary computers fails, the backup software will act as the primary until it candetermine if there are sufficient resources to either run a more robust version of thesoftware, or that version will have to be spread over several processors.

Managing Evolving Requirements Using eXtreme Programming 317

There is a problem frame for a controller that implements some Required Behavior.There is another that commands Controlled Behavior. In an earlier version ofJackson’s work, these two were combined in the Control Frame. However, in thespacecraft example, they are more effective separated (Figures 1 and 2).

AttitudeControlComputer

ControlMachine

Attitude Jets,Star Tracker

ControlledDomain

3-axis InertialOrientation

RequiredBehavior

Fig. 1. Required behavior.

The important thing to note when applying these problem frames is that theirproper use requires filling out the metaphor “Voyager computer system.” There is nomention of thrusters, star trackers, or 3-axis inertial orientation in the original userstory. It is only when completing the domains that these come up. However, these arehardly requirement changes. They are more like requirement refinements. But, theyhave the same effect, and coupling problem frames to the metaphor discovers themearlier and simplifies the evolution of requirements (Figure 1).

CommandComputer

ControlMachine

Maneuver Thrusters,Experiment Processors,Heartbeat

ControlledDomain

GroundController

Operator

Commanded Behavior

CourseCorrections,ExperimentOn/Off,Fault ToleranceManagement

Fig. 2. Controlled behavior

In contrast, consider naming a metaphor that does not match: financial software iscalled a “checkbook,” but it maintains a budget, investments, and several accounts,besides allowing check writing. Perhaps “accountant” would be better. The originalmetaphor is quite limiting.

318 J. Tomayko

2.2 Simple Design and Refactoring

Simple design and refactoring are discussed together because following the intent ofthe former makes the latter easier. Regardless of whether refactoring (redesigning) isused, simple designs fit the primary XP goal of simplicity. This enables the value ofcourage, as it gives the client the courage to ask for something new that may have justcome to light and it gives the developers the courage to add functionality. Basically,this is the courage to change, a central tenet to requirements evolution.

Simple designs are the product of much work at the front end. When XPdevelopers start exploring the solution space, they derive simple designs by keepingmodularity and abstraction prevalent through object-orientation. Refactoring for thelong view (i.e. global variables versus local, variables versus constants, classes andobject instantiations versus individual objects, etc.) results in simplifications andreuse. All are capable of making the adding of requirements graceful.

“Simplicity” is not another word for “poor” or “haphazard.” Simple designsfacilitate change. Fowler and Highsmith again: “Agile approaches emphasize qualityof design, because design quality is essential to maintaining agility.” [3]

2.3 On-Site Customer

Perhaps the strongest positive influence on evolving requirements and uncertainty isthe XP practice of having the customer present. Many times their presence can behandy in simplifying the product. Once when a team was implementing the spacecraftsoftware discussed above, they were having trouble re-booting processors that hadfailed. The client pointed out that a failed processor would have a low likelihood of arestart in space conditions. This seemingly obvious information changed therequirement for fault detection and tolerance when the team was about to go into theagony of implementing the requirement as they saw it.

It can be that the chief means of controlling uncertainty over requirements is tohave a representative of the group that will use the software present to say yea or naybefore any change. This avoids travel down false paths and shortens the develop lead-time of new functions.

2.4 Testing

One of the XP values is feedback. This is assured to be accomplished because XPsoftware development is predicated on testing, and passed tests tell the developerswhether they have successfully implemented a function. Testing is also a way ofevolving requirements, as a test must exist before a function is added or changed. If itis impossible to develop the test, it is probably impossible to implement the change.Therefore, one check upon requirements expansion is this need to develop a test forany implementation. If it all works out and there is an adequate code/test pair thatproves the added functionality and does not add complexity, as long as there is budgetto cover it, who will care?


2.5 Collective Ownership

One problem of evolving requirements is distributing their implementation accordingto developer expertise. Often the wrong engineer is assigned an implementation dueto management ignorance. Collective ownership prevents this by permittingdevelopers to choose to build components and avoid building others. This alsoprevents developers from learning new things, but training is not an essential part ofthe process. The principle of collective ownership both allows engineers to gravitatetoward their areas of expertise, or to fix a naïve implementation. In this way they cancontribute to keeping the effects of evolving requirements under control. Collectiveownership means more and hopefully better refactorings, and a simpler design, sinceall must be able to understand them.

2.6 Continuous Integration

This is a powerful technique of not only XP; Microsoft practices it and otherorganizations have tried to copy its success. Microsoft reportedly has a rule that theday is not over until the software under development is successfully compiled. That isdifficult to believe in the case of operating systems, but not for most applicationsusing modern compilers. The point is that the software is in a constant state ofpseudo-completion. If it was developed in the XP fashion, it already represents somevalue to the customer. Adding or changing a requirement does not affect itsavailability to do the baseline application.

3 A Case Study

In order to see how these XP practices and values are applied to evolvingrequirements, the story of how a team developed an application in an atmosphere ofuncertainty is illustrative. The XP team was two pairs. Its job was to build simulationof a deep space craft’s computer systems for a study of fault tolerance. Previously,fault tolerance was often accomplished by redundancy [9]. The problem with thatmethod is the redundant hardware constitutes an additional drain. The chiefconstraints on spacecraft are size, power, and weight. Eliminating redundant hardwarewould benefit all three.

During a meeting of the High Dependability Computing Consortium (HDCC) inearly 2001 at National Aeronautics and Space Administration’s (NASA) AmesResearch Center, it occurred to me that Moore’s Law made even the ancientprocessors likely to find themselves chosen for a probe to be more powerful than istruly necessary for experiments. Therefore, a kernel process running in experimentcomputers that could be instantiated if one or more of the primary computers failcould back up the primary command, attitude, and data computers. The primarycomputers on even spacecraft that have been in flight for decades have never failed.Thus it makes sense to try this scheme. The problem was that of a Voyager-typespacecraft instead of the reconfigurable software on Galileo [11].

Every detail of the requirements was not delivered prior to beginning development.The first thing that the XP team did was to use the metaphor and problem frames to

320 J. Tomayko

make a reasonably correct prototype. The prototype was related to communicationsbetween computers and processes. This turned out proven to be possible, so thesubstance of the metaphor was all right, the primary computers could communicatewith the others.

Until this was determined to be possible, nothing could really be done. The overallstory was written and divided to a series of small stories on cards. The client, whowas available throughout development, ordered the stories into something that woulddeliver value with each cycle of development. Essentially, this was three cycles: allthe functionality of the command computer, all that of the attitude control computer,and finally that of the data computer.

The developers seemed relieved by this ordering, as the most difficult task of theentire system is redundancy management. The pairs initially thought that, sinceredundancy management is the job of the command computer, then any logic for thatpurpose must reside on that computer. It turned out to be quite easy to send a“heartbeat” to the other computers. If they did not respond to it in a fixed amount oftime, then the node is considered failed. However, a failed computer could not berestarted unless it was completely rebooted.

The client saw this as a misunderstanding. Since this was a simulation of a deepspace craft, there would be no possibility for repair, so the reboot capability was notneeded. When the client removed this functionality, the team could reconsider thedirection of the fault tolerance signal. Now the peripheral computers could send aheartbeat to the command computer. When a time-out occurred, the offendingcomputer is declared failed.

The kernel process for the command computer only did the heartbeat. Thedevelopers realized that once the command computer and its kernel were figured out,development was essentially over. Near that point, they offered the code to the clientfor refactoring.

Normally, the client would have been involved directly in helping to shape thefunctionality in general, not the details. However, this client had some expertise, andwas allowed to exercise the privileges of refactoring and common ownership. Thissort of relationship can be common in aerospace, since the prime contractor usuallyhas some experience in the field. It turned out the refactorings were delivered to thedevelopers as suggestions for change, so the client did not have to learn thedevelopment environment or configuration management system. They just let thedevelopers handle the changes. Most were minor, such as replacing constants withvariables. Looking at the code, the client noted that the developers were preparing tohave the attitude control computer to accept an orientation value from the ground.This was not explicitly part of the requirements at first, but the developers seemedimpressed with the elucidation of the three-axis inertial orientation requirement. Inthis way, both client and developers can contribute to requirement refinement.

The client discovered this additional requirement as part of the refactoring/openownership of the code. Usually implementing a requirement that a client has notspecified has been brushed off as a “feature” by the developers and “gold plating” bythe clients. The usual rationale is that since the additional code does not affect theactual requirements, if the client gets something extra, so the better. This ignores bothadded complexity and added difficulty in maintenance. Specific to this case, itviolates the XP value of simplicity if both do not agree.

In this case, allowing the attitude control computer to align the spacecraft(virtually) along its zero axis added a reasonable function. The client was worried that


the attitude and data computers only had the “I’m alive” function of fault detection.By adding this functionality to the software, the developers accomplished getting theattitude control computer more involved with the system. The client then introduced arequirements change to make acknowledgement of the attitude change command berouted through the data computer, thus giving it an additional function. Later threeadditional and specific functions were added to simulate load.

Note that in the paradigmatic waterfall-like development life cycle, the clientwould most probably seen this feature at the acceptance test phase. This would bemuch too late for an elegant addition to the code, or to veto it. As it turned out,refactoring the code to remove the attitude change functionality would be simple, asthe developers used abstraction and information hiding well. But the client wanted tokeep the additional function and add one to the data computer. Therefore, thiscombination of techniques enabled requirements evolution, not only a reaction to it.Having the client on-site is the most important practice that helped the result.

3.1 Further Results

At about this time, the client asked for a written description of the JUnit test systemfor possible later use. The development team took this opportunity to produce a testplan for integration as well (Appendix I). The team had done well independentlyusing JUnit for component testing, as there was no need for interaction among them.However, integration would be different. When the time came for integration, theydeveloped the plan to avoid interfering with each other. Note that the team restatedthe requirements as they understood them after problem frames analysis.

Also at about this time in development, the team had the client sit down and try thesoftware, hopefully to get direct experiential feedback. One of the areas that were notgiven any emphasis was the user interface. This is because the main purpose of theinterface would be to turn on and off computing resources, so the client did notspecify anything special, essentially leaving the format up to the development team.Now, even though integration was not finished, the interface was. The developers hadmade a simple interface of selections from a hierarchical set of menus. The messagesat the lower levels were identical. So, with the speed of the processor, they wouldcome up in such a way that their overwriting would be indiscernible, so that oneperceived that they did not change. Small evolutions like this were found in theprocess of the client starting and stopping the three main spacecraft computers usingthe software.

Finally, the team finished the original cards. They were asked by the client to addsome experiment computer functions. This was done. When the client asked foradditional capabilities, the team turned him down because their previous experiencewith the software indicated that they could not get done before the time allocated ranout.

4 Summary

The claims of XP advocates, that the approach is relatively resilient to requirementschanges, seems to have been justified in this project [1]. One developer shared

322 J. Tomayko

reasons why: “Having short release cycles and testing continuously aided us, the XP-team, to find defects quickly and also gave me a sense of accomplishment once wefinish a requirement on the targeted date.” [2]

A reason why the client shared their enthusiasm of the method’s effectiveness atrequirements changes was stated by another developer: “Everything that is developedserves the interests of the customer.” [8]

We made the assertion that quality remained high using agile methods. Accordingto Jerry Weinberg “Quality is value to some person.” What we value is lack ofdefects. A corresponding team using the Team Software Process (TSP)1 on the sameproblem reported making 19.15 defects/KLOC [4], while a contemporary team withthe same process had 20.07 defects/KLOC [6]. The XP team injected 9.56defects/KLOC, or about half the relative number. The TSP team was measured atformal inspections and integration test. The XP team was measured duringintegrations and while executing their test plan.

TSP is just not build for speed. Productivity figures bear this out: the XP teamwrote 5,334 lines of Java that implemented all the requirements. The TSP team wrote2,133 lines that implemented roughly half the requirements. This is because the XPteam was implementing almost from the beginning, while the TSP team was writingthe vision document, design document, and so on. The XP team was almostdocumentless, except for the coding standard and the integration test plan. If theproject had continued, other documents would probably have been added.

There was no desire for documentation of requirements change, but rather of theresulting design. A third developer mentioned: “I believe that formal designdocumentation should be more emphasized so that even if developers come in and gothere is a point of reference.” [7]

This is one of the biggest problems of maintenance, recapturing the design, andrepresents good insight on the part of the developer. Therefore, we can see that in aproject trying to manage asynchronous requirements change, agile methods areuseful. They enable the client to change their mind due to business pressure yet stillhave the developers easily respond to change.

References

1. Kent Beck. eXtreme Programming Explained. Addison-Wesley, 2000.2. Maida Felix. Personal Experiences in an XP Project, manuscript, Carnegie Mellon

University, August 2001.3. Martin Fowler and Jim Highsmith. The Agile Manifesto, Software Development, V.9 No.8,

2001.4. Watts Humphrey. Team Software Process..Addison-Wesley, Boston, 2000.5. Michael Jackson. Problem Frames. Addison-Wesley, 2001.6. Michelle Krysztopik. Quality Report, manuscript, Carnegie Mellon University, 2001.7. Azifarwe Mahwasane. Personal Experiences in an XP Project, manuscript, Carnegie

Mellon University, August 2001.8. Beryl Mbeki. Personal Experiences in an XP Project, manuscript, Carnegie Mellon

University, August 2001.

1 The Team Software Process and TSP are service marks of Carnegie-Mellon University


9. James Tomayko. Achieving Reliability. The Evolution of Redundancy in AmericanManned Spacecraft Computers, Journal of the British Interplanetary Society, V.38, 1985.

10. Jim Tomayko. Adapting Problem Frames to eXtreme Programming , manuscript, CarnegieMellon University, 2001.

11. James Tomayko. Computers in Spaceflight. The NASA Experience, Contractor Report182505. Washington, DC. NASA, Ch. 5 and 6, 1988.

12. Webster’s Dictionary, 3rd College Ed., Webster’s New World, 1988.

Appendix: XP Project Test Plan

Test PlanVersion 1.0July 3, 2001

Maida FelixAzwifarwi MahwasaneBeryl MbekiThembakazi Zola

Introduction

This section introduces the Space Probe Simulation System including itsrequirements, assumptions, and scenarios. It also describes the objective of this testplan.

PurposeAn objective of this test plan is to provide guidance for the Space Probe SimulationSystem Test. The goal is to provide a framework to developers and testers in orderfor them to plan and execute the necessary tests in a timely and cost-effective manner.This system test plan outlines test procedures to verify system level operationsthrough user scenarios. This system testing is intended to test all portions of the SpaceProbe System.

This test plan sample follows the steps discussed in the book “Object-OrientedSoftware Engineering” by Bernd Bruegge and Allen Dutoit.

System OverviewA space probe has three primary computers: Command Computer, Attitude ControlComputer and the Data Processing Computer. Each primary computer has backupsoftware that resides in an experiment computer. The space probe interacts with theground station computer in order to get the commands. The computers communicatewith each other over a LAN.

The Command Computer receives and decodes the commands from the groundcomputer in order to change the attitude of the space probe or turn on experiments. Italso receives the heartbeat of each computer that is connected to it. The CommandComputer is able to tell if any of the computers has failed and shall instantiate the

324 J. Tomayko

kernel of a failed computer. The experiment computers shall be on a standby untilthey get a “turn-on” message from the Command Computer.

The Attitude Control Computer has an interface to the Command Computer forchanging attitude of the space probe and for fault tolerance. It generates a statusreport and sends it to the Data Processing Computer for formatting. The kernel of theAttitude Control Computer’s functionality shall reside on an experiment processor.

The Data Processing Computer gets raw data from each experiment processor.Once the data is in the Data Processing Computer, it will be translated into a formatthat is acceptable to the ground station computer. The kernel of the Data ProcessingComputer’s functionality shall reside on an experiment processor.

RequirementsThe requirements of the Space Probe Simulation System consists of the following:1. The Command computer shall be able to try to restore a failed computer’s

functionality.2. The Command computer shall be able to instantiate the kernel of a failed

computer, even itself.3. The Command computer shall be able to turn on and off experiments.4. If one of the primary computers has failed, then the kernel of the failed machine’s

functionality shall run as a second job in an experiment processor.5. The Attitude computer shall keep the spacecraft pointing to earth unless

commanded otherwise.6. The Command computer will send messages to the Attitude computer to change

the position in which the spacecraft is pointing.7. The Attitude Control computer shall have a sensor that senses the position of the

spacecraft. Once the Command computer has issued a command, the AttitudeComputer responds to the command by first finding out what position thespacecraft is at. If it is not at an acceptable position, then following the commandwill rectify the position.

8. By loading the functionality software on the Attitude Experiment computer it willbe possible to have the functionality of the Attitude Control computer on thekernel of the Experiment computer.

9. The Attitude computer will send power codes as a “heartbeat” to inform theCommand computer of its on or off status.

10. The Attitude and its Experiment shall gather data and pass it to the Datacomputer.

11. The Data Computer gets raw data from each processor. Once the data is in theData computer, it translates the data into a format that is acceptable to the Groundcomputer.

12. The Data Computer shall format data for downlink.13. A kernel of the Data computer’s functionality shall reside on an experiment

processor.

AssumptionsThe Space Probe simulation holds the following assumptions:1. The software shall run on seven (7) virtual machines within a single LAN. This is

an enclosed environment and no future changes are envisioned.


2. The main computers, except the Ground computer, are in space and once a failureoccurs they cannot be rebooted (reinstantiated).

3. All commands are sent from the Ground computer only.

System ScenariosThe following steps describe how the system functions. We have described the detailsof establishing a connection between the Ground computer and the Commandcomputer as well as a connection between the other computers (Attitude, Data,Attitude Experiment, Command Experiment and Data Experiment).

1. Establish a connection:Assumption: All computers are on standby waiting for commands. A user starts the Ground computer, stating which computer to connect to

e.g. MSEPC 26. This computer represents the Command computer. Once a connection has been established, a menu appears giving user

options to choose from. The user can choose one of the following options:

o 1. Switch on a computer.o 2. Switch off a computer.o 3. Change the attitude.

Or type “hangup” to exit the system.2. Sending Commands:

2.1 Switch on a computer User enters “1” to switch on a computer A submenu appears asking which one of the computers should

be switched on. User enters her choice and Command switches on the specified

computer.2.2 Switch off a computer

User enters “2” to switch off a computer A submenu appears asking which one of the computers should

be switched off. User enters her choice and Command switches off the specified

computer.2.3 Exit

The user types “hangup ” to exit the system.

Testing Information

This section provides information on the features to be tested, pass and fail criteria,approach, testing materials, test cases, and testing schedules.

Unit TestingAll code will be tested to ensure that the individual unit (class) performs the requiredfunctions and outputs the proper results and data. Proper results will be determined bysystem requirements, design specification, and input from the on site client. Unit

326 J. Tomayko

testing is a typical white box testing. This testing will help ensure proper operation ofa module because tests are generated with knowledge of the internal workings of themodule. The developers will perform unit testing on all components.

Tests based on the requirements were written before coding using JUnit. JUnit is asmall testing framework written in Java by Kent Beck and Erich Gamma. Eachmethod in a test case exercises a different aspect of the tested software. The aim, as iscustomary in software testing, is not to show that the software works fine, but to showthat it doesn’t. Interaction with the JUnit framework is achieved through assertions,which are method calls that check (assert) that given conditions hold at specifiedpoints in the test execution.

JUnit allows for several ways to run the different tests, either individually or inbatches, but the simplest one by far is to let the Test Suite collect the tests using Javaintrospection:

import junit.framework.*;public class TestOptionMatch extends TestCase Optionmatch M;

//constructorpublic TestOptionMatch (String name) super(name);

//Initialize the fixture statepublic void setUp() M = new Optionmatch();

//add all test methods to the testsuite.public static Test suite()return new TestSuite(TestOptionMatch.class);

//performing all testspublic void test() assert(M.choices(1)== 0);

public void test1() assert(M.choices(2)== 45);

public void test2() assert(M.choices(3)== 90);

//This method starts the text interface and runs all the tests.public static void main (String args[]) junit.textui.TestRunner.run(suite());


Integration TestingThere are two levels of integration testing. One level is the process of testing asoftware capability e.g. being able to send a message upon completion of theintegration of the developed system. During this level, each module is treated as ablack box, while conflicts between functions or classes are resolved. Test cases mustprovide unexpected parameter values when design documentation does not explicitlyspecify calling requirements for client functions. A second level of integration testingoccurs when sufficient modules have been integrated to demonstrate a scenario e.g.the ability to queue and receive commands. Both hardware and software coding isfixed or documentation is reworked as necessary. In the case of the Space ProbeSimulation System, the developers performed integration testing to ensure that the testcases work as desired on a specific computer. If a computer cannot communicatewith its corresponding partner, and depending on the problem, the developers mayhave to fix the code to ensure that the system works properly.

System TestingSystem testing will test communication functionalities between the Ground computerand the space probe computers. The purpose of system testing is to validate andverify the system in order to assure a quality product. This is the responsibility of thesystem developers. The developers and the client will ensure that the system tests arecompleted according to the specified test cases listed below. The developers willensure that the results of the system tests are addressed if the results show that thefunctionality does not meet requirements. System testing is actually a series ofdifferent tests intended to fully exercise the computer-based system. Each test mayhave a different purpose, but all work to expose system limitations. System testingwill follow formal test procedures based on hardware and software requirements.

Testing Features/Test CasesDisplaying Menu options on the Ground computer is a feature that will be tested. Thefollowing lists the test case scenarios that are most applicable to the Space ProbeSimulation System.1. Invalid Entries.2. Verify that a computer is on before sending a command to it.3. Verify that command computer receives heartbeats from other designated

computers every 10 seconds.4. When primary computer fails, backup (Experiment) computer are instantiated and

take over the functionalities of the failed computer.5. Verify that Attitude computer and its Experiment receive and display the proper

attitude.6. Data computer formats and sends attitude to Ground computer.

Pass/Fail CriteriaThe test is considered pass if all of the following criteria are met: The Command computer is connected to Ground computer, Attitude, Data and the

Experiments are connected to Command computer. The Command Experiment computer is connected to Ground computer.

328 J. Tomayko

Experiments are capable of taking over the functionality of the failed computerwhen one of the primary computers fails.

All computers in “space” perform their functions automatically without any userinterference.

Ground computer shall be user-friendly so that it is easy to use.The test is considered fail if: The different computers have difficulty connecting or communicating with each

other. The Experiments does not perform the functionality of a failed computer when one

of the primary computers fails. The computers cannot perform all their functions automatically.

Testing ApproachThis section describes the scenarios before the system testing and the step-by-stepinstruction on how to conduct the system testing.

Before System TestingThe following compile and run procedures should be followed in all seven (7) virtualmachines.To compile

Open DOS. Change the working directory to the directory containing all of the system's

source files, by typing cd directory name. Type javac filename.java

To run: type java filename

Conducting System TestingThe system testing consists of 7 user scenarios corresponding to the descriptionsbelow. Each of the following sections provides test procedures for the Space ProbeSimulation System. Each table lists the test procedure for a given user scenario. Allthe test procedures assume that a user has already established connection.

Table 2 shows a typical test procedure for checking invalid host name entry.

Table 2. Case 1 – Invalid Hostname

Step Purpose/Action View/ResultPurp. To test invalid host name1 Start the Ground Computer Show host name and IP address

and request for hostname2 Enter host name in host label.

Example: Type “MSEPC XX”Confirmation of connection andfirst menu.

Table 3 shows a typical test procedure for invalid Menu entry. Assume that theGround computer is on and a menu is being displayed on the screen.


Table 3. Test Case 2 – Invalid Menu Entry

. Step Purpose/Action View/ResultPurp. To test invalid Menu entry1 Enter Menu option e.g.: 1, 2, 3 A corresponding submenu

appears with more options.2 Enter submenu option e.g.: A, B,

C etc.A command is sent and aconfirmation is received.

Result User enters wrong choice Error message is displayed anduser gets another chance to enteran option.

Table 4 shows a test procedure for exit.

Table 4. Test Case 3 – Exit

Step Purpose/Action View/Result

Purp. To test if user can end the sessionwithout problems

1 Type “hangup” Exit

Table 5 shows a typical test procedure for checking if a computer is on when acommand is sent to it.

Table 5. Test Case 4 – Verify if Attitude computer is on.

Step Purpose/Action View/ResultPurp. To verify if Attitude computer is

on.1 Enter Menu option “3” Submenu with the possible

attitudes is displayed.2 Choose an attitude Command sent to Attitude

Computer.Result Attitude is off Send notification to Ground

computer to switch on Attitudecomputer first.

Table 6 shows the test procedure for checking if the Data computer is on when acommand is sent to it.

Table 6. Test Case 5 – Verify if Data computer is on.

Step Purpose/Action View/ResultPurp. To verify if Data computer is on.1 Attitude computer send message

to Data computer.Data formats the message andsend it to Ground Computer.

Result

Data computer is off Send notification to Groundcomputer to switch on Datacomputer before the commandcan be sent.

330 J. Tomayko

Table 7 shows a test procedure to verify that the Command computer receivesheartbeats.

Table 7. Test Case 6 – Command receives heartbeats

Step Purpose/Action View/ResultPurp. To verify that Command receives

heartbeats every ten (10) seconds.1 Send heartbeat to Command

computer.Command computer receivesan “I am alive” message.

Result

No heartbeat The specified computer hasfailed, Instantiate the kernel ofthe failed computer.

Table 8 shows a typical test procedure for instantiating an experiment computer.

Table 8. Test Case 7 – Instantiate backup computer

Step Purpose/Action View/ResultPurp. To test whether an experiment

computer switches on when itsprimary has failed.

1 Send a “SWITCH ON ” message Experiment takes overfunctionality of the failedcomputer.

Result Experiment does not instantiate orexperiment computer fails.

System failure.

Table 9. Test Case 8 – Attitude receives and displays proper attitude.

Step Purpose/Action View/Result

Purp. To test whether Attitude computerand Attitude Experiment receives anddisplays the right attitude.

1 Send attitude message e.g.: “B”. Attitude and AttitudeExperiment display “Move 45degrees”.

Result

No attitude command receivedAttitude and experiment displayswrong attitude.

No change in attitude.Attitude data will be wrong.


Table 10. Test Case 9 – Data computer formatting and sending attitude message

Step Purpose/Action View/ResultPurp. To test whether the Data computer

sends and formats the attitudeproperly.

1 Attitude computer or the AttitudeExperiment sends attitude change toData computer

Data computer receives,formats the attitude and sendsit to the Ground computer.

Result

No attitude sent to Data fromAttitude computerNo attitude sent to Ground fromData computer

No Data sent to Ground

Testing SchedulesThe developers must ensure that the test data are accurate and the product should notbe deployed until unit testing, integration testing, and system testing are properlyperformed.

As part of the extreme programming testing strategy we will run the tests we wrotebefore hand frequently. Therefore we are always in the testing stage, but our finaltesting will be done as scheduled below:

Testing Items Test Date Responsible PersonSpace ProbeSimulation System

7/30/2001–8/03/2001

Maida Felix, Azwifarwi Mahwasane,Nonzaliseko Mbeki, Thembakazi Zola


Text Summarization in Data Mining

Colleen E. Crangle

ConverSpeech LLC, 60 Kirby Place, Palo Alto, California 94301, [email protected]

www.converspeech.com

Abstract. Text summarizers automatically construct summaries of a natural-language document. This paper examines the use of text summarization withindata mining, identifying the potential summarizers have for uncovering inter-esting and unexpected information. It describes the current state of the art incommercial summarization and current approaches to the evaluation of sum-marizers. The paper then proposes a new model for text summarization andsuggests a new form of evaluation. It argues that for summaries to be truly use-ful within data mining, they must include concepts abstracted from the text inaddition to sentences extracted from the text. The paper uses two news articlesto illustrate its points.

1 Introduction

To summarize a piece of writing is to present the main points in a concise form. Workon automated text summarization began over 40 years ago [1]. The growth of theInternet invigorated this work in recent years [2], and summarization systems are be-ginning to be applied in areas such as healthcare and digital libraries [3]. Severalcommercially available text summarizers are now on the market. Examples includeCapito from Semiotis, Inxight’s summarizer, the Brevity summarizer from LexTekInternational, the Copernic summarizer, TextAnalyst from Megaputer, and Whis-key™ from Converspeech. These programs work by automatically extracting selectedsentences from a piece of writing.

A true summary succinctly expresses the gist of a document, revealing the essenceof its content. This paper examines the use of text summarization within data miningfor uncovering interesting and unexpected information. It describes the current stateof the art in summarization systems and current approaches to the evaluation of sum-marizers. The paper then proposes a new model for text summarization and suggests anew form of evaluation. It argues that for summaries to be truly useful within datamining, they must include concepts abstracted from the text in addition to sentencesextracted. Such summarizers offer a potential not yet exploited in data mining.

2 Summarizers in Data Mining

Much of the information crucial to an organization exists in the form of unstructuredtext data. That is, the information does not reside in a database with well-defined

Text Summarization in Data Mining 333

methods of organization and access, but is expressed in natural language and is con-tained within various documents such as web pages, e-mail messages, and other elec-tronic documents. The process of identifying and extracting valuable informationfrom such data repositories is known as text data mining. Tools to do the job must gobeyond simple keyword indexing and searching. They must determine, at some level,what a document is about.

2.1 Text Data Mining

Keyword indexing and searching can provide a specific answer to a specific question,such as “What is the deepest lake in the United States?” with the answer being foundin a piece of text such as: “Crater Lake, at 1,958 feet (597 meters) deep, is the seventhdeepest lake in the world and the deepest in the United States.”

Keyword indexing and searching can also provide answers to more complex ques-tions, such as “What geological processes formed the three deepest lakes in theworld?” Several sources will probably have to be consulted, their information fused,and interpretations made (what counts as a geological process versus a human inter-vention), and conclusions drawn. But standard keyword indexing and searching willprobably suffice to find the pieces of text needed.

Text data mining goes beyond question answering. It seeks to uncover interestingand useful patterns in large repositories of text, answering questions may not yet havebeen posed. The focus is on discovery, not simply finding what is sought. The focus ison uncovering unexpected content in text and unexpected relationships betweenpieces of text, not simply the text itself.

2.2 Summaries in Text Data Mining

Summaries aid text data mining in at least the following ways: An information analyst—whether a social scientist, a member of the intelligence

community, or a market researcher—uses summaries to guide her examination ofdata repositories that are so large she cannot possibly read everything or evenbrowse the repository adequately. Summaries suggest what documents should beread in their entirety, which should be read together or in sequence, and so on.

Summaries of the individual documents in a collection can reveal similarities intheir content. The summaries then form the basis for clustering the documents orcategorizing them into specified groups. Applications include Internet portal man-agement, evaluating free-text responses to survey questions, help-desk automationfor responses to customer queries, and so on. The very process of categorizing orclustering two document summaries into the same group can reveal an unexpectedrelationship between the documents.

The summary of a collection of related documents taken together can reveal aggre-gated information that exists only at the collection level. In biomedicine, for exam-ple, Swanson has used summaries together with additional information-extractiontechniques to form a new and interesting clinical hypothesis [4].

An interesting and significant form of indeterminacy creeps into summarization. Itresults from the inherent indeterminacy of meaning in natural language. Summaries,

334 C.E. Crangle

whether produced by a human abstractor or a machine, are generally thought to begood if they capture the author’s intent, that is, succinctly present the main points theauthor intended to make. (There are other kinds of summarization in which sentencesare extracted relative to a particular topic, a technique that is a form of informationextraction.) So-called neutral summaries, however, those that aim to capture theauthor’s intent, can succeed only to the extent that the author had a clear intent andexpressed it adequately. What if the author had no clear intent or was an inadequatewriter? Poor writers abound, and most short written communications, such as e-mailmessages or postings to electronic bulletin boards, are messy in content and execu-tion. Do automated text summarizers reveal anything useful in these cases? If thesummarization technique is itself valid, the answer is that the summary reveals whatthe piece of text is really about, whether the author intended it or not.

Various studies have explored the indeterminacy of meaning in language, and theextent to which meaning depends on the context in which language is used [5, 6].Author’s intent does not bound meaning nor fully determine the content of a docu-ment. When documents are pulled together and their collective content is examined,there generally is no single author anyway whose intent could dominate.

An automated summarizer that reveals what a text is really about, independent ofauthorial intent, is a powerful tool in data mining. It has the potential to reveal newand interesting information in a document or a collection of documents.

The pressing and practical concern is how to evaluate any given summarizer; thatis, how do we know whether or not it produces good summaries? What counts as agood summary, and does that judgment depend on the purpose the summary is toserve? Within data mining, for example, summaries that revealed unexpected contentor unexpected relationships between documents would be of the greatest value. Thenext section looks at current work in summarization evaluation.

3 Evaluating Summarizers

A group representing academic, U.S. government, and commercial interests has beenworking over the past few years to draw up guidelines for the design and evaluationof summarization systems. This work arose out of the TIDES Program (TranslingualInformation Detection, Extraction, and Summarization) sponsored by the InformationTechnology Office (ITO) of the U.S. Defense Advanced Research Project Agency(DARPA). In a related effort, the National Institute of Standards and Technology(NIST) of the U.S. government has initiated a new evaluation series in text summari-zation. Called the Document Understanding Conference (DUC), this initiative hasresulted in the production of reference data—documents and summaries—for sum-marizer training and testing. (See http://www-nlpir.nist.gov/projects/duc/ for furtherinformation.)

A key task accomplished by these initiatives was the compilation of sets of testdocuments. In these sets the important sentences (or sentence fragments) within eachdocument are annotated by human evaluators, and/or for each document, human-generated abstracts of various lengths are provided.

An early example of a summary data set consisted of eight news articles publishedby seven news providers, New York Times, CNN, CBS, Fox News, BBC, Reuters,


and Associated Press, on June 3rd, 2000, the eve of the meeting between PresidentsClinton and Putin. One of these articles is used below.

Several approaches to summarizer evaluation have been identified. They include: Using the annotated sentences. For each document, counting how many of the annotated (i.e., important) sentences

are included in the summary. A simple measure of percent agreement can be ap-plied, or the traditional measures of recall and precision.1

Using the abstracts. Counting how many of the sentences in the human-generated abstracts are repre-

sented by sentences in the summaries. A simple measure of percent agreement canbe applied, or the traditional measures of recall and precision.

Using a question-answering task. For a given set of pre-determined questions, counting how many of the questions

can be answered using the summary? The more questions can be answered, thebetter the summary.

Using the utility method of Radev [7] . Using the content-based measures of Donaway [8].To illustrate a simple evaluation, consider the following test document. The under-lined sentences are those considered important by the human evaluators.

CLINTON TAKES STAR WARS PLAN TO RUSSIA US president Bill Clinton has arrived in Moscow for his first meeting with Russia’s

new president Vladmir Putin. The two heads of state will meet on Saturday nightfor an informal dinner before getting down to business on Sunday.

High on the agenda will be the United State’s plans to build a missile shield inAlaska. Russia opposes the shield as it contravenes a pact signed by the two coun-tries in 1972 which bans any anti-missile devices.

Clinton—in his last few months of office and keen to make his mark in Americanhistory—will be seeking to secure some sort of concession from Putin.

The Russian leader has said that he will suggest an alternative to the US system. Kremlin officials said Putin would propose a system that would shoot down the

missiles with interceptors shortly after they were fired rather than high in their tra-jectory.

“We’ll talk about it in Russia,” Clinton told reporters before leaving Berlin forMoscow. “It won’t be long now.” Accompanying the President is US Secretary ofState Madeline Albright. “What’s new is that Putin is signalling that he is open todiscuss it, that he is ready for talks,” she said. “We will discuss it.”

Arms control will not be the only potentially troublesome issue. US National Secu-rity Adviser Sandy Berger said last week Clinton would raise human rights andpress freedom.

Here is an automatically generated summary of this text:

1 Recall refers to the number of annotated sentences correctly included in the summary, di-

vided by the total number of annotated sentences. Precision refers to the number of anno-tated sentences correctly included in the summary, divided by the total number of sentences(correctly or incorrectly) included in the summary.

336 C.E. Crangle


new president Vladmir Putin. The two heads of state will meet on Saturday night for an informal dinner before

getting down to business on Sunday. High on the agenda will be the United State’s plans to build a missile shield in

Alaska. Russia opposes the shield as it contravenes a pact signed by the two countries in

1972 which bans any anti-missile devices. Clinton—in his last few months of office and keen to make his mark in American

history—will be seeking to secure some sort of concession from Putin.

This extraction summary has three of the five important sentences, and of its six sen-tences (including the heading) three are considered important. Simple recall and pre-cision figures of 60% and 50% result.

All the current evaluation approaches assume that a summary is produced by ex-tracting sentences. Are there other ways to think about summarization? Are there alsonew ways to think about evaluating summarizers? This author would argue yes, par-ticularly in the context of data mining. In data mining, we are interested in discover-ing, not merely finding, information. We may need to dig beneath the surface of a textto make such discoveries. §6 returns to these questions, after a brief review of thestate of the art in text summarization and presentation of a new model for summariza-tion.

4 Text Summarization: The State of the Art

Current summarizers work by extracting key sentences from a document. As yet,there is no summarizer on the market or even within the research community thattruly fuses information to create a set of new sentences to represent the document’scontent. In general, summarizers simply extract sentences. They differ in the methodsthey use to select those sentence. There are two main kinds of methods involved, thatmay be used separately or in combination:1. Heuristic methods, based largely on insight into how human, professional ab-

stractors work. Many of these heuristics exploit document organization. So, for ex-ample, sentences in the opening and closing paragraphs are more likely to be in thesummary. Some heuristics exploit the occurrence of cue phrases such as “in con-clusion” or “important.”

2. Methods based on identifying key words, phrases, and word clusters. The docu-ment is analyzed using statistical and/or linguistic techniques to identify the words,phrases or word clusters that by their frequency and co-occurrence are thought torepresent the content of the document. Then sentences containing or related tothese words and phrases are selected.

The techniques commercial summarizers use to identify key words and phrases areoften proprietary and can only be inferred from the extracted sentences. What is read-ily seen, however, is whether or not the method identifies concepts in the text. Con-cepts are expressed using words and phrases that may or may not appear within thetext. Concept identification as opposed to key word and phrase identification is a cru-


cial differentiating factor between summarizers. Summaries that contain true abstrac-tions from the text are more likely to reveal unexpected, sometimes hidden, informa-tion within documents and surprising relationships between documents. A true ab-straction summarizer can be a powerful tool for text data mining.

It is important from a scientific point of view to devise objective measures toevaluate summarizers. However, given that the output of a summarizer is itself natu-ral-language text, some human judgment is inescapable. The DUC initiative reliesheavily on human evaluators.

Based on informal testing of several dozen documents of various kinds—businessand marketing documents (regulatory filing, product description, business news arti-cle), personal communications (fax, e-mail, letter), non-technical pieces (long essay,short information piece, work of fiction), scientific articles, and several documentsthat pose specific challenges (threaded bulletin board messages, enumerations in text,program code in text)—what follows is an intuitive judgment of the state of commer-cially available summarizers.

Current summarizers are able to produce adequate sentence-extraction summariesof articles that have the following characteristics: The article is well written and well organized. It is on one main topic. It is relatively short (600-2,000 words). It is informational, for example, a newspaper article or a technical article in an aca-

demic journal. It is not a work of the imagination, such as fiction, or an opinionpiece or general essay.

It is devoid of layout features such as enumerations, indented quotations, or blocksof program code. (Although some summarizers use heuristics that take headingsinto account, for example, summarizers typically ignore or strip a document ofmost of its layout features.)

Some summarizers perform limited post-processing “smoothing” on the sentencesthey list in an attempt to give coherence and fluency to the summary. This post-processing includes: Removing inappropriate connecting words and phrases. If a sentence in the docu-

ment begins with a connecting phrase—for example, “Furthermore” or “Al-though”—and that sentence is selected for the summary, the connecting phrasemust be removed from the summary because it probably no longer plays the con-necting role it was meant to.

Resolving anaphora and co-reference. When a sentence is selected for inclusion inthe summary, the pronouns (and other referring phrases) in it have to be resolved.That is, the summarizer has to make clear what that pronoun (or referring phrase)refers to. For example, suppose a document contains the following sentences:“Newcompany Inc. has recently reported record losses. If it continues to losemoney, it risks strong shareholder reaction. The company yesterday announcednew measures to…”

If the second sentence is selected for the summary, the word “it” has to be re-solved to refer to Newcompany Inc.; otherwise, the reader will have no idea what“it” is, and will naturally relate “it” to whatever entity is named in the precedingsentence of the summary. Ideally, the summary sentence should appear as: “If it[Newcompany Inc.] continues to loose money, it risks strong shareholder reac-tion.”

338 C.E. Crangle

If the third sentence is selected for the summary, the phrase “the company”should similarly be identified as referring to Newcompany Inc. and not any othercompany that may be named in a preceding sentence of the summary. Anaphoricand co-reference resolution is very difficult; not surprisingly, current commercialsummarizers incorporate very few, if any, of these techniques.

Current research into summarization has a strong emphasis on post-processing tech-niques.

5 A New Model for Summarization

The standard model of summary production is represented by the sequence shown inFigure 1.

Input the text document|

Identify key words and phrases in the text|

Extract sentences containing those words and phrases|

Perform post-processing “smoothing” on the extracted sentences

Fig. 1. Standard model of summary production

What if the summarizer is able to identify key concepts and not just key words andphrases? Not only can the key concepts by themselves stand as an encapsulatedsummary of the document, concepts can provide a better basis for selecting sentencesto be extracted. An enhanced model results, as depicted in Figure 2.

A summary that provides information not immediately evident from a surfacereading of the text is of potentially great value in data mining. To test the assumptionthat concepts can provide a better basis for selecting sentences to be extracted, and tounderstand the significance of this enhanced model, the action of three differentcommercially available summarizers on the news article in Appendix I is considered.The first summarizer simply produces sentences. The second additionally displayskey words and phrases from the text along with the extracted sentences. The third,ConverSpeech’s Whiskey, abstracts concepts and uses those concepts to extract sen-tences.

Input the text document|

Identify key concepts expressed in the text| |

Present key concepts asencapsulated summary

Extract sentences and dopost-processing “smoothing”

Fig. 2. Enhanced, concept-based, model of summary production

The first summarizer produced the following sentences extracted from the text:


About 20 Bay Area companies are performing so badly that they are in danger ofbeing booted off the Nasdaq, the stock exchange that lists most of the area’s high-tech companies.

Five local companies were already bumped off last year, and a sixth –PlanetRx.com Inc., a former South San Francisco health care company – was justdelisted.

The whole Internet market crashed down, and we’re rolling with it,” says PeterFriedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if itdoesn’t boost its stock price soon.

With stock prices down and the economy slowing, companies are falling short ofthe standards Nasdaq sets for its some 3,802 companies.

While the listing standards are arcane, the most obvious cardinal sin in the eyes ofNasdaq’s regulators is simple: The fall of a company’s stock price below $1 for 30consecutive trading days.

The second summarizer produced the following key words and phrases and sentencesextracted from the text: Nasdaq, stock, delisting, firms, investors, stock price, stock exchange, San, officer,

Edison It’s the company version of the pink slip in the mail – get your act together, or

you’re fired from Nasdaq. About 20 Bay Area companies are performing so badly that they are in danger of

being booted off the Nasdaq, the stock exchange that lists most of the area’s high-tech companies.

Five local companies were already bumped off last year, and a sixth –PlanetRx.com Inc., a former South San Francisco health care company – was justdelisted.

While the delisting doesn’t have to mean the game is over, it relegates companiesto the junior and less reputable leagues of the stock exchange world, where it’smuch harder to raise money.

“The whole Internet market crashed down, and we’re rolling with it,” says PeterFriedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if itdoesn’t boost its stock price soon.

Once booted, companies usually end up in the netherworlds of the stock market,where only a few brave investors venture.

This exchange doesn’t require firms to register with the Securities and ExchangeCommission or even file financial statements.

“We’re working on strategic partnerships that will have a major impact on thestock,” says Nadyne Edison, chief marketing officer for the company.

The third summarizer produced a high-level abstraction, a listing of the key conceptsexpressed in the text, and a list of extracted sentences. The number after each sen-tence is its score, calculated on the basis of how many occurrences of the words in theconcept list appear in the sentence, optionally normalized for sentence length. Thosesentences that receive the top 75% scores are selected for inclusion. This percentageis set as a parameter.

Notice that the concept of business has been extracted from the text even thoughthe word “business” appears only once in the text, in the last sentence. Note also that

340 C.E. Crangle

the word “time” also does not occur frequently in the text but the concept of timedoes.

time business capitalcompany Nasdaq stockday working share

About 20 Bay Area companies are performing so badly that they are in danger ofbeing booted off the Nasdaq, the stock exchange that lists most of the area’s high-tech companies. (378)

Five local companies were already bumped off last year, and a sixth—PlanetRxcom Inc., a former South San Francisco health care company—was just delisted.(352)

Nationwide, Nasdaq has either sent notices or is close to notifying at least 200other companies, many of whom offered stocks to the public for the first time lastyear. (368)

While the delisting doesn’t have to mean the game is over, it relegates companiesto the junior and less reputable leagues of the stock exchange world, where it’smuch harder to raise money. (400)

“The whole Internet market crashed down, and we’re rolling with it,” says PeterFriedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if itdoesn’t boost its stock price soon. (438)

While the listing standards are arcane, the most obvious cardinal sin in the eyes ofNasdaq’s regulators is simple: the fall of a company’s stock price below $1 for 30consecutive trading days. (404)

Autoweb.com Inc., a Santa Clara Internet company that specializes in auto con-sumer services, has about 40 days left under the 90-day rule, but is busy scram-bling to avoid a hearing. (390)

The three summaries have three sentences in common, and the third summary has oneadditional sentence in common with each of the first and second summaries.

6 A New Evaluation Method

How good are these three summaries and the summarizers that produced them? Anyof the evaluation methods mentioned in §3 could be applied to assess the value of theextracted sentences. However, in the context of data mining, A new evaluationmethod for summarizers is proposed here. It asks the following question: How sensi-tive is a summarizer to surface perturbations in the text, such as in word choice orsentence order?

Specifically, this method asks what happens if synonyms are substituted for wordsand phrases in the text. Does the summarizer give a different summary, selectingsentences that differ markedly in content from the previously selected ones? Simi-larly, the order of some of the sentences is changed, does that markedly alter whatgets identified as key sentences?

This test gives a good indication of the robustness of the summarizer and thesoundness of the methods used to identify the content of the document. If simple


changes in word choice or sentence order produce different summaries, it could beargued that the summarizer is not getting at the core of the document’s content.

The news article in Appendix I uses the words “firm” and “company” inter-changeably, with 23 occurrences of “company” (the more familiar word) and fouroccurrences of “firm. If we substitute “firm” for “company” in key sentences in thetext, what happens? Two tests were performed. In the first two substitutions weremade: About 20 Bay Area companies are performing so badly that they are in dangerof being booted off the Nasdaq, the stock exchange that lists most of the area’s high-tech companies firms. Five local companies firms were already bumped off last year,and a sixth—PlanetRx.com Inc., a former South San Francisco health care com-pany—was just delisted.

In the second test there were three additional substitutions: “The whole Internetmarket crashed down, and we’re rolling with it,” says Peter Friedman, CEO of TalkCity Inc., a company firm that could get kicked off Nasdaq if it doesn’t boost its stockprice soon. …With stock prices down and the economy slowing, companies firms arefalling short of the standards Nasdaq sets for its some 3,802 companies firms.

The results obtained were as follows:

First Summarizer. With the first round of substitutions, one sentence from the origi-nal summary was removed and a different sentence was inserted.

Out: Five local firms were already bumped off last year, and a sixth–PlanetRx.com Inc., a former South San Francisco health care company was justdelisted.

In: When that happens, Nasdaq sends a notice giving the company 90 calendardays to get the stock price up again.

With the second round of substitutions, another sentence from the original summarywas removed and a different sentence from the text inserted.

Out: With stock prices down and the economy slowing, firms are falling short ofthe standards Nasdaq sets for its some 3,802 firms.

In: If a company sold things on the Web—cars, pet food, you name it—it was al-most guaranteed a spot on the stock exchange.

Second Summarizer. The two rounds of substitutions produced only one change—the removal of the following sentence after the second round of substitutions: Out: This exchange doesn’t require firms to register with the Securities and Ex-

change Commission or even file financial statements.

Third Summarizer. The two rounds of substitutions produced the same sentences.The word “firm” was added to the list of concepts for both rounds.

To further test the second and third summarizers, which appeared somewhat equallyrobust, they were run on two more versions of the article with several further substi-tutions of “firm” for “company.” Both summarizers produced stable sets of sen-tences for these changes: the second summarizer retained the same altered set of sen-tences as for the other substitutions, and the third summarizer continued to select thesame sentences throughout.

These two summarizers were also run on the Clinton/Putin test article give earlier,and on two variations of that article. The first variation was obtained by substituting

342 C.E. Crangle

“anti-missile device” for the following four phrases which, in the context, were syn-onymous with “anti-missile device”: “missile shield,” “shield,” and “system.”

High on the agenda will be the United State’s plans to build a missile shield ananti-missile device in Alaska. Russia opposes the shield anti-missile device as it con-travenes a pact signed by the two countries in 1972 that bans any anti-missile devices.… The Russian leader has said that he will suggest an alternative to the US systemanti-missile device.

A second variation was obtained by further substituting the presidents’ last names(“Putin” and “Clinton” respectively) for the referring expressions “the Russianleader” and “the President” in the following sentence:

Putin The Russian leader has said that he will suggest an alternative to the US anti-missile device. … Accompanying Clinton the President is US Secretary of StateMadeline Albright.

These were the results obtained.

Second Summarizer. For the original article, the following words and phrases andextracted sentences had been produced: Clinton, Putin, Russia, president, Moscow,STAR WARS PLAN, missile shield, business, informal dinner, heads





1972 which bans any anti-missile devices. Clinton—in his last few months of office and keen to make his mark in American

history—will be seeking to secure some sort of concession from Putin.The following lists of key words and phrases were produced for the two altered ver-sions of the article: Clinton, Putin, Russia, president, anti-missile device, Moscow, STAR WARS

PLAN, business, informal dinner, heads Clinton, Putin, Russia, anti-missile device, Moscow, president Bill Clinton, STAR

WARS PLAN, business, informal dinner, headsFor both of the two altered versions, the following sentence was dropped from thesummary, with no other sentence being substituted: Out: High on the agenda will be the United State’s plans to build an anti-missile

device in Alaska.

Third Summarizer. The same set of sentences was extracted for the original articleand the two variations. The following listing of abstracted concepts preceded theoriginal summary. It is notable that the concept of country was identified as signifi-cant in the article even though the word “country” does not itself appear in the text.

state business presidentPutin (Vladmir Putin) US Clinton (Bill Clinton)country missile (missile shield,

missile devices)system


The concepts abstracted for both of the two altered versions of the article were thefollowing:

state business presidentPutin (Vladmir Putin) US Clinton (Bill Clinton)device missile (missile shield,

missile devices)

The sentences extracted for all three versions of the article were as follows (withscores omitted): US president Bill Clinton has arrived in Moscow for his first meeting with Russia’s




1972 which bans any anti-missile devices. Clinton – in his last few months of office and keen to make his mark in American

history – will be seeking to secure some sort of concession from Putin. Kremlin officials said Putin would propose a system that would shoot down the

missiles with interceptors shortly after they were fired rather than high in their tra-jectory.

“What’s new is that Putin is signalling that he is open to discuss it, that he is readyfor talks,” she said.

US National Security Adviser Sandy Berger said last week Clinton would raisehuman rights and press freedom.

Once again, the second summarizer, while not as stable as the third, concept-based,summarizer, did perform with relative robustness. Only one sentence was eliminatedfrom the summaries for the two versions of the article containing substitutions forsynonymous terms.

However, the second and third summarizers differed markedly in their behaviorwhen they were tested on articles that had re-ordered sentences. To illustrate, thesame Clinton/Putin article.was used (Similar results were obtained with the newsstory on the Nasdaq delistings.) The Clinton/Putin article was rearranged to begin atthe following sentence, with the displaced first two paragraphs tacked on at the end.(See Appendix II.) Clinton—in his last few months of office and keen to make hismark in American history—will be seeking to secure some sort of concession fromPutin.

These were the results obtained.

Second Summarizer. It selected the following key words and phrases for the per-muted article. There were seven in common with the original summary, three thatwere different (“Albright,” “Russian leader,” “concession”).

Clinton, Putin, Russia, president, STAR WARS PLAN, missile shield, StateMadeline Albright, Moscow, Russian leader, concession.

The real limitation of the summarizer, however, is revealed in the sentences it selectedfor extraction. It had only four sentences in common with the original summary,eliminating two and adding three different ones:

344 C.E. Crangle

Out: The two heads of state will meet on Saturday night for an informal dinnerbefore getting down to business on Sunday.Russia opposes the shield as it contravenes a pact signed by the two countries in1972 which bans any anti-missile devices.

In: The Russian leader has said that he will suggest an alternative to the US system.“We’ll talk about it in Russia,” Clinton told reporters before leaving Berlin forMoscow. Accompanying the President is US Secretary of State Madeline Albright.

This summarizer most likely uses a heuristic that is commonly employed in summa-rizing algorithms. The heuristic gives greater weight to a sentence the nearer it is tothe beginning of the article. (A variation of this heuristic assigns a greater weight onlyto the first sentence of the article or the sentences in the first paragraph.) However,there is something fundamentally mistaken about over-reliance on this heuristic eventhough it may improve results under some of the other evaluation methods. A sen-tence is placed at the beginning of an article because it is important. It is not importantbecause it is at the beginning of an article. Over-reliance on the heuristic confusesthese two points.

Third Summarizer. In marked contrast, the concept-based summarizer producedexactly the same results for the permuted article (and all other permuted articles it wastested on.)

The alterations in the summaries produced by the second summarizer, resultingsimply from sentence reordering, suggest that the summarizing technique lacks ro-bustness. Similarly, the alterations in the summaries produced by the first summar-izer, resulting simply from synonym substitution, also suggest lack of robustness.What is essentially different about the third summarizer is that it abstracts from thewords and phrases that appear in the text, and relies on those abstracted concepts toextract sentences.

7 Conclusion

To capture the essence of a document, regardless of authorial intent, a summarizermust do more than identify key words and phrases in the text and extract sentences onthat basis. It must also identify concepts expressed in the text. Summarizers that offerthis level of abstraction appear to get at the essence of a text more reliably, showing agreater tolerance for superficial changes in the input text. Such summarizers are po-tentially powerful tools in data mining, uncovering information that lies beneath thesurface of the words and phrases of the text.

References

1. H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and De-velopment, 2 (2), 1958.

2. Marti Hearst. Untangling Text Data Mining. Proceedings of ACL 99. 37th Annual Meetingof the Association for Computational Linguistics, University of Maryland, June 1999.


3. Kathleen R. McKeown, et al. PERSIVAL, a System for Personalized Search and Summari-zation over Multimedia Healthcare Information, In Proceedings of The First ACM+IEEEJoint Conference on Digital Libraries. Roanoke, W. Va., June 2001.

4. Don R. Swanson and N.R. Smalheiser. An interactive system for finding complementaryliteratures: a stimulus to scientific discovery. Artificial Intelligence, 91. 183-203 (1977)

5. C.E. Crangle. What words mean: some considerations from the theory of definition in logic.Journal of Literary Semantics, Vol. XXI, No. 1, 17-26, 1992.

6. C.E. Crangle and P. Suppes. Language and Learning for Robots. Stanford University, Stan-ford. CSLI Press. Distributed by Cambridge University Press, 1994

7. Dragomir R. Radev, Hongyan Jing, and Malgorzata Budzikowska. Centroid-based summa-rization of multiple documents: sentence extraction, utility-based evaluation, and user stud-ies. Proceedings of the ANLP/NAACL-2000 Workshop on Automatic Summarization, pp. 21-30, Seattle, WA., 2000.

8. Robert L. Donaway, Kevin K. Drummey, and Laura A. Mather. A Comparison of RankingsProduced by Summarization Evaluation Measures. Proceedings of the ANLP/NAACL-2000Workshop on Automatic Summarization, pp. 69-78, Seattle, WA., May 2000.

Appendix I: News Article

20 area firms face delisting by Nasdaq, by Matt Marshall, Jan. 24, 2001. Copyright ©2001 San Jose Mercury News. All rights reserved. Reproduced with permission. Useof this material does not imply endorsement of the San Jose Mercury News.

It’s the company version of the pink slip in the mail—get your act together, or you’refired from Nasdaq.

About 20 Bay Area companies are performing so badly that they are in danger ofbeing booted off the Nasdaq, the stock exchange that lists most of the area’s high-techcompanies. Five local companies were already bumped off last year, and a sixth—PlanetRx.com Inc., a former South San Francisco health care company—was just de-listed.

Nationwide, Nasdaq has either sent notices or is close to notifying at least 200other companies, many of whom offered stocks to the public for the first time lastyear.

While the delisting doesn’t have to mean the game is over, it relegates companiesto the junior and less reputable leagues of the stock exchange world, where it’s muchharder to raise money. For shareholders, a Nasdaq delisting sounds like a chillingdeath knoll – the value of their stock could all but implode. Some delisted companies,like Pets.com, simply close their doors.

“The whole Internet market crashed down, and we’re rolling with it,” says PeterFriedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if itdoesn’t boost its stock price soon. “The emotion was too much. Things just snapped.”

This round of delistings is the ignominious end to a year of decadence now comingback to haunt us.

Most of these companies had no profits, and many had hardly any sales, when in-vestor enthusiasm created a wave of new stock offerings last year. If a company soldthings on the Web—cars, pet food, you name it—it was almost guaranteed a spot onthe stock exchange.

346 C.E. Crangle

But in less than a year, many of the same investors have abandoned their formerdarlings. With stock prices down and the economy slowing, companies are fallingshort of the standards Nasdaq sets for its some 3,802 companies.

While the listing standards are arcane, the most obvious cardinal sin in the eyes ofNasdaq’s regulators is simple: The fall of a company’s stock price below $1 for 30consecutive trading days.

When that happens, Nasdaq sends a notice giving the company 90 calendar days toget the stock price up again. If it fails to do so—for 10 consecutive days—the firm hasone last resort: an appeal to Nasdaq.

That involves a trek to Washington, D.C., and a quick hearing at a room in the St.Regis Hotel, where Nasdaq’s three-person panel grills executives. Unless there’s goodreason to prolong the struggle, the company’s Nasdaq days are over.

Once booted, companies usually end up in the netherworlds of the stock market,where only a few brave investors venture.

First, it’s the Over The Counter Bulletin Board, which is considerably more riskyand yields lower return to investors. However, even the OTCBB has requirements.

Failing that, the next step down is the so-called Pink Sheets, named for the color ofthe paper they used to be traded on. This exchange doesn’t require firms to registerwith the Securities and Exchange Commission or even file financial statements.

“They’re the wild, wild West,” says Nasdaq spokesman Mark Gundersen.Autoweb.com Inc., a Santa Clara Internet company that specializes in auto con-

sumer services, has about 40 days left under the 90-day rule, but is busy scrambling toavoid a hearing.

“We’re working on strategic partnerships that will have a major impact on thestock,” says Nadyne Edison, chief marketing officer for the company. On Tuesday,Edison was in Detroit, busy opening a new office near the nation’s auto capital. Edi-son says the firm is considering moving its headquarters to Detroit to be nearer itsclients.

Other companies that got delisting notices are trying layoffs. Take Mountain View-based Network Computing Devices, which provides networking hardware and soft-ware to large companies. Its sales have been pinched as the personal computer indus-try slows down, so it has laid off people.

“We’ve had to downsize, downsize, downsize,” says Chief Financial Officer Mi-chael Garner.

Women.com, a San Mateo-based Internet site devoted to women, has laid off 25percent of the workforce recently to avoid delisting. Becca Perata-Rosati, vice presi-dent of communications, says the site isn’t being fairly rewarded by Wall Street. Thecompany is the 29th most heavily visited Web site in the world, she says.

One trick that doesn’t seem to work is the so-called “reverse stock split,” whichPlanetRx.com tried on Dec. 1. By converting every eight shares into one,PlanetRx.com hoped each share price would be boosted eightfold. But the move wasseen by investors as a sign of desperation, and the stock plunged from $1 to 53 cents.

Out of alternatives, PlanetRx didn’t even show up for its hearing with Nasdaq. It isnow trading on the OTCBB after a recent move to Memphis and faces an uncertainfuture.

At least one executive says he doesn’t mind the prospect of going to the OTCBB.Talk City’s Friedman says his company is growing, and expects its 9 million in

service fee revenue to double this year. Even if he’s forced off the Nasdaq, he hashopes of returning.


“I’d like to stay on the Nasdaq,” he says. ``If we get off, we’ll build a business.Then we’ll go back on.”

Contact Matt Marshall at [email protected] or (408)920-5920.

Appendix II: Permuted Clinton/Putin News Article

CLINTON TAKES STAR WARS PLAN TO RUSSIA

Clinton—in his last few months of office and keen to make his mark in Americanhistory—will be seeking to secure some sort of concession from Putin.

The Russian leader has said that he will suggest an alternative to the US system.Kremlin officials said Putin would propose a system that would shoot down the

missiles with interceptors shortly after they were fired rather than high in their trajec-tory.

“We’ll talk about it in Russia,” Clinton told reporters before leaving Berlin forMoscow. “It won’t be long now.” Accompanying the President is US Secretary ofState Madeline Albright. “What’s new is that Putin is signalling that he is open todiscuss it, that he is ready for talks,” she said. “We will discuss it.”

Arms control will not be the only potentially troublesome issue. US National Secu-rity Adviser Sandy Berger said last week Clinton would raise human rights and pressfreedom.

US president Bill Clinton has arrived in Moscow for his first meeting with Russia’snew president Vladmir Putin. The two heads of state will meet on Saturday night foran informal dinner before getting down to business on Sunday.

High on the agenda will be the United State’s plans to build a missile shield inAlaska. Russia opposes the shield as it contravenes a pact signed by the two countriesin 1972 that bans any anti-missile devices.

D. Bustard, W. Liu, and R. Sterritt (Eds.): Soft-Ware 2002, LNCS 2311, p. 348, 2002.© Springer-Verlag Berlin Heidelberg 2002

Industrial Applications of Intelligent Systems at BTexact

Keynote Address

Benham Azvine

BTexact—BT Advanced Communication Technology CentreAdastral Park, Martlesham, Ipswich, IP5 3RE, UK

[email protected]

Abstract. Soft computing techniques are beginning to penetrate into newapplication areas such as intelligent interfaces, information retrieval andintelligent assistants. The common characteristic of all these applications is thatthey are human-centred. Soft computing techniques are a natural way ofhandling the inherent flexibility with which humans communicate, requestinformation, describe events or perform actions.

Today, people use computers as useful tools to search for information andcommunicate electronically. There have been a number of ambitious projects inrecent years including one at BTexact known as the Intelligent PersonalAssistant (IPA), with the aim of revolutionising the way we use computers inthe near future. The aim remains to build systems that act as our assistants andgo beyond being just useful tools. The IPA is an integrated system of intelligentsoftware agents that helps the user with communication, information and timemanagement. The IPA includes specialist assistants for e-mail prioritisation andtelephone call filtering (communication management), Web search andpersonalisation (information management), and calendar scheduling (timemanagement). Each such assistant is designed to have a model of the user and alearning module for acquiring user preferences. In this talk I focus on the IPA,its components and how we used computational intelligence techniques todevelop the system.

Biography. Behnam Azvine holds a BSc in Mechanical Engineering, an MSc and aPhD in Control Systems, all from the University of Manchester. After a number ofacademic appointments, he joined BTexact Technologies (formerly BT Labs) in 1995to set up and lead a research programme in Intelligent systems and soft computing,and is currently the head of computational intelligence Group at BTexact. He holdsthe British Computer Society medal for IT for his team’s work on the digital personalassistant project and is a visiting fellow at Bristol University. He has edited a book,contributed to more than 60 publications, has 15 international patents and regularlygives presentations in international conferences and workshops on the application ofAI in Telecommunications. Currently, he is the co-chairman of the European Networkof Excellence for Uncertainty techniques (EUNITE). His research interests includethe application of Soft Computing and AI techniques to human-centred computingand adaptive software systems.


Intelligent Control of Wireless and Fixed TelecomNetworks

Keynote Address

John Bigham

Department of Electronic Engineering,Queen Mary College,University of London, UK

[email protected]

Abstract. Different agent systems that have been developed for the control ofresources in telecoms are studied and assessed. The applications consideredinclude the control of admission to ATM networks, resource management in thewider market place promised by 3G mobile networks and location awareservices. The designs of these systems have a common pattern that hassignificant differences from the traditional management and control planes usedin telecommunications. Rather, columns of control with communicationbetween them are built. The degree of communication depends on the layersand the corresponding response time. In this talk it will be also shown that sucha design provides increased flexibility. This allows the best policy to be appliedaccording to the current demand.

Biography. Dr. Bigham has many years’ experience in applying computationalintelligence in telecommunications. He currently is a Reader at Queen Mary College,University of London research Artificial Intelligence and Reasoning UnderUncertainty within the Intelligent Systems & Multimedia research group.


Assertions in Programming: From Scientific Theory toEngineering Practice

Keynote Address

Sir Tony Hoare

Microsoft Research, Cambridge, UK

Abstract. An assertion in a computer program is a logical formula (Booleanexpression) which the programmer expects to evaluate to true on every occasionthat program control reaches the point at which it is written. Assertions can beused to specify the purpose of a program, and to define the interfaces betweenits major components. An early proponent of assertions was Alan Turing(1948), who suggested their use in establishing the correctness of large routines.In 1967, Bob Floyd revived the idea as the basis of a verifying compiler thatwould automatically prove the correctness of the programs that it compiled.After reading his paper, I became a member of a small research school devotedto exploring the idea as a theoretical foundation for a top-down designmethodology of program development. I did not expect the research toinfluence industrial practice until after my retirement from academic life, thirtyyears ahead. And so it has been.

In this talk, I will describe some of the ways in which assertions are nowused in Microsoft programming practice. Mostly they are used as test oracles, todetect the effects of a program error as close as possible to its origin. But theyare beginning to be exploited also by program analysis tools and even bycompilers for optimisation of code. One purpose that they are never actuallyused for is to prove the correctness of programs. This story is presented as acase study of the way in which scientific research into ideals of accuracy andcorrectness can find unexpected application in the essentially softer and moreapproximative tasks of engineering.

Biography. Professor Hoare studied Philosophy, Latin, and Greek at OxfordUniversity in the early fifties, Russian during his National Service in the Royal Navy,and the machine translation of languages as a graduate student at Moscow StateUniversity (1959). One outcome of the latter work was the discovery of the Quicksortalgorithm. On returning to England in 1960, he worked as a programmer for ElliottBrothers, and led a team in the development of the first commercial compiler for theprogramming language Algol 60. In 1968, he took up a Chair in Computing Scienceat the Queen’s University, Belfast. There his output included a series of papers on theuse of assertions in program proving. In 1977, he moved to Oxford University, where’provable correctness’ was again a focus of his research. Well-known results of thiswork included the Z specification language, and the CSP concurrent-programmingmodel. Recently, he has been investigating the unification of a diverse range oftheories that apply to different programming languages, paradigms, andimplementation technologies. Throughout his academic career, Tony has maintainedstrong contacts with industry, through consultancy, teaching, and collaborative

Assertions in Programming: From Scientific Theory to Engineering Practice 351

research projects. He has taken a recent interest in legacy systems, where assertionscan play an important role in system testing. In 1999, on reaching retirement age atOxford, Tony moved back to industry as a Senior Researcher with MicrosoftResearch in Cambridge, England. In March 2000, he received a knighthood from theQueen for services to Computing Science.


Hybrid Soft Computing for Classification and PredictionApplications

Keynote Address

Piero Bonissone

General Electric Corporation, Research and Development Centre, Schenectady, New York

Abstract. Soft computing (SC) is an association of computing methodologiesthat includes as its principal members fuzzy logic (FL), neural computing (NC),evolutionary computing (EC), and probabilistic computing (PC). These meth-odologies allow us to deal with imprecise, uncertain data, and incomplete do-main knowledge that are encountered in real-world applications. We will de-scribe the advantages of using SC techniques, and in particular we will focus onthe synergy derived from the use of hybrid SC systems. This hybridization al-lows us to integrate knowledge-based and data-driven methodologies to con-struct models for classification, prediction, and control applications. In thispresentation we will describe three real-world SC applications: the prediction oftime-to-break margins in paper machines; the automated underwriting of insur-ance applications; and the development and tuning of raw-mix proportioningcontrollers for cement plants.

The first application is based on a model that periodically predicts theamount of time left before an unscheduled break of the web in a paper machine.The second application is based on a discrete classifier, which assigns a vectorof real-valued and attribute-values inputs, representing an insurance applicant’svital data, to a rate class, representing the correct insurance premium. The thirdapplication is based on a hierarchical fuzzy controller, which determines thecorrect proportion of the raw material to maintain certain properties in a cementplant.

The similarity among these applications is the common process with whichtheir models were constructed. In all three cases, we held knowledge engineer-ing sessions (to capture the expert knowledge) and we collected, scrubbed andaggregate process data (to define the inputs for the models). Then we encodedthe expert domain knowledge using fuzzy rule-based or case-based systems. Fi-nally, we tuned the fuzzy system parameters using either local or global searchmethods (NC and EC, respectively) to determine the parameter values thatminimize prediction, classification, and control errors.

Biography. Dr Bonissone has a BS in Electrical and Mechanical Engineering fromthe University of Mexico City (1975) and an MS and PhD in Electrical Engineeringand Computer Sciences from UC Berkeley (1976 and 1979). He has been a computerscientist at General Electric Corporate Research and Development Centre since 1979,carrying out research in expert systems, approximate reasoning, pattern recognition,decision analysis, and fuzzy sets. In 1993, he received the Coolidge FellowshipAward from General Electric for overall technical accomplishments. In 1996, he be-came a Fellow of the American Association for Artificial Intelligence (AAAI) and has

Hybrid Soft Computing for Classification and Prediction Applications 353

been the Editor-in-Chief of the International Journal of Approximate Reasoning forthe past seven years. He has co-edited four books, including the Handbook of FuzzyComputation (1998), published over one hundred articles, and registered nineteenpatents. He is the 2001 President-Elect of the IEEE Neural Network Council.


Why Users Cannot ‘Get What They Want’

Keynote Address

Ray Paul

Brunel University, UK

Abstract. The notion that users can ‘get what they want’ has caused a planningblight in information systems development, with the resultant plethora ofinformation slums that require extensive and expensive maintenance. Thispaper will outline why the concept of ‘user requirements’ has led to a variety offalse paradigms for information systems development, with the consequentcreation of dead systems that are supposed to work in a living organisation. It ispostulated that what is required is an architecture for information systems that isdesigned for breathing, for adapting to inevitable and unknown change. Such anarchitecture has less to do with ‘what is wanted’, and more to do with thecreation of a living space within the information system that enables the systemto live.

Biography. Professor Paul spent twenty one years at the London School ofEconomics (1971-92), as a Lecturer in Operational Research and Senior Lecturer inInformation Systems, before taking up a Chair in Simulation Modelling at BrunelUniversity in 1992. He was Head of Department for five years (1993-98) and thenappointed Dean of the Faculty of Science in 1999. He has been a Visiting Professor inthe Department of Community Medicine, Hong Kong University since 1992 and anAssociate Research Fellow in the Centre for Research into Innovation, Culture andTechnology (CRICT), Brunel University. He is a Director and Founder of the Centrefor Living Information Systems Thinking (LIST) and the Centre for AppliedSimulation Modelling (CASM) at Brunel. He has acted as a consultant for variousgovernment departments, including Health and Defence, as well as a plethora ofcommercial organisations and charitable bodies. Ray is co-founder of the EuropeanJournal of Information Systems, launched in January 1991, and Chair of the EditorialBoard from 1991 to 1999. He is currently on the editorial board of Computers andInformation Technology, Journal of Intelligent Systems, Logistics and InformationManagement, and Journal of Simulation Systems Science and Technology.


Systems Design with the Reverend Bayes

Keynote Address

Derek McAuley

Marconi Research Laboratories, Cambridge, UK

Abstract. A computer viewed as a technological encapulation of logic is a fineideal. However, given the laws of physics actually tell us that computers willmake mistakes, roll in "to err is human" in software development and we’resome way from the ideal. Perhaps we might review how probabilistic reasoning,originally expounded in the early 18th century by the Revd Bayes and adoptednow as the underpinning of machine learning, could be brought to bear onsoftware and systems design. Tales from the trenches and some views for thefuture.

Biography. Professor McAuley joined Marconi in January 2001 to establish the newMarconi Labs in Cambridge. He obtained his B.A. in Mathematics from theUniversity of Cambridge in 1982 and his Ph.D. addressing issues in interconnectingheterogeneous ATM networks in 1989. After a further five years at the University ofCambridge Computer Laboratory as a lecturer he moved in 1995 to a chair at theUniversity of Glasgow Department of Computing Science. He returned to Cambridgein July 1997, to help found the Cambridge Microsoft Research facility. His researchinterests include networking, distributed systems and operating systems. Recent workhas concentrated on the support of time dependent mixed media types in bothnetworks and operating systems.


Formalism and Informality in Software Development

Keynote Address

Michael Jackson

Consultant, UK

Abstract. Because the machines we build are essentially formal we are obligedto formalise every problem to which we apply them. But the world in which theproblem exists is rarely, if ever, formal. Formalisation of an informal reality istherefore a fundamental—though somewhat neglected—activity in softwaredevelopment. In this talk some aspects of the formalisation task are discussed,some techniques for finding better approximations to reality are sketched, andsome inevitable limitations of formal models are asserted.

Biography. Professor Jackson has over forty year’s experience in the softwaredevelopment industry. He created the JSD method of system development and the JSPmethod of program design, which became a government standard. He has heldvisiting chairs at several universities and received various honours, including theStevens Award, the IEE Achievement Medal, the BCS Lovelace Medal and the ACMSIGSOFT Award for Outstanding Research. He is on the editorial board of fourjournals: Requirements Engineering, Automated Software Engineering, Science ofComputer Programming and ACM Transactions on Software Engineering andMethodology. He now works as an independent consultant and as a part-timeresearcher at AT&T Research.


An Industrial Perspective on Soft Issues:Successes, Opportunities and Challenges

Panel

Gordon Bell (Chair), Managing Director, Liberty Information Technology

Bob Barbour, Chief Executive, Centre for Competitiveness

Paul McMenamin, Vice President (Belfast Labs), Nortel Networks

Dave Allen, Principal Consultant, Charteris

Maurice Mulvenna, Co-Founder and Chief Executive, LUMIO

Summary. Panel members have been invited to share their experiences of handlingsoft issues in system design, development and operation—either personal experiences,or those of their organisation. In addition, they have been encouraged to identify afew ‘grand challenges’ areas for future research.

Author Index

Adams, Carl 280Amor, Nahla Ben 263Avison, David E. 280Azvine, Benham 102, 348

Bagai, Rajiv 141Bell, David 206Benferhat, Salem 263Berry, Daniel M. 300Bigham, John 349Black, Michaela 74Bonissone, Piero 352Bontempi, Gianluca 46

Carvalho, Joao A. 300Chieng, David 14Chrysostomou C. 1Crangle, Colleen E. 332

Gabrys, Bogdan 232Golubski, Wolfgang 166

Haenni, Rolf 114Hickey, Ray 74Ho, Ivan 14Hoare, Tony 350Hong, Jun 60Hughes, John G. 60

Jackson, Michael 356

Kelley, Shellene J. 141Kettaf, Fatima Zohra 128Kryszkiewicz, Marzena 247

Lafruit, Gauthier 46Lehman, Manny M. 174Lukacs, Gergely 151

Marshall, Adele 206Marshall, Alan 14Martin, Trevor P. 102McAuley, Derek 355McClean, Sally 191McSherry, David 217Mellouli, Khaled 263

Palmer, Fiona 191Parr, Gerard 14Paul, Ray 354Pitsillides A. 1

Ramil, J.F. 174Ramos, Isabel 300Richard, Gilles 128Rossides, L. 1Ruta, Dymitr 232

Scotney, Bryan 191Sekercioglu, A. 1Stepankova, Olga 88Sterritt, Roy 31, 206

Tomayko, Jim 315

Zelezny, Filip 88Zıdgek, Jirı 88Zhu, Jianhan 60