1 Using Bayesian networks for Water Quality Prediction in Sydney Harbour Ann Nicholson Shannon Watson, Honours 2003 Charles Twardy, Research Fellow School.

Post on 19-Dec-2015

218 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

Transcript

1

Using Bayesian networks for Water Quality Prediction in

Sydney Harbour

Ann Nicholson

Shannon Watson, Honours 2003

Charles Twardy, Research Fellow

School of Computer Science and Software Engineering

Monash University

2

Overview

Representing uncertainty Introduction to Bayesian Networks

» Syntax, semantics, examples

The knowledge engineering process Sydney Harbour Water Quality Project 2003 Summary of other BN research

3

Sources of Uncertainty

Ignorance Inexact observations Non-determinism AI representations

» Probability theory» Dempster-Shafer» Fuzzy logic

4

Probability theory for representing uncertainty

Assigns a numerical degree of belief between 0 and 1 to facts» e.g. “it will rain today” is T/F. » P(“it will rain today”) = 0.2 prior probability

(unconditional)

Posterior probability (conditional)» P(“it will rain today” | “rain is forecast”) = 0.8

Bayes’ Rule: P(H|E) = P(E|H) x P(H) P(E)

5

Bayesian networks

A Bayesian Network (BN) represents a probability distribution graphically (directed acyclic graphs)

Nodes: random variables,» R: “it is raining”, discrete values T/F» T: temperature, cts or discrete variable» C: colour, discrete values {red,blue,green}

Arcs indicate conditional dependencies between variables

P(A,S,T) can be decomposed to P(A)P(S|A)P(T|A)

6

Bayesian networks

Conditional Probability Distribution (CPD)– Associated with each variable– probability of each state given parent states

“Jane has the flu”

“Jane has a high temp”

“Thermometertemp reading”

Flu

Te

Th

Models causal relationship

Models possible sensor error

P(Flu=T) = 0.05

P(Te=High|Flu=T) = 0.4P(Te=High|Flu=F) = 0.01

P(Th=High|Te=H) = 0.95P(Th=High|Te=L) = 0.1

7

BN inference

Evidence: observation of specific state Task: compute the posterior probabilities for query

node(s) given evidence.

Th

Y

Flu

Te

Diagnostic inference

Th

Flu

Te

Predictive inference

Intercausal inference

Te

Flu TBFlu

Mixed inference

Th

Flu

Te

8

BN software

Commerical packages: Netica, Hugin, Analytica (all with demo versions)

Free software: Smile, Genie, JavaBayes, See appendix B, Korb & Nicholson, 2004

Example running Netica software

9

Decision networks

Extension to basic BN for decision making» Decision nodes» Utility nodes

EU(Action) = p(o|Action,E) U(o) o

» choose action with highest expect utility

Example

10

Elicitation from experts

Variables» important variables? values/states?

Structure» causal relationships?» dependencies/independencies?

Parameters (probabilities)» quantify relationships and interactions?

Preferences (utilities)

11

Expert Elicitation Process

These stages are done iteratively Stops when further expert input is no longer

cost effective Process is difficult and time consuming. Current BN tools

» inference engine » GUI

Next generation of BN tools?

BN EXPERT

BN TOOLS

Domain EXPERT

12

Knowledge discovery

There is much interest in automated methods for learning BNS from data» parameters, structure (causal discovery)

Computationally complex problem, so current methods have practical limitations» e.g. limit number of states, require variable

ordering constraints, do not specify all arc directions

Evaluation methods

13

Knowledge Engineering for Bayesian Networks (KEBN)

1. Building the BN» variables, structure, parameters, preferences» combination of expert elicitation and knowledge discovery

2. Validation/Evaluation» case-based, sensitivity analysis, accuracy testing

3. Field Testing» alpha/beta testing, acceptance testing

4. Industrial Use» collection of statistics

5. Refinement» Updating procedures, regression testing

14

The KEBN process

15

Quantitative KE process

16

Water Quality for Sydney Harbour

Water Quality for recreational use

Beachwatch / Harbourwatch Programs

Bacteria samples used as pollution indicators

Many variables influencing Bacterial levels – rainfall, tide, wind,

sunlight temperature, ph etc

17

Past studies

Hose et al. used multi dimension scaling model of Sydney harbour » low predictive accuracy, unable to handle the noisy bacteria

samples, explained 63% of bacteria variablity (Port Jackson) Ashbolt and Bruno:

» agree with Hose et al, + wind effects, sunlight hours, tide Crowther et al (UK):

» rainfall, tide, sampling times, sunshine, wind» Explained 53% of bacteria variablility

Other models developed by the USEPA to model estuaries are:» QUAL2E – Steady-state receiving water model

» WASP – Time Varying dispersion model» EFDC – 3D hydrodynamic model

EPA in Sydney interested in a model applying the causal knowledge of the domain

18

EPA Guidelines

Today Yesterday Day Before Yesterday

Pollution

IF T>4 THEN Likely

ELSE IF T 4 AND Y 4 AND DBY 4

THEN Unlikely

ELSE IF T 4 AND Y 4 AND DBY 4

THEN Unlikely for 24h flushing

But Likely for 48h flushing

ELSE Likely for all other results

19

Stages of Project

Preparation of EPA Data rainfall only Hand-craft simple networks for rainfall

data Comparison of hand-crafted networks

with range of learners (using Weka software)

Using CaMML to learn BN on extended data set

2003 Honsproj

2003/04SummerVac proj

20

EPA Data

Database 1: » E.coli, Enterococci (cfu/100mL), thresholds 150 &

35. » 60 water samples each year since 1994 at 27 sites

in Sydney Harbour.» Enterococci E.coli, Raining, Sunny, Drain running,

temperature, time of sample, direction of sampling run, date, site name, beach code

Database 2:» Rainfall readings (mm) at 40 locations around

Sydney

21

Data Preparation

New file format:Date BeachCode Entc Ecoli D1 D2 D3 D4 D5 D6

D1 = rainfall on day of collection

D6 = rainfall 5 days previously

Rainfall data had many missing entries

22

Rainfall BNs

Hand-crafted BNs to predict bacteria using rainfall only

Started with deterministic BN that implemented EPA guidelines

Looked at varying number of previous days rainfall for predicting bacteria

Investigated various discretisations of variables

23

EPA Guidelines as BN

24

Davidson BN: 1 day rainfall

25

Davidson BN: 6 days rainfall

26

Evaluation

Split data 50-50 training/testing 10 fold cross validation Measures: Predictive Accuracy & Information Reward Also looked at ROC curves (correct classification vs

false positives) Using Weka: Java environment for machine learning

tools and techniques Small data: 4 beaches: Chinamans, Edwards,

Balmoral (all middle harbour), Clifton (Port Jackson) Using 6 days rainfall averaged from all rain gauges

27

Predictive accuracy

Examining each joint observation in the sample

Adding any available evidence for the other nodes

Updating the network Use value with highest probability as

predicted value Compare predicted value with the actual value

28

Information Reward

Rewards calibration of probabilities Zero reward for just reporting priors Unbounded below for a bad prediction Bounded above by a maximum that depends

on priors

Reward = 0

Repeat

If I == correct state

IR += log ( 1 / p[i] )

else

IR += log ( 1 / 1 - p[i] )

29

Evaluation: Weka learners

Naïve Bayes J48 (version of C4.5) CaMML –Causal BN learner, using MML metric AODE TAN Logistic “Davidson” BN – 6 days previous rainfall

» With and without adaptation of parameters (case learning)

“Guidelines” BN – 3 days previous rainfall» Deterministic rule» With adaptation of parameters (case learning)

Pr=1/3 Pr=1/3 Pr=1/3

30

Results

Learner Pred Accuracy Info Reward

Prior 0.758 0

Naïve Bayes 0.760 -0.729

J48 0.791 0.125

CaMML 0.764 0.122

AODE 0.769 0.128

TAN 0.775 -1.459

Logistic 0.787 0.128

Davidson 0.757 -0.272

Davidson CL 0.776 0.033

Guidelines (det) 0.530 -2.318

Guidelines CL 0.776 0.058

31

Results: ROC Curves

32

Results: area under ROC Curves

Perfect 0.999

AODE 0.733

Logistic 0.729

CaMML 0.718

J48 0.689

Naïve 0.679

Davidson CL 0.645

Guidelines CL 0.643

Guidelines 0.637

Davidson 0.620

TAN 0.561

Prior 0.496

33

Results: ROC Curves

For ~20% false-positive, can get ~60% of events For ~45% false-positive, can get ~75% of events For ~60% false-positive, can get ~80% of events Implications?

» Using current guidelines, if accept 45% false-positive, getting 60% hit rate

» Can either keep that false-positive rate, get extra 15%» Or, keep same hit rate at half the false positive rate

34

Example of CaMML BN

35

Future Directions?

36

37

Early BN-related projects

DBNS for discrete monitoring (PhD, 1992) Approximate BN inference algorithms based

on a mutual information measure for relevance (with Nathalie Jitnah, 1996-1999)

Plan recognition: DBNs for predicting users actions and goals in an adventure game (with David Albrecht, Ingrid Zukerman, 1997-2000)

DBNs for ambulation monitoring and fall diagnosis (with biomedical engineering, 1996-2000)

Bayesian Poker (with Kevin Korb, 1996-2003)

38

Knowledge Engineering with BNs

Seabreeze prediction: joint project with Bureau of Meteorology» Comparison of existing simple rule, expert elicited

BN, and BNs from Tetrad-II and CaMML ITS for decimal misconceptions Methodology and tools to support knowledge

engineering process» Matilda: visualisation of d-separation » Support for sensitivity analysis

Written a textbook: » Bayesian Artificial Intelligence, Kevin B. Korb and

Ann E. Nicholson, Chapman & Hall / CRC, 2004.www.csse.monash.edu.au/bai/book

39

Current BN-related projects

BNs for Epidemiology (with Kevin Korb, Charles Twardy)

» ARC Discovery Grant, 2004 » Looking at Coronary Heart Disease data sets» Learning hybrid networks: cts and discrete variables.

BNs for supporting meteorological forecasting process (DSS’2004) (with Ph. D student Tal Boneh, K. Korb, BoM)

» Building domain ontology (in Protege) from expert elicitation» Automatically generating BN fragments» Case studies: Fog, hailstorms, rainfall.

Ecological risk assessment » Goulburn Water, native fish abundance » Sydney Harbour Water Quality

40

Open Research Questions

Methodology for combining expert elicitation and automated methods» expert knowledge used to guide search» automated methods provide alternatives to be

presented to experts Evaluation measures and methods

» may be domain dependent Improved tools to support elicitation

» Reduce reliance on BN expert» e.g. visualisation of d-separation

Industry adoption of BN technology

top related