Top Banner
1 Discovering Robust Knowledge from Databases that Change Chun-Nan Hsu Craig A. Knoblock Arizona State University University of Southern California Journal of Data Mining and Knowledge Discovery, Volume 2, 1998 Presented & Edited by: Mark Stebbins CS331/295 Data Mining – Spring 2009
44

1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

1

Discovering Robust Knowledge from Databases that Change

Chun-Nan Hsu Craig A. KnoblockArizona State University University of Southern

California

Journal of Data Mining and Knowledge Discovery, Volume 2, 1998

Presented & Edited by: Mark StebbinsCS331/295 Data Mining – Spring 2009

Page 2: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

2

Outline

Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge

Discovery Experimental Results Conclusions

Page 3: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

3

Introduction: Importance Many applications of knowledge discovery and data

mining require the knowledge to be consistent with data.

Databases usually changes over time and make machine-discovered knowledge inconsistent.

How can we detect, check & update these inconsistent rules without high maintenance costs?

Useful knowledge should be robust against database changes so that it is unlikely to become inconsistent after database changes.

Defines a notion of robustness in the context of relational databases

Describes how robustness can be estimated and applied in knowledge discovery.

Page 4: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

4

Introduction: Definition of Robustness

Why do we need robustness? It is important to know whether a discovered knowledge is robust against database changes. Robustness: A measure of the quality of discovered knowledge.

General definition of robustness: Robustness of discovered knowledge can be defined as the probability that the knowledge is consistent with a database state. Or estimated in ratio form as:

Page 5: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

5

Introduction: Applications of Robustness Estimation

Minimize the maintenance cost of inconsistent rules in the presence of database changes.

When an inconsistent rule is detected, a system may: remove the rule: repeatedly invoke discovery system repair the rule: repaired frequently after data

changes

Apply to rule discovery for semantic query optimization (SQO).

Rule: all Maltese seaports have railroad access. Query: find all Maltese seaports with railroad access and

2,000 sq.ft of storage.

Page 6: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

6

Applications of Robustness Estimation (cont.)

Apply to guide the data mining and evaluation of semantic rules. Two stages:

Data mining stage: uses a rule pruning approach to prune antecedents of a discovered rule to increase its robustness.

Evaluation stage: eliminates rules when their estimated robustness values are below a given threshold.

Apply to rule maintenance. Similar to rule pruning approach.

Use the estimated robustness of the resulting partially repaired rules to search for the best sequence of repair operators so that the repaired rule is more robust than the original one.

Page 7: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

7

Introduction: Robustness Comparison

Comparison with Support count: Support count for an association rule expresses the

probability that a data instance satisfies a rule. While the robustness expresses the probability that

an entire database state is consistent with a rule.

Comparison with predictive accuracy: Predictive accuracy for classification rule measures

the probability that knowledge is consistent with randomly selected unseen data instead of with database states.

The difference is significant in databases that are interpreted using the closed-world assumption (CWA).

Page 8: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

8

Introduction: Closed-world Assumption

Information that is not explicitly present in the database is taken to be false.

Closed-world databases are widely used in relational db’s, deductive db’s and rule-based information systems because of

The characteristics of application domains. The limitation of the representation systems.

Databases change by deletions, insertions and updates, and in a close-world database, they may affect the validity of a rule.

Closed-world data tends to be dynamic, important for knowledge discovery systems to handle dynamic and closed world data.

An instance of closed-world data usually represents a dynamic state in the world. E.g: an instance of employee information in a personnel database.

Page 9: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

9

Example of a relational database

Schema: ship_class(class_name, ship_type,max_draft, length,container_cap),ship(ship_name, ship_class, status, fleet, year, built),geoloc(name, glc_cd, country, latitude, longitude),seaport( name, glc_code, storage, rail, road, anch_offshore),wharf(wharf_id, glc_code, depth, length, crane_qty).

Rules: R1:The latitude of a Maltese geographic location is greater than or equal to 35.89.

geoloc(_,_,?country,?latitude,_) ^ ?country = ”Malta” ?latitude >= 35.89

R2: All Maltese geographic locations are seaports.

geoloc(_,?glc_cd, ?country,_,_) ^ ?country = ”Malta” seaport(_, ?glc_cd, _, _,_,_)

Page 10: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

10

Outline

Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge

Discovery Experimental Results Conclusions

Page 11: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

11

Preliminaries Relational database: a set of relations

A relation contains a set of instances (tuples) of attribute-value vectors

Two types of literals: Database literal: literal defined on database relations (e.g.:

seaport(_,?glc_cd,?storage,_,_,_) Built-in literal: literal on built-in relations (e.g.: latitude >

35.89)

Two types of rules: Range rule: rules with a positive built-in literal (e.g.: R1) Relational rule: rules with a database literal as their

consequent (e.g.: R2)

Page 12: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

12

Preliminaries (cont.)

Database state at time t: the collection of the instances presents in the database at time t.

Consistent rule: given a database state, if all variable instantiations that satisfy the antecedents of the rule also satisfy the consequent of the rule.

Page 13: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

13

Outline

Introduction Preliminaries Robustness of Knowledge Applying Robustness in Knowledge

Discovery Experimental Results Conclusions

Page 14: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

14

Definitions of Robustness

Robust rule: does not become inconsistent (invalid) after database changes

Definition 1: (Robustness for all states) Given a rule r, let D be the event that a database is in a state that is consistent with r.

testa b a se stao ssib le d a# o f a ll p

t w ith r co n sistena se sta tes# o f d a ta b(r)R o b u st 1

(D) (r) Robust Pr1

Page 15: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

15

Definitions of Robustness (cont.)

Intuitively, a rule is robust if it is unlikely that the transactions will invalidate the rule.

Definition 2: (Robustness for accessible states) Given a rule r, and a database in a state d, in which r

is consistent. Let t denote the event of performing a transaction on d that results in new database states inconsistent with r. The robustness of r in accessible states from the current state d is defined as

(t|d)- t|d) ( ) Robust(r|d Pr1Pr

Page 16: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

16

Definitions of Robustness (cont.)

Observation: If all transactions are equally probable, then

The robustness of a rule could be different in different database states. E.g.: Suppose there are two db states d1 and d2 of a given db. To reach a state inconsistent with r, we need to delete 10 tuples in d1 and only 1 tuple in d2.

Robust(r|d1) > Robust(r|d2)

)|()(1 drR o b u strR o b u st

Page 17: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

17

Estimate Robustness Key idea: estimates the probabilities of data

changes, rather than the number of possible database states.

Estimate the robustness of a rule based on the probability of transactions that may invalidate the rule.

Decompose data changing transactions and estimate their probabilities using the Laplace law of succession.

Page 18: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

18

Estimate Robustness (cont.)

Laplace law of Succession: Given a repeatable experiment with an outcome of one of any of k classes. Suppose we have conducted this experiment n times, r of which have resulted in some outcome C, in which we are interested. The probability that the outcome of the next experiment will be C can be estimated as

kn

r

1

Page 19: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

19

Examples to Estimate Robustness

R1:The latitude of a Maltese geographic location is greater than or equal to 35.89.geoloc(_,_,?country,?latitude,_) ^ ?country = “Malta” ?latitude >= 35.89

Find the robustness of R1 ?

T1 (Update a satisfied tuple): One of the existing tuples of geoloc with its ?country = “Malta” is updated such that its ?latitude < 35.89

T2 (Insert a new tuple): A new tuple of geoloc with its ?country = “Malta” and ?latitude < 35.89 is inserted to the database.

T3 (Update an unsatisfied tuple): One of the existing tuples of geoloc with its ?latitude < 35.89 and ?country != “Malta” is updated such that its ?country = “Malta”.

Page 20: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

20

Examples to Estimate Robustness (cont.)

T1, T2, and T3: transactions that invalidate R1. Since T1, T2, and T3 are mutually exclusive, we have

The robustness of R1 can be estimated from the probabilities of T1, T2, and T3.

Requirements for Transactions Classes Transaction classes must be mutually exclusive. Minimal, no redundant conditions are specified.

?country = ‘Malta’ ?latitude < 35.89 and ?longitude > 130.00

)3P r()2P r()1P r()321P r( TTTTTT

Page 21: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

21

Decomposition of Transactions

How to estimate Pr(T1), Pr(T2), Pr(T3)?

Decompose the transaction into more primitive statements and estimate their local probabilities first

Use Bayesian network model to decompose the transaction (next slide).

Page 22: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

22

Bayesian network model of transaction

Nodes in the network represent the random variables involved in the transaction.

The arc from node x_i to x_j indicates that x_j is dependent on x_i.

X1:type of transaction?

X3: on which tuples?

X4:on which attributes?

X5:what new attribute

value?

X2: on which relation?

Page 23: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

23

Decomposition of Transaction

The probability of a transaction can be estimated as the joint probability of all variables

Example for the transaction T1, their semantics are as follows:

x1: a tuple is updated x2: a tuple of geoloc is updated x3: a tuple of geoloc, whose ?country = ‘Malta’ is updated x4: a tuple of geoloc whose ?latitude is updated x5: a tuple of geoloc whose ?latitude is updated to a new value

less than 35.89

)124|5P r()12|4P r()12|3P r()1|2P r()1P r(

)54321P r()1P r(

xxxxxxxxxxxxx

xxxxxT

Page 24: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

24

Estimation of Probability of T1

Apply the Laplace law to estimate each local conditional probability.

x1: a tuple is updated:

x2|x1: a tuple of geoloc is updated, given that a

tuple is updated:

x3|x2^x1: a tuple of geoloc whose ?country=“Malta” is updated, given that a tuple of geoloc is updated:

Page 25: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

25

Estimation of Probability of T1

x4|x2^x1: a tuple of geoloc whose ?latitude is updated

x5|x4^x2^x1: a tuple of geoloc whose ?latitude is updated to a new value less than 35.89

Page 26: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

26

Estimation of Probability of T1

Assume the size of the relation geoloc is 616, ten of them with ?country=“Malta”

Pr(T1) = (1/3)*(1/5)*(10/616)*(1/5)*(1/2) = 0.000108

Similarly, we can estimate Pr(T2) and Pr(T3).

Suppose, that Pr(T2) = 0.000265 and Pr(T3) = 0.00002

Then the robustness of the rule can be estimated as 1- (0.000108 + 0.000265 + 0.00002 ) = 0.999606

Page 27: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

27

Estimate Robustness Estimation accuracy depends on available

information. However, even given only database schemas, the method can still come up with some estimates.

How to derive transactions that invalidate an arbitrary logic statement? It is not a trivial problem!!!

Most knowledge discovery systems have strong restrictions on the syntax of discovered knowledge.

Hence, the invalidating transactions can be manually generalized into a small sets of transaction templates, as well as templates of probability estimates for robustness estimation.

Page 28: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

28

Outline

Introduction Preliminaries Robustness of Knowledge Applying Robustness in

Knowledge Discovery Experimental Results Conclusions

Page 29: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

29

Applying Robustness in Knowledge Discovery

A rule pruning algorithm which can increase the robustness and applicability of machine discovered rules by pruning their antecedent literals.

Specification of rule pruning problem:Take a machine-generated rule as input, which is consistent with a database but potentially overly-specific, and remove antecedent literals of the rule so that it remains consistent but is short and robust.

Basic idea of pruning algorithm: To search for a subset of antecedent literals to remove until any

further removal will yield an inconsistent rule.

However the search space can be exponentially large with respect to the number of literals in rule.

A beam-search algorithm was presented to trim the search space.

Page 30: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

30

Pruning Algorithm

Beam-search algorithm Apply the robustness estimation approach to

estimate the robustness of a partially pruned rule and guide the pruning search.

Two properties to optimize: robustness and length.

For each set of equally short rules, the algorithm searches for the rule that is as robust as possible while still being consistent.

Page 31: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

31

Antecedent pruning algorithm

Pruning rule antecedents

1. INPUT R = rules (initially the rule to be pruned), B = beam size;

2. LET O = results;( initially empty); 3. WHILE (R is not empty) DO4. Move the first rule r in R to O; 5. Prune r, LET R’ = resulting rules; 6. Remove visited dangling or inconsistent rules in R’; 7. Estimate and sort on the robustness of rules in R’;

8. Retain top B rules in R’ and remove the rest; 9. Merge sorted R’ into R in sorted order of the

robustness; 10. RETURN O ;

Page 32: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

32

Empirical Demonstration of Rule Pruning

A detailed empirical study has been conducted on rule R3.wharf(_,?code,?depth,?length,?crane) ^ seaport(?name,?code,_,_,_,_) ^ geoloc(?name,_,?country,_,_) ^?country = “Malta” ^?depth <= 50 ^ ?crane > 0 ?length >= 1200

The relationship between length and robustness of the rules is shown on the next slide.

Pruned rules R7: wharf(_,?code,?depth,?length,?crane) ^

seaport(?name,?code,_,_,_,_) ^ geoloc(?name,_,?country,_,_) ^ ?crane > 0

?length >= 1200

Page 33: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

33

Empirical Demonstration of Rule Pruning

Page 34: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

34

Pruned rules and their estimated robustness

Page 35: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

35

Experimental Results

Experiment setups: Use the rule discovery system BASIL to derive rules

from two large ORACLE relations databases BASIL part of SIMS information mediator developed by

authors Synthesize 123 sample transactions to represent

possible transactions of the experimental databases (27 updates, 29 deletions and 67 insertions)

Steps of the experiments: Train BASIL to discover a set of rules and estimate their

robustness, generates 355 rules Use another set of 202 sample transactions to assist the

robustness estimation. Apply the set of 123 transactions to the two relational

database and check consistency of all 355 rules

Page 36: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

36

Experimental results

Page 37: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Experimental results

Evaluation of predictive power of the robustness estimation Recall = |I L| / |I | Precision = |I L| / |L | I is set of inconsistent rules L is set of rules likely to become

inconsistent Consistency threshold = 0.75

Page 38: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Experimental results

Page 39: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

39

Conclusions Formalize the notion of robustness of a rule r in a

given database state d

Estimate probabilities of rule invalidating transactions

Decomposes the probability of a transactions into local probabilities that can be estimated using Laplace law.

No need to provide additional information for the estimation.

Apply in a rule discovery system Pruning antecedents of a discovered rule so that the rule

will be highly robust and widely applicable. Beam-search algorithm

Conduct empirical experiments to demonstrate the algorithm.

Page 40: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Real-world Applications

Face recognition software Envirionment is the “data set” – constantly

changing Rules must resist variability in lighting and

pose Systems prognostics

Data set continuously updated with diagnostic or calibration information

Predictive failure rules used to dictate system resource planning

Page 41: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

41

Questions for final exam prep.

• Define robustness. (Slides 4, 14, 15)– the probability that the knowledge is

consistent with a database state– Robustness for all states:

– Robustness for accessible states

testa b a se stao ssib le d a# o f a ll p

t w ith r co n sistena se sta tes# o f d a ta b(r)R o b u st 1

(t|d)- t|d) ( ) Robust(r|d Pr1Pr

Page 42: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

42

Questions for final exam prep.

• What is the closed world assumption and the significance of CWA? Where is it used in data mining? (Slide 8)

– Information that is not explicitly present in the database is taken to be false.

– Significant because the data tends to be dynamic

– Used in relational db’s, deductive db’s and rule-based information systems

Page 43: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

43

Questions for final exam prep.

• Compare and contrast robustness, support count and predicative accuracy (Slide 7)

– Support count for an association rule expresses the probability that a data instance satisfies a rule.

– Predictive accuracy for classification rule measures the probability that knowledge is consistent with randomly selected unseen data

– Robustness expresses the probability that an entire database state is consistent with a rule.

Page 44: 1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

44

Thanks and questions