1 ICMLC2007, Aug. 19~22, 2007, Hong Kong Incremental Maintenance of Ontology-Exploiting Association Rules Ming-Cheng Tseng 1 , Wen-Yang Lin 2 and Rong Jeng 3 1, 3 Institute of Information Engineering, I-Shou Universit y, Taiwan 2 Dept. of Comp. Sci. & Info. Eng., National University o f Kaohsiung, Taiwan August 20, 2007
26
Embed
ICMLC2007, Aug. 19~22, 2007, Hong Kong 1 Incremental Maintenance of Ontology- Exploiting Association Rules Ming-Cheng Tseng 1, Wen-Yang Lin 2 and Rong.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1ICMLC2007, Aug. 19~22, 2007, Hong Kong
Incremental Maintenance of Ontology-Exploiting Association Rules
Ming-Cheng Tseng1, Wen-Yang Lin2 and Rong Jeng3 1, 3 Institute of Information Engineering, I-Shou University, Taiwan
2 Dept. of Comp. Sci. & Info. Eng., National University of Kaohsiung, Taiwan
August 20, 2007
2ICMLC2007, Aug. 19~22, 2007, Hong Kong
Outline
Introduction
Problem description
The proposed algorithm
Performance evaluation
Conclusions
3ICMLC2007, Aug. 19~22, 2007, Hong Kong
Introduction
Motivation In general, there exist lots of semantic relationships
(domain knowledge) among items It is natural to incorporate domain ontology into the
process of data mining to explore more innovative rules The source databases are changing over time
E.g., insertion, deletion, modification The discovered knowledge (rules) has to be updated to
reflect new situation
4ICMLC2007, Aug. 19~22, 2007, Hong Kong
Introduction (cont.)
Association rules Given:
A database of customer transactions Each transaction is a set of items
Find all rules X Y that correlate the presence of one set of items X with another set of items Y
Example:Sony VAIO HP LaserJet 1300 (Sup. 30%, Conf. 60%)
5ICMLC2007, Aug. 19~22, 2007, Hong Kong
Introduction (cont.)
Strong association rules Given:
User’s specified constraints Minimum support (min_sup) minimum confidence (min_conf)
Finding rules X Y with support and confidence larger than the user’s specified minimum values
Example: min_sup = 25%, min_conf = 50%
Sony VAIO HP LaserJet 1300 (Sup. 30%, Conf. 60%)
6ICMLC2007, Aug. 19~22, 2007, Hong Kong
Introduction (cont.)
Frequent itemsets (patterns) mining The association mining problem can be reduced to the pr
oblem of mining frequent itemsets, i.e., itemsets with support larger than min_sup
Example min_sup = 25%, min_conf = 50%
Sony VAIO HP LaserJet 1300 (Sup. 30%, Conf. 60%)
sup({Sony VAIO, HP LaserJet 1300}) = 30%sup({Sony VAIO}) = 50%
7ICMLC2007, Aug. 19~22, 2007, Hong Kong
Introduction (cont.)
Ontology W3C Web Ontology Working Group
“An ontology formally defines a common set of terms that are used to describe and represent a domain knowledge.”
e.g., taxonomy: a kind of ontology presenting classification relationship among objects
Tomato
Vegetable
Carrot
Kale
Non-rootVegetable
Pickle
Apple
Fruit
Papaya
8ICMLC2007, Aug. 19~22, 2007, Hong Kong
Introduction (cont.)
Ontology-exploiting association rules
---
MemoryHard Disk
NotebookDesktop PC
PC
---
---
---
RAM256MB
S60GB
IBM60GB
RAM512MB
SonyVAIO
GatewayGE
IBMTP
Printer
HPDeskJet
EpsonEPL
---
InkCartridge
PhotoConductor
TonerCartridge
---
Composition
Classification
IBM 60GB HD => HP DeskJet
9ICMLC2007, Aug. 19~22, 2007, Hong Kong
Problem Description
Incremental maintenance of ontology-exploiting association rules Given:
A database of customer transactions DB An incremental database db An item ontology T Discovered frequent itemsets in DB, L minimum support, ms, and minimum confidence, mc
Find all frequent itemsets in UD = DB + db w.r.t. ms Construct all strong rules from the frequent itemsets w.r.t. m
c
10ICMLC2007, Aug. 19~22, 2007, Hong Kong
Problem Description (cont.) -- Example
TID Purchased Items
1 IBM TP, Epson EPL, Toner Cartridge
2 Sony VAIO, IBM TP, Epson EPL
3 IBM TP, HP DeskJet, Ink Cartridge
4 HP DeskJet
5 IBM TP, HP DeskJet, Ink Cartridge
6 Sony VAIO, Ink Cartridge
Composition
Classification
PhotoConductor
TonerCartridge
HPDeskJet
Printer
EpsonEPL
- -
InkCartridge
- - - -
RAM256MB
IBM60GB
SonyVAIO
PC
IBMTP
S60GB
- -
Customer transactions DB
L1 Count L2 & L3 Count
{Printer}{PC}{IBM TP}{RAM 256MB*}{IBM 60GB*}
55454
{Printer, PC}{Printer, IBM TP}{Printer, RAM 256MB*}{Printer, IBM 60GB*}{RAM 256MB*, IBM 60GB*}{Printer, RAM 256MB*, IBM 60GB*}
444444
Discovered frequent itemsets L
Item ontology G
minsup = 70% (algorithms AROC, AROS)
11ICMLC2007, Aug. 19~22, 2007, Hong Kong
Problem Description (cont.)
Example
TID Purchased Items
1 IBM TP, Epson EPL, Toner Cartridge
2 Sony VAIO, IBM TP, Epson EPL
3 IBM TP, HP DeskJet, Ink Cartridge
4 HP DeskJet
5 IBM TP, HP DeskJet, Ink Cartridge
6 Sony VAIO, Ink Cartridge
Composition
Classification
PhotoConductor
TonerCartridge
HPDeskJet
Printer
EpsonEPL
- -
InkCartridge
- - - -
RAM256MB
IBM60GB
SonyVAIO
PC
IBMTP
S60GB
- -
TID Items Purchased
7 Toner Cartridge
8 IBM TP, HP DeskJet, IBM 60GB, Toner Cartridge
9 IBM 60GB, Toner Cartridge
Customer transactions DB
Incremental transactions db
Item ontology G
minsup = 70%
Updated frequent itemsets L’
??
12ICMLC2007, Aug. 19~22, 2007, Hong Kong
Basic scheme An Apriori-based maintenance algorithm Employing a bottom-up, level-wise searching strategy
Starting from frequent 1-itemset, L1, then L2, …, Lk, etc.
A B C D
ABC ABD BCDACD
ABCD
AB AC AD BC BD CD
The Proposed Algorithm – IMARO
13ICMLC2007, Aug. 19~22, 2007, Hong Kong
Notation Definition
DB Original database
db Incremental database
UD Updated database UD DB + db
T Item ontology
ED Extension of DB with extended items in T
ed Extension of db with extended items in T
UE Updated extended database UE ED + ed
The Proposed Algorithm – IMARO (cont.)
Terminology
14ICMLC2007, Aug. 19~22, 2007, Hong Kong
Example
The Proposed Algorithm – IMARO (cont.)
15ICMLC2007, Aug. 19~22, 2007, Hong Kong
Note on database extension A component item may exist as a primitive item itself To clarify the meaning of associations involving such
an item, we have to differentiate the role this item play
e.g., IBM TP => Ink Cartridge
buy an IBM TP notebook, also buy an Ink Cartridge
buy an IBM TP notebook, also buy an product composed of Ink Cartridge
The Proposed Algorithm – IMARO (cont.)
TID Purchased Items
5 IBM TP, HP DeskJet, Ink Cartridge
TID Primitive Items Extended Items
5 IBM TP, HP DeskJet, Ink Cartridge*
PC, RAM 256MB, IBM 60GB, Printer, Ink
Cartridge
16ICMLC2007, Aug. 19~22, 2007, Hong Kong
The Proposed Algorithm – IMARO (cont.)
EDkL
EDkL 1
Candidate Generating kC
Mining
Freq. orInfreq. in
UE
UEkL
Determined
UndeterminedScan
Count
TDB
db
T
edkL
1
2
3
4
Process flow for updating frequent k-itemsets
e.g., AROC or AROS
17ICMLC2007, Aug. 19~22, 2007, Hong Kong
Frequent/infrequent itemsets inference
The Proposed Algorithm – IMARO (cont.)
Min.Support
Min.Support
Small Itemset
Small Itemset Large Itemset
Case 1Case 4Case 2Case 3
Large Itemset
DB
db
T
+
+T
UDT
+
Conditions Results
LED Led UE Action Case
freq. no 1
undetd. compare supUD(A) with ms 2
undetd. scan DB 3
infreq. no 4
18ICMLC2007, Aug. 19~22, 2007, Hong Kong
The Proposed Algorithm – IMARO (cont.)
Optimization 1: Candidate pruning Any candidate itemset that contains both an item and anyo
ne of its extensions (generalized item or component) is pruned.
PhotoConductor
TonerCartridge
HPDeskJet
Printer
EpsonEPL
- -
InkCartridge
- - - -
RAM256MB
IBM60GB
SonyVAIO
PC
IBMTP
S60GB
- -
{Epson EPL, Printer}
{Epson EPL, Toner Cartridge*}
19ICMLC2007, Aug. 19~22, 2007, Hong Kong
The Proposed Algorithm – IMARO (cont.)
The extension of an item can be added only if that item does appear in at least one candidate itemset being counted currently
Photo
Conductor
Toner
Cartridge
HP
DeskJet
Printer
Epson
EPL
- -
Ink
Cartridge
- - - -
RAM
256MB
IBM
60GB
Sony
VAIO
PC
IBM
TP
S
60GB
- -
Optimization 2: Extension filtering
20ICMLC2007, Aug. 19~22, 2007, Hong Kong
Performance Evaluation
Compared with applying our proposed algorithms, AROC and AROS, to the whole database DB+db with T Test data
A synthetic dataset generated by the IBM data generator with artificially–built ontology
Parameter Default value
|DB| Number of original transactions 200,000
|t| Average size of transactions 20
N Number of items 362
R Number of groups 30
L Number of levels 4
F Fanout 5
21ICMLC2007, Aug. 19~22, 2007, Hong Kong
Performance Evaluation (cont.)
Varying minimum supports
10
100
1000
1 1.5 2 2.5 3 3.5
ms %
Run
tim
e (s
ec.)
AROC AROS IMARO
log
|db| = 40,000
22ICMLC2007, Aug. 19~22, 2007, Hong Kong
Performance Evaluation (cont.)
Varying incremental transaction size
0
50
100
150
200
250
300
2 4 6 8 10 12 14 16 18 20
Number of incremental transctions (x 10,000)
Run
tim
e (s
ec.)
AROC AROS IMARO ms = 1.5%
23ICMLC2007, Aug. 19~22, 2007, Hong Kong
Conclusions
We have investigated the problem of updating ontology-exploiting association rules when new transactions are inserted into the database
An Apriori-based algorithm is proposed Other issues
More complicated semantic relationships and knowledge Non-uniform minimum support
Generalized item or composite item occurs more frequently Towards a total solution for evolving environments
Ontology evolution, database update Interactive refinement of support constraints
…
24ICMLC2007, Aug. 19~22, 2007, Hong Kong
Thanks for Thanks for your your attention!attention!
25ICMLC2007, Aug. 19~22, 2007, Hong Kong
Conclusions (cont.)
Taxonomy of semantic relationships
*source: 1993, Veda C. Storey, VLDB journal
26ICMLC2007, Aug. 19~22, 2007, Hong Kong
Related Work
Comparison with previous work
Contributors Model of incremental maintenance of association rules
Type of database update Type of ontology
Srikant & Agrawal, 1995 none classification
Han & Fu, 1995 none classification
Cheung et al., 1996 insertion classification
Cheung et al., 1997 insertion, deletion and modification
none
Jea et al., 2003 none composition
Chien et al., 2005 none classification & composition