Association Rule Mining In Partitioned Databases: Performance Evaluation and Analysis A DISSERTATION Submitted in partial fulfillment Of the requirements for the award of the degree Of MASTER OF TECHNOLOGY In INFORMATION TECHNOLOGY (Specialization: SOFTWARE ENGINEERING) By Pankaj Kandpal Under the Guidance of: Prof. M. Radhakrishna Mr. Manish Kumar IIIT-Allahabad INDIAN INSTITUTE OF INFORMATION TECHNOLOGY, ALLAHABAD (A Centre of Excellence in Information Technology Established by Govt. of India)
73
Embed
Association Rule Mining In Partitioned Databases ... grade/Pankaj Kandpal... · Association Rule Mining in Partitioned Databases: Performance Evaluation and Analysis Pankaj Kandpal,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Association Rule Mining In Partitioned Databases:
Performance Evaluation and Analysis
A DISSERTATION Submitted in partial fulfillment
Of the requirements for the award of the degree Of
MASTER OF TECHNOLOGY In
INFORMATION TECHNOLOGY (Specialization: SOFTWARE ENGINEERING)
By
Pankaj Kandpal
Under the Guidance of:
Prof. M. Radhakrishna Mr. Manish Kumar
IIIT-Allahabad
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY, ALLAHABAD
(A Centre of Excellence in Information Technology Established by Govt. of India)
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY ALLAHABAD
(Deemed University)
(A centre of excellence in IT, established by Ministry of HRD, Govt. of India)
Date:
We do hereby recommend that the thesis work prepared under our supervision
by Pankaj Kandpal entitled Association Rule Mining in Partitioned
Databases: Performance Evaluation and Analysis be accepted in partial
fulfillment of the requirements of the degree of Master of Technology in
Information Technology (Software Engineering) for examination.
Table 5.4: Description of different samples for BMSWEBVIEW1 dataset.........................48
Table 5.5: Description of different samples for T100I4D100K dataset ..............................48
Table 5.6: Candidate itemsets in different samples for BMSWEBVIEW1 dataset.............49
Table 5.7: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.15% support.54
Table 5.8: Frequent itemsets generated for BMSWEBVIEW1 dataset for 0.30% support.55
Table 5.9: Percentage error for BMSWEBVIEW1 dataset ..................................................55
Table 5.10: Frequent itemsets generated for T10I4D100K dataset for 0.45% support ......56
Table 5.11: Frequent itemsets generated for T10I4D100K dataset for 0.60% support ......56
Table 5.12: Percentage error for T10I4D100K dataset ...................................................... 56
Chapter 1 Introduction
There are basically two most important reasons that data mining has attracted a great deal
of attention in the recent years. First, our capability to collect and store the huge amount
of data is rapidly increasing day by day. Due to the decrease in the cost of storage devices
and increase in the processing power of computers, now a days it is possible to store huge
amount of organizational data and process it. The second but the more important reason
is the need to turn such data into useful information and knowledge. The knowledge that
is acquired through the help of data mining can be applied into various applications like
business management, retail and market analysis, engineering design and scientific
exploration [1].
Data mining or Knowledge discovery in databases (KDD) is the process of discovering
previously unknown patterns form the huge amount of data stored in flat files, databases,
data warehouses or any other type of information repository. Database mining deals with
the data stored in database management systems (e.g. Oracle).
If we are data rich then we may or may not be information rich, because the useful
information is often hidden in the data. Data mining tools and techniques are used to
generate information from the data that we have stored in our repositories over the years.
To take advantage in the market over the competitors’, decision makers or managers need
to mine the knowledge hidden in the data collected over the years and use that
information in an effective way.
1.1 Data mining Functionalities
The process of mining is often controlled by the requirements of the users. The user may
be a business analyst or may be a marketing manager. Different users have different need
of information. Depending on the requirements we can use different data mining
techniques. The different types of data mining functionalities and the patterns they
discover are described below: 1.1.1 Association Analysis [1]
Association rule mining is an interesting data mining technique that is used to find out
interesting patterns or associations among the data items stored in the database. Support
and confidence are two measures of the interestingness for the mined patterns. These are
user supplied parameters and differ from user to user. Association rule mining is mainly
used in market basket analysis or retail data analysis. In market basket analysis we
identify different buying habits of customers and analyze them to find associations
among items those are purchased by customers. Items that are frequently purchased
together by customers can be identified. Association analysis is used to help retailers to
plan different types of marketing, item placement and inventory management strategies.
When we do association rule mining in relational database management systems we
generally transform the database into (tid, item) format, where tid stands for transaction
ID and item stands for different items purchased by the customers. There will be multiple
entries for a given transaction ID, because one transaction ID indicates purchase of one
particular customer and a customer can purchase as many items as he want. An
association rule can look like this:
X (buys, computer) X (buys, Windows OS CD) [support =1%, confidence=50%]
Where:
Support = The number of transactions that contain Computer and Windows OS CD The total number of transactions
Confidence = The number of transactions that contain Windows OS CD The number of transactions that contain Computer The above rule will hold if its support and confidence are equal to or greater than the user
specified minimum support and confidence.
1.1.2 Clustering analysis [1] In clustering we group the data items in such a way that the data items in a cluster are
more similar to one another and data items in different clusters are more dissimilar. These
data items are some times called data points. The focus of clustering is to maximize the
intra-class similarity and minimize the interclass similarity. The main clustering methods
are: partitioning methods, hierarchical methods, density based methods, grid-based
methods and model based methods. By the help of clustering technique for example we
can plan our marketing strategy by dividing market areas into different zones according
to the climates or customer behaviors, so that each group is targeted differently
1.1.3 Classification analysis [1] In classification, by the help of the analysis of training data we develop a model which
then is used to predict the class of objects whose class label is not known. The model is
trained so that it can distinguish different data classes. The training data is having data
objects whose class label is known in advance. There are various presentation methods
for the derived model like IF-THEN rules, decision trees, neural networks, mathematical
formula.
The major difference between classification and clustering is that classification is
supervised and clustering is unsupervised. That means in classification the class label is
known in advance, while clustering does not assume any knowledge of clusters.
1.1.4 Deviation analysis [2] Deviations are differences between the current data and previously defined normal
values. Deviation analysis is used to detect anomalies in the datasets. It is very useful for
time-related data analysis, in which we need to identify data deviations which occur over
the time. Deviation analysis tools are helpful in security systems, where authorities can
be warned about the deviation in resource utilization by a particular user.
1.2 Architectures - Integrating mining with DBMS There are various architectures [3] available for integrating data mining process with the
data base management systems. These architectures are depicted in the figure 1.1 and
described briefly below:
1.2.1 Loose coupling (or cache mining) This is an example of multi tier architecture. Mining applications are integrated into
client or into the application server depending on the architecture. The mining kernel can
be considered as the application server. Data is first fetched from the database
management system into the mining kernel and then it is mined according the user need.
Finally the results are sent back to the DBMS. Any intermediate results generated are
also stored back into the DBMS. In this approach the DBMS runs in a different address
space form the mining process. Cache based mining is another type of loose coupling
approach, in which the data is read only once form the DBMS and cached into flat files
on the local disk for future processing.
1.2.2 Stored procedures and user defined functions Mining logic is embedded as an application on the database server. There are two ways in
which mining application is stored on the database server side: stored procedures and
user defined functions. The mining application and DBMS executes in the same address
space. For example in Oracle we can create PLSQL stored procedures or JAVA stored
procedures for our mining algorithms and these procedures are then stored in the
database. In IBM DB2 we can implement mining algorithm with the help of user defined
functions.
Mining as application on Client/app server
Integrated withSQL query engine
Mining as application on database server
Mining usingSQL+ extensions
Integrated approach
TightIntegrationLoose
SQL based approach
Stored procedure
Loosecoupling
Mining extenders/blades
User defined functions
CacheMine
Figure 1.1: Different architectures for integrating mining within DBMS [3]
1.2.3 SQL based approach Here mining algorithm is presented in the form of SQL queries to DBMS query engine
where these SQL queries are executed by the SQL query processor. A mining based
optimizer can be used to optimize these SQL queries. DBMS provides support for
check-pointing and space management which is very useful in the case of these long
running queries.
1.2.4 Integrated approach In Integrated approach querying and mining are treated similarly. There is no distinction
between OLTP, OLAP or mining, the main goal is to get the information form the
database in the most effective way. Here mining operators are essential part of database
query engine. These mining operators or extended SQL is used for mining.
1.3 Database Partitioning and PLSQL 1.3.1 Database partitioning [15] A database partition is a logical division of a database or its constructs like tables or
indexes into distinct independent parts. Database partitioning is done mainly for the
following reasons:
• Performance
• Manageability
• Availability
A database can be partitioned in two ways:
• Building several smaller databases
• Splitting selected elements (splitting a table into various tables)
Database can be partitioned in two manners:
• In Horizontal Partitioning we put different rows in different tables. (Row wise
partitioning)
• In vertical partitioning we put different columns in different tables (column wise
In this thesis we are concerned about the database mining, in which data is stored in the
relational database management systems (e.g. Oracle). RDBMS provides various
additional benefits those are lacking in the file based mining. SQL and PLSQL stored
procedures [15, 20] are used for the purpose of implementation. For the purpose of
experimentations one synthetic and two real life datasets [21, 22] are used.
The goal of the thesis is to evaluate the performance of association rule mining
algorithms in the context of database partitioning. The thesis focuses on Apriori,
partitioned and sampling algorithms for frequent itemsets mining when the data is
partitioned into a number of given segments. Apriori algorithm scans the database
multiple number of times for counting the support for the itemsets. Partitioning approach
partitions the database for mining frequent itemsets. Sampling algorithm uses the small
sample instead of the entire database for mining.
1.5 Thesis Organization
The structure of the rest of the thesis is as follows:
Chapter 2 presents the background of various association rule mining approaches
developed so far. It covers in detail about the association analysis and association rule
mining algorithms discussed in the thesis.
Chapter 3 discusses the performance analysis of apriori algorithm when it is applied as
the partitioned approach.
Chapter 4 presents the performance analysis of partitioning algorithm. It discusses the
TIDLIST approach for support counting and K-way join second pass optimization.
Chapter 5 presents the sampling approach for frequent itemsets mining.
Chapter 6 discusses the conclusion and future directions about the work done in the
thesis.
Chapter 2 Association Analysis
In this chapter a background of various association rule mining algorithms is discussed.
This chapter also covers in detail about the association analysis and association rule
mining algorithms discussed in the thesis.
2.1 Background The Association rule mining was first introduced in the AIS [4] algorithm. It was again
modified in [5]. There are various algorithms proposed for association rule mining since
the development of the AIS algorithm so that the performance of the algorithm is
improved. Apriori [5] is the very basic and most popular association rule mining
algorithm. Most of the association rule mining algorithms are based on apriori algorithm.
Apriori algorithm scans database multiple times. The FP-tree [6] (Frequent-pattern)
algorithm builds a special type of tree structure in main memory so that it can avoid
multiple scans over the database. The turbo-charging [7] algorithm improves the
performance by the help of data compression techniques.
The partition algorithm [8] is based on apriori algorithm. It firstly partitions the data into
a number of non overlapping partitions and processes each partition separately to
generate frequent itemsets local to each partition and finally it combines all the local
frequent itemsets to generate global frequent itemsets. It reduces the number of complete
database scans up to two and hence improves the performance of mining algorithm.
The Incremental Mining algorithm [9] is another useful technique for speeding up the
mining process when new data is added to the database. [1, 10] Sampling algorithm is
also based on apriori algorithm. Rather than mining entire database, we here draw out a
random sample of data form the database and then finds out frequent itemsets in that
sample instead of the entire database. Finally the rest of the database is used to compute
the actual support of the frequent itemsets that we found in the sample.
Because we are searching for frequent itemsets in the sample, it is possible that we may
miss some global frequent itemsets. To lessen this we use lower support than minimum
support for the sample. In this way we trade off some degree of accuracy against
efficiency. There are various mechanisms so that we can find out all the missing frequent
itemsets those are not find out in the sample.
Most of these algorithms are In-memory algorithms, in which data is directly read from
flat files or first extracted from database to the flat files and then processed in main
memory. Most of these algorithms build specialized data-structures and implement their
own buffer management schemes.
Since then very few attempts have been made to build database based mining approaches.
Various extensions to the standard SQL have also been proposed. These extensions allow
the inclusion of mining operators in the SQL. The data mining query language (DMQL)
[11] includes such mining operators for various types of mining tasks.
[12] Shows various architectural alternatives for coupling data mining with relational
database systems. [3] Have also compared various SQL based approaches for association
rule mining. These are SQL-92 based approaches and SQL-OR based approaches. The
SQL-92 based approaches use the standard SQL language for the mining. SQL-OR based
approaches use the object relational extensions to the SQL. [3] Have implemented apriori
algorithm in the form of SQL queries.
[13] Deals with the partitioned and incremental approaches for association rule mining
and they have evaluated basic k-way join algorithm in the context of multiple databases
and proposed two optimizations of partitioned approach for multi-database mining.
2.2 Association Rule Mining Algorithms 2.2.1 Terminology and Concepts [1] Let I is the set of all items in the database D. Database D contains user transactions. Each
transactions T contains a set of items such that T ⊂ I. Let X and Y are set of item. An
association rule is of the form X where XY, ⊂ I, Y ⊂ I, and
X ∩ Y = φ. Support and Confidence are two measures of rule interestingness.
The rule X Y holds in the database D with support s, where s is the percentage of
transactions in D that contain X U Y. The rule has confidence c if c is the percentage of
transactions in D containing X which also contains Y. i.e.
Support (X Y) = P (XUY)
Confidence (X Y) = P (Y|X)
The rules that satisfy both the user specified minimum support and confidence are said to
the Strong Association rules.
[1] A set of items is called an itemset. An itemset that contains k items is called a k-
itemset. The occurrence frequency of an itemset is the number of transactions that contain
the itemset. This is also known as the frequency or support count of the itemset. An
itemset satisfies minimum support if the occurrence frequency of the itemset is greater
than or equal to the product of minimum support and the total number of transactions in
the entire database. The number of transactions required for the itemset to satisfy
minimum support is referred as the minimum support count. If an itemset satisfies
minimum support then it is called a frequent or large itemset.
An association rule mining algorithm is divided into two parts:-
• Frequent itemsets generation i.e. all the itemsets having support greater than the
user specified minimum support.
• Frequent itemsets generated in the step 1 will be used to generate association rules
that satisfy user specified minimum confidence.
First step is more complex and requires more effort. After the frequent itemsets are
generated the strong association rule generation is simple. Strong association rules satisfy
both minimum support and minimum confidence.
Confidence (X Y) = P (Y | X) = support-count (X U Y) support-count (X) Where support-count (X U Y) is the total number of transactions having itemset {X, Y}
and support-count (X) is the total number of transactions having itemset {X}.
Association rules are generated as follows:-
• For every frequent itemset x, generate all non empty subset of x.
• For every non- empty subset s, of x, generate the association rule
s (x-s) if
support-count (X U Y) is greater or equal to minimum confidence. support-count (X)
Since the association rules are generated directly from frequent itemsets, each rule
automatically satisfies minimum support.
2.2.2 Example of association rules: The Table 2.1 depicts an example Transaction database and Table 2.2 shows that
{1, 2, 3} and {1, 2, 5} are frequent 3-itemsets. The non empty subsets of {1,2,3} are
{1},{2},{3},{1,2}, {1,3} and {2,3}. The association rules generated are:
{1, 2} {3} confidence=2/4 = 50%
{1, 3} {2} confidence=2/2 = 100%
{2, 3} {1} confidence=2/3 = 66%
{1} {2, 3} confidence=2/4 = 50%
{2} {1, 3} confidence=2/6 = 33%
{3} {1, 2} confidence=2/3 = 66%
If minimum confidence is equal to 66% then the following rules are strong rules:
Table 2.4: Tidlists for 2-itemsets 2.2.6 Sampling Algorithm [10, 14 and 16] There are various sampling algorithms for association rule mining has been proposed in
[10, 14 and 16]. Among them the sampling algorithm proposed in [10] has the best
performance. The algorithm [10] picks up a random sample form the database and then
finds out frequent itemsets in the sample using support that is less than the user specified
minimum support for the database. These frequent itemsets are denoted by S. Then the
algorithm finds out the negative border [10] of these itemsets denoted by NBd (S). The
negative border is the set of itemsets those are candidate itemsets but did not satisfy
minimum support. Simply NBd (Fk) = Ck - Fk. After that for each itemset X in
S U NBd (S) it checks whether X is frequent itemset in entire database by scanning the
database. [1, 17]
If NBd (S) contains no frequent itemsets then all the frequent itemsets are found.
If NBd (S) contains frequent itemsets then the algorithm constructs a set of candidate
itemsets CG by expanding the negative border of S U NBd (S) until the negative border is
empty. Now for each itemset X in CG the algorithm scans the database for the second
time. In the best case when all the frequent itemsets are found in the sample this
algorithm requires only one scan over the database. In the worst case it requires two scans
over the database. [1, 17]
The performance of sampling algorithm relies on the quality of the sample chosen. If the
sample chosen is a bad sample the number of candidates generates for second scan may
be very large hence second scan can be inefficient.
The sample can be a partition of the database. In that case the partition is treated just like
a random sample chosen.
The sampling algorithm [10] is depicted below:
s = Draw_random_sample (D);
// generate frequent itemsets for the sample drawn.
S = generate_frequent_itemsets (s, low_support);
// counting support for the itemsets and their negative border generated in the sample, in
the database D.
F = {X ∈ S U NBd (S) | X.count >= minsup};
// if NBd (S) contains frequent itemsets, expand border
Repeat
S = S U NBd (S);
Until S does not grow;
// another scan of D
F = {X ∈ S | X.count >= minsup};
Output F; // frequent itemsets in the database D
Chapter 3
Apriori Algorithm for Association Rule Mining
This chapter presents the performance analysis of the apriori algorithm [5] for association
rule mining in the context of partitioning approach. For support counting K-Way join
approach [2] is used. The algorithm executes in two phases. In the first phase the
database (or dataset) is partitioned into a number of given partitions and local frequent
itemsets for each partition are generated using the minimum support count for that
partition. Then all the local frequent itemsets are combined into the following two sets:
• Global frequent itemsets
• Global candidate itemsets
In the second phase support for the global candidate itemsets are calculated in the entire
database. Itemsets qualifying the minimum support are frequent itemsets in the entire
database and hence added to the set of global frequent itemsets. The algorithm scans the
database multiple number of times. TIDLIST data structure for support counting is not
used at all.
The experiments are done on Oracle 10g RDBMS installed on Microsoft Windows XP
with 1 GB of RAM and 2.40 GHz processor. Each experiment is performed various times
and the best of them is taken.
3.1 Datasets for Experiments
Datasets are needed for the purpose of experiments. Some synthetic and real life datasets
[21, 22] have been collected from the internet for experiments. Synthetic datasets are
generated through the synthetic datasets generation utility or program. Real life datasets
are real transactions done on retail items; those are collected over the years for the
purpose of the analysis. These datasets were stored in the flat files and will have to be
transferred into the database tables. To transfer these datasets within the Oracle database,
SQL*Loader utility [15] provided by Oracle RDBMS was used. This utilizes the
functionalities provided by DBMS and saves unnecessary effort that could have been
spend in writing the program for loading the data into DBMS. After the data have been
loaded into the database, it was converted into the format suitable for the algorithms.
The details of the datasets used in the thesis for experiments are given in the Table 3.1
3.2 Performance analysis The Table 3.2 shows the total number of records, total number of distinct transactions and distinct items contained in the partitions for BMSWEBVIEW1 dataset.
Figure 3.1: Performance of apriori for BMSWEBVIEW1 dataset (2 partitions) Figure 3.1 shows the performance of apriori algorithm for BMSWEBVIEW1 dataset for
two partitions. It is obvious form the figure that as the support value increases time taken
by the algorithm decreases.
The Table 3.3 and Table 3.4 show the total number of input records, total number of
distinct transactions and distinct items contained in the different partitions for
BMSWEBVIEW1 dataset for 3 and 4 partitions respectively.
Figure 3.6 depicts the performance of apriori algorithm for MUSHROOM dataset for 2
partitions and for different support values. As support increases for 0.15% to 1.0%, time
taken by the algorithm decreases except for 0.60% support value. But the decrease in
time is not significant as compared to decrease in support values.
Table 3.8 shows the candidate and frequent itemsets generated for the first partition of
MUSHROOM dataset. Figure 3.6 shows the time only up to frequent 3-itemsets.
Candidate 4-itemsets were not generated even after running the algorithm for two
additional hours after the generation of frequent 3-itemsets. For later passes larger
candidate and frequent itemsets are generated. They require more time for support
counting as compared to earlier passes. For example, in the case of 2.0% support value
size of C4 is more than 5 times the size of C3 (C3 = 25113, C4 = 127227). Generation of
C4 form F3 and support counting of C4 to generate F4 is more time consuming than
previous pass.
0200400600800
100012001400
0. 0. 0 0. 1.
1600
15% 30% .45% 60% 00%
Time in
seconds
Minimum Support Figure 3.6: Performance of apriori for MUSHROOM dataset (2 partitions)
support F1 C2 F2 C3 F3 C4 F4
0.15% 115 6555 3255 49907 44244
0.30% 111 6105 3023 45400 39846
0.45% 104 5356 2810 41832 36324
1.0% 94 4371 2413 34091 28475
2.0% 89 3916 2008 25113 20487 127227 119150
Table 3.8: Itemsets in MUSHROOM dataset (1st partition)
The main reason for the bad performance of the MUSHROOM dataset is the average
number of items per transactions. As the average number of items per transactions
increases, frequent itemsets generated for each pass also increases. Hence support
counting requires more time for that.
Chapter 4 Partitioned Algorithm for Association Rule
Mining
In this chapter we will present performance analysis of Partition algorithm [8]. Partition
algorithm finds all frequent itemsets in just two scans over the database. In the first scan
it partitions the database into a given number of partitions and finds out all local
frequents itemsets. Then it merges all local frequent itemsets to form a global candidate
itemset. In the second scan over the database it finds out the support of global candidate
itemsets in the entire database and outputs global frequent itemsets.
The experiments are done on Oracle 10g RDBMS installed on Microsoft Windows XP
with 1 GB of RAM and 2.40 GHz processor. Each experiment is performed various times
and the best of them is taken.
4.1 Performance analysis of Partition Algorithm
For the purpose of support counting partition algorithm builds a special data structure
called Tidlist. Tidlist is created as a CLOB (character large object). Table 2.3 and Table
2.4 show an example of Tidlists for 1-iteemsets and 2-itemsets respectively. Figure 4.1
shows the Tidlist creation time for different datasets.
Figure 4.2 compares the time taken by the partition algorithm for BMSWEBVIEW-1
dataset with 2 partitions for different support values on Oracle RDBMS. The time taken
includes the Tidlist creation time plus time taken by the partition algorithm for frequent
itemsets generation. For lower support values algorithm takes more time because it
generates too many candidate itemsets. These candidate itemsets are then tested for
minimum support. The algorithm takes two scans over the database.
0
100
200
300
400
500
600
Mushroom BMSWEBVIEW1 T10I4D100K
Time in
seconds
Datasets
Figure 4.1: Tidlist creation time for different datasets
0500
1000150020002500300035004000
0.15% 0.30% 0.45% 0.60%
Time in
seconds
Minimum Support Figure 4.2: Performance of partition for BMSWEBVIEW1 dataset (2 partitions)
The Table 3.2 shows the total number of records, total number of distinct transactions and
distinct items contained in the partitions for BMSWEBVIEW1 dataset.
The Figure 4.3 and Figure 4.4 show the performance of the algorithm for 3 and 4
partitions respectively for BMSWEBVIEW1 dataset. The Table 3.3 and Table 3.4
describe the BMSWEBVIEW1 dataset with 3 and 4 partitions respectively. For 0.15%
support time taken by the partition algorithm for 3 partitions is more than that of 2
partitions.
0500
10001500200025003000350040004500
0.15% 0.30% 0.45% 0.60%
Time in
seconds
Minimum Support Figure 4.3: Performance of partition for BMSWEBVIEW1 dataset (3 partitions)
0500
1000150020002500300035004000
0.15% 0.30% 0.45% 0.60%
Time in
seconds Minimum Support Figure 4.4: Performance of partition for BMSWEBVIEW1 dataset (4 partitions)
The partition algorithm [13] uses Tidlist for support counting. For a particular partition,
in the first pass the Tidlist for the all local frequent 1-itemsets are created. Then frequent
2-itemsets are generated by the intersection of two corresponding frequent 1-itemsets.
The intersection of two Tidlists is very time consuming process. The overall performance
of the partition algorithm mainly depends on the second pass. Because the size of
candidate 2-itemsets C2 is very large, the time taken by the support counting process is
very high. The Table 4.1 shows the candidate 2-itemsets generated for the
BMSWEBVIEW1 and T10I4D100K for 4 partitions and for 0.45% minimum support.
Partition-1 Partition-2 Partition-3 Partition-4
T10I4D100K 180300 178503 172578 182106
BMSWEBVIEW1 13530 13203 13695 13530
Table 4.1: Candidate 2-itemsets (C2) for 0.45% support for 4 partitions The size of MUSHROOM dataset is near about 3 MB and is equal to the size of
BMSWEBVIE1 dataset. But for partition algorithm MUSHROOM dataset takes more
time than BMSWEBVIE1 dataset, because in the case of MUSHROOM dataset average
number of items per transaction is much more than that of the BMSWEBVIEW1 dataset,
hence Tidlist intersection for support counting of pass two (generation of F2) is more
time consuming in this case. The MUSHROOM dataset results are not shown in the
thesis because they are not acceptable for any number of partitions.
The Table 3.6 shows the total number of records, total number of distinct transactions and
distinct items contained in the partitions for T10I4D100K dataset for 4 partitions.
The Figure 4.5 shows the performance of partition algorithm for T10I4D100K dataset for
four partitions. It is obvious form the figure that the time taken by the algorithm is very
high and is not acceptable. The very bad performance of T10I4D100K dataset is due to
the fact that candidate 2-itemsets generated are very large and their support counting
takes too much time.
The partition algorithm does not escalate well in the case of relational database systems.
The algorithm is mainly a main memory based algorithm and works fine in the case of
main memory databases.
0
10000
20000
30000
40000
50000
60000
70000
0.30% 0.45% 0.60%
Time in
seconds
Minimum Support Figure 4.5: Performance of partition for T10I4D100K dataset (4 partitions)
4.2 Partition algorithm with second optimization (SPO)
In this section we will discuss the combination of partition algorithm and second pass
optimization of K-Way join approach.
4.2.1 Second pass optimization of K-Way Join approach for support counting [2, 13] As we have seen earlier, the size of C2 is very large and hence second pass support
counting is most time consuming compared to other passes. The process of generating C2
from F1 and then counting support for C2 and generating F2 can be replaced by
generating F2 directly by reading the database once.
This is shown below:
Insert into F2 select T1.item, T2.item, count (*)
From I_Table T1, I_Table T2
Where T1.tid = T2.tid and T1.item < T2.item
Group by T1.item, T2.item
Having count (*) > minsup;
4.2.2 The Approach
Partition algorithm is implemented with the combination of second pass optimization of
K-Way join approach. In this approach also, the database is scanned two times: first when
the Tidlist is generated for each partition and second when the F2 is generated by second
pass optimization in each partition. Rest of the process for the partition remains the same.
All the local frequent itemsets are combined to generate the global candidate itemsets. In
second phase database is not scanned for final support counting, but the count generated
in the first phase along with the tidlists are used instead. The approach gives better results
than the partition algorithm because in the second pass support counting (F2 generation)
for each partition the Tidlists are not intersected, which is the most time consuming step
in the entire support counting process.
The global frequent 3-itemsets generation SQL script for two partitions is shown below:
Insert into Global_F3
(
Select item1, item2, item3, sum (count) from
(
Select item1, item2, item3, count
From tidt_c3
Union all
Select item1, item2, item3, count
From tidt_cc3
)
Group by item1, item2, item3
Having sum (count)>=179
);
4.3 Performance Comparisons The Figure 4.6 shows the performance comparisons between partition and partition
algorithm with the second pass optimization for BMSWEBVIEW1 for 2 partitions. The
details about the partitions are described in the Table 3.2.
0500
1000150020002500300035004000
0.15% 0.30% 0.45% 0.60%
PartitionPartition with SPO
Time
in seconds
Minimum Support
Figure 4.6: Performance comparisons of partition and partition with SPO algorithm for BMSWEBVIEW1 dataset (2 partitions)
The Figure 4.7 shows the performance comparisons between partition and partition
algorithm with the second pass optimization for BMSWEBVIEW1 for 3 partitions. The
details about the partitions are described in the Table 3.3.
0500
10001500200025003000350040004500
0.15% 0.30% 0.45% 0.60%
Partition
Partition withSPO
Time
in seconds
Minimum Support
Figure 4.7: Performance comparisons of partition and partition with SPO algorithm for BMSWEBVIEW1 dataset (3 partitions)
The Figure 4.8 shows the performance comparisons between partition and partition
algorithm with the second pass optimization for BMSWEBVIEW1 for 4 partitions. The
details about the partitions are described in the Table 3.4
050
1015202530
0. 0. 0 0
00000000000
35004000
15% 30% .45% .60%
PTime
in seconds
artitionPartition with SPO
Minimum Support
Figure 4.8: Performance comparisons of partition and partition with SPO algorithm for BMSWEBVIEW1 dataset (4 partitions)
The Table 4.2 shows improvement in the performance for the partition algorithm with
SPO as compared to partition algorithm for BMSWEBVIEW1 dataset. In general for all
the partitions the performance increases tremendously as we move form 0.15% to 0.60%.
2 Partitions 3 Partitions 4 Partitions
0.15% Support Approx. 5 times Approx. 2.8 times Approx. 3 times
0.30% Support Approx. 7 times Approx. 6.5 times Approx. 6.5 times
0.45% Support Approx. 9.5 times Approx. 8 times Approx. 7.5 times
0.60% Support Approx. 9 times Approx. 9.4 times Approx. 8 times
Table 4.2: Performance comparison
0
10000
20000
30000
40000
50000
60000
70000
0.30% 0.45% 0.60%
PartitionPartition with SPO
Figure 4.9: Performance comparisons of partition and partition with SPO algorithm for
T10I4D100K dataset (4 partitions)
Figure 4.9 shows the performance comparison for T10I4D100K dataset. The Table 3.6
shows the total number of records, total number of distinct transactions and distinct items
contained in the partitions for T10I4D100K dataset for 4 partitions.
T10I4D100K dataset has shown very bad performance for partition algorithm (Figure
4.5). With second pass optimization, algorithm performs approximately 17 times, 30
times and 45 times better than as compared to partition algorithm in the case of 0.30%,
0.45% and 0.60% support values respectively.
Chapter 5 Sampling Algorithm for Association Rule Mining
This chapter presents performance analysis of the Sampling algorithm [10] for
association rule mining.
Sampling algorithm is implemented in the context partition algorithm. The algorithm first
partitions the database into a number of partitions. And then takes one partition as a
sample. It then finds out all the local frequent itemsets in the sample for the reduced
minimum support for that sample. Then these local frequent itemsets along with their
negative borders are tested against the entire dataset for the minimum support in the
entire dataset. Itemsets qualifying the minimum support are frequent itemsets in the entire
database. If negative border of the local frequent itemsets contain frequent itemsets in the
entire database, then only the algorithm scans the database second time to find out
missing frequent itemsets in the database.
The experiments are done on Oracle 10g relational database management system
installed on Microsoft Windows XP with 1 GB of RAM and 2.40 GHz processor speed.
Each experiment is performed various times and the best of them is taken.
5.1 The Negative Border
For any pass k negative border [10] is the set containing itemsets those are not frequent in
that pass, i.e. for any pass k negative border NBd (FK ) is equal to the Ck - Fk where Ck
and Fk are the set of candidate k-itemsets and set of frequent k-itemsets respectively. The
tables below shows the candidate itemsets, frequent itemsets and negative borders for
second pass.
Table 5.1, Table 5.2 and Table 5.3 show examples of candidate 2-itemsets C2, negative
border of frequent 2-itemsets NBd (F2) and frequent 2-itemsets F2 respectively.
ITEM1 ITEM2 1 2 1 3 1 5 2 3 3 5 3 4 2 4
ITEM1 ITEM2
1 5
3 5
3 4
2 4
Table 5.1: Candidate itemset C2
Table 5.2: Negative Border NBd(F2)
ITEM1 ITEM2 SUPPORT
1 2 4 1 3 4 2 3 4
Table 5.3: Frequent itemset F2
5.2 Performance analysis of the Sampling algorithm The tables below show the different partition sizes taken as the sample for the analysis of
sampling algorithm for the BMSWEBVIEW1 and T10I4D100K datasets respectively. It
also shows the number of distinct transactions and distinct items contained in the sample.
Last column of the tables show the percentage of the sample to the dataset.