A New Incremental Data Mining Algorithm Using Pre-large Itemsets* Tzung-Pei Hong** Department of Information Management I-Shou University Kaohsiung, 84008, Taiwan, R.O.C. [email protected]http://www.nuk.edu.tw/tphong Ching-Yao Wang Institute of Computer and Information Science National Chiao-Tung University Hsinchu, 300, Taiwan, R.O.C. [email protected]Yu-Hui Tao Department of Information Management I-Shou University Kaohsiung, 84008, Taiwan, R.O.C. [email protected]--------------------------------------
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Due to the increasing use of very large databases and data warehouses, mining
useful information and helpful knowledge from transactions is evolving into an
important research area. In the past, researchers usually assumed databases were static
to simplify data mining problems. Thus, most of the classic algorithms proposed
focused on batch mining, and did not utilize previously mined information in
incrementally growing databases. In real-world applications, however, developing a
mining algorithm that can incrementally maintain discovered information as a
database grows is quite important. In this paper, we propose the concept of pre-large
itemsets and design a novel, efficient, incremental mining algorithm based on it. Pre-
large itemsets are defined by a lower support threshold and an upper support
threshold. They act as gaps to avoid the movements of itemsets directly from large to
small and vice-versa. The proposed algorithm doesn't need to rescan the original
database until a number of transactions have been newly inserted. If the database has
grown larger, then the number of new transactions allowed will be larger too.
Keywords: data mining, association rule, large itemset, pre-large itemset,
incremental mining.
2
1. Introduction
Years of effort in data mining have produced a variety of efficient techniques.
Depending on the type of databases processed, these mining approaches may be
classified as working on transaction databases, temporal databases, relational
databases, and multimedia databases, among others. On the other hand, depending on
the classes of knowledge derived, the mining approaches may be classified as finding
association rules, classification rules, clustering rules, and sequential patterns [4],
among others. Among them, finding association rules in transaction databases is most
commonly seen in data mining [1][3][5][9][10][12][13][15][16].
In the past, many algorithms for mining association rules from transactions were
proposed, most of which were executed in level-wise processes. That is, itemsets
containing single items were processed first, then itemsets with two items were
processed, then the process was repeated, continuously adding one more item each
time, until some criteria were met. These algorithms usually considered the database
size static and focused on batch mining. In real-world applications, however, new
records are usually inserted into databases, and designing a mining algorithm that can
maintain association rules as a database grows is thus critically important
When new records are added to databases, the original association rules may
become invalid, or new implicitly valid rules may appear in the resulting updated
databases [7][8][11][14][17]. In these situations, conventional batch-mining
algorithms must re-process the entire updated databases to find final association rules.
Two drawbacks may exist for conventional batch-mining algorithms in maintaining
3
database knowledge:
(a) Nearly the same computation time as that spent in mining from the original
database is needed to cope with each new transaction. If the original database
is large, much computation time is wasted in maintaining association rules
whenever new transactions are generated
(b) Information previously mined from the original database, such as large
itemsets and association rules, provides no help in the maintenance process.
Cheung and his co-workers proposed an incremental mining algorithm, called
FUP (Fast UPdate algorithm) [7], for incrementally maintaining mined association
rules and avoiding the shortcomings mentioned above. The FUP algorithm modifies
the Apriori mining algorithm [3] and adopts the pruning techniques used in the DHP
(Direct Hashing and Pruning) algorithm [13]. It first calculates large itemsets mainly
from newly inserted transactions, and compares them with the previous large itemsets
from the original database. According to the comparison results, FUP determines
whether re-scanning the original database is needed, thus saving some time in
maintaining the association rules. Although the FUP algorithm can indeed improve
mining performance for incrementally growing databases, original databases still need
to be scanned when necessary. In this paper, we thus propose a new mining algorithm
based on two support thresholds to further reduce the need for rescanning original
databases. Since rescanning the database spends much computation time, the
maintenance cost can thus be reduced in the proposed algorithm.
4
The remainder of this paper is organized as follows. The data mining process is
introduced in section 2. The maintenance of association rules is described in section 3.
The FUP algorithm is reviewed in section 4. A new incrementally mining algorithm is
proposed in section 5. An example is also given there to illustrate the proposed
algorithm. Conclusions are summarized in section 6.
2. The Data Mining Process Using Association Rules
Data mining plays a central role in knowledge discovery. It involves applying
specific algorithms to extract patterns or rules from data sets in a particular
representation. Because data mining is important to KDD, many researchers in
database and machine-learning fields are interested in this new research topic since it
offers opportunities to discover useful information and important relevant patterns in
large databases, thus helping decision-makers analyze data easily and make good
decisions regarding the domains in question.
One application of data mining is to induce association rules from transaction
data, such that the presence of certain items in a transaction will imply the presence of
certain other items. To achieve this purpose, Agrawal and his co-workers proposed
several mining algorithms based on the concept of large itemsets to find association
rules in transaction data [1][3][5]. They divided the mining process into two phases.
In the first phase, candidate itemsets were generated and counted by scanning the
transaction data. If the count of an itemset appearing in the transactions was larger
than a pre-defined threshold value (called the minimum support), the itemset was
considered a large itemset. Itemsets containing only one item were processed first.
5
Large itemsets containing only single items were then combined to form candidate
itemsets containing two items. This process was repeated until all large itemsets had
been found. In the second phase, association rules were induced from the large
itemsets found in the first phase. All possible association combinations for each large
itemset were formed, and those with calculated confidence values larger than a
predefined threshold (called the minimum confidence) were output as association
rules. We may summarize the data mining process we focus on as follows:
1. Determine user-specified thresholds, including the minimum support
value and the minimum confidence value.
2. Find large itemsets in an iterative way. The count of a large itemset must
exceed or equal the minimum support value.
3. Utilize the large itemsets to generate association rules, whose confidence
must exceed or equal the minimum confidence value.
Below, we use a simple example to illustrate the mining process. Suppose a
database with five transactions shown in Table 1 is to be mined. The database has two
features, transaction identification (TID) and transaction description (Items).
Table 1. An example of a transaction database
TID Items100 BE200 ABD300 AD400 BCE500 ABDE
Assume the user-specified minimum support and minimum confidence are 40%
and 80%, respectively. The transaction database is first scanned to count the candidate
6
1-itemsets. The results are shown in Table 2.
Table 2. Candidate 1-itemsets
Item CountA 3B 4C 1D 3E 3
Since the counts of the items A, B, D and E are larger than 2 (5*40%), they are
put into the set of large 1-itemsets. The candidate 2-itemsets are then formed from
these large 1-itemsets as shown in Table 3.
Table 3. Candidate 2-itemsets with counts
Items CountAB 2AD 3AE 1BD 2BE 3DE 1
AB, AD, BD and BE then form the set of large 2-itemsets. In a similar way,
ABD can be found to be a large 3-itemset.
Next, the large itemsets are used to generate association rules. According to the
condition probability, the possible association rules generated are shown in Table 4.
7
Table 4. Possible association rules
Rule ConfidenceIF AB, Then D Count(ABD)/Count(AB)=1IF AD, Then B Count(ABD)/Count(AD)=2/3IF BD, Then A Count(ABD)/Count(BD)=1IF A, Then B Count(AB)/Count(A)=2/3IF B, Then A Count(AB)/Count(B)=2/4IF A, Then D Count(AD)/Count(A)=1IF D, Then A Count(AD)/Count(D)=1IF B, Then D Count(BD)/Count(B)=2/4IF D, Then B Count(BD)/Count(D)=2/3IF B, Then E Count(BE)/Count(B)=3/4IF E, Then B Count(BE)/Count(E)=1
Since the user-specified minimum confidence is 80%, the final association rules
are shown in Table 5.
Table 5. The final association rules for this example
Rule ConfidenceIF AB, Then D Count(ABD)/Count(AB)=1IF BD, Then A Count(ABD)/Count(BD)=1IF A, Then D Count(AD)/Count(A)=1IF D, Then A Count(AD)/Count(D)=1IF E, Then B Count(BE)/Count(E)=1
3. Maintenance of Association Rules
In real-world applications, transaction databases grow over time and the
association rules mined from them must be re-evaluated because new association rules
may be generated and old association rules may become invalid when the new entire
databases are considered.
Conventional batch-mining algorithms, such as Apriori [1] and DHP [13], solve
8
this problem by re-processing entire new databases when new transactions are
inserted into the original databases. These algorithms do not, however, use previously
mined information and require nearly the same computational time they needed to
mine from the original databases. If new transactions appear often and the original
databases are large, these algorithms are thus inefficient in maintaining association
rules.
Considering an original database and newly inserted transactions, the following
four cases (illustrated in Figure 1) may arise:
Case 1: An itemset is large in the original database and in the newly inserted
transactions.
Case 2: An itemset is large in the original database, but is not large in the newly
inserted transactions.
Case 3: An itemset is not large in the original database, but is large in the newly
inserted transactions.
Case 4: An itemset is not large in the original database and in the newly inserted
transactions.
9
Large itemset
Small itemset
Case 1 Case 2
Case 3 Case 4
Small itemset
Large itemset
New transactions
Original database
Large itemset
Small itemset
Case 1 Case 2
Case 3 Case 4
Small itemset
Large itemset
New transactions
Original database
Figure 1: Four cases arising from adding new transactions to existing databases
Since itemsets in Case 1 are large in both the original database and the new
transactions, they will still be large after the weighted average of the counts.
Similarly, itemsets in Case 4 will still be small after the new transactions are inserted.
Thus Cases 1 and 4 will not affect the final association rules. Case 2 may remove
existing association rules, and case 3 may add new association rules. A good rule-
maintenance algorithm should thus accomplish the following.
1. Evaluate large itemsets in the original database and determine whether they
are still large in the updated database;
2. Find out whether any small itemsets in the original database may become
large in the updated database;
3. Seek itemsets that appear only in the newly inserted transactions and
determine whether they are large in the updated database.
These are accomplished by the FUP algorithm and by our proposed algorithm.
4. Review of the Fast Update Algorithm (FUP)
Cheung et al. proposed the FUP algorithm to incrementally maintain association
rules when new transactions are inserted [7][8]. Using FUP, large itemsets with their
counts in preceding runs are recorded for later use in maintenance. As new
transactions are added, FUP first scans them to generate candidate 1-itemsets (only for
these transactions), and then compares these itemsets with the previous ones. FUP
10
partitions candidate 1-itemsets into two parts according to whether they are large for
the original database. If a candidate 1-itemset from the newly inserted transactions is
also among the large 1-itemsets from the original database, its new total count for the
entire updated database can easily be calculated from its current count and previous
count since all previous large itemsets with their counts are kept by FUP. Whether an
original large itemset is still large after new transactions are inserted is determined
from its support ratio as its total count over the total number of transactions. By
contrast, if a candidate 1-itemset from the newly inserted transactions does not exist
among the large 1-itemsets in the original database, one of two possibilities arises. If
this candidate 1-itemset is not large for the new transactions, then it cannot be large
for the entire updated database, which means no action is necessary. If this candidate
1-itemset is large for the new transactions but not among the original large 1-itemsets,
the original database must be re-scanned to determine whether the itemset is actually
large for the entire updated database. Using the processing tactics mentioned above,
FUP is thus able to find all large 1-itemsets for the entire updated database. After that,
candidate 2-itemsets from the newly inserted transactions are formed and the same
procedure is used to find all large 2-itemsets. This procedure is repeated until all large
itemsets have been found.
Below, we use a simple example to illustrate the FUP algorithm. Suppose a
database with eight transactions such as the one shown in Table 6 is to be mined. The
Next, assume two new transactions, as shown in Table 8 appear.
Table 8. New transactions for the example
New transactionsTID Items900 ABCD1000 DEF
The FUP algorithm processes them as follows. First, the final large 1-itemsets for
the entire updated database are found. This process is shown in Figure 2. The same
12
process is then repeated until no new candidate itemsets are generated.
13
14
New transactions TID Items 900 ABCD 1000 DEF
Item Count A 1 B 1 C 1 D 2 E 1
F 1
Item Count A 1 B 1 C 1 E 1
Item Count D 2 F 1
Item Count A 6 B 7 C 7 E 7
Item Count D 2 F 1
Item Count A 6 B 7 C 7 E 7
Item Count D 5
Items Count A 6 B 7 C 7 D 5 E 7
Find all candidate 1-itemsets
Extract originally large 1-itemsets from these two transactions
Extract originally small 1-itemsets from these two transactions
Extract 1-itemsets large for the new transactions
Find the large 1- itemsets for updated database
Find the large 1-itemsets by rescanning the original database
Add the counts to the originally large 1-itemsets
Figure 2: The FUP process of finding large 1-itemsets
A summary of the four cases and their FUP results is given in Table 9.
Table 9. Four cases and their FUP results
Cases: Original – New ResultsCase 1: Large – Large Always largeCase 2: Large – Small Determined from existing informationCase 3: Small – Large Determined by rescanning original databaseCase 4: Small – Small Always small
FUP is thus able to handle cases 1, 2 and 4 more efficiently than conventional
batch mining algorithms. It must, however, reprocess the original database to handle
case 3.
5. Maintenance of Association Rules Based on Pre-large Itemsets
Although the FUP algorithm focuses on the newly inserted transactions and thus
saves much processing time by incrementally maintaining rules, it must still scan the
original database to handle case 3 in which a candidate itemsets is large for new
transactions but is not recorded in large itemsets already mined from the original
database. This situation may occur frequently, especially when the number of new
transactions is small. In an extreme situation, if only one new transaction is added
each time, then all items in this transaction are large since their support ratios are
100% for the new transaction. Thus, if case 3 could be efficiently handled, the
maintenance time could be further reduced.
15
5.1 Definition of Pre-large Itemsets
In this paper, we propose the concept of pre-large itemsets to solve the problem
represented by case 3. A pre-large itemset is not truly large, but promises to be large
in the future. A lower support threshold and an upper support threshold are used to
realize this concept. The upper support threshold is the same as that used in the
conventional mining algorithms. The support ratio of an itemset must be larger than
the upper support threshold in order to be considered large. On the other hand, the
lower support threshold defines the lowest support ratio for an itemset to be treated as
pre-large. An itemset with its support ratio below the lower threshold is thought of as
a small itemset. Pre-large itemsets act like buffers in the incremental mining process
and are used to reduce the movements of itemsets directly from large to small and
vice-versa.
Considering an original database and transactions newly inserted using the two
support thresholds, itemsets may thus fall into one of the following nine cases
illustrated in Figure 3.
Figure 3: Nine cases arising from adding new transactions to existing databases
16
Large itemsets
Large itemsets
Pre-large itemsets
Original database
New transactions
Small itemsets
Small itemsets
Case 1 Case 2 Case 3
Case 4 Case 5 Case 6
Case 7 Case 8 Case 9
Pre-large itemsets
Large itemsets
Large itemsets
Pre-large itemsets
Original database
New transactions
Small itemsets
Small itemsets
Case 1 Case 2 Case 3
Case 4 Case 5 Case 6
Case 7 Case 8 Case 9
Pre-large itemsets
Cases 1, 5, 6, 8 and 9 above will not affect the final association rules according
to the weighted average of the counts. Cases 2 and 3 may remove existing association
rules, and cases 4 and 7 may add new association rules. If we retain all large and pre-
large itemsets with their counts after each pass, then cases 2, 3 and case 4 can be
handled easily. Also, in the maintenance phase, the ratio of new transactions to old
transactions is usually very small. This is more apparent when the database is growing
larger. An itemset in case 7 cannot possibly be large for the entire updated database as
long as the number of transactions is small compared to the number of transactions in
the original database. This point is proven below. A summary of the nine cases and
their results is given in Table 10.
Table 10. Nine cases and their results
Cases: Original – New ResultsCase 1: Large – Large Always large
Case 2: Large - Pre-largeLarge or pre-large,
Determined from existing information
Case 3: Large - SmallLarge or pre-large or small,
Determined from existing information
Case 4: Pre-large - LargePre-large or large,
Determined from existing informationCase 5: Pre-large - Pre-large Always pre-large
Case 6: Pre-large - SmallPre-large or small,
Determined from existing information
Case 7: Small - LargePre-large or small when the number of
transactions is smallCase 8: Small - Pre-large Small or Pre-largeCase 9: Small - Small Always small
5.2 Notation
The notation used in this paper is defined below.
17
D : the original database;.
T : the set of new transactions;
U : the entire updated database, i.e., D T;
d : the number of transactions in D;
t : the number of transactions in T;
Sl : the lower support threshold for pre-large itemsets;
Su : the upper support threshold for large itemsets, Su >Sl;
: the set of large k-itemsets from D;
: the set of large k-itemsets from T;
: the set of large k-itemsets from U;
: the set of pre-large k-itemsets from D;
: the set of pre-large k-itemsets from T;
: the set of pre-large k-itemsets from U;
Ck : the set of all candidate k-itemsets from T;
I : an itemset;
SD(I) : the number of occurrences of I in D;
ST(I) : the number of occurrences of I in T;
SU(I) : the number of occurrences of I in U.
5.3 Theoretical Foundation
As mentioned above, if the number of new transactions is small compared to the
number of transactions in the original database, an itemset that is small (neither large
nor pre-large) in the original database but is large in the newly inserted transactions
cannot possibly be large for the entire updated database. This is proven in the
18
following theorem.
Theorem 1: let Sl and Su be respectively the lower and the upper support
thresholds, and let d and t be respectively the numbers of the original and new
transactions. If t , then an itemset that is small (neither large nor pre-
large) in the original database but is large in newly inserted transactions is not large
for the entire updated database.
Proof:
The following derivation can be obtained from t :
t (1)
t(1-Su) (Su- Sl) d
t-tSu dSu- dSl
t+ dSl Su(d+t)
Su.
If an itemset I is small (neither large nor pre-large) in the original database D,
then its count SD(I) must be less than Sld, therefore,
SD(I) < dSl.
If I is large in the newly inserted transactions T, then:
t ST(I) tSu.
19
The entire support ratio of I in the updated database U is , which can be
further expanded to:
=
Su.
I is thus not large for the entire updated database. This completes the proof.
Example 1: Assume d=100, Sl=50% and Su=60%. The number of new
transactions within which the original database need not be scanned for rule
maintenance is:
= .
Thus, if the number of newly inserted transactions is equal to or less than 25,
then I is cannot be large for the entire updated database.
From theorem 1, the number of new transactions required for efficient handling
of case 7 is determined by Sl, Su, and d. It can easily be seen from Formula 1 that if d
grows larger, then t can grow larger too. Therefore, as the database grows, our
proposed approach becomes increasingly efficient. This characteristic is especially
useful for real-world applications.
Form theorem 1, the ratio of new transactions to previous transactions for the
20
proposed approach to work out can easily be derived as follows.
Corollary 1: Let r denote the ratio of new transactions t to old transactions d. If
r , then an itemset that is small (neither large nor pre-large) in the original
database but is large in the newly inserted transactions cannot be large for the entire
updated database.
Example 2: Assume Sl=50% and Su=60%. The ratio of new transactions to old
transactions within which the original database need not be scanned for rule
maintenance is:
= .
Thus, if the number of newly inserted transactions is equal to or less than 1/4 of
the number of original transactions, then I cannot be large for the entire updated
database.
It is easily seen from corollary 1 that if the range between Sl and Su is large, then
the ratio r can also be large, meaning that the number of new transactions will be large
for a fixed d. However, a large range between Sl and Su will also create a large set of
pre-large itemsets, which will represent an additional overhead in maintenance.
5.4 Presentation of the Algorithm
21
In the proposed algorithm, the large and pre-large itemsets with their counts in
preceding runs are recorded for later use in maintenance. As new transactions are
added, the proposed algorithm first scans them to generate candidate 1-itemsets (only
for these transactions), and then compares these itemsets with the previously retained
large and pre-large 1-itemsets. It partitions candidate 1-itemsets into three parts
according to whether they are large or pre-large for the original database. If a
candidate 1-itemset from the newly inserted transactions is also among the large or
pre-large 1-itemsets from the original database, its new total count for the entire
updated database can easily be calculated from its current count and previous count
since all previous large and pre-large itemsets with their counts have been retained.
Whether an originally large or pre-large itemset is still large or pre-large after new
transactions have been inserted is determined from its new support ratio, as derived
from its total count over the total number of transactions. On the contrary, if a
candidate 1-itemset from the newly inserted transactions does not exist among the
large or pre-large 1-itemsets in the original database, then it is absolutely not large for
the entire updated database as long as the number of newly inserted transactions is
within the safety threshold derived from Theorem 1. In this situation, no action is
needed. When transactions are incrementally added and the total number of new
transactions exceeds the safety threshold, the original database is re-scanned to find
new pre-large itemsets in a way similar to that used by the FUP algorithm. The
proposed algorithm can thus find all large 1-itemsets for the entire updated database.
After that, candidate 2-itemsets from the newly inserted transactions are formed and
the same procedure is used to find all large 2-itemsets. This procedure is repeated
until all large itemsets have been found. The details of the proposed maintenance
algorithm are described below. A variable, c, is used to record the number of new
22
transactions since the last re-scan of the original database.
The proposed maintenance algorithm:
INPUT: A lower support threshold Sl, an upper support threshold Su, a set of large
itemsets and pre-large itemsets in the original database consisting of (d+c)
transactions, and a set of t new transactions.
OUTPUT: A set of final association rules for the updated database.
STEP 1: Calculate the safety number f of new transactions according to theorem 1 as
follows:
f = .
STEP 2: Set k =1, where k records the number of items in itemsets currently being
processed.
STEP 3: Find all candidate k-itemsets Ck and their counts from the new transactions.
STEP 4: Divide the candidate k-itemsets into three parts according to whether they are
large, pre-large or small in the original database.
STEP 5: For each itemset I in the originally large k-itemsets , do the following
substeps:
Substep 5-1: Set the new count SU(I) = ST(I)+ SD(I).
Substep 5-2: If SU(I)/(d+t+c) Su, then assign I as a large itemset, set SD(I) =
SU(I) and keep I with SD(I),
otherwise, if SU(I)/(d+t+c) Sl, then assign I as a pre-large itemset, set
SD(I) = SU(I) and keep I with SD(I),
otherwise, neglect I.
STEP 6: For each itemset I in the originally pre-large itemset , do the following
23
substeps:
Substep 6-1: Set the new count SU(I) = ST(I)+ SD(I).
Substep 6-2: If SU(I)/(d+t+c) Su, then assign I as a large itemset, set SD(I) =
SU(I) and keep I with SD(I),
otherwise, if SU(I)/(d+t+c) Sl, then assign I as a pre-large itemset, set
SD(I) = SU(I) and keep I with SD(I),
otherwise, neglect I.
STEP 7: For each itemset I in the candidate itemsets that is not in the originally large
itemsets or pre-large itemsets , do the following substeps:
Substep 7-1: If I is in the large itemsets or pre-large itemsets from the
new transactions, then put it in the rescan-set R, which is used
when rescanning in Step 8 is necessary.
Substep 7-2: If I is small for the new transactions, then do nothing.
STEP 8: If t +c f or R is null, then do nothing; otherwise, rescan the original
database to determine whether the itemsets in the rescan-set R are large or
pre-large.
STEP 9: Form candidate (k+1)-itemsets Ck+1 from finally large and pre-large k-
itemsets ( ) that appear in the new transactions.
STEP 10: Set k = k+1.
STEP 11: Repeat STEPs 4 to 10 until no new large or pre-large itemsets are found.
STEP 12: Modify the association rules according to the modified large itemsets.
STEP 13: If t +c > f, then set d=d+t+c and set c=0; otherwise, set c=t+c.
After Step 13, the final association rules for the updated database have been
determined.
24
5.5 An Example
In this section, an example is given to illustrate the proposed incremental data
mining algorithm. Assume the initial data set includes 8 transactions, which are the
same as those shown in Table 6. For Sl=30% and Su=50%, the sets of large itemsets
and pre-large itemsets for the given data are shown in Tables 11 and 12, respectively.
Table 11. The large itemsets for the original database
Large itemsets1 item Count 2 items Count 3 items Count
A 5 BC 4 BCE 4B 6 BE 6C 6 CE 4E 6
Table 12. The pre-large itemsets for the original database