A Novel Algorithm for Cross Level Frequent Pattern Mining in Multi datasets:
Table of Contents:
Abstract
List of Keywords
Introduction
Literature Survey
Existing System
Drawbacks in Existing System
Proposed System
System Design
Advantage of Proposed System
Requirement Specification
Modules
Modules description
Conclusion
References
Abstract
We consider the problem of discovering association rules between items
in a large database of sales transactions. We present two new algorithms
for solving this problem that is fundamentally different from the known
algorithms. Empirical evaluation shows that these algorithms outperform
the known algorithms by factors ranging from three for small problems
to more than an order of mag-nitude for large problems. We also show
how the best frequent pattern mining has become one of the most
popular data mining approaches for the analysis of purchasing patterns.
There are techniques such as Apriori and FP-Growth, which were
typically restricted to a single concept level. We extend our research to
discover cross - level frequent patterns in multi-level environments.
Unfortunately, little research has been paid to this research area. Mining
cross - level frequent pattern may lead to the discovery of mining
patterns at different levels of hierarchy. In this study a transaction
reduction technique with FP-tree based bottom up approach is used for
mining cross-level pattern. This method is using the concept of reduced
support.
Introduction
This discusses the theories and algorithms for the maintenance of
frequent pattern space. Frequent patterns", also known as frequent item
sets, refer to patterns that appear frequently in a particular dataset.
Frequent patterns are denied based on a user-denied threshold, called the
support threshold". Given a dataset, we say a Pattern is a frequent
pattern if and only if its occurrence frequency is above or equals to the
support threshold. We also denies the collection of all frequent patterns
as the frequent pattern space" or the space of frequent patterns".
Frequent patterns are a very important type of patterns in data mining.
Frequent patterns play an essential role in various knowledge discovery
tasks, such as the discovery of association rules, correlations, causality,
sequential patterns, partial periodicity, emerging patterns, etc. In the last
decade, the discovery of frequent patterns has attracted tremendous
research attention, and a phenomenal number of discovery algorithms,
such as are proposed. The maintenance of the frequent pattern space is
as crucial as the discovery of the pattern space. This is because data is
dynamic in nature. Due to the advance in data generation and collection
technologies, databases are constantly updated with newly collected
data. Data updates are also used as a means in interactive data mining, to
gauge the impact caused by hypothetical changes to the data and to
detect emergence and disappearance of trends. When a database is often
updated or modified for interactive mining, repeating the pattern
discovery process from scratch causes significant computational and I/O
overheads. Therefore, effective maintenance algorithms are needed to
update and maintain the frequent pattern space. This Thesis focuses on
the maintenance of frequent pattern space for transactional datasets. We
observe that most of the prior works in frequent pattern maintenance are
proposed as an extension of certain frequent pattern discovery
algorithms or the data structures they used. Unlike the prior works, this
Thesis lays a theoretical foundation for the development of effective
maintenance algorithms by analyzing the evolution of frequent pattern
space in response to data changes. We study the evolution of pattern
space using the concept of equivalence classes. Inspired by the evolution
analysis, novel maintenance algorithms are proposed to handle various
data updates.
Apriori-based algorithms
Apriori is the most influential algorithm for frequent pattern discovery.
Many Discovery algorithms are inspired by Apriori. Apriori employs a
candidate- generation-verification" framework. The algorithm generates
its candidate patterns using a “level-wise" search. The essential idea of
the level-wise search is to iteratively enumerate the set of candidate
patterns of length (k + 1) from the set of frequent patterns of length k.
The support of candidate patterns will then be counted by scanning the
dataset. One major drawback of Apriori is that it leads to the
enumeration of a huge number of candidate patterns. For example, if a
dataset has 100 items, Apriori may need to generate candidates. Another
drawback of Apriori is that it requires multiple scans of the dataset to
count the support of candidate patterns. Different variations of Apriori
are proposed to address these limitations. Introduced a hash-based
technique in to reduce the size of candidate patterns. proposed to speed
up the support counting process by reducing the number of transactions
scanned in future iterations. The idea of is that a transaction that does
not contain any frequent pattern of length k cannot contain any frequent
pattern with length greater than k. Therefore, such transactions can be
ignored for subsequent iterations.
FP-tree-based algorithms:
To address the shortcoming of the candidate-generation-verification
framework, Fp tree-Based algorithms, which involve no candidate
generation, are proposed. Examples of Fp tree-Based algorithms include
FP-growth described in is the state-of-the-art Fp tree-Based discovery
FP-growth mines frequent patterns based on a structure, Frequent
Pattern Tree (FP-tree). FP-tree is a compact representation of all relevant
frequency information in a database. Every branch of the FP-tree
represents a projected transaction" and also a candidate pattern. The
nodes along the branches are stored in descending order of the support
values of corresponding items, so leaves are representing the least
frequent items. Compression is achieved by building the tree in such a
way that overlapping transactions share prefixes of the corresponding
branches. Demonstrates how FP-tree is constructed for the sample
dataset given a support threshold First, the dataset is transformed into
the projected dataset". With FP-tree, FP-growth generates frequent
patterns using a fragment growth technique". The fragment growth
technique enumerates frequent patterns based on the support information
stored in FP-tree, which effectively avoids the generation of unnecessary
candidate patterns. Inspired by the idea of divide-and-conquer, the
fragment growth technique decomposes the mining tasks into subtasks
that mines frequent patterns for conditional datasets, which greatly
reduces the search space.
Details of the technique can be referred to. FP-growth significantly
outperforms both the Apriori-based and partition-based algorithms. The
advantages of FP-growth are: rst, FP-tree effectively compresses and
summarizes the dataset so that multiple scans of dataset is no longer
needed to obtain the support of patterns; second, the fragment growth
technique ensures no un-necessary candidate patterns are enumerated;
lastly, the search task is simplified with a divide-and-conquer method.
However, FP-growth, like other pre x-tree based algorithms, still the
undesirable large size of the frequent pattern space. To break this
bottleneck, algorithms are proposed to discover the concise
representations of frequent pattern space
Data mining Technology:
Generally, data mining (sometimes called data or knowledge discovery)
is the process of analyzing data from different perspectives and
summarizing it into useful information - information that can be used to
increase revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to analyze
data from many different dimensions or angles, categorize it, and
summarize the relationships identified. Technically, data mining is the
process of finding correlations or patterns among dozens of fields in
large relational databases.
While large-scale information technology has been evolving separate
transaction and analytical systems, data mining provides the link
between the two. Data mining software analyzes relationships and
patterns in stored transaction data based on open-ended user queries.
Several types of analytical software are available: statistical, machine
learning, and neural networks. Generally, any of four types of
relationships are sought:
Classes: Stored data is used to locate data in predetermined
groups. For example, a restaurant chain could mine customer
purchase data to determine when customers visit and what they
typically order. This information could be used to increase traffic
by having daily specials.
Clusters: Data items are grouped according to logical relationships
or consumer preferences. For example, data can be mined to
identify market segments or consumer affinities.
Associations: Data can be mined to identify associations. The
beer-diaper example is an example of associative mining.
Sequential patterns: Data is mined to anticipate behavior patterns
and trends. For example, an outdoor equipment retailer could
predict the likelihood of a backpack being purchased based on a
consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements:
Extract, transform, and load transaction data onto the data
warehouse system.
Store and manage the data in a multidimensional database system.
Provide data access to business analysts and information
technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
Artificial neural networks: Non-linear predictive models that
learn through training and resemble biological neural networks in
structure.
Genetic algorithms: Optimization techniques that use processes
such as genetic combination, mutation, and natural selection in a
design based on the concepts of natural evolution.
Decision trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
dataset. Specific decision tree methods include Classification and
Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID) . CART and CHAID are decision tree
techniques used for classification of a dataset. They provide a set
of rules that you can apply to a new (unclassified) dataset to
predict which records will have a given outcome. CART segments
a dataset by creating 2-way splits while CHAID segments using
chi square tests to create multi-way splits. CART typically requires
less data preparation than CHAID.
Nearest neighbor method: A technique that classifies each record
in a dataset based on a combination of the classes of the k record(s)
most similar to it in a historical dataset (where k 1). Sometimes
called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data
based on statistical significance.
Data visualization: The visual interpretation of complex
relationships in multidimensional data. Graphics tools are used to
illustrate data relationships.
EXISTING SYSTEM:
In the Existing system the Top-down approach is used. The
Existing has implemented to find large 1 frequent
pattern for all levels using new method CCB-tree.
Drawbacks in Existing System:
1) Top-down Approach
2)In this algorithm can’t reduce the search
spaces without losing any patterns.
3)There is no Reduction based frequent pattern
mining for single concept level.
PROPOSED SYSTEM:
Level-Crossing:
One approach to multilevel mining would be to directly exploit the
standard algorithms in this area – Apriori and FP-Growth by iteratively
applying them in a level by level manner to each concept level. In this
paper, we introduce a new study in discovery of frequent patterns based
on the FP-tree. Our approach is different from FP-Growth algorithm
which needs to recursively generate conditional FP-trees such that a
large amount of memory space needs to be used.
Our approach minimizes I/O costs by applying transaction
reduction technique and applying the resulted transactions in FP-tree as
input to subsequent iterations of the mining process. Our method adopts
a bottom-up approach, with a leaf to root traversal, so as to identify
frequent patterns existing between arbitrary classification levels. Our
method reduces the search spaces without losing any patterns.
A new approach to mine frequent patterns for multi datasets has to
be considered. Work has been done in adopting approaches originally
made for single level datasets into techniques usable on multilevel
datasets.
In this work, we attempt to reduce the unwanted patterns and
transactions using transaction reduction technique and applying the
resulted transactions in FP-tree as input to subsequent iterations of the
mining process. Our method adopts a bottom-up approach, with a leaf to
root traversal with single FP-tree generation, so as to identify frequent
patterns existing between arbitrary classification levels. Our method
reduces the I/O costs and search spaces without losing any patterns.
ADVANDAGES:
1)Bottom-Up Approach
2) In this algorithm we reduce the search spaces
without losing any patterns.
3)Here a new algorithm for transaction reduction
based frequent pattern mining in single concept
level.
SYSTEM DESIGN
DATA SET
FINDING THE FREQUENT
PATTERN TREE
CCB TREE CONSTRUCTION
REDUCED TRANSACTION
TABLE
FP TREE GENERATION
FREQUENT PATTERN
GENERATION
PERFORMANCE EVALUATION
FIND SUPPORT AND COUNT
APPLY CROSS LEVEL SET
ANALYSIS ALGORITHM
EXTRACTION
APPLY ASSOCIATION
RULE
BY APRIORI
DELETE MIN SUPPORT COUNT
FIND FREQUENT ITEM SET
ORDERED ITEM SET
Requirement Specification:
Hardware Requirements:
• System : Pentium IV 2.4 GHz
• Hard Disk : 160 GB
• Monitor : 15 VGA color
• Mouse : Logitech.
• Keyboard : 110 keys enhanced
• Ram : 1 GBSoftware Requirements:
• Os : Windows Xp,7
• Language : .Net
• Data Base : Sql server 2005
Modules
Multilevel Association mining
Find frequent item set(Apriori algorithm)
CCB- Tree mining
FP-Tree generation
Frequent pattern generation
Performance evaluation
Modules Description
Multilevel Association mining:
In data mining, association rule learning is a popular and well
researched method for discovering interesting relations between
variables in large databases. It is intended to identify strong rules
discovered in databases using different measures of interestingness. e.g.,
promotional pricing or product placements. In addition to the above
example from market basket analysis association rules are employed
today in many application areas including Web usage mining, intrusion
detection, Continuous production and bioinformatics. As opposed
to sequence mining, association rule learning typically does not consider
the order of items either within a transaction or across transactions.
Find frequent item set
Apriori is the most influential algorithm for frequent pattern discovery.
Many Discovery algorithms are inspired by Apriori. Apriori employs a
candidate- generation-verification" framework. The algorithm generates
its candidate patterns using a “level-wise" search. The essential idea of
the level-wise search is to iteratively enumerate the set of candidate
patterns of length (k + 1) from the set of frequent patterns of length k.
Using this we can find the frequent item set.
CCB- Tree mining :
CCB – Tree Algorithm has been used to find multilevel frequent 1
pattern for all levels. CCB – Tree starts from Left most initial node
and deletes the minimum support count to provide the reduced
transaction table.
FP-Tree generation:
A FP-tree is a compact data structure that represents the data set in
tree form. Each transaction is read and then mapped onto a path in
the FP-tree. This is done until all transactions have been read.
Different transactions that have common subsets allow the tree to
remain compact because their paths overlap.
The diagram to the right is an example of a best-case scenario that
occurs when all transactions have exactly the same item set; the size
of the FP-tree will be only a single branch of nodes.
Frequent pattern generation:
FP-tree, the next phase is to generate candidate item sets and find
frequent patterns. Cross-level frequent pattern with bottom up
approach starts from the leaf nodes of an existing FP-tree and
traverses each branch upwards until it reaches its root. We begin
by scanning the tree and identifying its leaf nodes. A pointer to
each leaf is then inserting into the leaf node array. We now
perform a bottom up scan of each leaf node until we reach the root.
Meanwhile each node visited is conserved into temporary buffer
for recording the passing path when a node with support count is
visited. Candidate Generation keeps the path from starting node.
Performance evaluation:
In this module we can evaluate the performance of the result of this
process in the graph. In this graph we can conclude the result
perfectly. It’s easy to analyze bye the users.
LITERATURE SURVEY:
Mining frequent item sets without candidate generation
Implements
In many cases, the Apriori algorithm significantly reduces the
size of candidate sets using the Apriori principle. However, it can
suffer from two-nontrivial costs:
(1) Generating a huge number of candidate sets,
(2) repeatedly scanning the database and checking the
candidates by pattern matching.
(3) Devised an FP-growth method that mines the complete
set of frequent item sets without candidate generation. FP-
growth works in a divide-and-conquer way. The first scan
of the database derives a list of frequent items in which
items are ordered by frequency descending order.
Algorithm for Efficient Multilevel Association Rule Mining
Implements
Over the years, a variety of algorithms for finding frequent item sets in very large transaction databases have been developed. The problems of finding frequent item sets are basic in multi level association rule mining, fast algorithms for solving problems are needed. This paper presents an efficient version of apriori algorithm for mining multi-level association rules in large databases to finding maximum frequent item set at lower level of abstraction. We propose a new, fast and an efficient algorithm (SC-BF Multilevel) with single scan of database for mining complete frequent item sets. To reduce the execution time and increase throughput in new method. Our proposed algorithm works well comparison with general approach of multilevel association rules.
An Efficient Algorithm for Mining Multilevel Association Rule Based on Pincer Search
Implements
Discovering frequent item set is a key difficulty in significant data mining
applications, such as the discovery of association rules, strong rules, episodes, and
minimal keys. The problem of developing models and algorithms for multilevel
association mining poses for new challenges for mathematics and computer
science. In this paper, we present a model of mining multilevel association rules
which satisfies the different minimum support at each level, we have employed
princer search concepts, multilevel taxonomy and different minimum supports to
find multilevel association rules in a given transaction data set. This search is used
only for maintaining and updating a new data structure. It is used to prune early
candidates that would normally encounter in the top-down search. A main
characteristic of the algorithms is that it does not require explicit examination of
every frequent item sets, an example is also given to demonstrate and support that
the proposed mining algorithm can derive the multiple-level association rules
under different supports in a simple and effective manner
Fast Algorithm for Mining Multi-Level Association Rules in Large Databases
Implements
Association rule mining finds interesting association among a large set of
data items. With massive amount of data continuously being collected and stored.
Many industries are becoming interested in mining association rules from their
databases. The discovery of interesting association relationship among huge
amount of business transaction records can help in much business decision making
process, such as catalogue design, cross marketing and loss leader analysis.
An Efficient Approach for Incremental Association Rule Mining
Implements
we study the issue of maintaining association rules in a large database of
sales transactions. The maintenance of association rules can be mapped into the
problem of maintaining large itemsets in the database. Because the mining of
association rules is time-consuming, we need an efficient approach to maintain the
large itemsets when the database is updated. In this paper, we present efficient
approaches to solve the problem. Our approaches store the itemsets that are not
large at present but may become large itemsets after updating the database, so that
the cost of processing the updated database can be reduced. Moreover, we discuss
the cases where the large itemsets can be obtained without scanning the original
database. Experimental results show that our algorithms outperform other
algorithms, especially when the original database need not be scanned in our
algorithms.
Conclusion
Transaction databases in many applications contain data that has built-in
hierarchy information. In such databases, users may be interested in finding
association among items only at the same level and we extended the scope of study
of mining level-crossing association rules from large databases. A transaction
reduction technique based method is used to reduce the unwanted candidates and
transactions and applying the resulted transactions in FP-tree as input to
subsequent iterations of the mining process. We adopted a bottom-up approach,
with a leaf to root traversal with single FP-tree generation, so as to identify
frequent patterns existing between arbitrary classification levels. Our method
reduces the I/O costs and search spaces without losing any patterns. Performance
Evaluation demonstrates the viability of our new method. In future, an efficient
algorithm can be generated to reduce the redundancy in cross-level association
rules.
References
[1] T.Eavis and XI Zheng, Multi-Level Frequent Pattern Mining, in Springer-Verlag Berlin Heidelberg 2009, pp. 369 – 383.
[2] Dr.K.Duraiswamy and B.Jayanthi, a Novel preprocessing Algorithm for Frequent Pattern Mining in Mutidatasets, International Journal of Data Engineering,Vol. 2, No. 3, Aug 2011.
[3] Han, J., Fu, Y., Discovery of Multiple-Level Association Rules from Large Databases, in Proceedings of the 21st Very Large Data Bases Conference, Morgan Kaufmann, P. 420-431, 1995.
[4] Yinbo WAN, Yong LIANG, Liya DING, “Mining Multilevel Association Rules from Primitive Frequent Item sets”, Journal of Macau University of Science and Technology, Vol.3 No.1, 2009
[5] Thakur, R. S., Jain, R. C., Pardasani, K. R., Mining Level-Crossing Association Rules from Large Databases, in the Journal of Computer Science 2(1), P. 76-81, 2006.
[6] R.E.Thevar, R.Krishnamoorthy, A New Approach of Modified Transaction Reduction Algorithm For mining Frequent Item set, proceedings of IEEE Workshop on Data mining and Artificial Intelligence, 2008.
[7] Rajkumar.N, Karthik.M.R, Sivanada.S.N, “Fast Algorithm for mining multilevel Association Rules,”IEEE Trans. Knowledge and Data Engg., Vol.2 pp. 688-692, 2003.
[8] Pratima Gautham, Pardasani, K. R., “Algorithm for Efficient Multilevel Association Rule Mining”, International Journal of Computer Science and Engineering, Vol.2 pp. 1700-1704, 2010