Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please find" or straight forward end. To say it another way, data mining queries are on the ad hoc or unstructured end of the query spectrum rather than standard report generation or "retieve all records matching a criteria" or SQL side). Still, Data Mining queries ARE queries and are processed (or will eventually be processed) by a Database Management System the same way queries are processed today, namely: 1. SCAN and PARSE (SCANNER-PARSER): A Scanner identifies the tokens or language elements of the DM query. The Parser check for syntax or grammar validity. 2. VALIDATED: The Validator checks for valid names and semantic correctness. 3. CONVERTER converts to an internal representation. |4. QUERY OPTIMIZED: the Optimzier devises a stategy for executing the DM query (chooses among alternative Query internal representations). 5. CODE GENERATION: generates code to implement each operator in the selected DM query plan (the optimizer-selected internal representation). 6. RUNTIME DATABASE PROCESSORING: run plan code. Developing new, efficient and effective DataMining Query (DMQ) processors is the central need and issue in DBMS research today (far and away!). These notes concentrate on 5,i.e., generating code (algorithms) to implement operators (at a high level) namely operators that do: Association Rule Mining (ARM), Clustering (CLU), Classification (CLA)
12
Embed
Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Mining 1Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query
Processing, rather than the "please find" or straight forward end. To say it another way, data mining queries are on the ad hoc or unstructured end of the query spectrum rather
than standard report generation or "retieve all records matching a criteria" or SQL side).
Still, Data Mining queries ARE queries and are processed (or will eventually be processed) by a Database Management System the same way queries are processed today, namely:
1. SCAN and PARSE (SCANNER-PARSER): A Scanner identifies the tokens or language elements of the DM query. The Parser check for syntax or grammar validity.
2. VALIDATED: The Validator checks for valid names and semantic correctness.
3. CONVERTER converts to an internal representation.
|4. QUERY OPTIMIZED: the Optimzier devises a stategy for executing the DM query (chooses among alternative Query internal representations).
5. CODE GENERATION: generates code to implement each operator in the selected DM query plan (the optimizer-selected internal representation).
6. RUNTIME DATABASE PROCESSORING: run plan code.
Developing new, efficient and effective DataMining Query (DMQ) processors is the central need and issue in DBMS research today (far and away!).
These notes concentrate on 5,i.e., generating code (algorithms) to implement operators (at a high level) namely operators that do: Association Rule Mining (ARM), Clustering (CLU), Classification (CLA)
Machine Learning is almost always based on Near Neighbor Set(s), NNS.
Clustering, even density based, identifies near neighbor cores 1st (round NNSs, about a center).
Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity
>0 >0 : d(x,a)< d(f(x),f(a))< where f assigns a class to a feature vector, or
-NNS of f(a), a -NNS of a in its pre-image. f(Dom) categorical >0 : d(x,a)<f(x)=f(a)
Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining
Machine Learning can be broken down into 2 areas, Clustering and Classification.
Clustering can be broken down into 2 types, Isotropic (round clusters) and Density-based
Classification can be broken down into to types, Model-based and Neighbor-based
Database analysis can be broken down into 2 areas, Querying and Data Mining.
Caution: For classification, boundary analysis may be needed also to see the class (done by projecting?).
1234 Finding NNS in lower a dimension may still the 1st step. Eg, 12345678 are all from a5 a 6 (unclassified sample); 1234 are red-class, 5678 are blue-class. 7 8 Any that gives us a vote gives us a tie vote (0-to-0 then 4-to-4).But projecting onto the vertical subspace,then taking /2 we see that /2 about a contains only blue class (5,6) votes.
** *
Using horizontal data, NNS derivation requires ≥1 scan (O(n)). L ε-NNS can be derived using vertical-data in O(log2n) (but Euclidean disks are preferred). (Euclidean and L coincide in Binary data sets).
Association Rule Mining (ARM)Assume a relationship between two entities, T (e.g., a set of Transactions an enterprise performs) andI (e.g., a set of Items which are acted upon by those transactions).
In Market Basket Research (MBR) a transaction is a checkout transaction and an item is an Item in that customer's market basket going thru check out).
An I-Association Rule, AC, relates 2 disjoint subsets of I (I-temsets) has 2 main measures, support and confidence (A is called the antecedent, C is called the consequent)
There are also the dual concepts of T-association rules (just reverse the roles of T and I above).Examples of Association Rules include: The MBR, relationship between customer cash-register transactions, T, and
purchasable items, I (t is related to i iff i is being bought by that customer during that cash-register transaction.).
In Software Engineering (SE), the relationship between Aspects, T, and Code Modules, I (t is related to i iff module, i, is part of the aspect, t).
In Bioformatics, the relationship between experiments, T, and genes, I (t is related to i iff gene, i, expresses at a threshold level during experiment, t).
In ER diagramming, any “part of” relationship in which iI is part of tT (t is related to i iff i is part of t); and any “ISA” relationship in which iI ISA tT (t is related to i iff i IS A t) . . .
The support of an I-set, A, is the fraction of T-instances related to every I-instance in A, e.g. if A={i1,i2} and C={i4} then supp(A)= |{t2,t4}|/|{t1,t2,t3,t4,t5}| = 2/5 Note: | | means set size or count of elements in the set. I.e., T2 and T4 are the only transactions from the total transaction set, T={T1,T2,T3,T4,T5}. that are related to both i1 and i2, (buy i1 and i2 during the pertinent T-period of time).
support of rule, AC, is defined as supp{A C} = |{T2, T4}|/|{T1,T2,T3,T4,T5}| = 2/5
confidence of rule, AC, is supp(AC)/ supp(A) = (2/5) / (2/5) = 1
DM Queriers typically want STRONG RULES: supp≥minsupp, conf≥minconf (minsupp and minconf are threshold levels)
Note that Conf(AC) is also just the conditional probability of t being related to C, given that t is related to A).
T I
A
t1
t2
t3
t4
t5
i1
i2
i3
i4
C
Finding Strong Association RulesThe relationship between Transactions and Items can be expressed in a Transaction Table where each transaction is a row containing its ID and the list
of the items that are related to that transaction:
T ID A B C D E F
2000 1 1 1 0 0 0
1000 1 0 1 0 0 0
4000 1 0 0 1 0 0
5000 0 1 0 0 1 1If minsupp is set by the querier at .5 and minconf at .75:To find frequent or Large itemsets (support ≥ minsupp)
PseudoCode: Assume the items in Lk-1 are ordered:Step 1: self-joining Lk-1 insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
p from Lk-1, q from Lk-1 where p.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1<q.itemk-1
Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c do if (s is not in Lk-1) delete c from Ck
Transaction Bitmap Tab;e can be expressed using “Item bit vectors”
(inheritance property)
Any subset of a large itemset is large. Why?
(e.g., if {A, B} is large, {A} and {B} must be large)
APRIORI METHOD: Iteratively find the large k-itemsets, k=1...
Find all association rules supported by each large Itemset.
Ck denotes candidate k-itemsets generated at each step.
Lk denotes Large k-itemsets.
3 2 2 1 1 11-itemset supp
3 2 2Large (supp2)
Start by finding large 1-ItemSets.
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
No subset antecedent can yield a strong rule either (i.e., no need to check conf{2}{3,5} or conf{5}{2,3} since both denominators will be at least as large and therefore, both confidences will be at least as low.
No need to check conf{3}{2,5} or conf{5}{2,3} DONE!
This 0 makes entire left branch 0These 0s make this node 0 These 1s and these 0s make this 1
21-level has the only 1-bit so the 1-count = 1*21 = 2
Processing Efficiencies? (prefixed leaf-sizes have been removed)
Ptree-ARM versus Apriori on aerial photo (RGB) data together with yeild data
Scalability with support threshold
• 1320 1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000).
0
100
200
300
400
500
600
700
800
10% 20%30%40%50%60%70%80%90%
Support threshold
Ru
n t
ime
(Sec
.)
P-ARM
Apriori
P-ARM compared to Horizontal Apriori (classical) and FP-growth (an improvement of it).In P-ARM, we find all frequent itemsets, not just those containing Yield (for fairness)Aerial TIFF images (R,G,B) with synchronized yield (Y).
Scalability with number of transactions
0
200
400
600
800
1000
1200
100 500 900 1300 1700
Number of transactions(K)
Tim
e (S
ec.)
Apriori
P-ARM
Identical resultsP-ARM is more scalable for lower support thresholds.P-ARM algorithm is more
scalable to large spatial datasets.
P-ARM versus FP-growth (see literature for definition)
Scalability with support threshold
0
100
200
300
400
500
600
700
800
10% 30% 50% 70% 90%
Support threshold
Ru
n t
ime (
Sec.)
P-ARM
FP-grow th
17,424,000 pixels (transactions)
0
200
400
600
800
1000
1200
100 500 900 1300 1700
Number of transactions(K)
Tim
e (S
ec.)
FP-growth
P-ARM
Scalability with number of trans
FP-growth = efficient, tree-based frequent pattern mining method (details later)For a dataset of 100K bytes, FP-growth runs very fast. But for images of large
size, P-ARM achieves better performance. P-ARM achieves better performance in the case of low support threshold.
Other methods (other than FP-growth) to Improve Apriori’s Efficiency(see the literature or the html notes 10datamining.html in Other Materials for more detail)
• Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
• Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans
• Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
• Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness
• Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent
The core of the Apriori algorithm:– Use only large (k – 1)-itemsets to generate candidate large k-itemsets– Use database scan and pattern matching to collect counts for the candidate itemsets
The bottleneck of Apriori: candidate generation 1. Huge candidate sets:
104 large 1-itemset may generate 107 candidate 2-itemsets To discover large pattern of size 100, eg, {a1…a100}, we need to generate 2100 1030 candidates.2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)