The Apriori Algorithm and its Extension by the Application of DeMorgan's Laws Mitch Fernandez Giri Narasimhan
Aug 07, 2015
The Apriori Algorithm and its Extension by the Application ofDeMorgan's Laws
Mitch FernandezGiri Narasimhan
The only way to verify you’ve found all of the item sets in a database is to check all of the possible item sets1 2 3 4 5 6 7 8 9
0.0005,000.000
10,000.00015,000.00020,000.00025,000.00030,000.00035,000.00040,000.00045,000.000
f(x) = 0.00167030278004725 exp( 2.1736345740781 x )R² = 0.994652089677174
Processing Time
Max Items
Seco
nds
11.3 hours
1 2 3 4 5 6 7 8 90
10,000,00020,000,00030,000,00040,000,00050,000,00060,000,00070,000,000
f(x) = 0.00131257744766744 x^11.7821784802445R² = 0.999883665008539
Item Sets Produced
Max Items
Item
Set
s
An NP-Hard ProblemMaximum Size Item Sets Time (secs) - Ermine Time (secs) - Laptop
2 5 0.162 0.1503 537 0.642 0.8104 14,858 10.456 10.5905 212,681 98.549 102.0006 1,927,513 1,135.844 7 12,385,790 7,992.549 8 60,727,444 40,870.235
The Apriori Algorithm First pass
Count all item occurrences to calculate support Subsequent k th pass
Use item sets from the (k –1)th pass to generate candidate item sets
Calculate support for candidate item sets Prune candidates with support below threshold
Proceed to the (k +1)th pass
The Apriori Algorithm Hot Dogs Buns Mustard Chips
Trans01 1 1 1 1
Trans02 0 1 0 1
Trans03 1 1 0 0
Trans04 0 1 0 1
Trans05 1 0 0 0
Set minimum support to 0.25
1st Pass• {Hot Dogs} = 0.60• {Buns} = 0.80• {Mustard} = 0.20• {Chips} = 0.60
2nd Pass• {Hot Dogs, Buns} = 0.40• {Hot Dogs, Chips} =
0.20• {Buns, Chips} = 0.60
3rd Pass• {Hot Dogs, Buns, Chips}
= 0.20
Creating a Complement Set
Item01 Item02 Item03 Item04Trans0
1 1 0 1 1Trans0
2 0 1 0 1Trans0
3 1 0 0 1Trans0
4 0 1 1 0Trans0
5 1 0 0 1Trans0
6 0 1 1 1
Creating a Complement Set
Item01 Item02 Item03 Item04Trans0
1 0 1 0 0Trans0
2 1 0 1 0Trans0
3 0 1 1 0Trans0
4 1 0 0 1Trans0
5 0 1 1 0Trans0
6 1 0 0 0
What DeMorgan Means for Apriori
If has 90% support, then must have 10% support since
must have 10% support
What DeMorgan Means for Apriori
If has 10% support, then must have 10% support since
Therefore, must have 90% support
Putting it Into Practice
1. Take the complement of the original data set by changing all the 1s to 0s and all the 0s to 1s2. Set minimum support as low as possible and set an upper limit for maximum support3. Run Apriori – Resulting item sets will be ORs 4. Correct the reported support by subtracting it from 100%
But what if…
What if your data set looks like this? Item01 Item02 Item03 Item04
Trans01 1 0 1 1
Trans02 0 1 0 1
Trans03 1 0 0 1
Trans04 0 1 1 0
Trans05 1 0 0 1
Trans06 0 1 1 1
Item01 or Item02 is found in every patient – set has 100% support. But Apriori can’t find that!
… then
Add a dummy transaction to ensure all OR sets can be found Item01 Item02 Item03 Item04
Trans01 1 0 1 1
Trans02 0 1 0 1
Trans03 1 0 0 1
Trans04 0 1 1 0
Trans05 1 0 0 1
Trans06 0 1 1 1
Dummy01 1 1 1 1
Now Apriori can find it!
The Complete Procedure
1. Take the complement of the original data set by changing all the 1s to 0s and all the 0s to 1s2. Add dummy transactions to ensure all item sets can be found3. Set minimum support as low as possible and set an upper limit for maximum support4. Run Apriori – Resulting item sets will be ORs
Lupus
289
Items to Analyze
134 Subject
s
17 Muscular Symptoms 4 Neurological Symptoms 16 Dermatological Symptoms 10 Inflammatory Symptoms 23 Major Organ Symptoms 61 Miscellaneous Symptoms 158 Laboratory Tests
Lupus – Symptoms Needed for Full Coverage
Raynaud’s
Syndrome
Photo-sensitivi
ty
Nasal or Oral
Ulcers
Joint Swelling
Swelling of 3 or More
Joints
Hand Swelling
Joint Pain
Swelling of Lower
Extremities
Alopecia
Swelling of Hand
Muscles
Fever Malar Rash
Lupus – Full Coverage for SLE vs MCTD
Joint Pain
Hand Swelling
Nasal or Oral
Ulcers
Malar Rash
Raynaud’s
Syndrome
Hand Swellin
g
Muscle Weakne
ss
SLE MCTD
Chronic Obstructive Pulmonary Disease
9939
Items to Analyze
55 Subject
s
Subjects with COPD Active Smokers Former Smokers Never Smokers
Metagenomic Data Set
Species0
1Species0
2Species0
3Species0
4Patient
01 12,347 400 1,254 845Patient
02 8,457 523 1,632 698Patient
03 9,322 1,351 1,324 1,709Patient
04 4,252 451 1,155 821Patient
05 7,453 1,625 959 58Patient
06 8,255 54 1,434 457
Metagenomic Data Set
Species0
1Species0
2Species0
3Species0
4Patient
01 High Low Med LowPatient
02 High Low Med LowPatient
03 High Med Med MedPatient
04 High Low Med LowPatient
05 High Med Low LowPatient
06 High Low Med Low
-2.2 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 More0
50
100
150
200
250
300
350
Z-score
Fre
quency
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 More0
1000
2000
3000
4000
5000
6000
Z-score
Freq
uenc
yDistribution of Normalized Read Counts
Low Med High
Significant Frequent Item SetsItem Set ID Support - Active Support - Former Support - Never Max - Min No. of Items p-value Item Set Details
Active_0003 50.00% 4.17% 44.44% 45.83% 2 0.0242 Family.Pseudomonadaceae.001 = High, Rhizobium.001 = Med
Active_0009 50.00% 8.33% 22.22% 41.67% 3 0.0702 Staphylococcus.001 = High, Prevotella.002 = High, Porphyromonas.001 = Med
Active_0026 50.00% 12.50% 11.11% 38.89% 4 0.0889Prevotella.002 = High, Mogibacterium.001 = Med,
Family.Carnobacteriaceae.001 = Med, Order.Actinomycetales.001 = Med,
Active_0028 50.00% 20.83% 0.00% 50.00% 4 0.1033Staphylococcus.001 = High, Prevotella.002 = High,
Order.Actinomycetales.001 = Med, Mogibacterium.001 = Med
Active_0020 50.00% 25.00% 0.00% 50.00% 4 0.1469Staphylococcus.001 = High, Prevotella.002 = High,
Family.Prevotellaceae.001 = Med, Mycoplasma.001 = Med
Active_0029 50.00% 25.00% 0.00% 50.00% 4 0.1469Family.Pseudomonadaceae.001 = High,
Order.Actinomycetales.001 = Med, Prevotella.002 = High, Mogibacterium.001 = Med
Active_0010 50.00% 12.50% 22.22% 37.50% 3 0.1524 Family.Pseudomonadaceae.001 = High, Prevotella.002 = High, Porphyromonas.001 = Med
Active_0021 50.00% 12.50% 22.22% 37.50% 4 0.1524Neisseria.001 = High, Prevotella.002 = High,
Order.Actinomycetales.001 = Med, Treponema.001 = Med
Active_0014 50.00% 16.67% 11.11% 38.89% 3 0.1636 Family.Prevotellaceae.001 = Med, Prevotella.002 = High, Order.Actinomycetales.001 = Med
Former_0003 13.64% 50.00% 22.22% 36.36% 3 0.1868 Neisseria.001 = High, Porphyromonas.001 = High, Order.Clostridiales.004 = Med