Catalog Integration R. Agrawal, R. Srikant: WWW-10
Jan 17, 2016
Catalog Integration
R. Agrawal, R. Srikant: WWW-10
Catalog Integration Problem
Integrate products from new catalog into master catalog.
a
ICs
LogicMem.DSP
fec db
ICs
Cat 2Cat 1
yx z
New CatalogMaster Catalog
The Problem (cont.)
After integration:
ICs
LogicMem.DSP
a fec db yx z
Desired Solution
Automatically integrate products: little or no effort on part of user. domain independent. Problem size: Million products Thousands of categories
Model
Product descriptions consist of words Products live in the leaf-level categories
Basic Algorithm
Build classification model using product descriptions in master catalog.
Use classification model to predict categories for products in the new catalog.
Logic
DSPx
5%
95%
National Semiconductor Files
Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver
device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator
...
...
...
National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:
Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line
Part: LM3940 1A Low Dropout RegulatorPangea Category:
Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator
...
...
Accuracy on Pangea Data
B2B Portal for electronic components: 1200 categories, 40K training
documents. 500 categories with < 5 documents. Accuracy: 72% for top choice. 99.7% for top 5 choices.
Enhanced Algorithm: Intuition
Use affinity information in the catalog to be integrated (new catalog):
Products in same category are similar. Bias the classifier to incorporate this
information. Accuracy boost depends on quality of new
catalog: Use tuning set to determine amount of bias.
Algorithm
Extension of the Naive-Bayes classification to incorporate affinity information
Naive Bayes Classifier
Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule
Pr(d): same for all categories (ignore) Pr(Ci) = #docs Ci / #total docs
Pr(d|Ci) = wd Pr(w|Ci)– Words occur independently (unigram model)
Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|)– Maximum likelihood estimate smoothed with the
Lidstone’s law of succession
Enhanced Algorithm
Pr(Ci|d,S) //d existed in category S= Pr(Ci,d,S) / Pr(d,S)
– Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S)
= Pr(Ci)Pr(S,d|Ci) / Pr(d,S)= Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d)
– Assuming d, S independent given Ci
= Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d)– Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S)
= Pr(Ci|S)Pr(d|Ci) / Pr(d|S)– Pr(S,d) = Pr(S)Pr(d|S)
Same as NB except Pr(Ci|S) instead of Pr(Ci)– Ignore Pr(d|S) as it is same for all classes
Computing Pr(Ci|S)
Pr(Ci|S) =
|Ci|(#docs in S predicted to be in Ci)w /
j[1,n] |Cj|(#docs in S predicted to be in Cj)w
|Ci| = #docs in Ci in the master catalog w determines weight of the new catalog
– Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known
– Choose one weight for the entire new catalog or different weights for different sections
Superiority of the Enhanced Algorithm Theorem: The highest possible accuracy
achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm.
Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent.
The tune set method attempts to select a good value for weight, but there is no guarantee of success.
Empirical Evaluation
Start with a real catalog M Remove n products from M to form the new
catalog N In the new catalog N
– Assign f*n products to the same category as M– Assign the rest to other categories as per some
distribution (but remember their true category) Accuracy: Fraction of products in N assigned
to their true categories
Improvement in Accuracy (Pangea)
1 2 5 10 25 50 100 200
Weight
65
70
75
80
85
90
95
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Improvement in Accuracy (Reuters)
1 2 5 10 25 50 100 200
Weight
82
84
86
88
90
92
94
96
98
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Improvement in Accuracy (Google.Outdoors)
1 5 25 100 400 1000
Weight
50
60
70
80
90
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Tune Set Size (Pangea)
0 5 10 20 35 50
Tune Set Size
70
75
80
85
90
95A
ccu
racy
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Similar results for Reuters and Google.
Empirical Results
71-22-6 79-21 100
Purity (No. of classes & their distribution)
0
5
10
15
20
% E
rro
rs Standard
Enhanced
Summary
Classification accuracy can be improved by factoring in the affinity information implicit in the data to be categorized.
How to apply these ideas to other types of classifiers?