Top Banner
Machine Learning and Data Mining 15-381 3-April-2003 Jaime Carbonell
35

Data Mining in eCommerce Web-Based Information Architectures

Nov 30, 2014

Download

Documents

Tommy96

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining in eCommerce Web-Based Information Architectures

Machine Learning and Data Mining15-381

3-April-2003

Jaime Carbonell

Page 2: Data Mining in eCommerce Web-Based Information Architectures

General Topic: Data Mining

• Typology of Machine Learning

• Data Bases (brief review/intro)

• Data Mining (DM)

• Supervised Learning Methods in DM

• Evaluating ML/DM Systems

Page 3: Data Mining in eCommerce Web-Based Information Architectures

Typology of Machine Learning Methods (1)

• Learning by caching– What/when to cache– When to use/invalidate/update cache

• Learning from Examples(aka "Supervised" learning)– Labeled examples for training– Learn the mapping from examples to labels– E.g.: Naive Bayes, Decision Trees, ...– Text Categorization (using kNN or other means)

is a learning-from-examples task

Page 4: Data Mining in eCommerce Web-Based Information Architectures

Typology of Machine Learning Methods (2)

• "Speedup" Learning– Tuning search heuristics from experience– Inducing explicit control knowledge– Analogical learning (generalized instances)

• Optimization "policy" learning– Predicting continuous objective function– E.g. Regression, Reinforcement, ...

• New Pattern Discovery(aka "Unsupervised" Learning)– Finding meaningful correlations in data– E.g. association rules, clustering, ...

Page 5: Data Mining in eCommerce Web-Based Information Architectures

Data Bases in a Nutshell (1)

Ingredients• A Data Base is a set of one or more rectangular

tables (aka "matrices", "relational tables").• Each table consists of m records (aka, "tuples")• Each of the m records consists of n values, one for

each of the n attributes• Each column in the table consist of all the values

for the attribute it represents

Page 6: Data Mining in eCommerce Web-Based Information Architectures

Data Bases in a Nutshell (2)

Ingredients• A data-table scheme is just the list of table column

headers in their left-to-right order. Think of it as a table with no records.

• A data-table instance is the content of the table (i.e. a set of records) consistent with the scheme.

• For real data bases: m >> n.

Page 7: Data Mining in eCommerce Web-Based Information Architectures

Data Bases in a Nutshell (3)

A Generic DB tableAttr1, Attr2, ...,

Attrn

Record-1 t1,1, t1,2, ..., t1,n

Record-2 t2,1, t2,2, ..., t2,n

. .

. .

. .

Record-m tm,1, tm,2, ..., tm,n

Page 8: Data Mining in eCommerce Web-Based Information Architectures

Example DB tables (1)Customer DB Table

Customer-Schema = (SSN, Name, YOB, DOA, user-id)SSN Name YOB DOA user-id

110-20-3003 Smith 1954 12-07-99 asmith

034-67-1188 Jones 1962 11-02-99 jjones

404-10-1111 Suzuki 1948 24-04-00 suzuki

333-10-0066 Smith 1972 24-04-00 asmith2

… … … … …

Page 9: Data Mining in eCommerce Web-Based Information Architectures

Example DB tables (2)Transaction DB table

Transaction-Schema = (user-id, DOT, product, help, tcode)

user-id DOT product help tcode price

asmith2 24-04-00 book-2241 N 10001 23.95

asmith2 25-04-00 CD-1129 N 10002 18.95

suzuki 25-04-00 book-5011 Y 10003 44.50

asmith2 30-04-00 CD-1129 N 10004 18.95

asmith2 30-04-00 CD-1131 N 10005 19.95

jjones 01-05-00 *err* Y 10006 0.00

suzuki 05-05-00 book-7702 N 10007 39.95

jjones 05-05-00 CD-2380 Y 10008 12.95

asmith2 06-05-00 CD-2380 N 10009 21.95

jjones 09-05-00 book-1922 Y 10010 7.95

… … … … … …

Page 10: Data Mining in eCommerce Web-Based Information Architectures

Data Bases Facts (1)

DB Tables

• m =< O(106), n =< O(102)

• matrix Ti,j (a DB "table") is dense

• Each ti,j is any scalar data type

(real, integer, boolean, string,...)

• All entries in a given column of a DB-table must have the same data type.

Page 11: Data Mining in eCommerce Web-Based Information Architectures

Data Bases Facts (2)

DB Query

• Relational algebra query system (e.g. SQL)

• Retrieves individual records, subsets of tables, or information liked across tables (DB joins on unique fields)

• See DB optional textbook for details

Page 12: Data Mining in eCommerce Web-Based Information Architectures

Data Base Design Issues (1)

Design Issues• What additional table(s) are needed?• Why do we need multiple DB tables?

Why not encode everything into one big table?• How do we search a DB table?

How about the full DB?• How do we update a DB instance?

How do we update a DB schema?

Page 13: Data Mining in eCommerce Web-Based Information Architectures

Data Base Design Issues (2)Unique keys• Any column can serve as search key• Superkey = unique record identifier

user-id and SSN for customertcode for product

• Sometimes superkey = 2 or more keyse.g.: nationality + passport-number

• Candidate Key = minimal superkey = unique keyUpdate Used for cross-products and joins

Page 14: Data Mining in eCommerce Web-Based Information Architectures

Data Base Design Issues (3)

Drops and errors

• Missing data -- always happens

• Erroneously entered data (type checking, range checking, consistency checking, ...)

Page 15: Data Mining in eCommerce Web-Based Information Architectures

Data Base Design Issues (4)

Text Mining

• Rows in Tm,n are document vectors

• n = vocabulary size = O(105)

• m = documents = O(105)

• Tm,n is sparse

• Same data type for every cell ti,j in Tm,n

Page 16: Data Mining in eCommerce Web-Based Information Architectures

DATA MINING [Supervised] (1)

Given:• A data base table Tm,n

• Predictor attributes: tj

• Predicted attributes: tk (k # j)Find Predictor Functions:Fk: tj --> tk , such that, for each k:

Fk = Argmin Error[Fl,k(tj), tk]

Fl,k L2

(or L1, or L-infinity norm, ...)

Page 17: Data Mining in eCommerce Web-Based Information Architectures

DATA MINING [Supervised] (2)Where typically:• There is only one tk of interest and therefore only

one Fk (tj)• tk may be boolean

=> Fk is a binary classifier• tk may be nominal (finite set)

=> Fk is an n-ary classifier• tk may be a real number

=> tk is a an approximating function• tk may be an arbitrary string

=> tk is hard to formalize

Page 18: Data Mining in eCommerce Web-Based Information Architectures

DATA MINING APPLICATIONS (1)

FINANCE:• Credit-card & Loan Fraud Detection• Time Series Investment Portfolio• Credit Decisions & Collections

HEALTHCARE:• Decision Support: optimal treatment choice• Survivability Predictions• medical facility utilization predictions

Page 19: Data Mining in eCommerce Web-Based Information Architectures

DATA MINING APPLICATIONS (2)

MANUFACTURING:• Numerical Controller Optimizations• Factory Scheduling optimization

MARKETING & SALES:• Demographic Segmentation• Marketing Strategy Effectiveness• New Product Market Prediction• Market-basket analysis

Page 20: Data Mining in eCommerce Web-Based Information Architectures

Simple Data Mining Example (1)Tot Num Max Num

Acct. Income Job Delinq Delinq Owns Credit Finalnumb. in K/yr Now? accts cycles home? years disp.------------------------------------------------------------1001 25 Y 1 1 N 2 Y1002 60 Y 3 2 Y 5 N1003 ? N 0 0 N 2 N1004 52 Y 1 2 N 9 Y1005 75 Y 1 6 Y 3 Y1006 29 Y 2 1 Y 1 N1007 48 Y 6 4 Y 8 N1008 80 Y 0 0 Y 0 Y1009 31 Y 1 1 N 1 Y1011 45 Y ? 0 ? 7 Y1012 59 ? 2 4 N 2 N1013 10 N 1 1 N 3 N1014 51 Y 1 3 Y 1 Y1015 65 N 1 2 N 8 Y1016 20 N 0 0 N 0 N1017 55 Y 2 3 N 2 N1018 40 N 0 0 Y 1 Y

Page 21: Data Mining in eCommerce Web-Based Information Architectures

Simple Data Mining Example (2)Tot Num Max Num

Acct. Income Job Delinq Delinq Owns Credit Finalnumb. in K/yr Now? accts cycles home? years disp.------------------------------------------------------------1019 80 Y 1 1 Y 0 Y1021 18 Y 0 0 N 4 Y1022 53 Y 3 2 Y 5 N1023 0 N 1 1 Y 3 N1024 90 N 1 3 Y 1 Y1025 51 Y 1 2 N 7 Y1026 20 N 4 1 N 1 N1027 32 Y 2 2 N 2 N1028 40 Y 1 1 Y 1 Y1029 31 Y 0 0 N 1 Y1031 45 Y 2 1 Y 4 Y1032 90 ? 3 4 ? ? N1033 30 N 2 1 Y 2 N1034 88 Y 1 2 Y 5 Y1035 65 Y 1 4 N 5 Y1036 12 N 1 1 N 1 N

Page 22: Data Mining in eCommerce Web-Based Information Architectures

Simple Data Mining Example (3)

Tot Num Max Num

Acct. Income Job Delinq Delinq Owns Credit Final

numb. in K/yr Now? accts cycles home? years disp.

------------------------------------------------------------

1037 28 Y 3 3 Y 2 N

1038 66 ? 0 0 ? ? Y

1039 50 Y 2 1 Y 1 Y

1041 ? Y 0 0 Y 8 Y

1042 51 N 3 4 Y 2 N

1043 20 N 0 0 N 2 N

1044 80 Y 1 3 Y 7 Y

1045 51 Y 1 2 N 4 Y

1046 22 ? ? ? N 0 N

1047 39 Y 3 2 ? 4 N

1048 70 Y 0 0 ? 1 Y

1049 40 Y 1 1 Y 1 Y

------------------------------------------------------------

Page 23: Data Mining in eCommerce Web-Based Information Architectures

Trend Detection in DM (1)

Example: Sales Prediction2003 Q1 sales = 4.0M,

2003 Q2 sales = 3.5M

2003 Q3 sales = 3.0M

2003 Q4 sales = ??

Page 24: Data Mining in eCommerce Web-Based Information Architectures

Trend Detection in DM (2)Now if we knew last year:2002 Q1 sales = 3.5M,2002 Q2 sales = 3.1M2002 Q3 sales = 2,8M2002 Q4 sales = 4.5M

And if we knew previous year:2001 Q1 sales = 3.2M,2001 Q2 sales = 2.9M2001 Q3 sales = 2.5M2001 Q4 sales = 3.7M

Page 25: Data Mining in eCommerce Web-Based Information Architectures

Trend Detection in DM (3)

What will 2001 Q4 sales be?

What if Christmas 2000 was cancelled?

What will 2002 Q4 sales be?

Page 26: Data Mining in eCommerce Web-Based Information Architectures

Trend Detection in DM II (1)Methods• Numerical series extrapolation• Cyclical curve fitting

– Find period of cycle– Fit curve for each period

(often with L2 or L infinity norm)– Find translation (series extrapolation)– Extrapolate to estimate desire values

• Preclassify data first(e.g. "recession" and "expansion" years)

• Combine with "standard" data mining

Page 27: Data Mining in eCommerce Web-Based Information Architectures

Trend Detection in DM II (2)

Thorny Problems• How to use external knowledge* to make

up for limitations in the data?• How to make longer-range extrapolations?• How to cope with corrupted data?

– Random point errors (easy)– Systematic error (hard)– Malicious errors (impossible)

Page 28: Data Mining in eCommerce Web-Based Information Architectures

Methods for Supervised DM (1)Classifiers• Linear Separators (regression)• Naive Bayes (NB)• Decision Trees (DTs)• k-Nearest Neighbor (kNN)• Decision rule induction• Support Vector Machines (SVMs)• Neural Networks (NNs) ...

Page 29: Data Mining in eCommerce Web-Based Information Architectures

Methods for Supervised DM (2)

Points of Comparison• Hard vs Soft decisions

(e.g. DTs and rules vs kNN, NB)• Human-interpretable decision rules

(best: rules, worst: NNs, SVMs)• Training data needed (less is better)

(best: kNNs, worst: NNs)• Graceful data-error tolerance

(best: NNs, kNNs, worst: rules)

Page 30: Data Mining in eCommerce Web-Based Information Architectures

Symbolic Rule Induction (1)

General idea

• Labeled instances are DB tuples

• Rules are generalized tuples

• Generalization occurs at term in tuple

• Generalize on new E+ not predicted

• Specialize on new E- not predicted

• Ignore predicted E+ or E-

Page 31: Data Mining in eCommerce Web-Based Information Architectures

Symbolic Rule Induction (2)

Example term generalizations

• Constant => disjunction

e.g. if small portion value set seen

• Constant => least-common-generalizer class

e.g. if large portion of value set seen

• Number (or ordinal) => range

e.g. if dense sequential sampling

Page 32: Data Mining in eCommerce Web-Based Information Architectures

Symbolic Rule Induction (3)

Example term specializations

• class => disjunction of subclasses

• Range => disjunction of sub-ranges

Page 33: Data Mining in eCommerce Web-Based Information Architectures

Symbolic Rule Induction Example (1)Age Gender Temp b-cult c-cult loc Skin disease65 M 101 + .23 USA normal strep25 M 102 + .00 CAN normal strep65 M 102 - .78 BRA rash dengue36 F 99 - .19 USA normal *none*11 F 103 + .23 USA flush strep88 F 98 + .21 CAN normal *none*39 F 100 + .10 BRA normal strep12 M 101 + .00 BRA normal strep15 F 101 + .66 BRA flush dengue20 F 98 + .00 USA rash *none*81 M 98 - .99 BRA rash ec-1287 F 100 - .89 USA rash ec-1212 F 102 + ?? CAN normal strep

14 F 101 + .33 USA normal67 M 102 + .77 BRA rash

Page 34: Data Mining in eCommerce Web-Based Information Architectures

Symbolic Rule Induction Example (2)Candidate Rules:IF age = [12,65]

gender = *any*temp = [100,103]b-cult = +c-cult = [.00,.23]loc = *any*skin = (normal,flush)

THEN: strepIF age = (15,65)

gender = *any*temp = [101,102]b-cult = *any*c-cult = [.66,.78]loc = BRAskin = rash

THEN: dengue

Disclaimer: These are *not* real medical records

Page 35: Data Mining in eCommerce Web-Based Information Architectures

Evaluation of ML/DM Methods

• Split labeled data into training & test sets• Apply ML (d-tree, rules, NB, …) to training• Measure accuracy (or P, R, F1, …) on test• Alternatives:

– K-fold cross-validation– Jacknifing (aka “leave one out”)

• Caveat: distributional equivalence• Problem: temporally-sequenced data (drift)