Top Banner
Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell
36

Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Mining in eCommerceWeb-Based Information Architectures

MSEC 20-760Mini II

Jaime Carbonell

Page 2: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

General Topic: Data Mining

• Typology of Machine Learning

• Data Bases (review/intro)

• Data Mining (DM)

• Supervised methods for DM

• Applications (e.g. Text Mining)

Page 3: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Machine Learning

• Discovering useful patterns in data– Data: DB tables, text, time-series, …– Patterns: generalizable and predictive

• Learning methods are:– Deductive (e.g. cache implications)– Inductive (e.g. rules to summarize data)– Abductive (e.g. generative models)

Page 4: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Typology of Machine Learning Methods

• Learning by caching (remember key results)• Learning from examples (“supervised learning”)

• Learning by experimentation (“active learning”)

• Learning from experience (“re-enforcement and speedup learning”)

• Learning from time-series data• Learning by discovery (“unsupervised learning”)

Page 5: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Bases in a Nutshell (1)

Ingredients• A Data Base is a set of one or more rectangular

tables (aka "matrices", "relational tables").• Each table consists of m records (aka, "tuples")• Each of the m records consists of n values, one for

each of the n attributes• Each column in the table consist of all the values

for the attribute it represents

Page 6: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Bases in a Nutshell (2)

Ingredients• A data-table scheme is just the list of table column

headers in their left-to-right order. Think of it as a table with no records.

• A data-table instance is the content of the table (i.e. a set of records) consistent with the scheme.

• For real data bases: m >> n.

Page 7: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Bases in a Nutshell (3)

A Generic DB table

Attr1, Attr2, ..., Attrn

Record-1 t1,1, t1,2, ..., t1,n

Record-2 t2,1, t2,2, ..., t2,n

. .

. .

. .

Record-m tm,1, tm,2, ..., tm,n

Page 8: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Example DB tables (1)Customer DB Table

Customer-Schema = (SSN, Name, YOB, DOA, user-id)SSN Name YOB DOA user-id

110-20-3003 Smith 1954 12-07-99 asmith

034-67-1188 Jones 1962 11-02-99 jjones

404-10-1111 Suzuki 1948 24-04-00 suzuki

333-10-0066 Smith 1972 24-04-00 asmith2

… … … … …

Page 9: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Example DB tables (2)Transaction DB table

Transaction-Schema = (user-id, DOT, product, help, tcode)

user-id DOT product help tcode price

asmith2 24-04-00 book-2241 N 10001 23.95

asmith2 25-04-00 CD-1129 N 10002 18.95

suzuki 25-04-00 book-5011 Y 10003 44.50

asmith2 30-04-00 CD-1129 N 10004 18.95

asmith2 30-04-00 CD-1131 N 10005 19.95

jjones 01-05-00 *err* Y 10006 0.00

suzuki 05-05-00 book-7702 N 10007 39.95

jjones 05-05-00 CD-2380 Y 10008 12.95

asmith2 06-05-00 CD-2380 N 10009 21.95

jjones 09-05-00 book-1922 Y 10010 7.95

… … … … … …

Page 10: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Bases Facts (1)

DB Tables

• m =< O(106), n =< O(102)

• matrix Ti,j (a DB "table") is dense

• Each ti,j is any scalar data type

(real, integer, boolean, string,...)

• All entries in a given column of a DB-table must have the same data type.

Page 11: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Bases Facts (2)

DB Queries:

• Relational algebra query system (SQL)

• Retrieves individual records, subsets of tables, or information liked across tables (DB joins on unique fields)

• See DB optional textbook for details

Page 12: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Base Design Issues (1)

Design Issues• What additional table(s) are needed?• Why do we need multiple DB tables?

Why not encode everything into one big table?• How do we search a DB table?

How about the full DB?• How do we update a DB instance?

How do we update a DB schema?

Page 13: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Base Design Issues (2)Unique keys• Any column can serve as search key• Superkey = unique record identifier

user-id and SSN for customertcode for product

• Sometimes superkey = 2 or more keyse.g.: nationality + passport-number

• Candidate Key = minimal superkey = unique keyUpdate Used for cross-products and joins

Page 14: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Base Design Issues (3)

Drops and errors

• Missing data -- always happens

• Erroneously entered data (type checking, range checking, consistency checking, ...)

Page 15: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Data Base Design Issues (4)

Comparing DBs with Text (IR) vectors:

• Rows in Tm,n are document vectors

• n = vocabulary size = O(105)

• m = documents = O(105)

• Tm,n is sparse

• Same data type for every cell ti,j in Tm,n

Page 16: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Supervised Machine Learning

Given:• A data base table Tm,n

• Predictor attributes: tj1, tj2,…

• To-be-predicted attributes: tk1, tk2,… (k≠j)

Find Predictor Functions:Fk1: tj1, tj2,… tk1, Fk2: tj1, tj2,… tk2, …

such that, for each ki:

Fki = Argmin Error[f(tj1, tj2,… ), tki] f with L1-norm(or L2, LChevychev)

Page 17: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

DATA MINING [Supervised] (2)Where typically:• There is only one tk of interest and therefore only one

Fk (tj)• tk may be boolean

=> Fk is a binary classifier• tk may be nominal (finite set)

=> Fk is an n-ary classifier• tk may be a real number

=> Fk is a an approximating function• tk may be an arbitrary string (rare case)

=> Fk is hard to formalize

Page 18: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

DATA MINING APPLICATIONS (1)

FINANCE:• Credit-card & Loan Fraud Detection• Time Series Investment Portfolio• Credit Decisions & Collections

HEALTHCARE:• Decision Support: optimal treatment choice• Survivability Predictions• medical facility utilization predictions

Page 19: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

DATA MINING APPLICATIONS (2)

MANUFACTURING:• Numerical Controller Optimizations• Factory Scheduling optimization

MARKETING & SALES:• Demographic Segmentation• Marketing Strategy Effectiveness• New Product Market Prediction• Market-basket analysis

Page 20: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Simple Data Mining Example (1)Tot Num Max Num

Acct. Income Job Delinq Delinq Owns Credit Finalnumb. in K/yr Now? accts cycles home? years disp.------------------------------------------------------------1001 25 Y 1 1 N 2 Y1002 60 Y 3 2 Y 5 N1003 ? N 0 0 N 2 N1004 52 Y 1 2 N 9 Y1005 75 Y 1 6 Y 3 Y1006 29 Y 2 1 Y 1 N1007 48 Y 6 4 Y 8 N1008 80 Y 0 0 Y 0 Y1009 31 Y 1 1 N 1 Y1011 45 Y ? 0 ? 7 Y1012 59 ? 2 4 N 2 N1013 10 N 1 1 N 3 N1014 51 Y 1 3 Y 1 Y1015 65 N 1 2 N 8 Y1016 20 N 0 0 N 0 N1017 55 Y 2 3 N 2 N1018 40 N 0 0 Y 1 Y

Page 21: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Simple Data Mining Example (2)Tot Num Max Num

Acct. Income Job Delinq Delinq Owns Credit Finalnumb. in K/yr Now? accts cycles home? years disp.------------------------------------------------------------1019 80 Y 1 1 Y 0 Y1021 18 Y 0 0 N 4 Y1022 53 Y 3 2 Y 5 N1023 0 N 1 1 Y 3 N1024 90 N 1 3 Y 1 Y1025 51 Y 1 2 N 7 Y1026 20 N 4 1 N 1 N1027 32 Y 2 2 N 2 N1028 40 Y 1 1 Y 1 Y1029 31 Y 0 0 N 1 Y1031 45 Y 2 1 Y 4 Y1032 90 ? 3 4 ? ? N1033 30 N 2 1 Y 2 N1034 88 Y 1 2 Y 5 Y1035 65 Y 1 4 N 5 Y1036 12 N 1 1 N 1 N

Page 22: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Simple Data Mining Example (3)

Tot Num Max Num

Acct. Income Job Delinq Delinq Owns Credit Final

numb. in K/yr Now? accts cycles home? years disp.

------------------------------------------------------------

1037 28 Y 3 3 Y 2 N

1038 66 ? 0 0 ? ? Y

1039 50 Y 2 1 Y 1 Y

1041 ? Y 0 0 Y 8 Y

1042 51 N 3 4 Y 2 N

1043 20 N 0 0 N 2 N

1044 80 Y 1 3 Y 7 Y

1045 51 Y 1 2 N 4 Y

1046 22 ? ? ? N 0 N

1047 39 Y 3 2 ? 4 N

1048 70 Y 0 0 ? 1 Y

1049 40 Y 1 1 Y 1 Y

------------------------------------------------------------

Page 23: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Supervised Learning Methods

• Naïve Bayes:

f(tj1, tj2,… ) = f(p(tk|tj1 ), p(tk|tj1 ),…)

• K-Nearest Neighbors (kNN): ∑simn(dnew,d+) - ∑simn(dnew,d-) [dnew,old in k]

• Support Vector Machines (SVM)• Decision trees (with/without boosting)

• Neural Nets … & many more

Page 24: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Tradeoffs among Inductive Methods

• Hard vs Soft decisions

(e.g. DTs and rules vs kNN, NB)• Human-interpretable decision rules

(best: rules, worst: NNs, SVMs)• Training data needed (less is better)

(best: kNNs, worst: NNs)• Graceful data-error tolerance

(best: NNs, kNNs, worst: rules)

Page 25: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Trend Detection in DM (1)

Example: Sales Prediction2002 Q1 sales = 4.0M,

2002 Q2 sales = 3.5M

2002 Q3 sales = 3.0M

2002 Q4 sales = ??

Page 26: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Trend Detection in DM (2)

Now if we knew last year:

2001 Q1 sales = 3.5M,

2001 Q2 sales = 3.1M

2001 Q3 sales = 2,8M

2001 Q4 sales = 4.5M

And if we knew previous year:

2000 Q1 sales = 3.2M,

2000 Q2 sales = 2.9M

2000 Q3 sales = 2.5M

2000 Q4 sales = 3.7M

Page 27: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Trend Detection in DM (3)

What will 2002 Q4 sales be?

What if Christmas 2002 was cancelled

What will 2003 Q4 sales be?

Page 28: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Time-Series Analysis

• Numerical series extrapolation• Cyclical curve fitting

– Find period of cycle (and super-cycle, …)

– Fit curve for each period

(often with L2 or Linfinity norm)

– Find translation (series extrapolation)

– Extrapolate to estimate desire values

• But, better to pre-classify data first

(e.g. "recession" and "expansion" years)

• Combine with "standard" data mining

Page 29: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Trend Detection in DM II (2)

Thorny Problems• How to use external knowledge to make up

for limitations in the data?• How to make longer-range extrapolations?• How to cope with corrupted data?

– Random point errors (easy)– Systematic error (hard)– Malicious errors (impossible)

Page 30: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Methods for Supervised DM (1)Classifiers (used in text categorization too)• Linear Separators (regression)• Naive Bayes (NB)• Decision Trees (DTs)• k-Nearest Neighbor (kNN)• Decision rule induction• Support Vector Machines (SVMs)• Neural Networks (NNs) ...

Page 31: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Methods for Supervised DM (2)

Points of Comparison• Hard vs Soft decisions

(e.g. DTs and rules vs kNN, NB)• Human-interpretable decision rules

(best: rules, worst: NNs, SVMs)• Training data needed (less is better)

(best: kNNs, worst: NNs)• Graceful data-error tolerance

(best: NNs, kNNs, worst: rules)

Page 32: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Symbolic Rule Induction (1)

General idea

• Labeled instances are DB tuples

• Rules are generalized tuples

• Generalization occurs at term in tuple

• Generalize on new E+ not predicted

• Specialize on new E- not predicted

• Ignore predicted E+ or E-

Page 33: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Symbolic Rule Induction (2)

Example term generalizations

• Constant => disjunction

e.g. if small portion value set seen

• Constant => least-common-generalizer class

e.g. if large portion of value set seen

• Number (or ordinal) => range

e.g. if dense sequential sampling

Page 34: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Symbolic Rule Induction (3)

Example term specializations

• class => disjunction of subclasses

• Range => disjunction of sub-ranges

Page 35: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Symbolic Rule Induction Example (1)Age Gender Temp b-cult c-cult loc Skin disease65 M 101 + .23 USA normal strep25 M 102 + .00 CAN normal strep65 M 102 - .78 BRA rash dengue36 F 99 - .19 USA normal *none*11 F 103 + .23 USA flush strep88 F 98 + .21 CAN normal *none*39 F 100 + .10 BRA normal strep12 M 101 + .00 BRA normal strep15 F 101 + .66 BRA flush dengue20 F 98 + .00 USA rash *none*81 M 98 - .99 BRA rash ec-1287 F 100 - .89 USA rash ec-1212 F 102 + ?? CAN normal strep

14 F 101 + .33 USA normal67 M 102 + .77 BRA rash

Page 36: Data Mining in eCommerce Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Symbolic Rule Induction Example (2)Candidate Rules:IF age = [12,65]

gender = *any*temp = [100,103]b-cult = +c-cult = [.00,.23]loc = *any*skin = (normal,flush)

THEN: strepIF age = (15,65)

gender = *any*temp = [101,102]b-cult = *any*c-cult = [.66,.78]loc = BRAskin = rash

THEN: dengue

Disclaimer: These are *not* real medical records