Top Banner
Data Mining and Machine Learning with EM
96

Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Mar 29, 2015

Download

Documents

Elian Birchett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Data Mining and Machine Learning with EM

Page 2: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Data Mining and Machine Learning are Ubiquitous!

• Netflix• Amazon• Wal-Mart• Algorithmic Trading/High Frequency Trading• Banks (Segmint)• Google/Yahoo/Microsoft/IBM• CRM/Consumer Behavior Profiling• Consumer Review• Mobile Ads• Social Network (Facebook/Twitter/Google+)• Voting Behaviors• …

Page 3: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Data Mining

• Non-trivial extraction of implicit, previously unknown and potentially useful information from data

• Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 4: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Data Mining Tasks

• Prediction Methods– Use some variables to predict unknown or future

values of other variables.

• Description Methods– Find human-interpretable patterns that describe

the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 5: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Data Mining Tasks...

• Classification [Predictive]

• Clustering [Descriptive]

• Association Rule Discovery [Descriptive]

• Sequential Pattern Discovery [Descriptive]

• Regression [Predictive]

• Deviation Detection [Predictive]

Page 6: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.
Page 7: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Association Rule Discovery: Definition

• Given a set of records each of which contain some number of items from a given collection;– Produce dependency rules which will predict

occurrence of an item based on occurrences of other items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Page 8: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Association Rule Discovery: Application 1

• Marketing and Sales Promotion:– Let the rule discovered be {Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to determine

what should be done to boost its sales.– Bagels in the antecedent => Can be used to see which

products would be affected if the store discontinues selling bagels.

– Bagels in antecedent and Potato chips in consequent =>

Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

Page 9: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

9

Definition: Frequent Itemset• Itemset

– A collection of one or more items• Example: {Milk, Bread, Diaper}

– k-itemset• An itemset that contains k items

• Support count ()– Frequency of occurrence of an itemset– E.g. ({Milk, Bread,Diaper}) = 2

• Support– Fraction of transactions that contain an itemset– E.g. s({Milk, Bread, Diaper}) = 2/5

• Frequent Itemset– An itemset whose support is greater than or

equal to a minsup threshold

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 10: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Frequent Itemsets Mining

TID Transactions

100 { A, B, E }

200 { B, D }

300 { A, B, E }

400 { A, C }

500 { B, C }

600 { A, C }

700 { A, B }

800 { A, B, C, E }

900 { A, B, C }

1000 { A, C, E }

• Minimum support level 50%– {A},{B},{C},{A,B}, {A,C}

Page 11: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

11

Frequent Itemset Generationnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets

Page 12: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

12

Frequent Itemset Generation• Brute-force approach:

– Each itemset in the lattice is a candidate frequent itemset– Count the support of each candidate by scanning the

database

– Match each transaction against every candidate– Complexity ~ O(NMw) => Expensive since M = 2d !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List ofCandidates

M

w

Page 13: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

13

Reducing Number of Candidates• Apriori principle:

– If an itemset is frequent, then all of its subsets must also be frequent

• Apriori principle holds due to the following property of the support measure:

– Support of an itemset never exceeds the support of its subsets

– This is known as the anti-monotone property of support

)()()(:, YsXsYXYX

Page 14: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

14

Illustrating Apriori Principle

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Pruned supersets

Page 15: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.
Page 16: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Apriori

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, 487-499, 1994

Page 17: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.
Page 18: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.
Page 19: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

What is Cluster Analysis?

• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 20: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Applications of Cluster Analysis

• Understanding– Group related documents for

browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations

• Summarization– Reduce the size of large data

sets

Discovered Clusters Industry Group

1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN

Technology1-DOWN

2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN

Financial-DOWN

4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP

Oil-UP

Clustering precipitation in Australia

Page 21: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

Page 22: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Types of Clusterings

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering– A division data objects into non-overlapping subsets (clusters) such

that each data object is in exactly one subset

• Hierarchical clustering– A set of nested clusters organized as a hierarchical tree

Page 23: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Partitional Clustering

Original Points A Partitional Clustering

Page 24: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Page 25: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

K-means Clustering

• Partitional clustering approach – Each cluster is associated with a centroid (center point) – Each point is assigned to the cluster with the closest centroid

• Number of clusters, K, must be specified• The basic algorithm is very simple

Page 26: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

K-means Clustering – Details

• Initial centroids are often chosen randomly.– Clusters produced vary from one run to another.

• The centroid is (typically) the mean of the points in the cluster.• ‘Closeness’ is measured by Euclidean distance, cosine similarity,

correlation, etc.

Page 27: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.
Page 28: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.
Page 29: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

K-means Clustering – Details• K-means will converge for common similarity measures

mentioned above.• Most of the convergence happens in the first few iterations.

– Often the stopping condition is changed to ‘Until relatively few points change clusters’

• Complexity is O( n * K * I * d )– n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes

Page 30: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.
Page 31: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

K-Means Clustering

31

Page 32: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

How to MapReduce K-Means?

• Given K, assign the first K random points to be the initial cluster centers

• Assign subsequent points to the closest cluster using the supplied distance measure

• Compute the centroid of each cluster and iterate the previous step until the cluster centers converge within delta

• Run a final pass over the points to cluster them for output

Page 33: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

K-Means Map/Reduce Design• Driver

– Runs multiple iteration jobs using mapper+combiner+reducer– Runs final clustering job using only mapper

• Mapper– Configure: Single file containing encoded Clusters– Input: File split containing encoded Vectors– Output: Vectors keyed by nearest cluster

• Combiner– Input: Vectors keyed by nearest cluster– Output: Cluster centroid vectors keyed by “cluster”

• Reducer (singleton)– Input: Cluster centroid vectors– Output: Single file containing Vectors keyed by cluster

Page 34: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Mapper - mapper has k centers in memory.

Input Key-value pair (each input data point x).

Find the index of the closest of the k centers (call it iClosest).

Emit: (key,value) = (iClosest, x)

Reducer(s) – Input (key,value) Key = index of centerValue = iterator over input data points closest to ith center

At each key value, run through the iterator and average all the Corresponding input data points.

Emit: (index of center, new center)

Page 35: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Improved Version: Calculate partial sums in mappers

Mapper - mapper has k centers in memory. Running through one input data point at a time (call it x). Find the index of the closest of the k centers (call it iClosest). Accumulate sum of inputs segregated into K groups depending on which center is closest.

Emit: ( , partial sum)OrEmit(index, partial sum)

Reducer – accumulate partial sums and

Emit with index or without

Page 36: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Issues and Limitations for K-means

• How to choose initial centers?• How to choose K?• How to handle Outliers?• Clusters different in

– Shape– Density– Size

Page 37: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Page 38: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 6

Page 39: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 40: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

Page 41: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Page 42: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Solutions to Initial Centroids Problem

• Multiple runs– Helps, but probability is not on your side

• Sample and use hierarchical clustering to determine initial centroids

• Select more than k initial centroids and then select among these initial centroids– Select most widely separated

• Postprocessing• Bisecting K-means

– Not as susceptible to initialization issues

Page 43: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

EM-Algorithm

Page 44: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

What is MLE?

• Given– A sample X={X1, …, Xn}– A vector of parameters θ

• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)

• Given X, find)(maxarg

LML

Page 45: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

MLE (cont)

• Often we assume that Xis are independently identically distributed (i.i.d.)

• Depending on the form of p(x|θ), solving optimization problem can be easy or hard.

)|(logmaxarg

)|(logmaxarg

)|,...,(logmaxarg

)|(logmaxarg

)(maxarg

1

ii

ii

n

ML

XP

XP

XXP

XP

L

Page 46: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

An easy case

• Assuming– A coin has a probability p of being heads, 1-p of

being tails.– Observation: We toss a coin N times, and the

result is a set of Hs and Ts, and there are m Hs.

• What is the value of p based on MLE, given the observation?

Page 47: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

An easy case (cont)

)1log()(log

)1(log)|(log)(

pmNpm

ppXPL mNm

01

))1log()(log()(

p

mN

p

m

dp

pmNpmd

dp

dL

p= m/N

Page 48: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

EM: basic concepts

Page 49: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Basic setting in EM

• X is a set of data points: observed data• Θ is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler, where Y is

“hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

XP

LML

Page 50: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

Page 51: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

The log-likelihood function

• L is a function of θ, while holding X constant:

)|()()|( XPLXL

)|,(log

)|(log

)|(log

)|(log)(log)(

1

1

1

yxP

xP

xP

XPLl

iy

n

i

i

n

i

n

ii

Page 52: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

The iterative approach for MLE

)|,(logmaxarg

)(maxarg

)(maxarg

1

yxp

l

L

n

i yi

ML

,....,...,, 10 tIn many cases, we cannot find the solution directly.

An alternative is to find a sequence:

....)(...)()( 10 tlll s.t.

Page 53: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

])|,(

)|,([log

])|,(

)|,([log

)|,(

)|,(),|(log

)|,(

)|,(

)|',(

)|,(log

)|,(

)|,(

)|',(

)|,(log

)|',(

)|,(log

)|,(

)|,(

log

)|,(log)|,(log

)|(log)|(log)()(

1),|(

1),|(

1

'1

'1

'1

1

11

ti

in

ixyP

ti

in

ixyP

ti

itn

i yi

ti

i

yt

yi

ti

n

i

ti

ti

yt

yi

in

i

yt

yi

in

i

t

yi

yin

i

t

yi

n

iyi

n

i

tt

yxP

yxPE

yxP

yxPE

yxP

yxPxyP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxPyxP

XPXPll

ti

ti

Jensen’s inequality

Page 54: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Jensen’s inequality

])([()](([, xgEfxgfEthenconvexisfif

)])([log()]([log( xpExpE

])([()](([, xgEfxgfEthenconcaveisfif

log is a concave function

Page 55: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Maximizing the lower bound

)]|,([logmaxarg

)|,(log),|(maxarg

)|,(

)|,(log),|(maxarg

])|,(

)|,([logmaxarg

1),|(

1

1

1),|(

)1(

yxPE

yxPxyP

yxP

yxPxyP

yxp

yxpE

i

n

ixyP

it

i

n

i y

ti

iti

n

i y

ti

in

ixyP

t

ti

ti

The Q function

Page 56: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

The Q-function

• Define the Q-function (a function of θ):

– Y is a random vector.– X=(x1, x2, …, xn) is a constant (vector).– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.

• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

)|,(log),|(

)]|,([log)|,(log),|(

)]|,([log],|)|,([log),(

1

1),|(

),|(

yxPxyP

yxPEYXPXYP

YXPEXYXPEQ

it

n

i yi

n

iixyP

Y

t

XYP

tt

ti

t

Page 57: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

The inner loop of the EM algorithm

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Page 58: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

L(θ) is non-decreasing at each iteration

• The EM algorithm will produce a sequence

• It can be proved that

,....,...,, 10 t

....)(...)()( 10 tlll

Page 59: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

The inner loop of the Generalized EM algorithm (GEM)

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

),(),( 1 tttt QQ

Page 60: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Recap of the EM algorithm

Page 61: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

XP

LML

Page 62: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Idea #2: find the θt sequence

No analytical solution iterative approach, find s.t.

,....,...,, 10 t

....)(...)()( 10 tlll

Page 63: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

])|,(

)|,([log)()(

1),|( t

i

in

ixyP

t

yxP

yxPEll t

i

Page 64: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Idea #4: find θt+1 that maximizes the Q function

)]|,([logmaxarg

])|,(

)|,([logmaxarg

1),|(

1),|(

)1(

yxPE

yxp

yxpE

i

n

ixyP

ti

in

ixyP

t

ti

ti

Lower bound of )()( tll

The Q function

Page 65: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Page 66: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Important classes of EM problem

• Products of multinomial (PM) models• Exponential families• Gaussian mixture• …

Page 67: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Probabilistic Latent Semantic Analysis (PLSA)

• PLSA is a generative model for generating the co-occurrence of documents d∈D={d1,…,dD} and terms w∈W={w1,…,wW}, which associates latent variable z∈Z={z1,…,zZ}.

• The generative processing is:

w1w1

w2w2

wWwW

d1d1

d2d2

dDdD

z1

z2

zZ

P(d)

P(z|d) P(w|z)

Page 68: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Model

• The generative process can be expressed by:

( , ) ( ) ( | ),

( | ) ( | ) ( | )z Z

P d w P d P w d

where P w d P w z P z d

Two independence assumptions:1) Each pair (d,w) are assumed to be generated independently,

corresponding to ‘bag-of-words’2) Conditioned on z, words w are generated independently of the

specific document d.

Page 69: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Model• Following the likelihood principle, we detemines P(z),

P(d|z), and P(w|z) by maximization of the log-likelihood function

( | , , ) ( , ) log ( , )d D w W

L d w z n d w P d w

( , ) ( | ) ( | ) ( ) ( | ) ( | ) ( )z Z z Z

where P d w P w z P z d P d P w z P d z P z

co-occurrence times of d and w.

Observed data

Unobserved data

P(d), P(z|d), and P(w|d)

Page 70: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Maximum-likelihood• Definition

– We have a density function P(x|Θ) that is govened by the set of parameters Θ, e.g., P might be a set of Gaussians and Θ could be the means and covariances

– We also have a data set X={x1,…,xN}, supposedly drawn from this distribution P, and assume these data vectors are i.i.d. with P.

– Then the likehihood function is:

– The likelihood is thought of as a function of the parameters Θwhere the data X is fixed. Our goal is to find the Θthat maximizes L. That is

1

( | ) ( | ) ( | )N

ii

P X P x L X

* arg max ( | )L X

Page 71: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Jensen’s inequality

0)(

0

1

)()(

jg

a

a

provided

jgajg

j

jj

j j

aj

j

Page 72: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Dd Ww Zz

zdPzwPzPwdnzwdL )|()|()(log),(max),,|(max

Estimation-using EM

difficult!!!

Idea: start with a guess t, compute an easily computed lower-bound B(; t) to the function log P(|U) and maximize the bound instead

By Jensen’s inequality:

),|(

]),|(

)|()|()([),|(

),|(

)|()|()( dwzP

Zz j dwzP

zdPzwPzPdwzP

dwzP

zdPzwPzP

Dd Ww z

dwzP

zDd Ww

t

dwzPdwzPzdPzwPzPwdn

dwzP

zdPzwPzPwdnB

),|()],|(log)|()|()([log),(max

]),|(

)|()|()([log),(max),(max

),|(

Page 73: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

(1)Solve P(w|z)• We introduce Lagrange multiplier λwith the constraint that

∑wP(w|z)=1, and solve the following equation:

( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( | ) 1) 0( | )

( , ) ( | , )0,

( | )

( , ) ( | , )( | ) ,

( | ) 1,

( , ) ( | , ),

( , )( | )

d D w W z w

d D

d D

w

w W d D

n d w P z P w z P d z P z w d P z w d P w zP w z

n d w P z d w

P w z

n d w P z d wP w z

P w z

n d w P z d w

n d w PP w z

( | , )

( , ) ( | , )d D

w W d D

z d w

n d w P z d w

Page 74: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

(2)Solve P(d|z)

• We introduce Lagrange multiplier λwith the constraint that ∑dP(d|z)=1, and get the following result:

( , ) ( | , )( | )

( , ) ( | , )w W

d D w W

n d w P z d wP d z

n d w P z d w

Page 75: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

(3)Solve P(z)• We introduce Lagrange multiplier λwith the constraint that

∑zP(z)=1, and solve the following equation:

( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( ) 1) 0( )

( , ) ( | , )0,

( )

( , ) ( | , )( ) ,

( ) 1,

( , ) ( | , ) ( , ),

d D w W z z

d D w W

d D w W

z

d D w W z d D w W

n d w P z P w z P d z P z w d P z w d P zP z

n d w P z d w

P z

n d w P z d wP z

P z

n d w P z d w n d w

( , ) ( | , )( )

( , )d D w W

w W d D

n d w P z d wP z

n d w

Page 76: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

(1)Solve P(z|d,w) • We introduce Lagrange multiplier λwith the constraint that

∑zP(z|d,w)=1, and solve the following equation:,

,

,

( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( | , ) 1) 0( | , )

( , )[log ( ) ( | ) ( | ) log ( | , ) 1] 0,

log ( | , ) log ( ) ( | ) ( | ) 1 0,

( |

d wd D w W z d D w W z

d w

d w

n d w P z P w z P d z P z d w P z d w P z d wP z d w

n d w P z P w z P d z P z d w

P z d w P z P w z P d z

P z d

,

,

,

1

1

,

1

1 (1 log ( ) ( | ) ( | ))

, ) ( ) ( | ) ( | )

( | , ) 1,

( ) ( | ) ( | ) 1

1 log ( ) ( | ) ( | )

( ) ( | ) ( | )( | )

( ) ( | ) ( | )

( ) ( | ) ( | )

( ) ( |

d w

d w

d w

z

z

z

d wz

P z P w z P d z

w P z P w z P d z e

P z d w

P z P w z P d z e

P z P w z P d z

P z P w z P d zP w z

eP z P w z P d z

eP z P w z P d z

P z P w z

) ( | )z

P d z

Page 77: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

(4)Solve P(z|d,w) -2

( , , )( | , )

( , )

( , | ) ( )

( , )

( | ) ( | ) ( )

( | ) ( | ) ( )z Z

P d w zP z d w

P d w

P w d z P z

P d w

P w z P d z P z

P w z P d z P z

Page 78: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

The final update Equations

• E-step:

• M-step:

( | ) ( | ) ( )( | , )

( | ) ( | ) ( )z Z

P w z P d z P zP z d w

P w z P d z P z

( , ) ( | , )( | )

( , ) ( | , )d D

w W d D

n d w P z d wP w z

n d w P z d w

( , ) ( | , )( | )

( , ) ( | , )w W

d D w W

n d w P z d wP d z

n d w P z d w

( , ) ( | , )( )

( , )d D w W

w W d D

n d w P z d wP z

n d w

Page 79: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Coding Design• Variables:

• double[][] p_dz_n // p(d|z), |D|*|Z|• double[][] p_wz_n // p(w|z), |W|*|Z|• double[] p_z_n // p(z), |Z|

• Running Processing:1. Read dataset from file

ArrayList<DocWordPair> doc; // all the docsDocWordPair – (word_id, word_frequency_in_doc)

2. Parameter InitializationAssign each elements of p_dz_n, p_wz_n and p_z_n with a random double value, satisfying

∑d p_dz_n=1, ∑d p_wz_n =1, and ∑d p_z_n =13. Estimation (Iterative processing)

1. Update p_dz_n, p_wz_n and p_z_n 2. Calculate Log-likelihood function to see where ( |Log-likelihood – old_Log-likelihood|

< threshold)4. Output p_dz_n, p_wz_n and p_z_n

Page 80: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Coding Design• Update p_dz_n

For each doc d{ For each word w included in d {

denominator = 0; nominator = new double[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]

denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_dz_n[d][z] += tfwd*P_z_condition_d_w;

denominator_p_dz_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z }// end for each word w included in d}// end for each doc d

For each doc d {For each topic z {

p_dz_n_new[d][z] = nominator_p_dz_n[d][z]/ denominator_p_dz_n[z];} // end for each topic z

}// end for each doc d

Page 81: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Coding Design• Update p_wz_n

For each doc d{ For each word w included in d {

denominator = 0; nominator = new double[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]

denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_wz_n[w][z] += tfwd*P_z_condition_d_w;

denominator_p_wz_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z }// end for each word w included in d}// end for each doc d

For each w {For each topic z {

p_wz_n_new[w][z] = nominator_p_wz_n[w][z]/ denominator_p_wz_n[z];} // end for each topic z

}// end for each doc d

Page 82: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Coding Design• Update p_z_n

For each doc d{ For each word w included in d {

denominator = 0; nominator = new double[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]

denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_z_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z

denominator_p_z_n[z] += tfwd; }// end for each word w included in d}// end for each doc d

For each topic z{p_dz_n_new[d][j] = nominator_p_z_n[z]/ denominator_p_z_n;

} // end for each topic z

Page 83: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Apache Mahout

Industrial Strength Machine LearningMay 2008

Page 84: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Current Situation• Large volumes of data are now available• Platforms now exist to run computations over

large datasets (Hadoop, HBase)• Sophisticated analytics are needed to turn data

into information people can use• Active research community and proprietary

implementations of “machine learning” algorithms

• The world needs scalable implementations of ML under open license - ASF

Page 85: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

History of Mahout

• Summer 2007– Developers needed scalable ML– Mailing list formed

• Community formed– Apache contributors– Academia & industry– Lots of initial interest

• Project formed under Apache Lucene– January 25, 2008

Page 86: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Current Code Base• Matrix & Vector library

– Memory resident sparse & dense implementations• Clustering

– Canopy– K-Means– Mean Shift

• Collaborative Filtering– Taste

• Utilities– Distance Measures– Parameters

Page 87: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Under Development

• Naïve Bayes• Perceptron• PLSI/EM• Genetic Programming• Dirichlet Process Clustering• Clustering Examples• Hama (Incubator) for very large arrays

Page 88: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Appendix

• From Mahout Hands on, by Ted Dunning and Robin Anil, OSCON 2011, Portland

Page 89: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Step 1 – Convert dataset into a Hadoop Sequence File

• http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

• Download (8.2 MB) and extract the SGML files.– $ mkdir -p mahout-work/reuters-sgm– $ cd mahout-work/reuters-sgm && tar xzf ../reuters21578.tar.gz && cd .. && cd ..

• Extract content from SGML to text file– $ bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters mahout-work/reuters-sgm mahout-work/reuters-out

Page 90: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Step 1 – Convert dataset into a Hadoop Sequence File

• Use seqdirectory tool to convert text file into a Hadoop Sequence File– $ bin/mahout seqdirectory \ -i mahout-work/reuters-out \

-o mahout-work/reuters-out-seqdir \

-c UTF-8 -chunk 5

Page 91: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Hadoop Sequence File• Sequence of Records, where each record is a <Key, Value> pair

– <Key1, Value1>– <Key2, Value2>– …– …– …– <Keyn, Valuen>

• Key and Value needs to be of class org.apache.hadoop.io.Text– Key = Record name or File name or unique identifier– Value = Content as UTF-8 encoded string

• TIP: Dump data from your database directly into Hadoop Sequence Files (see next slide)

Page 92: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Writing to Sequence Files Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part-00000"); SequenceFile.Writer writer = new

SequenceFile.Writer( fs, conf, path, Text.class, Text.class); for (int i = 0; i < MAX_DOCS; i++) writer.append(new Text(documents(i).Id()), new Text(documents(i).Content())); } writer.close();

Page 93: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Generate Vectors from Sequence Files

• Steps1. Compute Dictionary2. Assign integers for words3. Compute feature weights4. Create vector for each document using word-integer

mapping and feature-weight

Or

• Simply run $ bin/mahout seq2sparse

Page 94: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Generate Vectors from Sequence Files

• $ bin/mahout seq2sparse \ -i mahout-work/reuters-out-seqdir/ \ -o mahout-work/reuters-out-seqdir-sparse-kmeans

• Important options– Ngrams– Lucene Analyzer for tokenizing– Feature Pruning

• Min support• Max Document Frequency• Min LLR (for ngrams)

– Weighting Method• TF v/s TFIDF• lp-Norm• Log normalize length

Page 95: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Start K-Means clustering• $ bin/mahout kmeans \

-i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \ -c mahout-work/reuters-kmeans-clusters \ -o mahout-work/reuters-kmeans \ -dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 \ -x 10 -k 20 –ow

• Things to watch out for– Number of iterations– Convergence delta– Distance Measure– Creating assignments

Page 96: Data Mining and Machine Learning with EM. Data Mining and Machine Learning are Ubiquitous! Netflix Amazon Wal-Mart Algorithmic Trading/High Frequency.

Inspect clusters

• $ bin/mahout clusterdump \ -s mahout-work/reuters-kmeans/clusters-9 \ -d mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 20

Typical output:VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, …

Top Terms: iran => 3.1861672217321213strike => 2.567886952727918iranian => 2.133417966282966union => 2.116033937940266said => 2.101773806290277workers => 2.066259451354332gulf => 1.9501374918521601had => 1.6077752463145605he => 1.5355078004962228