Top Banner
Jeff Howbert Introduction to Machine Learning Winter 2012 1 Collaborative Filtering Matrix Factorization Approach
30

Collaborative Filtering Matrix Factorization Approach

Jan 08, 2016

Download

Documents

yule

Collaborative Filtering Matrix Factorization Approach. Collaborative filtering algorithms. Common types: Global effects Nearest neighbor Matrix factorization Restricted Boltzmann machine Clustering Etc. Optimization. Optimization is an important part of many machine learning methods. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 1

Collaborative Filtering

Matrix Factorization Approach

Page 2: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 2

Collaborative filtering algorithms

Common types:

– Global effects

– Nearest neighbor

– Matrix factorization

– Restricted Boltzmann machine

– Clustering

– Etc.

Page 3: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 3

Optimization is an important part of many machine learning methods.

The thing we’re usually optimizing is the loss function for the model.

– For a given set of training data X and outcomes y, we want to find the model parameters w that minimize the total loss over all X, y.

Optimization

Page 4: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 4

Suppose target outcomes come from set Y

– Binary classification: Y = { 0, 1 }

– Regression: Y = (real numbers) A loss function maps decisions to costs:

– defines the penalty for predicting when the true value is .

Standard choice for classification:0/1 loss (same asmisclassification error)

Standard choice for regression:squared loss

Loss function

)ˆ,( ii yyL iy

iy

otherwise 1

ˆ if 0)ˆ,(1/0

iiii

yyyyL

2)ˆ()ˆ,( iiii yyyyL

Page 5: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 5

Calculate sum of squared loss (SSL) and determine w:

Can prove that this method of determining w minimizes SSL.

Least squares linear fit to data

ttt

j

j

N

j

d

iiij

y

y

xwy

xxw

yXXXw

xX

y

XwyXwy

samplefor test ˆ

)(

samples trainingall ofmatrix

responses trainingall ofvector

)()()(SSL

T1T

1

T2

0

Page 6: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 6

Simplest example - quadratic function in 1 variable:

f( x ) = x2 + 2x – 3

Want to find value of x where f( x ) is minimum

Optimization

Page 7: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 7

This example is simple enough we can find minimum directly– Minimum occurs where slope of curve is 0

– First derivative of function = slope of curve

– So set first derivative to 0, solve for x

Optimization

Page 8: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 8

f( x ) = x2 + 2x – 3

f( x ) / dx = 2x + 2

2x + 2 = 0

x = -1

is value of x where f( x ) is minimum

Optimization

Page 9: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 9

Another example - quadratic function in 2 variables:

f( x ) = f( x1, x2 ) = x12 + x1x2 + 3x2

2

f( x ) is minimum where gradient of f( x ) is zero in all directions

Optimization

Page 10: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 10

Gradient is a vector

– Each element is the slope of function along direction of one of variables

– Each element is the partial derivative of function with respect to one of variables

– Example:

Optimization

dd x

f

x

f

x

fxxxff

)()()(),,,()(

2121

xxxx

212121

21

2221

2121

62)()(

),()(

3),()(

xxxxx

f

x

fxxff

xxxxxxff

xxx

x

Page 11: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 11

Gradient vector points in direction of steepest ascent of function

Optimization

121 ),( xxxf

),( 21 xxf

),( 21 xxf

221 ),( xxxf

Page 12: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 12

This two-variable example is still simple enough that we can find minimum directly

– Set both elements of gradient to 0

– Gives two linear equations in two variables

– Solve for x1, x2

Optimization

212121

2221

2121

62),(

3),(

xxxxxxf

xxxxxxf

0 0

06 02

21

2121

xx

xxxx

Page 13: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 13

Finding minimum directly by closed form analytical solution often difficult or impossible.

– Quadratic functions in many variables system of equations for partial derivatives may be ill-conditioned example: linear least squares fit where redundancy among features is high

– Other convex functions global minimum exists, but there is no closed form solution example: maximum likelihood solution for logistic regression

– Nonlinear functions partial derivatives are not linear example: f( x1, x2 ) = x1( sin( x1x2 ) ) + x2

2

example: sum of transfer functions in neural networks

Optimization

Page 14: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 14

Many approximate methods for finding minima have been developed

– Gradient descent

– Newton method

– Gauss-Newton

– Levenberg-Marquardt

– BFGS

– Conjugate gradient

– Etc.

Optimization

Page 15: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 15

Simple concept: follow the gradient downhill Process:

1. Pick a starting position: x0 = ( x1, x2, …, xd )

2. Determine the descent direction: - f( xt )

3. Choose a learning rate: 4. Update your position: xt+1 = xt - f( xt

)

5. Repeat from 2) until stopping criterion is satisfied

Typical stopping criteria– f( xt+1 ) ~ 0

– some validation metric is optimized

Gradient descent optimization

Page 16: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 16

Slides thanks to Alexandre Bayen

(CE 191, Univ. California, Berkeley, 2006)

http://www.ce.berkeley.edu/~bayen/ce191www/lecturenotes/lecture10v01_descent2.pdf

Gradient descent optimization

Page 17: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 17

Example in MATLAB

Find minimum of function in two variables:

y = x12 + x1x2 + 3x2

2

http://www.youtube.com/watch?v=cY1YGQQbrpQ

Gradient descent optimization

Page 18: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 18

Problems:

– Choosing step size too small convergence is slow and inefficient too large may not converge

– Can get stuck on “flat” areas of function

– Easily trapped in local minima

Gradient descent optimization

Page 19: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 19

Stochastic (definition):1. involving a random variable

2. involving chance or probability; probabilistic

Stochastic gradient descent

Page 20: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 20

Application to training a machine learning model:1. Choose one sample from training set

2. Calculate loss function for that single sample

3. Calculate gradient from loss function

4. Update model parameters a single step based on gradient and learning rate

5. Repeat from 1) until stopping criterion is satisfied

Typically entire training set is processed multiple times before stopping.

Order in which samples are processed can be fixed or random.

Stochastic gradient descent

Page 21: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 21

Matrix factorization in action

mov

ie 1

mov

ie 2

mov

ie 3

mov

ie 4

mov

ie 5

mov

ie 6

mov

ie 7

mov

ie 8

mov

ie 9

mov

ie 1

0

… mov

ie 1

7770

user 1 1 2 3user 2 2 3 3 4user 3 5 3 4user 4 2 3 2 2user 5 4 5 3 4user 6 2user 7 2 4 2 3user 8 3 4 4user 9 3

user 10 1 2 2…

user 480189 4 3 3

mov

ie 1

mov

ie 2

mov

ie 3

mov

ie 4

mov

ie 5

mov

ie 6

mov

ie 7

mov

ie 8

mov

ie 9

mov

ie 1

0

… mov

ie 1

7770

feature 1feature 2feature 3feature 4feature 5

< a bunch of numbers >

feat

ure

1

feat

ure

2

feat

ure

3

feat

ure

4

feat

ure

5

user 1user 2user 3user 4user 5user 6user 7user 8user 9

user 10…

user 480189

< a

bu

nch

of

nu

mb

ers

>

factorization(trainingprocess)

+

trainingdata

Page 22: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 22

mov

ie 1

mov

ie 2

mov

ie 3

mov

ie 4

mov

ie 5

mov

ie 6

mov

ie 7

mov

ie 8

mov

ie 9

mov

ie 1

0

… mov

ie 1

7770

user 1 1 2 3user 2 2 3 3 4user 3 5 3 4user 4 2 3 2 2user 5 4 5 3 4user 6 2user 7 2 4 2 3user 8 3 4 4 ?user 9 3

user 10 1 2 2…

user 480189 4 3 3

mov

ie 1

mov

ie 2

mov

ie 3

mov

ie 4

mov

ie 5

mov

ie 6

mov

ie 7

mov

ie 8

mov

ie 9

mov

ie 1

0

… mov

ie 1

7770

feature 1feature 2feature 3feature 4feature 5

feat

ure

1

feat

ure

2

feat

ure

3

feat

ure

4

feat

ure

5

user 1user 2user 3user 4user 5user 6user 7user 8user 9

user 10…

user 480189

multiply and addfeatures

(dot product)for desired

< user, movie >prediction

+

Matrix factorization in action

Page 23: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 23

Notation– Number of users = I

– Number of items = J

– Number of factors per user / item = F

– User of interest = i

– Item of interest = j

– Factor index = f

User matrix U dimensions = I x F Item matrix V dimensions = J x F

Matrix factorization

Page 24: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 24

Prediction for user, item pair i, j :

Loss for prediction where true rating is :

– Using squared loss; other loss functions possible

– Loss function contains F model variables from U, and F model variables from V

Matrix factorization

F

fjfifij VUr

1

ˆ

ijr

ijr

F

fjfifijijijijij VUrrrrrL

1

22 )()ˆ()ˆ,(

Page 25: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 25

Gradient of loss function for sample i, j :

– for f = 1 to F

Matrix factorization

F

fifjfifij

jf

F

fjfifij

jf

ijij

F

fjfjfifij

if

F

fjfifij

if

ijij

UVUrV

VUr

V

rrL

VVUrU

VUr

U

rrL

1

1

2

1

1

2

)(2

)()ˆ,(

)(2

)()ˆ,(

Page 26: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 26

Let’s simplify the notation:

– for f = 1 to F

Matrix factorization

ifjfjf

ijij

jfifif

ijij

F

fjfifij

eUV

e

V

rrL

eVU

e

U

rrL

VUre

2)ˆ,(

2)ˆ,(

error) prediction (the let

2

2

1

Page 27: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 27

Matrix factorization

Set learning rate = Then the factor matrix updates for sample i, j

are:

– for f = 1 to F

ifjfjf

jfifif

eUVV

eVUU

2

2

Page 28: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 28

SGD for training a matrix factorization:

1. Decide on F = dimension of factors

2. Initialize factor matrices with small random values

3. Choose one sample from training set

4. Calculate loss function for that single sample

5. Calculate gradient from loss function

6. Update 2 F model parameters a single step using gradient and learning rate

7. Repeat from 3) until stopping criterion is satisfied

Matrix factorization

Page 29: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 29

Must use some form of regularization (usually L2):

Update rules become:

– for f = 1 to F

Matrix factorization

F

f

F

fjf

F

fifjfifijijij VUVUrrrL

1 1

2

1

22)()ˆ,(

)(2

)(2

jfifjfjf

ifjfifif

VeUVV

UeVUU

Page 30: Collaborative Filtering Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2012 30

Random thoughts …

– Samples can be processed in small batches instead of one at a time batch gradient descent

– We’ll see stochastic / batch gradient descent again when we learn about neural networks (as back-propagation)

Stochastic gradient descent