Top Banner
Mining of Massive Datasets Leskovec, Rajaraman, and Ullman Stanford University
15

Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))

Jul 04, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

Mining  of  Massive  Datasets  Leskovec,  Rajaraman,  and  Ullman  Stanford  University  

Page 2: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  Consider  user  x  

¡  Find  set  N  of  other    users  whose  ra1ngs    are  “similar”  to    x’s  ra1ngs  

¡  Es1mate  x’s  ra1ngs    based  on  ra1ngs    of  users  in  N  

2  

x  

N  

Page 3: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  Consider  users  x  and  y  with  ra1ng  vectors  rx  and  ry  

¡ We  need  a  similarity  metric  sim(x,  y)    

¡  Capture  intui1on  that  sim(A,B)  >  sim(A,C)  

 3  

Page 4: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  sim(A,B)  =  |  rA  Å  rB  |  /  |  rA  [  rB  |  

¡  sim(A,B)  =  1/5;  sim(A,C)  =  2/4    §  sim(A,B)  <  sim(A,C)  

¡  Problem:  Ignores  ra1ng  values!  4  

Page 5: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  sim(A,B)  =  cos(rA,  rB)  

¡  sim(A,B)  =  0.38,  sim(A,C)  =  0.32  §  sim(A,B)  <  sim(A,C),  but  not  by  much  

¡  Problem:  treats  missing  ra1ngs  as  nega1ve  5  

Page 6: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  Normalize  ra1ngs  by  subtrac1ng  row  mean  

6  

Page 7: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  sim(A,B)  =  cos(rA,  rB)  =  0.09;  sim(A,C)  =  -­‐0.56    §  sim(A,B)  >  sim(A,C)  

¡  Captures  intui1on  be[er  §  Missing  ra1ngs  treated  as  “average”  §  Handles  “tough  raters”  and  “easy  raters”    

¡  Also  known  as  Pearson  Correla1on  7  

Page 8: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  Let  rx  be  the  vector  of  user  x’s  ra1ngs  ¡  Let  N  be  the  set  of  k  users  most  similar  to  x  who  have  also  rated  item  i  

¡  Predic1on  for  user  x  and  item  i  

¡  Op1on  1:  rxi  =  1/k  ∑y  2  N  ryi  

¡  Op1on  2:  rxi  =  ∑y  2  N  sxy  ryi  /  ∑y  2  N  sxy  

                                           where  sxy  =  sim(x,y)  

8  

Page 9: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  So  far:  User-­‐user  collabora2ve  filtering  ¡  Another  view:  Item-­‐item  §  For  item  i,  find  other  similar  items  §  Es1mate  ra1ng  for  item  i  based  on  ra1ngs  for  similar  items  

§  Can  use  same  similarity  metrics  and  predic1on  func1ons  as  in  user-­‐user  model  

9  

∑∑

∈⋅

=);(

);(

xiNj ij

xiNj xjijxi s

rsr

sij… similarity of items i and j rxj…rating of user x on item j N(i;x)… set items rated by x similar to i

Page 10: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users  

mov

ies  

- unknown rating - rating between 1 to 5

10  

Page 11: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users  

- estimate rating of movie 1 by user 5

11  

mov

ies  

Page 12: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users  

Neighbor selection: Identify movies similar to movie 1, rated by user 5

12  

mov

ies  

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Here we use Pearson correlation as similarity: 1) Subtract mean rating mi from each movie i m1 = (1+3+5+5+4)/5 = 3.6 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 2) Compute cosine similarities between rows

Page 13: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users  

Compute similarity weights: s13=0.41, s16=0.59

13  

mov

ies  

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Page 14: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 2.6 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

users  

Predict by taking weighted average:

r15 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6 14  

mov

ies  

Page 15: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-24 · sim(A,B))= cos(r A, r B)! sim(A,B))=0.38,) sim(A,C))=0.32)! sim(A,B))< sim(A,C),)butnotby)much)!

¡  In  theory,  user-­‐user  and  item-­‐item  are  dual  approaches  

¡  In  prac1ce,  item-­‐item  outperforms  user-­‐user  in  many  use  cases  

¡  Items  are  “simpler”  than  users  §  Items  belong  to  a  small  set  of  “genres”,  users  have  varied  tastes  

§  Item  Similarity  is  more  meaningful  than  User  Similarity  

15