Top Banner
5/19/17 1 Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org High dim. data Locality sensitive hashing Clustering Dimensional ity reduction Graph data PageRank, SimRank Community Detection Spam Detection Infinite data Filtering data streams Web advertising Queries on streams Machine learning SVM Decision Trees Perceptron, kNN Apps Recommen der systems Association Rules Duplicate document detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2
22

13 Rec Sys - SJTU

Feb 12, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 13 Rec Sys - SJTU

5/19/17

1

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

High dim. data

Localitysensitivehashing

Clustering

Dimensionality

reduction

Graph data

PageRank,SimRank

CommunityDetection

SpamDetection

Infinite data

Filteringdata

streams

Webadvertising

Queriesonstreams

Machine learning

SVM

DecisionTrees

Perceptron,kNN

Apps

Recommendersystems

AssociationRules

Duplicatedocumentdetection

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

Page 2: 13 Rec Sys - SJTU

5/19/17

2

¡ CustomerX§ BuysMetallicaCD§ BuysMegadeth CD

¡ CustomerY§ DoessearchonMetallica§ RecommendersystemsuggestsMegadeth fromdatacollectedaboutcustomerX

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 3

Items

Search Recommendations

Products, web sites, blogs, news items, …

4J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Examples:

Page 3: 13 Rec Sys - SJTU

5/19/17

3

¡ Shelfspaceisascarcecommodityfortraditionalretailers§ Also:TVnetworks,movietheaters,…

¡ Webenablesnear-zero-costdisseminationofinformationaboutproducts§ Fromscarcitytoabundance

¡ Morechoicenecessitatesbetterfilters§ Recommendationengines§ HowIntoThinAirmadeTouchingtheVoidabestseller:http://www.wired.com/wired/archive/12.10/tail.html

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 5

Source: Chris Anderson (2004)

6J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 4: 13 Rec Sys - SJTU

5/19/17

4

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 7Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!

¡ Editorialandhandcurated§ Listoffavorites§ Listsof“essential”items

¡ Simpleaggregates§ Top10,MostPopular,RecentUploads

¡ Tailoredtoindividualusers§ Amazon,Netflix,…

8J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 5: 13 Rec Sys - SJTU

5/19/17

5

¡ X =setofCustomers¡ S =setofItems

¡ Utilityfunction u:X × Sà R§ R =setofratings§ R isatotallyorderedset§ e.g.,0-5 stars,realnumberin[0,1]

9J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

0.410.2

0.30.50.21

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

10J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 6: 13 Rec Sys - SJTU

5/19/17

6

¡ (1) Gathering“known”ratingsformatrix§ Howtocollectthedataintheutilitymatrix

¡ (2) Extrapolateunknownratingsfromtheknownones§ Mainlyinterestedinhighunknownratings

§ Wearenotinterestedinknowingwhatyoudon’tlikebutwhatyoulike

¡ (3) Evaluatingextrapolationmethods§ Howtomeasuresuccess/performanceofrecommendationmethods

11J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

¡ Explicit§ Askpeopletorateitems§ Doesn’tworkwellinpractice– peoplecan’tbebothered

¡ Implicit§ Learnratingsfromuseractions

§ E.g.,purchaseimplieshighrating

§ Whataboutlowratings?

12J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 7: 13 Rec Sys - SJTU

5/19/17

7

¡ Keyproblem: UtilitymatrixU issparse§ Mostpeoplehavenotratedmostitems§ Coldstart:

§ Newitemshavenoratings§ Newusershavenohistory

¡ Threeapproachestorecommendersystems:§ 1) Content-based§ 2) Collaborative§ 3) Latentfactorbased

13J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Today!

Page 8: 13 Rec Sys - SJTU

5/19/17

8

¡ Mainidea: Recommenditemstocustomerxsimilartopreviousitemsratedhighlybyx

Example:¡ Movierecommendations§ Recommendmovieswithsameactor(s),director,genre,…

¡ Websites,blogs,news§ Recommendothersiteswith“similar”content

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 15

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

16J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 9: 13 Rec Sys - SJTU

5/19/17

9

¡ Foreachitem,createanitemprofile

¡ Profileisaset(vector)offeatures§ Movies: author,title,actor,director,…§ Text: Setof“important”wordsindocument

¡ Howtopickimportantfeatures?§ UsualheuristicfromtextminingisTF-IDF(Termfrequency*InverseDocFrequency)§ Term …Feature§ Document …Item

17J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

fij =frequencyofterm(feature)i indoc(item)j

ni =numberofdocsthatmentiontermiN =totalnumberofdocs

TF-IDFscore: wij =TFij × IDFiDocprofile= setofwordswithhighestTF-IDFscores,togetherwiththeirscores

18J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Note: we normalize TFto discount for “longer” documents

Page 10: 13 Rec Sys - SJTU

5/19/17

10

¡ Userprofilepossibilities:§ Weightedaverageofrateditemprofiles§ Variation: weightbydifferencefromaverageratingforitem

§ …¡ Predictionheuristic:§ Givenuserprofilex anditemprofilei,estimate𝑢(𝒙, 𝒊) = cos(𝒙, 𝒊) = 𝒙·𝒊

| 𝒙 |⋅| 𝒊 |

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 19

¡ +:Noneedfordataonotherusers§ Nocold-startorsparsity problems

¡ +:Abletorecommendtouserswithuniquetastes

¡ +:Abletorecommendnew&unpopularitems§ Nofirst-raterproblem

¡ +:Abletoprovideexplanations§ Canprovideexplanationsofrecommendeditemsbylistingcontent-featuresthatcausedanitemtoberecommended

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 20

Page 11: 13 Rec Sys - SJTU

5/19/17

11

¡ –:Findingtheappropriatefeaturesishard§ E.g.,images,movies,music

¡ –:Recommendationsfornewusers§ Howtobuildauserprofile?

¡ –:Overspecialization§ Neverrecommendsitemsoutsideuser’scontentprofile

§ Peoplemighthavemultipleinterests§ Unabletoexploitqualityjudgmentsofotherusers

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 21

Harnessingqualityjudgmentsofotherusers

Page 12: 13 Rec Sys - SJTU

5/19/17

12

¡ Consideruserx

¡ FindsetN ofotheruserswhoseratingsare“similar”tox’sratings

¡ Estimatex’sratingsbasedonratingsofusersinN

23J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

x

N

¡ Letrx bethevectorofuserx’sratings¡ Jaccard similaritymeasure§ Problem: Ignoresthevalueoftherating

¡ Cosinesimilaritymeasure§ sim(x,y)=cos(rx,ry)=

/0⋅/1||/0||⋅||/1||

§ Problem: Treatsmissingratingsas“negative”¡ Pearsoncorrelationcoefficient§ Sxy =itemsratedbybothusersx andy

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 24

rx = [*, _, _, *, ***]ry = [*, _, **, **, _]

rx, ry as sets:rx = {1, 4, 5}ry = {1, 3, 4}

rx, ry as points:rx = {1, 0, 0, 1, 3}ry = {1, 0, 2, 2, 0}

rx, ry … avg.rating of x, y

𝒔𝒊𝒎 𝒙, 𝒚 =∑ 𝒓𝒙𝒔 − 𝒓𝒙 𝒓𝒚𝒔 − 𝒓𝒚�𝒔∈𝑺𝒙𝒚

∑ 𝒓𝒙𝒔 − 𝒓𝒙 𝟐�𝒔∈𝑺𝒙𝒚

� ∑ 𝒓𝒚𝒔 − 𝒓𝒚𝟐�

𝒔∈𝑺𝒙𝒚�

Page 13: 13 Rec Sys - SJTU

5/19/17

13

¡ Intuitivelywewant: sim(A,B)>sim(A,C)¡ Jaccard similarity: 1/5< 2/4¡ Cosinesimilarity: 0.380 > 0.322§ Considersmissingratingsas“negative”§ Solution:subtractthe(row)mean

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 25

sim A,B vs. A,C:0.092 > -0.559Notice cosine sim. is correlation when data is centered at 0

𝒔𝒊𝒎(𝒙, 𝒚) = ∑ 𝒓𝒙𝒊 ⋅ 𝒓𝒚𝒊�𝒊

∑ 𝒓𝒙𝒊𝟐�𝒊

�⋅ ∑ 𝒓𝒚𝒊𝟐�

𝒊�

Cosine sim:

Fromsimilaritymetrictorecommendations:¡ Letrx bethevectorofuserx’sratings¡ LetN bethesetofk usersmostsimilartoxwhohaverateditemi

¡ Predictionforitemsof userx:§ 𝑟=> =

?@∑ 𝑟A>�

A∈B

§ 𝑟=> =∑ C01⋅/1D�1∈E∑ C01�1∈E

§ Otheroptions?¡ Manyothertrickspossible…

26J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Shorthand:𝒔𝒙𝒚 = 𝒔𝒊𝒎 𝒙, 𝒚

Page 14: 13 Rec Sys - SJTU

5/19/17

14

¡ Sofar: User-usercollaborativefiltering¡ Anotherview:Item-item§ Foritemi,findothersimilaritems§ Estimateratingforitemi basedonratingsforsimilaritems

§ Canusesamesimilaritymetricsandpredictionfunctionsasinuser-usermodel

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 27

åå

Î

Î×

=);(

);(

xiNj ij

xiNj xjijxi s

rsr

sij… similarity of items i and jrxj…rating of user x on item jN(i;x)… set items rated by x similar to i

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

28J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 15: 13 Rec Sys - SJTU

5/19/17

15

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

29J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

mov

ies

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

Neighbor selection:Identify movies similar to movie 1, rated by user 5

30J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

mov

ies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Here we use Pearson correlation as similarity:1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)/5 = 3.6row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]

2) Compute cosine similarities between rows

Page 16: 13 Rec Sys - SJTU

5/19/17

16

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

Compute similarity weights:s1,3=0.41, s1,6=0.59

31J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

mov

ies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

121110987654321

4552.6311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average:

r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.632J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

mov

ies

𝒓𝒊𝒙 =∑ 𝒔𝒊𝒋 ⋅ 𝒓𝒋𝒙�𝒋∈𝑵(𝒊;𝒙)

∑𝒔𝒊𝒋

Page 17: 13 Rec Sys - SJTU

5/19/17

17

¡ Definesimilaritysij ofitemsi andj¡ Selectk nearestneighborsN(i;x)§ Itemsmostsimilartoi,thatwereratedbyx

¡ Estimateratingrxi astheweightedaverage:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 33

baseline estimate for rxi ¡ μ =overallmeanmovierating¡ bx =ratingdeviationofuserx

=(avg.ratingofuserx) – μ¡ bi =ratingdeviationofmoviei

åå

Î

Î=);(

);(

xiNj ij

xiNj xjijxi s

rsr

Before:

åå

Î

Î-×

+=);(

);()(

xiNj ij

xiNj xjxjijxixi s

brsbr

𝒃𝒙𝒊 = 𝝁 + 𝒃𝒙 + 𝒃𝒊

0.418.010.90.30.5

0.81Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

34J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

¡ Inpractice,ithasbeenobservedthatitem-itemoftenworksbetterthanuser-user

¡ Why?Itemsaresimpler,usershavemultipletastes

Page 18: 13 Rec Sys - SJTU

5/19/17

18

¡ +Worksforanykindofitem§ Nofeatureselectionneeded

¡ - ColdStart:§ Needenoughusersinthesystemtofindamatch

¡ - Sparsity:§ Theuser/ratingsmatrixissparse§ Hardtofindusersthathaveratedthesameitems

¡ - Firstrater:§ Cannotrecommendanitemthathasnotbeenpreviouslyrated

§ Newitems,Esotericitems¡ - Popularitybias:

§ Cannotrecommenditemstosomeonewithuniquetaste

§ TendstorecommendpopularitemsJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 35

¡ Implementtwoormoredifferentrecommendersandcombinepredictions§ Perhapsusingalinearmodel

¡ Addcontent-basedmethodstocollaborativefiltering§ Itemprofilesfornewitemproblem§ Demographicstodealwithnewuserproblem

36J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 19: 13 Rec Sys - SJTU

5/19/17

19

- Evaluation- Errormetrics- Complexity/Speed

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 37

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 38

Page 20: 13 Rec Sys - SJTU

5/19/17

20

1 3 4

3 5 5

4 5 5

3

3

2 ? ?

?

2 1 ?

3 ?

1

Test Data Set

users

movies

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 39

¡ Comparepredictionswithknownratings§ Root-mean-squareerror (RMSE)

§ ∑ 𝑟=> − 𝑟=>∗M�

=>�

where𝒓𝒙𝒊 ispredicted,𝒓𝒙𝒊∗ isthetrueratingofx oni§ Precisionattop10:

§ %ofthoseintop10§ RankCorrelation:

§ Spearman’scorrelation betweensystem’sanduser’scompleterankings

¡ Anotherapproach:0/1model§ Coverage:

§ Numberofitems/usersforwhichsystemcanmakepredictions§ Precision:

§ Accuracyofpredictions§ Receiveroperatingcharacteristic (ROC)

§ Tradeoffcurvebetweenfalsepositivesandfalsenegatives

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 40

Page 21: 13 Rec Sys - SJTU

5/19/17

21

¡ Narrowfocusonaccuracysometimesmissesthepoint§ PredictionDiversity§ PredictionContext§ Orderofpredictions

¡ Inpractice,wecareonlytopredicthighratings:§ RMSEmightpenalizeamethodthatdoeswellforhighratingsandbadlyforothers

41J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

¡ Expensivestepisfindingkmostsimilarcustomers:O(|X|)

¡ Tooexpensivetodoatruntime§ Couldpre-compute

¡ Naïvepre-computationtakestimeO(k·|X|)§ X…setofcustomers

¡ Wealreadyknowhowtodothis!§ Near-neighborsearchinhighdimensions(LSH)§ Clustering§ Dimensionalityreduction

42J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 22: 13 Rec Sys - SJTU

5/19/17

22

¡ Leverageallthedata§ Don’ttrytoreducedatasizeinanefforttomakefancyalgorithmswork

§ Simplemethodsonlargedatadobest

¡ Addmoredata§ e.g.,addIMDBdataongenres

¡ Moredatabeatsbetteralgorithmshttp://anand.typepad.com/datawocky/2008/03/more-data-usual.html

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 43