13 Rec Sys - SJTU

5/19/17

1

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

High dim. data

Localitysensitivehashing

Clustering

Dimensionality

reduction

Graph data

PageRank,SimRank

CommunityDetection

SpamDetection

Infinite data

Filteringdata

streams

Webadvertising

Queriesonstreams

Machine learning

SVM

DecisionTrees

Perceptron,kNN

Apps

Recommendersystems

AssociationRules

Duplicatedocumentdetection

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

5/19/17

2

¡ CustomerX§ BuysMetallicaCD§ BuysMegadeth CD

¡ CustomerY§ DoessearchonMetallica§ RecommendersystemsuggestsMegadeth fromdatacollectedaboutcustomerX


Items

Search Recommendations

Products, web sites, blogs, news items, …

4J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Examples:

5/19/17

3

¡ Shelfspaceisascarcecommodityfortraditionalretailers§ Also:TVnetworks,movietheaters,…

¡ Webenablesnear-zero-costdisseminationofinformationaboutproducts§ Fromscarcitytoabundance

¡ Morechoicenecessitatesbetterfilters§ Recommendationengines§ HowIntoThinAirmadeTouchingtheVoidabestseller:http://www.wired.com/wired/archive/12.10/tail.html


Source: Chris Anderson (2004)


5/19/17

4

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 7Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!

¡ Editorialandhandcurated§ Listoffavorites§ Listsof“essential”items

¡ Simpleaggregates§ Top10,MostPopular,RecentUploads

¡ Tailoredtoindividualusers§ Amazon,Netflix,…


5/19/17

5

¡ X =setofCustomers¡ S =setofItems

¡ Utilityfunction u:X × Sà R§ R =setofratings§ R isatotallyorderedset§ e.g.,0-5 stars,realnumberin[0,1]


0.410.2

0.30.50.21

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David


5/19/17

6

¡ (1) Gathering“known”ratingsformatrix§ Howtocollectthedataintheutilitymatrix

¡ (2) Extrapolateunknownratingsfromtheknownones§ Mainlyinterestedinhighunknownratings

§ Wearenotinterestedinknowingwhatyoudon’tlikebutwhatyoulike

¡ (3) Evaluatingextrapolationmethods§ Howtomeasuresuccess/performanceofrecommendationmethods


¡ Explicit§ Askpeopletorateitems§ Doesn’tworkwellinpractice– peoplecan’tbebothered

¡ Implicit§ Learnratingsfromuseractions

§ E.g.,purchaseimplieshighrating

§ Whataboutlowratings?


5/19/17

7

¡ Keyproblem: UtilitymatrixU issparse§ Mostpeoplehavenotratedmostitems§ Coldstart:

§ Newitemshavenoratings§ Newusershavenohistory

¡ Threeapproachestorecommendersystems:§ 1) Content-based§ 2) Collaborative§ 3) Latentfactorbased


Today!

5/19/17

8

¡ Mainidea: Recommenditemstocustomerxsimilartopreviousitemsratedhighlybyx

Example:¡ Movierecommendations§ Recommendmovieswithsameactor(s),director,genre,…

¡ Websites,blogs,news§ Recommendothersiteswith“similar”content


likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild


5/19/17

9

¡ Foreachitem,createanitemprofile

¡ Profileisaset(vector)offeatures§ Movies: author,title,actor,director,…§ Text: Setof“important”wordsindocument

¡ Howtopickimportantfeatures?§ UsualheuristicfromtextminingisTF-IDF(Termfrequency*InverseDocFrequency)§ Term …Feature§ Document …Item


fij =frequencyofterm(feature)i indoc(item)j

ni =numberofdocsthatmentiontermiN =totalnumberofdocs

TF-IDFscore: wij =TFij × IDFiDocprofile= setofwordswithhighestTF-IDFscores,togetherwiththeirscores


Note: we normalize TFto discount for “longer” documents

5/19/17

10

¡ Userprofilepossibilities:§ Weightedaverageofrateditemprofiles§ Variation: weightbydifferencefromaverageratingforitem

§ …¡ Predictionheuristic:§ Givenuserprofilex anditemprofilei,estimate𝑢(𝒙, 𝒊) = cos(𝒙, 𝒊) = 𝒙·𝒊

| 𝒙 |⋅| 𝒊 |


¡ +:Noneedfordataonotherusers§ Nocold-startorsparsity problems

¡ +:Abletorecommendtouserswithuniquetastes

¡ +:Abletorecommendnew&unpopularitems§ Nofirst-raterproblem

¡ +:Abletoprovideexplanations§ Canprovideexplanationsofrecommendeditemsbylistingcontent-featuresthatcausedanitemtoberecommended


5/19/17

11

¡ –:Findingtheappropriatefeaturesishard§ E.g.,images,movies,music

¡ –:Recommendationsfornewusers§ Howtobuildauserprofile?

¡ –:Overspecialization§ Neverrecommendsitemsoutsideuser’scontentprofile

§ Peoplemighthavemultipleinterests§ Unabletoexploitqualityjudgmentsofotherusers


Harnessingqualityjudgmentsofotherusers

5/19/17

12

¡ Consideruserx

¡ FindsetN ofotheruserswhoseratingsare“similar”tox’sratings

¡ Estimatex’sratingsbasedonratingsofusersinN


x

N

¡ Letrx bethevectorofuserx’sratings¡ Jaccard similaritymeasure§ Problem: Ignoresthevalueoftherating

¡ Cosinesimilaritymeasure§ sim(x,y)=cos(rx,ry)=

/0⋅/1||/0||⋅||/1||

§ Problem: Treatsmissingratingsas“negative”¡ Pearsoncorrelationcoefficient§ Sxy =itemsratedbybothusersx andy


rx = [*, _, _, *, ***]ry = [*, _, **, **, _]

rx, ry as sets:rx = {1, 4, 5}ry = {1, 3, 4}

rx, ry as points:rx = {1, 0, 0, 1, 3}ry = {1, 0, 2, 2, 0}

rx, ry … avg.rating of x, y

𝒔𝒊𝒎 𝒙, 𝒚 =∑ 𝒓𝒙𝒔 − 𝒓𝒙 𝒓𝒚𝒔 − 𝒓𝒚�𝒔∈𝑺𝒙𝒚

∑ 𝒓𝒙𝒔 − 𝒓𝒙 𝟐�𝒔∈𝑺𝒙𝒚

� ∑ 𝒓𝒚𝒔 − 𝒓𝒚𝟐�

𝒔∈𝑺𝒙𝒚�

5/19/17

13

¡ Intuitivelywewant: sim(A,B)>sim(A,C)¡ Jaccard similarity: 1/5< 2/4¡ Cosinesimilarity: 0.380 > 0.322§ Considersmissingratingsas“negative”§ Solution:subtractthe(row)mean


sim A,B vs. A,C:0.092 > -0.559Notice cosine sim. is correlation when data is centered at 0

𝒔𝒊𝒎(𝒙, 𝒚) = ∑ 𝒓𝒙𝒊 ⋅ 𝒓𝒚𝒊�𝒊

∑ 𝒓𝒙𝒊𝟐�𝒊

�⋅ ∑ 𝒓𝒚𝒊𝟐�

𝒊�

Cosine sim:

Fromsimilaritymetrictorecommendations:¡ Letrx bethevectorofuserx’sratings¡ LetN bethesetofk usersmostsimilartoxwhohaverateditemi

¡ Predictionforitemsof userx:§ 𝑟=> =

?@∑ 𝑟A>�

A∈B

§ 𝑟=> =∑ C01⋅/1D�1∈E∑ C01�1∈E

§ Otheroptions?¡ Manyothertrickspossible…


Shorthand:𝒔𝒙𝒚 = 𝒔𝒊𝒎 𝒙, 𝒚

5/19/17

14

¡ Sofar: User-usercollaborativefiltering¡ Anotherview:Item-item§ Foritemi,findothersimilaritems§ Estimateratingforitemi basedonratingsforsimilaritems

§ Canusesamesimilaritymetricsandpredictionfunctionsasinuser-usermodel


åå

Î

Î×

=);(

);(

xiNj ij

xiNj xjijxi s

rsr

sij… similarity of items i and jrxj…rating of user x on item jN(i;x)… set items rated by x similar to i

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5


5/19/17

15

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5


mov

ies

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

Neighbor selection:Identify movies similar to movie 1, rated by user 5


mov

ies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Here we use Pearson correlation as similarity:1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)/5 = 3.6row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]

2) Compute cosine similarities between rows

5/19/17

16

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

Compute similarity weights:s1,3=0.41, s1,6=0.59


mov

ies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

121110987654321

4552.6311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average:

r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.632J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

mov

ies

𝒓𝒊𝒙 =∑ 𝒔𝒊𝒋 ⋅ 𝒓𝒋𝒙�𝒋∈𝑵(𝒊;𝒙)

∑𝒔𝒊𝒋

5/19/17

17

¡ Definesimilaritysij ofitemsi andj¡ Selectk nearestneighborsN(i;x)§ Itemsmostsimilartoi,thatwereratedbyx

¡ Estimateratingrxi astheweightedaverage:


baseline estimate for rxi ¡ μ =overallmeanmovierating¡ bx =ratingdeviationofuserx

=(avg.ratingofuserx) – μ¡ bi =ratingdeviationofmoviei

åå

Î

Î=);(

);(

xiNj ij

xiNj xjijxi s

rsr

Before:

åå

Î

Î-×

+=);(

);()(

xiNj ij

xiNj xjxjijxixi s

brsbr

𝒃𝒙𝒊 = 𝝁 + 𝒃𝒙 + 𝒃𝒊

0.418.010.90.30.5

0.81Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David


¡ Inpractice,ithasbeenobservedthatitem-itemoftenworksbetterthanuser-user

¡ Why?Itemsaresimpler,usershavemultipletastes

5/19/17

18

¡ +Worksforanykindofitem§ Nofeatureselectionneeded

¡ - ColdStart:§ Needenoughusersinthesystemtofindamatch

¡ - Sparsity:§ Theuser/ratingsmatrixissparse§ Hardtofindusersthathaveratedthesameitems

¡ - Firstrater:§ Cannotrecommendanitemthathasnotbeenpreviouslyrated

§ Newitems,Esotericitems¡ - Popularitybias:

§ Cannotrecommenditemstosomeonewithuniquetaste

§ TendstorecommendpopularitemsJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 35

¡ Implementtwoormoredifferentrecommendersandcombinepredictions§ Perhapsusingalinearmodel

¡ Addcontent-basedmethodstocollaborativefiltering§ Itemprofilesfornewitemproblem§ Demographicstodealwithnewuserproblem


5/19/17

19

- Evaluation- Errormetrics- Complexity/Speed


1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users


5/19/17

20

1 3 4

3 5 5

4 5 5

3

3

2 ? ?

?

2 1 ?

3 ?

1

Test Data Set

users

movies


¡ Comparepredictionswithknownratings§ Root-mean-squareerror (RMSE)

§ ∑ 𝑟=> − 𝑟=>∗M�

=>�

where𝒓𝒙𝒊 ispredicted,𝒓𝒙𝒊∗ isthetrueratingofx oni§ Precisionattop10:

§ %ofthoseintop10§ RankCorrelation:

§ Spearman’scorrelation betweensystem’sanduser’scompleterankings

¡ Anotherapproach:0/1model§ Coverage:

§ Numberofitems/usersforwhichsystemcanmakepredictions§ Precision:

§ Accuracyofpredictions§ Receiveroperatingcharacteristic (ROC)

§ Tradeoffcurvebetweenfalsepositivesandfalsenegatives


5/19/17

21

¡ Narrowfocusonaccuracysometimesmissesthepoint§ PredictionDiversity§ PredictionContext§ Orderofpredictions

¡ Inpractice,wecareonlytopredicthighratings:§ RMSEmightpenalizeamethodthatdoeswellforhighratingsandbadlyforothers


¡ Expensivestepisfindingkmostsimilarcustomers:O(|X|)

¡ Tooexpensivetodoatruntime§ Couldpre-compute

¡ Naïvepre-computationtakestimeO(k·|X|)§ X…setofcustomers

¡ Wealreadyknowhowtodothis!§ Near-neighborsearchinhighdimensions(LSH)§ Clustering§ Dimensionalityreduction


5/19/17

22

¡ Leverageallthedata§ Don’ttrytoreducedatasizeinanefforttomakefancyalgorithmswork

§ Simplemethodsonlargedatadobest

¡ Addmoredata§ e.g.,addIMDBdataongenres

¡ Moredatabeatsbetteralgorithmshttp://anand.typepad.com/datawocky/2008/03/more-data-usual.html


13 Rec Sys - SJTU

Documents