Top Banner
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee
26

1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

1

Collaborative Filtering and Pagerank in a Network

Qiang YangHKUST

Thanks: Sonny Chee

Page 2: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

2

Motivation Question:

A user bought some products already what other products to recommend to a

user? Collaborative Filtering (CF)

Automates “circle of advisors”.

+

Page 3: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

3

Collaborative Filtering

“..people collaborate to help one another perform filtering by recording their reactions...” (Tapestry)

Finds users whose taste is similar to you and uses them to make recommendations.

Complimentary to IR/IF. IR/IF finds similar documents – CF finds

similar users.

Page 4: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

4

Example Which movie would Sammy watch

next? Ratings 1--5

• If we just use the average of other users who voted on these movies, then we get

•Matrix= 3; Titanic= 14/4=3.5

•Recommend Titanic!

•But, is this reasonable?

Starship Trooper

(A)

Sleepless in Seattle

(R)MI-2 (A)

Matrix (A)

Titanic (R)

Sammy 3 4 3 ? ?Beatrice 3 4 3 1 1Dylan 3 4 3 3 4Mathew 4 2 3 2 5John 4 3 4 4 4Basil 5 1 5 ? ?

Titles

Use

rs

Page 5: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

5

Types of Collaborative Filtering Algorithms

Collaborative Filters Open Problems

Sparsity, First Rater, Scalability

Page 6: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

6

Statistical Collaborative Filters Users annotate items with numeric

ratings. Users who rate items “similarly”

become mutual advisors.

Recommendation computed by taking a weighted aggregate of advisor ratings.

I1 I2 … Im U1 U2 .. Un

U1 .. .. .. U1 .. .. ..

U2 .. .. U2 .. .. .. ..

… .. .. .. .. .. .. ..Un .. .. .. Un .. .. ..

Items

Use

rs

Use

rs

Users

Page 7: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

7

Basic Idea Nearest Neighbor Algorithm Given a user a and item i

First, find the the most similar users to a,

Let these be Y Second, find how these users (Y) ranked i, Then, calculate a predicted rating of a on i

based on some average of all these users Y How to calculate the similarity and average?

Page 8: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

8

Statistical Filters

GroupLens [Resnick et al 94, MIT] Filters UseNet News postings Similarity: Pearson correlation Prediction: Weighted deviation from

mean uauiuaia wrrrP ,,, )(

1

Page 9: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

9

Pearson Correlation

0

1

2

3

4

5

6

7

Item 1 Item 2 Item 3 Item 4 Item 5

Items

Rating

User A User B User C

Pearson Correlation

A B CA 1 1 -1B 1 1 -1C -1 -1 1

User

Use

r

Page 10: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

10

Pearson Correlation

Weight between users a and u Compute similarity matrix between

users Use Pearson Correlation (-1, 0, 1) Let items be all items that users rated

items uiu

items

aiaitems

uiuaiaua

rrrr

rrrrw

2,

2,

,,,

)()(

))((Pearson Correlation

A B CA 1 1 -1B 1 1 -1C -1 -1 1

User

Use

r

Page 11: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

11

Prediction Generation

Predicts how much user a likes an item i (a stands for active user) Make predictions using weighted

deviation from the mean

: sum of all weights

uauiuaia wrrrP ,,, )(1

||,

,uaY

uaw

(1)

Page 12: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

12

Error Estimation

Mean Absolute Error (MAE) for user a

Standard Deviation of the errorsN

rPMAE

ia

N

iia

a

|| ,1

,

K

MAEMAEK

aa

2

1

)(

Page 13: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

13

Example

Sammy Dylan Mathew

Sammy 1 1 -0.87

Dylan 1 1 0.21Mathew -0.87 0.21 1U

sers

Correlation

MAE

Matrix Titanic Matrix Titanic

Sammy 3.6 2.8 3 4 0.9Basil 4.6 4.1 4 5 0.75

Prediction Actual

Use

rs

||||1

)(

)(

,,,,

,,

,MathewSammyDylanSammyMathewSammyMathewMatrixMathew

DylanSammyDylanMatrixDylanSammyMatrixSammy wwwrr

wrrrP

6.3

)87.01/()}87.0()2.32(1)4.33{(3.3

=0.83

iaw ,

MAE

Page 14: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

14

Open Problems in CF

“ Sparsity Problem” CFs have poor accuracy and coverage

in comparison to population averages at low rating density [GSK+99].

“First Rater Problem” (cold start prob) The first person to rate an item

receives no benefit. CF depends upon altruism. [AZ97]

Page 15: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

15

Open Problems in CF

“ Scalability Problem” CF is computationally expensive.

Fastest published algorithms (nearest-neighbor) are n2.

Any indexing method for speeding up? Has received relatively little attention.

Page 16: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

16

The PageRank Algorithm Fundamental question to ask

What is the importance level of a page P, Information Retrieval

Cosine + TF IDF does not give related hyperlinks

Link based Important pages (nodes) have many other

links point to it Important pages also point to other

important pages

)(PI

Page 17: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

17

The Google Crawler Algorithm

“Efficient Crawling Through URL Ordering”, Junghoo Cho, Hector Garcia-Molina, Lawrence Page,

Stanford http://www.www8.org http://www-db.stanford.edu/~cho/crawler-paper/

“Modern Information Retrieval”, BY-RN Pages 380—382

Lawrence Page, Sergey Brin. The Anatomy of a Search Engine. The Seventh International WWW Conference (WWW 98). Brisbane, Australia, April 14-18, 1998.

http://www.www7.org

Page 18: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

18

Page Rank Metric

i

N

ii CTIRddPIR /)(*)1()(

1

Web PageP

T1

T2

TN

•Let 1-d be probabilitythat user randomly jump to page P;

•“d” is the damping factor. (1-d) is the likelihood of arriving at P by random jumping

•Let N be the in degree of P

•Let Ci be the number ofout links (out degrees) from each Ti

C=2

d=0.9

Page 19: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

19

How to compute page rank?

For a given network of web pages, Initialize page rank for all pages (to

one) Set parameter (d=0.90) Iterate through the network, L times

Page 20: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

20

Example: iteration K=1

A

B

C

IR(P)=1/3 for all nodes, d=0.9

node IR

A 1/3

B 1/3

C 1/3

Page 21: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

21

Example: k=2

A

B

C node IR

A 0.4

B 0.1

C 0.55

i

l

ii CTIRPIR /)(*9.01.0)(

1

l is the in-degree of P

Note: A, B, C’s IR values are Updated in order of A, then B, then CUse the new value of A when calculating B, etc.

Page 22: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

22

Example: k=2 (normalize)

A

B

C node IR

A 0.38

B 0.095

C 0.52

Page 23: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

23

Crawler Control

All crawlers maintain several queues of URL’s to pursue next Google initially maintains 500 queues Each queue corresponds to a web site pursuing

Important considerations: Limited buffer space Limited time Avoid overloading target sites Avoid overloading network traffic

Page 24: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

24

Crawler Control

Thus, it is important to visit important pages first

Let G be a lower bound threshold on IR(P)

Crawl and Stop Select only pages with IR>G to crawl, Stop after crawled K pages

Page 25: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

25

Test Result: 179,000 pages

                                                                     

      Percentage of Stanford Web crawled vs. PST –

the percentage of hot pages visited so far

Page 26: 1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

26

Google Algorithm (very simplified)

First, compute the page rank of each page on WWW Query independent

Then, in response to a query q, return pages that contain q and have highest page ranks

A problem/feature of Google: favors big commercial sites