Top Banner
1 PageRank and Similar Ideas Topic-Sensitive PageRank Spam: TrustRank, Spam Mass SimRank HITS (Hubs and Authorities)
78

PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

Jul 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

1

PageRank and Similar Ideas

Topic-Sensitive PageRank Spam: TrustRank, Spam Mass

SimRank HITS (Hubs and Authorities)

Page 2: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

2

Topic-Sensitive PageRank

Random Walkers Teleport Sets

Deducing Relevant Topics

Page 3: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

3

The Walkers

Yahoo

M’soft Amazon

Page 4: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

4

The Walkers

Yahoo

M’soft Amazon

Page 5: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

5

The Walkers

Yahoo

M’soft Amazon

Page 6: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

6

The Walkers

Yahoo

M’soft Amazon

Page 7: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

7

In the Limit …

Yahoo

M’soft Amazon

Page 8: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

8

Topic-Specific Page Rank

 Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history.”

 Allows search queries to be answered based on interests of the user.   Example: Query Trojan wants different

pages depending on whether you are interested in sports or history.

Page 9: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

9

Teleport Sets

  Assume each walker has a small probability of “teleporting” at any tick.

  Teleport can go to: 1.  Any page with equal probability.   To avoid dead-end and spider-trap problems.

2.  A topic-specific set of “relevant” pages (teleport set ).   For topic-sensitive PageRank.

Page 10: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

10

Example: Topic = Software

 Only Microsoft is in the teleport set.  Assume 20% “tax.”

Page 11: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

11

Only Microsoft in Teleport Set

Yahoo

M’soft Amazon

Dr. Who’s phone booth.

Page 12: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

12

Only Microsoft in Teleport Set

Yahoo

M’soft Amazon

Page 13: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

13

Only Microsoft in Teleport Set

Yahoo

M’soft Amazon

Page 14: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

14

Only Microsoft in Teleport Set

Yahoo

M’soft Amazon

Page 15: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

15

Only Microsoft in Teleport Set

Yahoo

M’soft Amazon

Page 16: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

16

Only Microsoft in Teleport Set

Yahoo

M’soft Amazon

Page 17: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

17

Only Microsoft in Teleport Set

Yahoo

M’soft Amazon

Page 18: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

18

Matrix Formulation

 Aij =   βMij + (1-β)/|S| if i is in S   βMij otherwise

 Compute as for regular PageRank:  Multiply by M, then add a vector.  Maintains sparseness.

Page 19: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

19

Discovering the Topic

 Create different PageRanks for different topics.   E.g., the 16 DMOZ top-level nodes.

 Several ways to guess what topic the queryer is interested in.  Words in previous pages viewed.   Bookmarked pages.   Expressed preferences.

Page 20: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

20

Link Spam

History of Spam Spam Farms TrustRank Spam Mass

Page 21: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

21

What is Web Spam?

  Spamming = any deliberate action solely in order to boost a Web page’s position in search engine results, incommensurate with page’s real value

  Spam = pages created for spamming   SEO industry might disagree!   SEO = search engine optimization

  Approximately 10-15% of web pages are spam

Page 22: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

22

Early Search Engines

1.  Crawl the Web (follow links from page to page, finding and copying as many pages as they could).

2.  Index pages by the words they contained.

3.  Respond to search queries (lists of words) with the pages containing those words.

Page 23: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

23

Early Page Ranking

  Attempt to order pages matching a search query by “importance.”

  First search engines considered: 1.  Number of times query words appeared. 2.  Prominence of word position, e.g. title,

header.

Page 24: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

24

The First Spammers

 As people began to use search engines to find things on the Web, those with commercial interests tried to exploit search engines to bring people to their own site – whether they wanted to be there or not.

 Example: shirt-seller might pretend to be about “movies.”

Page 25: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

25

The First Spammers – (2)

 How do you make your page appear to be about movies?

 Add the word movie 1000 times to your page.

 Set its color to the background color, so only search engines would see it.

Page 26: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

26

The First Spammers – (3)

 Or, run the query movie on your target search engine.

 See what page came first in the listings.  Copy it into your page, invisibly.  These and similar techniques are term

spam.

Page 27: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

27

The First Spammers – (4)

 Rapidly, the promise of search engines disappeared.

 Spam dominated the listings to the extent that responses to search queries were useless.

Page 28: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

28

The Google Solution to Term Spam

1.  Believe what people say about you, rather than what you say about yourself.   Consider words in the anchor text (words

that appear underlined to represent the link) and its surrounding text.

2.  PageRank as a tool to measure the “importance” of Web pages.

Page 29: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

29

Why Google Works

 Our hypothetical shirt-seller loses.  His page isn’t very important, so it

won’t be ranked high for shirts or movies.

 Saying he is about movies doesn’t help, because others don’t say he is about movies.

Page 30: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

30

Simple Spam Techniques Fail

 Example: shirt-seller creates 1000 pages, each of which links to his with movie in the anchor text.

 These pages have no links in, so they get little PageRank.

 So the shirt-seller can’t beat truly important movie pages like IMDB.

Page 31: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

31

Round 2: Link Spam

 Once Google became the dominant search engine, spammers began to work out ways to fool Google.

 Spam farms were developed to concentrate PageRank on a single page.

Page 32: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

32

Structure of a Typical Spam Farm

Target page

Links from outside

Millions of farm pages.

Page 33: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

33

Farm Pages

 Even with taxation, farm pages can preserve most of the PageRank that the farm starts with.

 And it amplifies externally supplied PageRank by a significant factor.

Page 34: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

34

External Links

 Where do external links come from?  Blog pages allow spammers to add

comments, e.g., “I agree. See www.mySpamFarm.com.”

Page 35: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

35

Combating Link Spam

1.  Detection and blacklisting of structures that look like spam farms.   Leads to another war – hiding and

detecting spam farms.

2.  TrustRank = topic-specific PageRank with a teleport set of “trusted” pages.   Example: .edu domain, plus similar

domains for non-US schools.

Page 36: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

36

Web-Spam Taxonomy

 We follow the treatment by Gyongyi and Garcia-Molina [2004]

 Boosting techniques   Techniques for achieving high relevance

/importance for a Web page

 Hiding techniques   Techniques to hide the use of boosting

•  From humans and Web crawlers

Page 37: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

37

Boosting Techniques

 Term spamming  Manipulating the text of web pages in

order to appear relevant to queries

 Link spamming   Creating link structures that boost page

rank or hubs and authorities scores

Page 38: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

38

Term Spamming   Repetition

  of one or a few specific terms e.g., free, cheap, Viagra

  Dumping   of a large number of unrelated terms   e.g., copy entire dictionaries

  Weaving   Copy legitimate pages and insert spam terms at

random positions (to hide the spamming)

  Phrase Stitching   Glue together sentences and phrases from

different sources (also hides spamming)

Page 39: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

39

Detecting Term Spam

 Analyze text using statistical methods e.g., Naïve Bayes classifiers   Similar to email spam filtering

 Also useful: detecting approximate duplicate pages

Page 40: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

40

Link Spam

 Three kinds of web pages from a spammer’s point of view   Inaccessible pages   Accessible pages

• e.g., blog comments pages •  spammer can post links to his pages

  Own pages • Completely controlled by spammer • May span multiple domain names

Page 41: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

41

Link Farms

 Spammer’s goal  Maximize the page rank of target page t

 Technique   Get as many links from accessible pages as

possible to target page t   Construct “link farm” to get page rank

multiplier effect

Page 42: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

42

Link Farms

Inaccessible

t

Accessible Own

1

2

M

One of the most common and effective organizations for a link farm

Page 43: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

43

Analysis

Suppose rank contributed by accessible pages = x Let page rank of target page = y Rank of each “farm” page = βy/M + (1-β)/N y = x + βM[βy/M + (1-β)/N] + (1-β)/N = x + β2y + β(1-β)M/N + (1-β)/N y = x/(1-β2) + cM/N where c = β/(1+β)

Inaccessible t

Accessible Own

1 2

M

Very small; ignore

Page 44: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

44

Analysis

 y = x/(1-β2) + cM/N where c = β/(1+β)  For β = 0.85, 1/(1-β2)= 3.6  Multiplier effect for “acquired” page rank   By making M large, we can make y as large

as we want

Inaccessible t

Accessible Own

1 2

M

Page 45: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

45

TrustRank Idea

 Basic principle: approximate isolation   It is rare for a “good” page to point to a

“bad” (spam) page

 Sample a set of “seed pages” from the web  Have an oracle (human) identify the good

pages and the spam pages in the seed set   Expensive task, so must make seed set as small as

possible

Page 46: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

46

Trust Propagation

 Call the subset of seed pages that are identified as “good” the “trusted pages”

 Perform a topic-sensitive PageRank with teleport set = trusted pages.

 Use a threshold value and mark all pages below the trust threshold as spam

Page 47: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

47

Picking the Seed Set

 Two conflicting considerations   Human has to inspect each seed page, so

seed set must be as small as possible  Must ensure every “good page” gets

adequate TrustRank, so need make all good pages reachable from seed set by short paths

Page 48: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

48

Approaches to Picking Seed Set

1.  Pick top pages by PageRank.   Theory is that you can’t get a bad page’s

rank really high.

2.  Use domains whose membership is controlled, like .edu, .mil, .gov

Page 49: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

49

Spam Mass

 In the TrustRank model, we start with good pages and propagate trust

 Complementary view: what fraction of a page’s PageRank comes from “spam” pages?

 In practice, we don’t know all the spam pages, so we need to estimate

Page 50: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

50

Spam Mass Estimation

r(p) = PageRank of page p r+(p) = PageRank of p with teleport into

“good” pages only = TrustRank r--(p) = r(p) – r+(p) Spam mass of p = r--(p)/r(p)

Page 51: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

51

SimRank

Random Walks from a Fixed Node on k-Partite Graphs

Page 52: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

52

SimRank

 Setting: a k -partite graph with k types of nodes.   Example: picture nodes and tag nodes.

 Perform a random-walk with restart from a particular node N.   I.e., teleport set = {N}.

 Resulting probability distribution measures similarity to N.

Page 53: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

53

SimRank – (2)

 Problem: must be done once for each node of one type.

 But suitable for sub-Web-scale applications.

 Example: CleverSense measures similarity of the 400K US restaurants by key phrases in their reviews.   Startup based on CS345A.

Page 54: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

54

Example: Similarity to Pict. 1

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 55: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

55

Example: Walk One Step

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 56: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

56

Example: Tax 20%

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 57: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

57

Example: Walk Second Step

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 58: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

58

Example: Tax 20%

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 59: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

59

Example: Walk Third Step

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 60: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

60

Example: Tax 20%

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 61: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

61

Example: Walk Fourth Step

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 62: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

62

Example: Tax 20%

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

Page 63: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

63

Example: In the Limit

Pict. 1 Pict. 3 Pict. 2

“Sky” “Tree”

.345 .066 .145

.249 .196

Pict. 3 is more similar to Pict. 1 than Pict. 2 is.

Page 64: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

64

Hubs and Authorities

Matrix Formulation Bipartite Cores and Secondary

Cores

Page 65: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

65

Hubs and Authorities

 HITS (Hypertext-Induced Topic Selection ) is a measure of importance of pages or documents, similar to PageRank.   Proposed at approximately the same time

(1998).   But never changed the world.

Page 66: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

66

HITS Model

  Interesting documents fall into two classes

  Authorities are pages containing useful information   E.g., course home pages

  Hubs are pages that link to authorities   On-line list of links to CS courses.

Page 67: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

67

Idealized view Hubs Authorities

Page 68: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

68

Mutually Recursive Definition

 A good hub links to many good authorities

 A good authority is linked from many good hubs

 Model using two scores for each node   Hub score and Authority score   Represented as vectors h and a

Page 69: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

69

Transition Matrix A

 HITS uses a matrix A[i, j] = 1 if page i links to page j, 0 if not

 AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions

Page 70: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

70

Example

Yahoo

M’soft Amazon

y 1 1 1 a 1 0 1 m 0 1 0

y a m

A =

Page 71: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

71

Hub and Authority Equations

 The hub score of page P is proportional to the sum of the authority scores of the pages it links to   h = λAa   Constant λ is a scale factor

 The authority score of page P is proportional to the sum of the hub scores of the pages it is linked from   a = µAT h   Constant µ is a second scale factor

Page 72: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

72

Iterative Algorithm

 Initialize h to all 1’s  a = ATh  Scale a so that its max entry is 1.0  h = Aa  Scale h so that its max entry is 1.0  Continue until h, a converge

Page 73: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

73

Example 1 1 1 A = 1 0 1 0 1 0

1 1 0 AT = 1 0 1 1 1 0

a(yahoo) a(amazon) a(m’soft)

= = =

1 1 1

1 4/5 1

1 0.75 1

. . .

. . .

. . .

1 0.732 1

h(yahoo) = 1 h(amazon) = 1 h(m’soft) = 1

1 2/3 1/3

1 0.73 0.27

. . .

. . .

. . .

1.000 0.732 0.268

1 0.71 0.29

Page 74: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

74

Existence and Uniqueness

h = λAa a = µAT h h = λµAAT h a = λµATA a

Under reasonable assumptions about A, the dual iterative algorithm converges to vectors h* and a* such that:

•  h* is the principal eigenvector of the matrix AAT

•  a* is the principal eigenvector of the matrix ATA

Page 75: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

75

Bipartite Cores Hubs Authorities

Most densely-connected core (primary core)

Less densely-connected core (secondary core)

Page 76: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

76

Secondary Cores

 A single topic can have many bipartite cores   corresponding to different meanings, or points of

view   abortion: pro-choice, pro-life   evolution: Darwinian, intelligent design   jaguar: auto, Mac, NFL team, panthera onca

 How to find such secondary cores?

Page 77: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

77

Non-Primary Eigenvectors

 AAT and ATA have the same set of eigenvalues   An eigenpair is the pair of eigenvectors with the

same eigenvalue   The primary eigenpair (largest eigenvalue) is what

we get from the iterative algorithm

 Non-primary eigenpairs correspond to other bipartite cores   The eigenvalue is a measure of the density of links

in the core

Page 78: PageRank and Similar Ideas - Stanford Universitysnap.stanford.edu/class/cs246-2011/slides/11-trustrank.pdf · probability of “teleporting” at any tick. Teleport can go to: 1.

78

Finding Secondary Cores

 Once we find the primary core, we can remove its links from the graph

 Repeat HITS algorithm on residual graph to find the next bipartite core

 Technically, not exactly equivalent to non-primary eigenpair approach