Top Banner
VIPAS: Virtual Link Powered Authority Search in the Web Chi-Chun Lin and Ming-Syan Chen Network Database Laboratory National Taiwan University
29

VIPAS: Virtual Link Powered Authority Search in the Web

Jan 22, 2016

Download

Documents

jerom

VIPAS: Virtual Link Powered Authority Search in the Web. Chi-Chun Lin and Ming-Syan Chen Network Database Laboratory National Taiwan University. Outline. Motivation and Goal Preliminaries and Related work Introduction to Link-analysis - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: VIPAS: Virtual Link Powered Authority Search in the Web

VIPAS: Virtual Link Powered Authority Search in the Web

Chi-Chun Lin and Ming-Syan ChenNetwork Database LaboratoryNational Taiwan University

Page 2: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 2

Outline Motivation and Goal Preliminaries and Related work

Introduction to Link-analysis Defects of Traditional Link-analysis and

Ideas for Improvement System Framework and Algorithms Implementation and Experimental Results Conclusions

Page 3: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 3

Motivation and Goal To find the most relevant pages satisfying

the user’s information need in the Web Traditional means for this task

Keyword-based search engines Problems

Some relevant pages do not contain the keywords in the page text

An alternative method Analyze the links contained in Web pages

instead of ranking by keywords

Page 4: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 4

HITS (1/3)

Authority pages A page pointed to by many other pages

Hub pages A page pointing to many other pages

Mutual reinforcement An authority pointed to by many hub pages is

an even better authority A hub pointing to many authority pages is an

even better hub Based on this argument, the goal of HITS is to

find the set of best authority pages

Page 5: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 5

HITS (2/3)

q1

q2

q3

page pxp := sum of yq

for all qp

Let xp and yp denote the authority and hub score of page p, respectively

q1

q2

q3page pyp := sum of xq

for all pq

Page 6: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 6

HITS (3/3)

Iterative algorithm1. Obtain a set of Web pages using a keyword-

based query and expand it to form a base set2. Assign each page of the base set an initial

authority and hub score of 13. According to its links, update the scores of

each page4. Normalize the scores so that

(xp)2=1 and (yp)2=1 for all p in the base set5. Do steps 3 and 4 iteratively until the scores

converge

Page 7: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 7

The Problem with HITS Links in Web pages only reflect page

creators’ judgment Sometimes a link will not be put in the

page even though its destination is very relevant e.g: There will be no link to a company’s

competitor in the same industry in its homepage

We argue: Page readers’ considerationshould be of equal importance

Page 8: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 8

The Notion of Virtual Links The basic idea

Identify pages that are heavily accessed within a period, and form a “hot set” from these pages

Create “virtual links” for pages in the hot set and incorporate them into the computation of authority scores

Design a Web warehouse for this task and utilize it to identify authoritative Web pages

Page 9: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 9

System Framework

PageArchive

Keyword& RankingDatabase

Web Pages

AuthorityEvaluator

QueryInterface

ClickstreamDatabase

ClickingObserver

Virtual LinkCreator

virtual links

page content

& links

keywords

scores

query results

Page 10: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 10

Creating Virtual Links Scenario: A user interested in Java-related

Web pages came to our system She submitted a query with keyword “java” Assume that the query result contains 100

URLs She clicked top 1-10 of the 100 URLs except

the 6th

The hot set consists of the 9 URLs clicked

Page 11: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 11

Creating Virtual Links (cont’d)

URL 1

URL 2

Virtual Hub

URL 5

URL 6

URL 7

URL 10

2 criteria

URL 1

URL 2

Hub 1

URL 5

URL 6

URL 7

URL 10

Hub 2

Hub n

Page 12: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 12

Algorithm VIPAS(Virtual LInk Powered Authority Search)

Initialization Phase1. For a query term, perform the regular HITS analysis2. Collect a base set of pages with computed authority

and hub scores and store them in the database Virtual Link Collection Phase3. Monitor the user behavior to see whether a URL in

the list is clicked by the user or not4. After a period of user behavior observation, put URLs

that are often accessed into the “hot set”

5. Create virtual links for pages in the hot set

Page 13: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 13

Algorithm VIPAS (cont’d)

Refinement Phase6. For each page in the hot set, compute its new

authority and hub scores7. Run several iterations of score updating for pages in

the base set

2 flavors VIPAS-VH(VIPAS with virtual links from a Virtual Hub) VIPAS-TH(VIPAS with virtual links from Top Hubs)

Page 14: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 14

Finding Hot Sets

1. In an observing period, pay attention to clicks of continuous URLs in the list

2. When a user continuously clicks several URLs and then skips some URLs following, we mark those that have been skipped

3. Exclude pages marked with a frequency greater than from the forming of hot sets

4. Among pages left, those that are accessed by at least % users are put into the hot set

Some relevant URLs that have already been browsed by the user will be skipped

Page 15: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 15

Finding Hot Sets (cont’d)

1. http://java.sun.com/2. http://www.sun.com/java/3. http://www.javaworld.com/4. http://java.oreilly.com/5. http://www.jars.com/6. …………..

clicked

clicked

skipped

clicked

clicked

1. http://java.sun.com/2. http://www.sun.com/java/3. http://www.javaworld.com/4. http://java.oreilly.com/5. http://www.jars.com/6. …………..

skipped

clicked

skipped

clicked

clickedURL 4 is marked,but URL 1 is not

URL 4 is marked

Page 16: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 16

Assigning Weights to Virtual Links

0...

)067.0(4321

1

6

4

)133.0(4321

2

6

4

)200.0(4321

3

6

4

)267.0(4321

4

6

4

,16,15,1

4,1

3,1

2,1

1,1

nwww

w

w

w

w

Clickstream 1: (t1,t2,t3,t4,x1,x2)

Clickstream 2: (t3,x1,t1)

0...

)444.0(21

2

3

2

)222.0(21

1

3

2

,25,24,22,2

3,2

1,2

nwwww

w

w

n pages in the hot set: t1,t2,…,tn

Page 17: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 17

Final weight:

For period Ti where i 2

Assigning Weights to Virtual Links (cont’d)

)()(

1

)(

1,

1

1

TN

wTw

TN

khk

h

100.02

0200.0)(

245.02

222.0267.0)(

12

11

Tw

Tw

)(3

2)(

3

1)(

)()(

'1

)(

1,

'

ihihih

i

TN

khk

ih

TwTwTw

TN

wTw

i

(1/3 is the degeneration factor)

Page 18: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 18

Computing the New Scores

Let xp and yp denote the authority and hub

score of page p, respectively For each page p, we update p’s authority

score by

Similarly, we update p’s hub score by

Epqq Epqq

qpqAqp ywyx ),( : ' )',( : '

'',

Eqpq Eqpq

qqpHqp xwxy ),( : ' )',( : '

'',

Page 19: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 19

User-behavior Observation Use an ASP script

1. The Source of Java(TM) Technologyhttp://java.sun.com/

2. ………………….http://….

3. ………http://…

plain URL http://java.sun.com/ replaced bywrapper.asp?URL=http://java.sun.com/

1. Increment the click count ofhttp://java.sun.com/

2. Record the time3. Redirect the user to

http://java.sun.com/

Query result for keyword: “Java”

Query result page

Page 20: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 20

Implementation and Experiments Experimental testbed

NTUEE website (http://www.ee.ntu.edu.tw/)

Data collection 03/28/’02 ~ 05/31/’02

ParametersParameter Value

20%

40%

A 10

H 10

Page 21: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 21

Evaluation Method For a keyword, we manually select a list

of authority pages and compare it with the output of each algorithm

Discrepancy coefficient

SN URL (H denotes http://www.ee.ntu.edu.tw) Title

5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]

8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

n

kRn

kk

1

)(

Page 22: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 22

Discrepancy Coefficient – Regular HITSRank SN URL (H denotes http://www.ee.ntu.edu.tw) Title

1 5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

2 93 H/professor_c.html Faculty members of NTUEE

3 34 H/prodata_c.html Faculty members of NTUEE

4 94 H/professor_e.html Faculty members of NTUEE

5 8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

6 7229 H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html [no title]

7 7269 H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html [no title]

8 5892 H/html_2000/WWW/faculty/NoSort.html [no title]

9 4959 H/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….

10 8904 H/html_2000/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….

41 7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]

R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)

67.133

)341()25()11(

Page 23: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 23

Discrepancy Coefficient – VIPAS-VHRank SN URL (H denotes http://www.ee.ntu.edu.tw) Title

1 5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

2 93 H/professor_c.html Faculty members of NTUEE

3 34 H/prodata_c.html Faculty members of NTUEE

4 94 H/professor_e.html Faculty members of NTUEE

5 8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

6 7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]

7 7229 H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html [no title]

8 7269 H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html [no title]

9 5892 H/html_2000/WWW/faculty/NoSort.html [no title]

10 4959 H/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….

R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)

23

)36()25()11(

Page 24: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 24

Evaluation Method Grouping coefficient

Stability The standard deviation of each algorithm’s

discrepancy coefficients for all of the keywords

n

kRn

kk

1

2])[(

Page 25: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 25

Grouping Coefficient – Regular HITS

R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)

25.173

]67.13)341[(]67.13)25[(]67.13)11[( 222

Rank SN URL (H denotes http://www.ee.ntu.edu.tw) Title

1 5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

2 93 H/professor_c.html Faculty members of NTUEE

3 34 H/prodata_c.html Faculty members of NTUEE

4 94 H/professor_e.html Faculty members of NTUEE

5 8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

6 7229 H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html [no title]

7 7269 H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html [no title]

8 5892 H/html_2000/WWW/faculty/NoSort.html [no title]

9 4959 H/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….

10 8904 H/html_2000/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….

41 7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]

Page 26: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 26

Grouping Coefficient – VIPAS-VH

R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)

41.13

]2)341[(]2)25[(]2)11[( 222

Rank SN URL (H denotes http://www.ee.ntu.edu.tw) Title

1 5633 H/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

2 93 H/professor_c.html Faculty members of NTUEE

3 34 H/prodata_c.html Faculty members of NTUEE

4 94 H/professor_e.html Faculty members of NTUEE

5 8682 H/html_2000/www/faculty/rb-wu/rb-wu.htm Homepage of professor Ruey-Beei Wu

6 7228 H/html_2000/WWW/faculty/english/Wu-Rei-Bei.html [no title]

7 7229 H/html_2000/WWW/faculty/english/Cao-Heng-Wei.html [no title]

8 7269 H/html_2000/WWW/faculty/english/Chen-Qiu-Lin.html [no title]

9 5892 H/html_2000/WWW/faculty/NoSort.html [no title]

10 4959 H/content/chinese/required/differential_equations.html Engineering Mathematics I: Diff….

Page 27: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 27

Experimental Results

0

5

10

15

20

25

1 2 3 4 5 6 7 8

Dis

crep

ancy

Coe

ffic

ient

HITS

VIPAS-VH

VIPAS-TH

0

4

8

12

16

20

1 2 3 4 5 6 7 8

Keyword

Gro

upin

g C

oeff

icie

nt

HITS

VIPAS-VH

VIPAS-TH

Page 28: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 28

Experimental Results (cont’d)

0123456789

HITS VIPAS-VH VIPAS-TH

Sta

bilit

y

Page 29: VIPAS: Virtual Link Powered Authority Search in the Web

M.-S. Chen NTU 29

Conclusions Link-analysis algorithms are popular in

Web information retrieval But they need further improvement

In our work, we built a Web warehouse Incorporate user feedback into the

identification of authoritative resources(Algorithm VIPAS)

Experimental results show that VIPAS is very effective and the warehouse is able to retrieve much more valuable information for users