Top Banner
Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004
31

Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Keywords Selection Problem in Hidden Web Crawling

Ka Cheung Sia, Richard

March 15 2004

Page 2: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Agenda What is Hidden Web? How to crawl the Hidden Web? Problem formalization Searching for “best” keyword

Greedy Tree searching Pruning

Experiments & results Conclusion

Page 3: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

What is Hidden Web? Hidden

Unreachable by following hyperlinks Dynamically generated Accessible only through a search interface

Informative Examples

http://citeseer.ist.psu.edu/ - CS research paper http://www.pubmed.org – medical research paper http://catalog.loc.gov – library of congress

Page 4: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

What is Hidden Web? Search interface

http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1

Page 5: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

What is Hidden Web? Result

Page 6: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

What is Hidden Web? Document

Page 7: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

How to crawl the Hidden Web http://citeseer.ist.psu.edu/cis?

q=heuristic+search&submit=Search+Documents&cs=1

Figure out a keyword

HiddenWeb

QueryResult

Our task

Page 8: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Problem formalization Set-cover

Vertex – documents Hyper-edges – query words

Page 9: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Goal Maximize the number of unique documents

retrieved with minimum number of query words

Page 10: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Problem formalization P(qi)

portion of unique documents retrieved by issuing query word qi (portion of documents containing “qi”)

P(qi v qj) portion of unique documents retrieved by issuing query

words qi and qj (portion of documents containing qi or qj)

P(qi | qj) portion of documents containing qi in the set of

documents retrieved by issuing query words qj

Page 11: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Problem formalization What is the next “best” query word?

P((q1 v … v qi-1) v qi)= P(q1 v … v qi-1) + P(qi) – P((q1 v … v qi-1) ^ qi)= P(q1 v … v qi-1) + P(qi) – P(q1 v … v qi-1)P(qi | q1 v … v qi-1)

P(q1 v … v qi-1) – knownP(qi | q1 v … v qi-1) – knownP(qi) – unknown Approximate P(qi) using P(qi | q1 v … v qi-1)

Page 12: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search for best query word Greedy: choose the most frequently occurring

word so far to be the query Choose qi with maximum P(qi | q1 v … v qi-1)

For set-cover problem, greedy is proven to obtain log-optimal solution

Page 13: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search for best query word Can we do better? Intuition

Correlation of keywords E.g.

- linux- debian, redhat, suse, knoppix, fedora, etc…

We might save the query word “linux” !

Page 14: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search for best query word

Wholedocumentcollection

Already retrieveddocuments

Documents retrieved by qi

Documentsretrieved by qj

Documentsretrieved by qk

Page 15: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search for best query word

linux

debian

redhat

f(x) = Number of documents we get by issuing queries linux, debain, redhat minus theoverlapping between “redhat, linux” and “debain, linux” and “redhat, debain”

Page 16: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search for best query word The search tree is huge (branching factor)

We look ahead for the 10 most frequent keywords

We only search up to depth 6 Pruning

Page 17: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search for best query word DFBnB

Sub-tree where the sum of documentsretrieved assuming no overlappingbetween keywords are less than thecurrent best solution

Page 18: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Experiment Document collection : ~100K front pages of

randomly selected websites Query interface : an inverted index (a program that

returns documents containing the given query word) Methods

Greedy DFS search (look ahead for 10 words, up to depth 6) DFS search with pruning (DFBnB)

Page 19: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Results Does searching helps?

provide 51work 159privacy 144years 172world 344list 205info 1467map 184want 57order 87people 85read 56main 2270high 95designed 240latest 36events 132looking 46send 80right 380enter 1285local 77browser 1216questions 77real 77

provide 51work 159privacy 144years 172read 101main 2364designed 291info 1455latest 53looking 60send 101right 402local 99world 239list 142map 150want 42order 69people 67high 85events 126questions 85enter 1272browser 1216real 77

Page 20: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Results Does searching helps?

Page 21: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Results How much does pruning saves?

With out pruning – 187300 nodes are examined187300=(10)+(10*9)+(10*9*8)+(10*9*8*7)+(10*9*8*7*6)+(10*9*8*7*6*5)

With pruning – 5558 nodes are examined on average (when we choose the most frequent keyword to expand)

DFBnB saves ~ 30 times

Page 22: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Conclusion Searching helps little “in this problem”

DFBnB is “really effective” in pruning search tree

Page 23: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

End

Page 24: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

More results Priori information helps

Page 25: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Results

Page 26: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Results

Page 27: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search & Greedy

Page 28: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search with prune & Greedy

Page 29: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Search for best query word base = q1 v … v qi

P(base v qi+1 v qi+2)= P(base v qi+1) + P(qi+2) – P((base v qi+1) ^ qi+2)

P((base v qi+1) ^ qi+2)= P((base ^ qi+2) v (qi +1^ qi+2))= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)

Page 30: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

2 words overlapping

Page 31: Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

3 words overlapping