Top Banner
Searching the Hidden Web Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser
17
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Searching the Hidden Web

Donghui XuSpring 2011, COMS E6125Prof. Gail Kaiser

Page 2: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

• What is the hidden Web• Two approaches in searching the hidden

WeboBrowsing Yahoo! like Web directoryoCrawling the hidden Web

• conclusion

Outline

Page 3: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

The surface Web ◦ reachable via hyperlinks

The Surface Web

Page 4: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

What is the Hidden Web The hidden Web

◦ no static hyperlink points to the webpage◦ access via a query interface◦ dynamically generated base on the query

submitted

Page 5: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

What is the Hidden Web

Page 6: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Size of the Hidden Web About 500 times larger than the surface

web◦ The surface web - 1 billion pages◦ Hidden web - over 550 billion pages

Top sixty largest Deep web sites are about 40 times larger than the surface web.

the Deep Web V.S. the Surface Web (from Bergman)

Page 7: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Quality of the Hidden Web Name URL Web Size (GBs)

National Climatic Data Center (NOAA) http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html 366,000

NASA EOSDIS http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html 219,600

National Oceanographic (combined with

Geophysical) Data Center (NOAA)http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/ 32,940

MP3.com http://www.mp3.com/ 4,300

US PTO - Trademarks + Patents http://www.uspto.gov/tmdb/, http://www.uspto.gov/patft/ 2,440

Informedia (Carnegie Mellon Univ.) http://www.informedia.cs.cmu.edu/ 1,830

UC Berkeley Digital Library Project http://elib.cs.berkeley.edu/ 766

US Census http://factfinder.census.gov 610

NCI CancerNet Database http://cancernet.nci.nih.gov/ 488

Amazon.com http://www.amazon.com/ 461

IBM Patent Center http://www.patents.ibm.com/boolquery 345

NASA Image Exchange http://nix.nasa.gov/ 337

some of the largest Hidden Web sites (from Bergman)

Page 8: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Browsing Yahoo! like Web directory Crawling the Hidden Web.

Two Approaches to Access the Hidden Web

Page 9: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Manually populate Yahoo! like directory Classify collections of text database into

categories and subcategories

Browsing Yahoo! like Web Directory

Page 10: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Pros◦ Intuitive◦ Easy to use

Cons◦ Labor intensive

Yahoo Directory containing 200, 0000 categories and there are millions of database searchable online

◦ Accurate classification is not an easy task

Browsing Yahoo! like Web Directory

Page 11: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Main challenge in searching the hidden Web◦ How to automatically generate meaningful query as

input against query interface

The query generation problem◦ assume that a Web site contains a set of pages, s.◦ each query qi issued returns a subset of s, si

◦ the task is to select a set of queries that would return maximum number of unique pages in the database with minimum cost

Crawling the hidden Web

Page 12: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Random - select the query randomly from a list of keywords (e.g. a random word from an English dictionary).

Generic Frequency - select a list of most frequent key words from a generic document corpus.

Adaptive - select promising keywords from documents downloaded based on previously issued queries.

query selection algorithms

Page 13: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Evaluation of Query Selection Algorithm

comparison of policies for dmoz (modified from Ntoulas et al )

Page 14: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Evaluation of Query Selection Algorithm

comparison of policies for PubMed (modified from Ntoulas et al)

Page 15: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

The surface web is the tip of the iceberg Beneath it is an even vaster hidden Web Two main approaches to access the hidden Web

◦ Yahoo! like web directory◦ Crawling the Hidden Web

Much work need to be done. Hidden Web searching technology would enable us to

connect different data sources and allow businesses use data in new ways.

Conclusion

Page 16: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

[1] "The Deep Web: Surfacing Hidden Value"Michael K. Bergman. . The Journal of Electronic Publishing, August 2001

[2] "Exploring a 'Deep Web' That Google Can’t Grasp"Alex Wright. . New York Times, February 3 2009

[3] S. Raghavan and H. Garcia-Molina. “Crawling the Hidden Web.” In Proceedings of the International Conference on Very Large Databases (VLDB), 2001.

[4] Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, Luis Gravano "Modeling and Managing Content Changes in Text Databases."ACM Transactions on Database Systems, 32(3): June 2007.

[5] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

[6] Alexandros Ntoulas, Petros Zerfos, Junghoo Cho "Downloading Textual Hidden Web Content by Keyword Queries" ,In Proceedings of the Joint Conference on Digital Libraries (JCDL),June 2005

[7] J. P. Callan and M. E. Connell. Query-based sampling of text databases. Information Systems, 97–130, 2001.

References

Page 17: Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Thanks!