Top Banner
Beyond the Visible Web:Understanding & Exploring the Deep Web
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Web

Beyond the Visible Web:Understanding & Exploring the Deep Web

Page 2: Deep Web

Visible, Deep Web, defined

Visible WebHTML web

pages that search engines include in their indices

Deep WebContent

excluded from traditional search engines (Google, AltaVista, etc.) and web directories (Lycos, Looksmart, etc.)

Page 3: Deep Web

Tools to search the Visible Web

• Traditional Search Engine (Google, AltaVista, Clusty)

• Targeted Directory (Electronic Resources for Classicists)

• Focused Crawler (SearchEdu.com, FirstGov.gov)Meta Search Engine (Dogpile, Metacrawler)

• Value-Added Search Services (HighBeam Research)

Page 4: Deep Web

What a search engine does

• Conventional web sites, such the IHS library site, contain static html files, known as web pages.Search engines use spiders* (crawlers) to crawl the web and find all of these files by navigating hyperlinks

*a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Page 5: Deep Web

Myth: Search Engines are Comprehensive

• Most search engines include only 20%-50% off the VISIBLE web.

• Spiders (crawlers) can’t keep up with the rapid growth of the web

• Spiders can’t find all pages due to several possible reasons:– There are no direct links to the files (any online

database)– Spiders can’t find some file formats (PDF*, flash, etc.)– Webmasters may voluntarily exclude pages (Robot

Exclusion Protocol)Search Engines sometimes arbitrarily drop pages, deliberately to make room, or inadvertently.*This is changing…check out Google Scholar (scholar.google.com)

Page 6: Deep Web

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

The Haystack Problem

• The haystack problem (defined by Dr. Matthew Koll): Finding a needle in a haystack can mean:– A known needle in a known haystack

A known needle in an unknown haystack– An unknown needle in an unknown haystack– Any needle in a haystack– The sharpest needle in a haystack

Most of the sharpest needles in a haystack– All the needles in a haystack– Affirmation of no needles in the haystack– Things like needles in any haystack– Let me know whenever a new needle shows up– Where are the haystacks?– Needles, haystacks — whatever.

Page 7: Deep Web

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Meta Search Engines and Limitations of “Multiple Haystack” Searching

• Meta search engines search multiple haystacks

• Meta search engines are only capable of broad, shallow searches– Limitation of total results returned– Not applying boolean operators to all

enginesIllusion of greater coverage—lots of redundancy

Page 8: Deep Web

Beyond the Visible Web for Academic Research:

Value-Added Search Services• Value-Added Search Services

such as HighBeam Research (highbeam.com, formerly elibrary.com) provide periodical articles to customers for a price, typically $1 to $4 per article

Page 9: Deep Web

Beyond the Visible Web for Academic Research:

Online Subscription Periodical Databases(Proprietary Web)

• Most libraries (public, academic, and school) subscribe to multiple online subscription databases, such as ProQuest, Electric Library, LexisNexis, Brittanica, Grolier’s, etc. These databases are used for the vast majority of periodical research in school libraries today.

Page 10: Deep Web

Beyond the Visible Web:Online Databases in General

• 54% of Deep Web content consists of information in topic databases.

• The vast majority of these topic databases are free.

• There is currently NO free search engine that can find this information.

Page 11: Deep Web

Database content is hidden from spiders…

• Dynamic databases don’t use the conventional structure of web pages with hyperlinks. Dynamic database content can only be found by conducting an internal query—there are no hyperlinks to this content.

Page 12: Deep Web

What’s in a name?Files in online databases aren’t static, and don’t have logical naming structures. For example:

URL for a page on the library web site:

URL for article in the ProQuest database:http://proquest.umi.com/pqdweb?index=1&did=358807701&SrchMode=1&sid=2&Fmt=3&VInst=PROD&VType=PQD&RQT=309&VName=PQD&TS=1106156199&clientId=7065

http://www.icsd.k12.ny.us/highschool/library/science.html

Page 13: Deep Web

Search Engines drag a net…

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 14: Deep Web

The Deep Web: much deeper

than what Google can skim

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 15: Deep Web

7,500 terabytes of info!

• The deep web is estimated to contain 7,500 terabytes of information.

• The surface web is estimated to contain 19 terabytes of information.The deep web is approximately 400 to 550 times larger than the surface web!

Page 16: Deep Web

Content of the Deep Web

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 17: Deep Web

Cumulative Original Content

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 18: Deep Web

4 Types of Invisible Web Content

• Opaque Web• Private Web• Proprietary Web

The Truly Deep Web

Page 19: Deep Web

When to Use the Deep Web

• When you’re familiar with a subject

• When you’re familiar with specific search tools

• When you’re looking for a precise answerWhen timeliness of content matters

Page 20: Deep Web

Changing Research Habits

• When going online to do work for a course, 46.5% of students reported being more likely to use an Internet search engine than a library-sponsored electronic resource (21.9%).

EPIC Online Use and Cost Evaluation Program research findingsby Kate Wittenberg, director of Columbia University’s Electronic Publishing InitiativeWittenberg studied 1,233 US college students’ research habits over a three year period

Page 21: Deep Web

Changing Research Habits, cont’d

• 31.5% of students surveyed learn about school-related electronic resources primarily through their library website (and another 27.1% get that information through their professors). Only 13.7% of students reported using Internet search engines to find academic electronic databases.

Page 22: Deep Web

• 40% of faculty respondents somewhat or strongly agreed that they would rather settle for what they can find online rather than making a trip to the library.

Changing Research Habits, cont’d

Page 23: Deep Web

• 61% of faculty reported that credibility/reliability is a key concern with online research resources. About half of faculty reported difficulties in assessing online source credibility

Changing Research Habits, cont’d

Page 24: Deep Web

• In the focus group, librarians confirmed that library users increasingly demand electronic resources. In most cases, when there is a choice, library users overwhelmingly prefer electronic resources to print. Some undergraduates use electronic sources exclusively. Relative ease of use, availability at all hours, and ease of repurposing text drive this demand.

Changing Research Habits, cont’d

Page 25: Deep Web

CompletePlanet.com

• A comprehensive listing of dynamic searchable databases. Find databases with highly relevant documents that cannot be crawled or indexed by surface web search engines.

Page 26: Deep Web

BibliographySherman, Chris and Gary Price. The Invisible Web: Uncovering Information Sources Search Engines Can’t See. Medford, New Jersey: Information Today, Inc., 2002. Bergman, Michael K. “White Paper: The Deep Web: Surfacing Hidden Value.” The Journal of Electronic Publishing. Vol. 7, Issue 1. Aug. 2001. <http://www.press.umich.edu/jep/07-01/bergman.html>.Koll, Matthew. “Annual Meeting Coverage: Track 3: Information Retrieval.” Bulletin of the American Society for Information Science.” Vol. 26, No. 2. Dec./Jan. 2000. <http://www.asis.org/Bulletin/Jan-00/track_3.html>Hafner, Katie. “Old Search Engine, the Library, Tries to Fit Into a Google World.” The New York Times. A1. 21 July 2004. LexisNexis Scholastic Edition. 1 Feb. 2005. <http://web.lexis-nexis.com.>Clyde, Laurel A. “Search Engines are Improving but they Still Can’t Find Everything.” Teacher Librarian. 1 June 2003. Electric Library. 1 Feb. 2005. <http://elibrary.bigchalk.com>.