Crawling the Web Web pages • Few thousand characters long • Served through the internet using the hypertext transport protocol (HTTP) • Viewed at client end using `browsers’ Crawler • To fetch the pages to the computer • At the computer Automatic programs can analyze hypertext documents
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Crawling the WebWeb pages
•Few thousand characters long
•Served through the internet using the hypertext transport protocol (HTTP)
•Viewed at client end using `browsers’Crawler
•To fetch the pages to the computer
•At the computerAutomatic programs can analyze hypertext documents
Mining the Web Chakrabarti and Ramakrishnan 2
HTML HyperText Markup Language Lets the author
•specify layout and typeface•embed diagrams•create hyperlinks.
expressed as an anchor tag with a HREF attribute
HREF names another page using a Uniform Resource Locator (URL),
•URL = protocol field (“HTTP”) + a server hostname (“www.cse.iitb.ac.in”) + file path (/, the `root' of the published file
system).
Mining the Web Chakrabarti and Ramakrishnan 3
HTTP(hypertext transport protocol)
Built on top of the Transport Control Protocol (TCP)
Steps(from client end)• resolve the server host name to an Internet
address (IP) Use Domain Name Server (DNS) DNS is a distributed database of name-to-IP mappings
maintained at a set of known servers
• contact the server using TCP connect to default HTTP port (80) on the server. Enter the HTTP requests header (E.g.: GET) Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)– A meta-data standard for email and Web content transfer
Fetch the HTML page
Mining the Web Chakrabarti and Ramakrishnan 4
Crawl “all” Web pages? Problem: no catalog of all accessible
URLs on the Web. Solution:
•start from a given set of URLs
•Progressively fetch and scan them for new outlinking URLs
•fetch these pages in turn…..
•Submit the text in page to a text indexing system
•and so on……….
Mining the Web Chakrabarti and Ramakrishnan 5
Crawling procedure Simple
•Great deal of engineering goes into industry-strength crawlers
• Industry crawlers crawl a substantial fraction of the Web
•E.g.: Alta Vista, Northern Lights, Inktomi No guarantee that all accessible Web
pages will be located in this fashion Crawler may never halt …….
•pages will be added continually even as it is running.
Mining the Web Chakrabarti and Ramakrishnan 6
Crawling overheads Delays involved in
•Resolving the host name in the URL to an IP address using DNS
•Connecting a socket to the server and sending the request
•Receiving the requested page in response
Solution: Overlap the above delays by•fetching many pages at the same time
Mining the Web Chakrabarti and Ramakrishnan 7
Anatomy of a crawler. Page fetching threads
•Starts with DNS resolution •Finishes when the entire page has been
fetched Each page
•stored in compressed form to disk/tape •scanned for outlinks
Work pool of outlinks•maintain network utilization without
overloading it Dealt with by load manager
Continue till he crawler has collected a sufficient number of pages.
Mining the Web Chakrabarti and Ramakrishnan 8
Typical anatomy of a large-scale crawler.
Mining the Web Chakrabarti and Ramakrishnan 9
Large-scale crawlers: performance and reliability
considerations Need to fetch many pages at same time• utilize the network bandwidth• single page fetch may involve several seconds of
network latency Highly concurrent and parallelized DNS
lookups Use of asynchronous sockets
• Explicit encoding of the state of a fetch context in a data structure
• Polling socket to check for completion of network transfers
• Multi-processing or multi-threading: Impractical Care in URL extraction
• Eliminating duplicates to reduce redundant fetches• Avoiding “spider traps”
Mining the Web Chakrabarti and Ramakrishnan 10
DNS caching, pre-fetching and resolution
A customized DNS component with…..
1. Custom client for address resolution2. Caching server3. Prefetching client
Mining the Web Chakrabarti and Ramakrishnan 11
Custom client for address resolution
Tailored for concurrent handling of multiple outstanding requests
Allows issuing of many resolution requests together •polling at a later time for completion of
individual requests Facilitates load distribution among
many DNS servers.
Mining the Web Chakrabarti and Ramakrishnan 12
Caching server With a large cache, persistent across
DNS restarts Residing largely in memory if
possible.
Mining the Web Chakrabarti and Ramakrishnan 13
Prefetching client• Steps
1. Parse a page that has just been fetched2. extract host names from HREF targets3. Make DNS resolution requests to the
caching server• Usually implemented using UDP
• User Datagram Protocol• connectionless, packet-based
communication protocol• does not guarantee packet delivery
Design of the core components: Crawler class. To copy bytes from network sockets to storage media
Three methods to express Crawler's contract with user pushing a URL to be fetched to the Crawler (fetchPush) Termination callback handler (fetchDone) called with same
URL Method (start) which starts Crawler's event loop.
Implementation of Crawler class Need for two helper classes called DNS and Fetch