WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

WEB CRAWLING HEIDI JAUHIAINEN

[email protected]

Loading a webpage

https://www.helsinki.fi/en/research

Web crawler

Web crawling used by:

•  Search engines

•  Internet Archive, National libraries etc.

•  Common Crawl

Crawling

•  for each url in queue •  download file

•  parse links from file

•  for each link found

•  add to the end of queue

•  handle file

Good spider

•  Is polite •  Does not strain a server

•  Obeys robots.txt

•  Avoids traps

Top 50 open source web crawlers for data mining http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/

Crawls

•  Time •  Periodic crawls, snapshots

•  Continuous crawls

•  Scope •  Universal crawling

•  Focused crawling

Crawl order

•  Breadth-first •  Page importance/relevance

•  Backlink count

•  pageRank

•  Forward link count

•  Location matrices

•  OPIC

•  Larger-sites-first

•  FICA

Distributing

Mohr, Gordon & al., An Introduction to Heritrix.

Crawling for research data

•  Choose a crawler that suits your needs •  Make sure to obey time limits and robots.txt •  Add on (project’s) home page info on the crawling

•  www.pagename/webmasters

•  if possible add this address to the crawlers information

•  Give your bot a name

•  If doing intensive crawling •  Inform your internet provider / university it-dep.

•  Inform Funet CERT

WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Documents