Top Banner
WEB CRAWLING HEIDI JAUHIAINEN [email protected]
15

WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Mar 22, 2018

Download

Documents

vuongdan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

WEB CRAWLING HEIDI JAUHIAINEN

[email protected]

Page 2: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Loading a webpage

https://www.helsinki.fi/en/research

Page 3: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Web crawler

Page 4: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Web crawling used by:

•  Search engines

•  Internet Archive, National libraries etc.

•  Common Crawl

Page 5: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Crawling

•  for each url in queue •  download file

•  parse links from file

•  for each link found

•  add to the end of queue

•  handle file

Page 6: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Good spider

•  Is polite •  Does not strain a server

•  Obeys robots.txt

•  Avoids traps

Page 7: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Top 50 open source web crawlers for data mining http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/

Page 8: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Crawls

•  Time •  Periodic crawls, snapshots

•  Continuous crawls

•  Scope •  Universal crawling

•  Focused crawling

Page 9: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Crawl order

•  Breadth-first •  Page importance/relevance

•  Backlink count

•  pageRank

•  Forward link count

•  Location matrices

•  OPIC

•  Larger-sites-first

•  FICA

Page 10: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Distributing

Page 11: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Mohr, Gordon & al., An Introduction to Heritrix.

Page 12: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •
Page 13: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •
Page 14: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •
Page 15: WEB CRAWLING - ling.helsinki.fi Introduction to Heritrix. Crawling for research data • Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt •

Crawling for research data

•  Choose a crawler that suits your needs •  Make sure to obey time limits and robots.txt •  Add on (project’s) home page info on the crawling

•  www.pagename/webmasters

•  if possible add this address to the crawlers information

•  Give your bot a name

•  If doing intensive crawling •  Inform your internet provider / university it-dep.

•  Inform Funet CERT