Apache Nutch Web Crawling and Data Gathering Steve Watt - @wattsteve IBM Big Data Lead Data Day Austin
May 08, 2015
Apache Nutch
Web Crawling and Data Gathering
Steve Watt - @wattsteveIBM Big Data LeadData Day Austin
2
Topics
Introduction
The Big Data Analytics Ecosystem
Load Tooling
How is Crawl data being used?
Web Crawling - Considerations
Apache Nutch Overview
Apache Nutch Crawl Lifecycle, Setup and Demos
3
The Offline (Analytics) Big Data Ecosystem
Load Tooling
Web Content Your Content
Hadoop
Data Catalogs Analytics Tooling Export Tooling
Find Analyze Visualize Consume
4
Load Tooling - Data Gathering Patterns and Enablers
Web Content
– Downloading – Amazon Public DataSets / InfoChimps
– Stream Harvesting – Collecta / Roll-your-own (Twitter4J)
– API Harvesting – Roll your own (Facebook REST Query)
– Web Crawling – Nutch
Your Content
– Copy from FileSystem
– Load from Database - SQOOP
– Event Collection Frameworks - Scribe and Flume
5
How is Crawl data being used?
Build your own search engine – Built in Lucene Indexes for querying
– Solr integration for Multi-faceted search
Analytics Selective filtering and extraction with data from a single
provider Joining datasets from multiple providers for further
analytics Event Portal Example Is Austin really a startup town?
Extension of the mashup paradigm - “Content Providers cannot predict how their data will be re-purposed”
6
Web Crawling - considerations
Robots.txt
Facebook lawsuit against API Harvester
“No Crawling without written approval” in Mint.com Terms of Use
What if the web had as many crawlers as Apache Web Servers ?
7
Apache Nutch – What is it ?
Apache Nutch Project – nutch.apache.org– Hadoop + Web Crawler + Lucene
Hadoop based web crawler ? How does that work ?
8
Apache Nutch Overview
Seeds and Crawl Filters
Crawl Depths
Fetch Lists and Partitioning
Segments - Segment Reading using Hadoop
Indexing / Lucene
Web Application for Querying
Apache Nutch - Web Application
Crawl Lifecycle
Generate
Inject
LinkDB
Fetch
Index
CrawlDB Update
Dedup
Merge
Single Process Web Crawling
Single Process Web Crawling
- Create the seed file and copy it into a “urls” directory
- Export JAVA_HOME
- Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)
- Edit the conf/nutch-site.xml and specify an http.agent.name
- bin/nutch crawl urls -dir crawl -depth 2
D E M O
Distributed Web Crawling
Distributed Web Crawling
- The Nutch distribution is overkill if you already have a Hadoop Cluster. Its also not how you really integrate with Hadoop these days, but there is some history to consider. Nutch Wiki has Distributed Setup.
- Why orchestrate your crawl?
- How?– Create the seed file and copy it into a “urls” directory. Then
copy the directory up to the HDFS
– Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)
– Copy the conf/nutch-site,conf/nutch-default.xml, conf/nutch-conf.xml & conf/crawl-urlfilter.txt to the Hadoop conf directory.
– Restart Hadoop so the new files are picked up in the classpath
Distributed Web Crawling
- Code Review: org.apache.nutch.crawl.Crawl
- Orchestrated Crawl Example (Step 1 - Inject):
bin/hadoop jar nutch-1.2.0.job org.apache.nutch.crawl.Injector crawl/crawldb urls
D E M O
Segment Reading
17
Segment Readers
The SegmentReader class is not all that useful. But here it is anyway:
– bin/nutch readseg -list crawl/segments/20110128170617
– bin/nutch readseg -dump crawl/segments/20110128170617 dumpdir
What you really want to do is process each crawled page in M/R as an individual record– SequenceFileInputFormatters over Nutch HDFS Segments
FTW
– RecordReader returns Content Objects as Value
Code Walkthrough
D E M O
Thanks
Questions ?
Steve Watt - [email protected]
Twitter: @wattsteveBlog: stevewatt.blogspot.com
austinhug.blogspot.com