Dive into Scrapy @juanriaza Juan Riaza
Dive into Scrapy
@juanriazaJuan Riaza
CHAPTER 1
- THE FANTABULOUS WORLD OF DATA -
Sources of Data
!RSS✉EMAIL
#INTERNET📰DOCUMENTS
🕏APIs
Tradeoffs
Most of the world hasn't embraced API-centric development
Most of the world's interesting data isn't API accessible
APIs Tradeoffs
Throttling
Limited Data
Availability
They know you
The web is thoroughly broken
tl;dr
Web Scraping
“is a computer software technique of extracting information from websites”
- BASIC TOOLSET FOR THE CURIOUS -
Chapter 2
HTTP
Headers, Query String
Status Codes
Methods
Persistence
GET, POST, PUT, HEAD…
2XX, 3XX, 4XX, 418 , 5XX, 999
Accept-language, UA*…
Cookies
Developer Tools
Emulate mobile devices
Network Inspector
Resources
Search XPATH
Elements, Cookies
Filter by XHR
Mobile sites
Extensions Hola, JS Switch…
HTTP Libraries
Urllib2 (stdlib)
requests-oauthlib
python-requests
requestb.in
HTML is not a regular language
HTML Parsers
lxml pythonic binding for the C libraries libxml2 and libxslt
beautifulsoup html.parser, lxml, html5lib
Those who don't understand xpath are cursed to reinvent it, poorly.
# -*- coding: utf-8 -*-import requestsimport lxml.html
req = requests.get('https://fosdem.org/2015/schedule/events/')tree = lxml.html.fromstring(req.text)for tr in tree.xpath('//tr'): content = tr.xpath('./td[1]/a/text()') name = tr.xpath('./td[2]/a/text()')
- TOOLSET FOR THE ADVENTUROUS -
CHAPTER 3
Maybe you'll need multiple HTTP requests.
Scrapy-ify early on
Maybe you'll just want testable code.
“An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.”
Healthy Community
!6.3k ★1.6k forks
500 watchers
" @scrapyproject 1.6k followers
2.7k questions
2k members on mailing list✉
Start a project
$ scrapy startproject <name>fosdem ├── fosdem │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ └── __init__.py └── scrapy.cfg
import scrapy
class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/', ]
def parse(self, response): self.log('A response from %s just arrived!' % response.url)
Spiders
Generate the initial Requests
In the callback function, you parse the response and return either Item objects, Request objects, or an iterable of both
start_urls, start_requests()
import scrapy
class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/' ]
def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3}
for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
Interactive Shell
Invaluable tool for developing and debugging your spiders
$ scrapy shell <url>
Interactive Shell
iPython
Invoking the shell from spiders to inspect responses (scrapy.shell.inspect_response)
Available Scrapy Objects spider, request, sel…
Available Shortcuts shelp(), fetch(), view()
Avoid getting banned
Rotate your user agent
Disable cookies
Download delays
Use a pool of rotating IPs
Crawlera
Everything else
Feed Exports
Items, ItemLoaders, Middlewares, Pipelines, Stats
Testing
JSON, CSV, XML, DjangoItem, S3…
Contracts
from django.db import models
class Person(models.Model): name = models.CharField(max_length=255) age = models.IntegerField()
from scrapy.contrib.djangoitem import DjangoItem
class PersonItem(DjangoItem): django_model = Person
DjangoItem
scrapinghub/pycon-speakers!
- DEPLOYMENT -
CHAPTER 4
Scrapyd
Provides a JSON web service to upload new project versions (as eggs) and schedule spiders
$ scrapy deploy
Scrapy Cloud
Scrapy Cloud, our platform as a service offering, allows you to easily build crawlers, deploy them instantly and scale them on demand. Watch your Scrapy spiders as they run and collect data, and review their data through our beautiful frontend.
- ABOUT US -
CHAPTER 5
TONS of Open Source
Mandatory Sales Slide
Professional Services
Scrapy Cloud
Crawlera
Products
We’re hiring!