Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Dive into Scrapy

@juanriazaJuan Riaza

CHAPTER 1

- THE FANTABULOUS WORLD OF DATA -

Sources of Data

!RSS✉EMAIL

#INTERNET📰DOCUMENTS

🕏APIs

Tradeoffs

Most of the world hasn't embraced API-centric development

Most of the world's interesting data isn't API accessible

APIs Tradeoffs

Throttling

Limited Data

Availability

They know you

The web is thoroughly broken

tl;dr

Web Scraping

“is a computer software technique of extracting information from websites”

- BASIC TOOLSET FOR THE CURIOUS -

Chapter 2

HTTP

Headers, Query String

Status Codes

Methods

Persistence

GET, POST, PUT, HEAD…

2XX, 3XX, 4XX, 418 , 5XX, 999

Accept-language, UA*…

Cookies

Developer Tools

Emulate mobile devices

Network Inspector

Resources

Search XPATH

Elements, Cookies

Filter by XHR

Mobile sites

Extensions Hola, JS Switch…

HTTP Libraries

Urllib2 (stdlib)

requests-oauthlib

python-requests

requestb.in

HTML is not a regular language

HTML Parsers

lxml pythonic binding for the C libraries libxml2 and libxslt

beautifulsoup html.parser, lxml, html5lib

Those who don't understand xpath are cursed to reinvent it, poorly.

# -*- coding: utf-8 -*-import requestsimport lxml.html

req = requests.get('https://fosdem.org/2015/schedule/events/')tree = lxml.html.fromstring(req.text)for tr in tree.xpath('//tr'): content = tr.xpath('./td[1]/a/text()') name = tr.xpath('./td[2]/a/text()')

- TOOLSET FOR THE ADVENTUROUS -

CHAPTER 3

Maybe you'll need multiple HTTP requests.

Scrapy-ify early on

Maybe you'll just want testable code.

“An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.”

Healthy Community

!6.3k ★1.6k forks

500 watchers

" @scrapyproject 1.6k followers

2.7k questions

2k members on mailing list✉

Start a project

$ scrapy startproject <name>fosdem ├── fosdem │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ └── __init__.py └── scrapy.cfg

import scrapy

class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/', ]

def parse(self, response): self.log('A response from %s just arrived!' % response.url)

Spiders

Generate the initial Requests

In the callback function, you parse the response and return either Item objects, Request objects, or an iterable of both

start_urls, start_requests()

import scrapy

class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/' ]

def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3}

for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)

Interactive Shell

Invaluable tool for developing and debugging your spiders

$ scrapy shell <url>

Interactive Shell

iPython

Invoking the shell from spiders to inspect responses (scrapy.shell.inspect_response)

Available Scrapy Objects spider, request, sel…

Available Shortcuts shelp(), fetch(), view()

Avoid getting banned

Rotate your user agent

Disable cookies

Download delays

Use a pool of rotating IPs

Crawlera

Everything else

Feed Exports

Items, ItemLoaders, Middlewares, Pipelines, Stats

Testing

JSON, CSV, XML, DjangoItem, S3…

Contracts

from django.db import models

class Person(models.Model): name = models.CharField(max_length=255) age = models.IntegerField()

from scrapy.contrib.djangoitem import DjangoItem

class PersonItem(DjangoItem): django_model = Person

DjangoItem

scrapinghub/pycon-speakers!

- DEPLOYMENT -

CHAPTER 4

Scrapyd

Provides a JSON web service to upload new project versions (as eggs) and schedule spiders

$ scrapy deploy

Scrapy Cloud

Scrapy Cloud, our platform as a service offering, allows you to easily build crawlers, deploy them instantly and scale them on demand. Watch your Scrapy spiders as they run and collect data, and review their data through our beautiful frontend.

- ABOUT US -

CHAPTER 5

TONS of Open Source

Mandatory Sales Slide

Professional Services

Scrapy Cloud

Crawlera

Products

We’re hiring!

Dive into Scrapy - files.meetup.comfiles.meetup.com/13310742/pythonmadrid.pdf · Developer Tools Emulate mobile devices Network Inspector Resources Search XPATH Elements, Cookies

Documents