Top Banner
Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012
27

Webscrapping Tools

Jul 20, 2016

Download

Documents

Brandon Murphy

An Overview of Web Scrapping Tools
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Webscrapping Tools

Overview of Python web scraping tools

Maik RöderBarcelona Python Meetup Group

17.05.2012

Friday, May 18, 2012

Page 2: Webscrapping Tools

Data Scraping

• Automated Process

• Explore and download pages

• Grab content

• Store in a database or in a text file

Friday, May 18, 2012

Page 3: Webscrapping Tools

urlparse

• Manipulate URL strings

urlparse.urlparse()urlparse.urljoin()urlparse.urlunparse()

Friday, May 18, 2012

Page 4: Webscrapping Tools

urllib

• Download data through different protocols

• HTTP, FTP, ...

urllib.parse()urllib.urlopen()urllib.urlretrieve()

Friday, May 18, 2012

Page 5: Webscrapping Tools

Scrape a web site

• Example: http://www.wunderground.com/

Friday, May 18, 2012

Page 7: Webscrapping Tools

Beautifulsoup

• HTML/XML parser

• designed for quick turnaround projects like screen-scraping

• http://www.crummy.com/software/BeautifulSoup

Friday, May 18, 2012

Page 8: Webscrapping Tools

BeautifulSoup

from BeautifulSoup import *

a = BeautifulSoup(d).findAll('a')

[x['href'] for x in a]

Friday, May 18, 2012

Page 9: Webscrapping Tools

Faster BeautifulSoup

from BeautifulSoup import *

p = SoupStrainer('a')

a = BeautifulSoup(d, parseOnlyThese=p)

[x['href'] for x in a]

Friday, May 18, 2012

Page 10: Webscrapping Tools

Inspect the Element

• Inspect the Maximum temperature

Friday, May 18, 2012

Page 11: Webscrapping Tools

Find the node

>>> from BeautifulSoup import BeautifulSoup>>> soup = BeautifulSoup(d)>>> attrs = {'class':'nobr'}>>> nobrs = soup.findAll(attrs=attrs)>>> temperature = nobrs[3].span.string>>> print temperature23

Friday, May 18, 2012

Page 12: Webscrapping Tools

htmllib.HTMLParser

• Interesting only for historical reasons

• based on sgmllib

Friday, May 18, 2012

Page 13: Webscrapping Tools

htmllib5• Using the custom simpletree format

• a built-in DOM-ish tree type (pythonic idioms)

from html5lib import parsefrom html5lib import treebuilderse = treebuilders.simpletree.Elementi = parse(d)a =[x for x in d if isinstance(x, e) and x.name= 'a'][x.attributes['href'] for x in a]

Friday, May 18, 2012

Page 14: Webscrapping Tools

lxml• Library for processing XML and HTML

• Based on C librariessudo aptitude install libxml2-devsudo aptitude install libxslt-dev

• Extends the ElementTree API

• e.g. with XPath

Friday, May 18, 2012

Page 15: Webscrapping Tools

lxml

from lxml import etreet = etree.parse('t.xml')for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent()

Friday, May 18, 2012

Page 16: Webscrapping Tools

twill• Simple

• No JavaScript

• http://twill.idyll.org

• Some more interesting concepts

• Pages, Scenarios

• State Machines

Friday, May 18, 2012

Page 17: Webscrapping Tools

twill

• Commonly used methods:go() code() show() showforms() formvalue() (or fv()) submit()

Friday, May 18, 2012

Page 18: Webscrapping Tools

Twill

>>> from twill import commands as twill>>> from twill import get_browser>>> twill.go('http://www.google.com')>>> twill.showforms()>>> twill.formvalue(1, 'q', 'Python')>>> twill.showforms()>>> twill.submit()>>> get_browser().get_html()

Friday, May 18, 2012

Page 20: Webscrapping Tools

Twill>>> twill.config("acknowledge_equiv_refresh", "false")>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")==> at http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'

Friday, May 18, 2012

Page 21: Webscrapping Tools

mechanize• Stateful programmatic web browsing

• navigation history

• HTML form state

• cookies

• ftp:, http: and file: URL schemes

• redirections

• proxies

• Basic and Digest HTTP authentication

Friday, May 18, 2012

Page 24: Webscrapping Tools

Selenium

• http://seleniumhq.org

• Support for JavaScript

Friday, May 18, 2012

Page 25: Webscrapping Tools

Selenium

from selenium import webdriverfrom selenium.common.exceptions \ import NoSuchElementExceptionfrom selenium.webdriver.common.keys \ import Keysimport time

Friday, May 18, 2012

Page 27: Webscrapping Tools

Phantom JS

• http://www.phantomjs.org/

Friday, May 18, 2012