Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012
Overview of Python web scraping tools
Maik RöderBarcelona Python Meetup Group
17.05.2012
Friday, May 18, 2012
Data Scraping
• Automated Process
• Explore and download pages
• Grab content
• Store in a database or in a text file
Friday, May 18, 2012
urlparse
• Manipulate URL strings
urlparse.urlparse()urlparse.urljoin()urlparse.urlunparse()
Friday, May 18, 2012
urllib
• Download data through different protocols
• HTTP, FTP, ...
urllib.parse()urllib.urlopen()urllib.urlretrieve()
Friday, May 18, 2012
Scrape a web site
• Example: http://www.wunderground.com/
Friday, May 18, 2012
Preparation
>>> from StringIO import StringIO>>> from urllib2 import urlopen>>> f = urlopen('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()>>> d = StringIO(p)>>> f.close()
Friday, May 18, 2012
Beautifulsoup
• HTML/XML parser
• designed for quick turnaround projects like screen-scraping
• http://www.crummy.com/software/BeautifulSoup
Friday, May 18, 2012
BeautifulSoup
from BeautifulSoup import *
a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]
Friday, May 18, 2012
Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]
Friday, May 18, 2012
Inspect the Element
• Inspect the Maximum temperature
Friday, May 18, 2012
Find the node
>>> from BeautifulSoup import BeautifulSoup>>> soup = BeautifulSoup(d)>>> attrs = {'class':'nobr'}>>> nobrs = soup.findAll(attrs=attrs)>>> temperature = nobrs[3].span.string>>> print temperature23
Friday, May 18, 2012
htmllib.HTMLParser
• Interesting only for historical reasons
• based on sgmllib
Friday, May 18, 2012
htmllib5• Using the custom simpletree format
• a built-in DOM-ish tree type (pythonic idioms)
from html5lib import parsefrom html5lib import treebuilderse = treebuilders.simpletree.Elementi = parse(d)a =[x for x in d if isinstance(x, e) and x.name= 'a'][x.attributes['href'] for x in a]
Friday, May 18, 2012
lxml• Library for processing XML and HTML
• Based on C librariessudo aptitude install libxml2-devsudo aptitude install libxslt-dev
• Extends the ElementTree API
• e.g. with XPath
Friday, May 18, 2012
lxml
from lxml import etreet = etree.parse('t.xml')for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent()
Friday, May 18, 2012
twill• Simple
• No JavaScript
• http://twill.idyll.org
• Some more interesting concepts
• Pages, Scenarios
• State Machines
Friday, May 18, 2012
twill
• Commonly used methods:go() code() show() showforms() formvalue() (or fv()) submit()
Friday, May 18, 2012
Twill
>>> from twill import commands as twill>>> from twill import get_browser>>> twill.go('http://www.google.com')>>> twill.showforms()>>> twill.formvalue(1, 'q', 'Python')>>> twill.showforms()>>> twill.submit()>>> get_browser().get_html()
Friday, May 18, 2012
Twill - acknowledge_equiv_refresh
>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")...twill.errors.TwillException: infinite refresh loop discovered; aborting.Try turning off acknowledge_equiv_refresh...
Friday, May 18, 2012
Twill>>> twill.config("acknowledge_equiv_refresh", "false")>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")==> at http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'
Friday, May 18, 2012
mechanize• Stateful programmatic web browsing
• navigation history
• HTML form state
• cookies
• ftp:, http: and file: URL schemes
• redirections
• proxies
• Basic and Digest HTTP authentication
Friday, May 18, 2012
mechanize - robots.txt>>> import mechanize>>> browser = mechanize.Browser()>>> browser.open('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')
mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
Friday, May 18, 2012
mechanize - robots.txt
• Do not handle robots.txtbrowser.set_handle_robots(False)
• Do not handle equivbrowser.set_handle_equiv(False)
browser.open('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')
Friday, May 18, 2012
Selenium
• http://seleniumhq.org
• Support for JavaScript
Friday, May 18, 2012
Selenium
from selenium import webdriverfrom selenium.common.exceptions \ import NoSuchElementExceptionfrom selenium.webdriver.common.keys \ import Keysimport time
Friday, May 18, 2012
Selenium
>>> browser = webdriver.Firefox() >>> browser.get("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")>>> a = browser.find_element_by_xpath("(//span[contains(@class,'nobr')])[position()=2]/span").textbrowser.close()>>> print a
23
Friday, May 18, 2012
Phantom JS
• http://www.phantomjs.org/
Friday, May 18, 2012