Top Banner
feedparser http://www.feedparser.org/ Because RSS is Hairy Lindsey Smith @turbodog
13

Feedparser

Jan 14, 2015

Download

Technology

Lindsey Smith

Brief overview of the Python feedparser module. Feedparser is a robust tool for parsing RSS feeds of all types.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feedparser

feedparserhttp://www.feedparser.org/

Because RSS is Hairy

Lindsey Smith@turbodog

Page 2: Feedparser

feedparser: because RSS is hairy

RSS formats bundle HTML

User input via HTML is hairy

There are several syndication formats and versions (RSS, Atom, etc.)

RSSRSS

HTMLHTML

Micro-formatMicro-format

Page 3: Feedparser

feedparser: because rss is hairy

Download and parse just about any feed type, including: Various flavors of Atom and RSS

Format extensions (iTunes)

Micro-formats (GeoRSS, hcard)

Ensures that you can treat all feeds the same way, regardless of format or version

Page 4: Feedparser

feedparser: because rss is hairy

Digests whatever crap you throw at itSanitizes HTML

Date normalization

Resolving relative links

Feed type, version and encoding detection

Bozo detection of non-well-formed feeds without blowing up

Page 5: Feedparser

feedparser: because rss is hairy

Parse URL, local file or string data

304 Not Modified HTTP return code

HTTP basic auth

Custom request headers

Customer handlers

Captures response headers

Page 6: Feedparser

feedparser: the good ol’ days

Created circa 2002 by Mark Pilgrim of Dive Into Python fame

Powers feedvalidator.orgv4.1 released in 2007

Open sourceWell-documented3000 unit testsAvailable in popular Linux

distros

Page 7: Feedparser

feedparser: the lean years

Development slows to a trickle

No official releasesAtom & RSS continue to

evolve iTunes enclosures

v4.1 released in 2007Still available in popular Linux

distros

Page 8: Feedparser

feedparser 5.0: a new hope

Small group of developers start working on feedparser

v5.0 released January 2011Supports Python 3

Micro-formats

CSS & HTML5 sanitation

Bug fixes, bug fixes, bug fixes

Page 9: Feedparser

>>> import feedparser

>>> d = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml")

>>> d['feed']['title'] # feed data is a dictionary

u'Sample Feed'

>>> d.feed.title # get values attr-style or dict-style

u'Sample Feed'

>>> d.channel.title # use RSS or Atom terminology anywhere

u'Sample Feed'

>>> d.feed.link # resolves relative links

u'http://example.org/'

>>> d.feed.subtitle # parses escaped HTML

u'For documentation <em>only</em>'

Page 10: Feedparser

>>> len(d['entries']) # entries are a list

1

>>> d['entries'][0]['title'] # each entry is a dictionary

u'First entry title'

>>> d.entries[0].title # attr-style works here too

u'First entry title'

>>> d['items'][0].title # RSS terminology works here too

u'First entry title'

>>> e = d.entries[0]

>>> e.link # easy access to alternate link

u'http://example.org/entry/3'

>>> e.links[1].rel # full access to all Atom links

u'related'

>>> e.links[0].href # resolves relative links here too

u'http://example.org/entry/3'

Page 11: Feedparser

>>> e.updated_parsed # parses all date formats

time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0)

>>> e.content[0].value # sanitizes dangerous HTML

u'<div>Watch out for <em>nasty tricks</em></div>'

>>> d.version # reports feed type and version

u'atom10'

>>> d.encoding # auto-detects character encoding

u'utf-8'

>>> d.headers.get('Content-type') # full access to all HTTP headers

u'application/xml‘

>>> d.bozo # well-formed?

0

Page 12: Feedparser

feedparser: caveats

Fairly slow and CPU intensiveFriendfeed rolled their own and fell back

on feedparser

Team is looking at ways to speed it up

Page 13: Feedparser

feedparser: the project details

Home page: http://www.feedparser.org

Discussion: http://code.google.com/p/feedparser