Feedparser

feedparserhttp://www.feedparser.org/

Because RSS is Hairy

Lindsey Smith@turbodog

feedparser: because RSS is hairy

RSS formats bundle HTML

User input via HTML is hairy

There are several syndication formats and versions (RSS, Atom, etc.)

RSSRSS

HTMLHTML

Micro-formatMicro-format

feedparser: because rss is hairy

Download and parse just about any feed type, including: Various flavors of Atom and RSS

Format extensions (iTunes)

Micro-formats (GeoRSS, hcard)

Ensures that you can treat all feeds the same way, regardless of format or version


Digests whatever crap you throw at itSanitizes HTML

Date normalization

Resolving relative links

Feed type, version and encoding detection

Bozo detection of non-well-formed feeds without blowing up


Parse URL, local file or string data

304 Not Modified HTTP return code

HTTP basic auth

Custom request headers

Customer handlers

Captures response headers

feedparser: the good ol’ days

Created circa 2002 by Mark Pilgrim of Dive Into Python fame

Powers feedvalidator.orgv4.1 released in 2007

Open sourceWell-documented3000 unit testsAvailable in popular Linux

distros

feedparser: the lean years

Development slows to a trickle

No official releasesAtom & RSS continue to

evolve iTunes enclosures

v4.1 released in 2007Still available in popular Linux

distros

feedparser 5.0: a new hope

Small group of developers start working on feedparser

v5.0 released January 2011Supports Python 3

Micro-formats

CSS & HTML5 sanitation

Bug fixes, bug fixes, bug fixes

>>> import feedparser

>>> d = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml")

>>> d['feed']['title'] # feed data is a dictionary

u'Sample Feed'

>>> d.feed.title # get values attr-style or dict-style

u'Sample Feed'

>>> d.channel.title # use RSS or Atom terminology anywhere

u'Sample Feed'

>>> d.feed.link # resolves relative links

u'http://example.org/'

>>> d.feed.subtitle # parses escaped HTML

u'For documentation <em>only</em>'

>>> len(d['entries']) # entries are a list

1

>>> d['entries'][0]['title'] # each entry is a dictionary

u'First entry title'

>>> d.entries[0].title # attr-style works here too


>>> d['items'][0].title # RSS terminology works here too


>>> e = d.entries[0]

>>> e.link # easy access to alternate link

u'http://example.org/entry/3'

>>> e.links[1].rel # full access to all Atom links

u'related'

>>> e.links[0].href # resolves relative links here too

u'http://example.org/entry/3'

>>> e.updated_parsed # parses all date formats

time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0)

>>> e.content[0].value # sanitizes dangerous HTML

u'<div>Watch out for <em>nasty tricks</em></div>'

>>> d.version # reports feed type and version

u'atom10'

>>> d.encoding # auto-detects character encoding

u'utf-8'

>>> d.headers.get('Content-type') # full access to all HTTP headers

u'application/xml‘

>>> d.bozo # well-formed?

0

feedparser: caveats

Fairly slow and CPU intensiveFriendfeed rolled their own and fell back

on feedparser

Team is looking at ways to speed it up

feedparser: the project details

Home page: http://www.feedparser.org

Discussion: http://code.google.com/p/feedparser

Feedparser

Technology

feedparser http

hairy rss formats

rss terminology

feedparser v5

feedparser team

style usample feed d

dictionary usample feed

use rss