Some assembly required: June 30, 2006

Problem:

Bloggers or other feed providers that don't treat visitors with a full feed.

Solution:

Convert a partial feed to a full feed.

Method:

Scrape a partial feed for URLs to full posts.

Cut out the portion of the page that contains the post.

Encode it into another feed, based in all other aspects on the partial feed.

Requirements:

(partial) Feed URL.

(per-feed) XPath selector that slices out the post content node.

Web server to run the feed-to-feed translator.

Interface:

http://[base url of f2f translator]?feed=[URL encoded feed URL]&post=[URL encoded XPath expression that slices out the full post content node from a full post page]

...producing a new feed to the specs of the two parameters.

Notes:

It's probably a good idea to limit the feed handling to one feed format.

I would adopt ATOM, and put the Google Reader backend to use for the conversion step.

Might prove useful to adopt some caching strategy to only perform each URL + XPath combination once as subscribers fetch the feed and as new content shows up.

OTOH, that would probably also render a need for cacheability checks. Cacheability metadata from the feed and HTTP headers of the post URL itself should provide enough guidelines.

It's likely that Google Reader (or other online aggregators) could be put to good use for the republication and caching layer, granted that feed subscribers are routed through that service (via HTTP redirects), whereas the Google Reader spider invokes the fetch-and-compose engine.

This isn't an advocacy post; the web is simply mine to consume however I please. I pick my browser, my feed reader, and, true to my habits, my way of browsing - including when and why I opt to take a visit to a site for reading a post, and when I stay in my feed reader.

Command line tools for XPath extraction of data from a given HTML file or URL might prove useful for quick prototyping; any available?

Tags:

Some assembly required

Contributors

Syndicate

Friday, June 30, 2006

Roll your own full post feed

0 Comments: