Some assembly required

This blog is about unimplemented ideas. At least until they get ticked off; I suppose a few eventually will have implementations too, but fresh posts never will. Because that's the primary purpose of this blog: keeping track of ideas I'd like to dive into, or problems I'd like to see solved. Feel free to join me in implementing, or further developing these ideas. I don't mind working solo, but it's a whole lot more fun working in concert!

Saturday, January 27, 2007

Track changed URLs via subversion

Problem:

Online resources don't provide a standardized way of tracking changes, much less one well integrated with feed readers and other technology suited for news coverage.

Solution:

Setup an agent that polls tracked URLs for changes, committing any changes to a subversion repository. Provide feeds, linking to diffs and noting their sizes.

Elevator pitch:

Ever tried to track changes to some online API, TOS or similar, by hand? Of course not; it's way too messy and too much laboursome work. And there are just too many of them, anyway; staying on top of each is impossible. It's not a task for humans.

Ever thought of how neat it would be if you could just point some automated agent at your URL of choice, whenever you find the need, and have it magically track that URL, commiting new changes, once they appear, to a subversion repository?

Get a changes timeline you can tune in to instead, to peruse the diffs at your leisure? Well, that is the basic idea of this service. Coupled with feeds for your feed reader, of course, so you can forget about them all, while nothing happens, and be alerted, the times when there is some action.

Without having to opt in on lots of blogs and the like, who may or may not announce the news, and who will most certainly not announce these news in any standard format easy to discover and digest. See, that's what you need Swwwoon for.

Method:

Per host, or perhaps otherwise intelligently grouped cluster of URLs, iterate:

New URL:

  1. Fetch URL and Content-type header

  2. Store as ${repository}/${URL hostname}/${URL path}

  3. svn propset svn:mime-type ${Content-type} ${path}

Old URL:

  1. Fetch URL and Content-type header

  2. Update file contents and mime-type in working copy.

  3. svn diff ${path} | format_commit_message

svn ci -m ${commit message} ${paths}

Requirements:

  • Interface to add an URL to track (web page, bookmarklet)

  • Change tracker agent run at regular intervals

  • Feeds generator; per user, per repository (optional), aggregate all

  • A database of:
    • urls

    • users

    • what users track which urls

    • optional: urls & credentials to remote repositories

Interface:

Add URL:

http://swwwoon/add?url=[URL encoded URL]

Feeds:

  • http://swwwoon/feeds/user/${user public id}
    - all the user's tracked URLs

  • http://swwwoon/feeds/url/${repository path}
    - changes to a single URL

  • http://swwwoon/feeds/subtree/${partial repository path}
    - changes to all URLs in the subtree rooted at the given prefix

  • http://swwwoon/feeds/domin/google.com
    - anything covered at *.google.com

  • Perhaps combinations of the above

Notes:

  • Multiple users tracking the same resource don't waste resources in proportion to their numbers. This is a very good property.

  • Some normalization scheme may prove necessary to cover URL queries too, if present. Or, alternatively, plain disallowing them.

  • Making use of HTTP/1.1 If-Modified-Since, when possible, might prove a worthwhile optimization. It also is likely that won't catch changes of Content-type due to bugs in web servers or their configuration, so it is likely worth verifying with a HEAD request, even when given a 304 Not Modified.

  • Good commit messages aren't in ready supply, but listing content type (for new files, and from/to, when changed), file lengths (in lines, too, for text/* and some xml formats) and, for text variants, diff sizes (+20/-3 lines), are good for a start. Machine readable format, so the feed generator can present a set of nicely annotated links to online diff browsers (via Code Librarian or Trac, for instance), is good.

  • Passing a private user id cookie to associate the URL with your account.

  • Good user id:s are chunks of opaque random ASCII. For instance, a cryptographic hash of a private key and an auto_increment user number. Forcing login/passwords on users is annoying; invent good private keys instead.

  • A public id can be shared with others without issues. Given the open nature suggested here, allowing anyone to see anyone else's tracked URLs, plain integers would work.