As some of you probably know, I've been working a lot on getting together a compressed full history dump of Wikipedia which can be randomly accessed for reading and writing. Inspired by the article entitled Building a (fast) Wikipedia offline reader, my focus has been on using bzip2 compression - at least up until today.
For a case study, let's take Wikipedia's article on "anarchism". As dumped in the full history dump from May, this file takes up 902 megs of space. Bzip2 compression brings this down to 51 megs, which is a huge savings. Now, bzipping (-9) the entire file takes about 13 minutes, and bunzipping it takes about 1.75 minutes, so this is clearly not a solution in itself. But "bzip -9" uses a block size of 900K, so it's actually possible to access those 1000 or so individual bzipped blocks.
This is promising, and I was working on building indexes and recompressing things so that I can deal with this, however, the process of decompressing, manipulating, and recompressing, was taking longer than I had hoped for. I ran the script overnight and in the morning not as much progress had been made as I had hoped for (it turned out to be 2% done). While trying to figure just how big the problem was, I googled for "wikipedia number of article revisions" and one of the articles I read was Will Wikipedia collapse under its own weight? I finished my search of the number of article revisions and came up with my estimate that I was 2% done. Not terrible, but slow enough that I decided to hit Ctrl-C and look deeper into the problem.
In doing this, and while thinking about that article, I decided to look into RCS. I whipped up a script to put all the revisions of "Anarchism" into an RCS file. It started out fast, but then as the revisions went up and up it got slower and slower. (I think I have a way to solve this, but let's skip that for now.) I let it finish, and the RCS file wound up being 42 megs. I then bzipped the RCS file. 3.3 megs!
Even just standing alone, the RCS files, once created, are bearable for random access. On that 42 megabyte RCS file, the current revision can be accessed in less than 0.1 seconds, and the earliest revision (which should be the slowest) can be accessed in under 3 seconds. A new revision can be added in about 3-6 seconds [1]. This is clearly too slow for production use, but wouldn't be so bad for an offline copy. At that compression ratio you could probably fit a full history dump on a single blue-ray DVD which could be randomly accessed!
But it gets better, because RCS uses a needlessly inefficient process for adding new revisions. Basically, when you add a new addition using the "ci" command, the entire (42 megabyte) file gets rewritten. It's ironic, because RCS actually makes the exact opposite mistake as Wikipedia: it puts all the metadata in text files, instead of a database (whereas Wikipedia puts all the text in a database, instead of in text files). But in theory this is completely unnecessary. All that's needed is to throw out the old "current" revision, to save the new "current" revision, and to save the reverse delta. This should be possible on the order of that 0.1 seconds. Prior revision access is, in my opinion, already bearable, but this could be brought down through checkpointing. Putting checkpoints at every 100 revisions ("Anarchism" has 13350) would only incur a small space penalty, and would likely get the access time down to less than a quarter second. Creative placement of checkpoints (at page blankings and other major revisions where the delta is huge anyway) could probably even eliminate most of the space penalty. The use of skip-deltas can do even better than that. And intelligent caching could accomplish even more. Less than a dozen servers with 16 gigs of memory each, could hold every single uncompressed RCS file, covering the entire history of the English Wikipedia in memory.
- I'm not sure which, because this is the time for the locking checkout and the checkin, and I'd imagine there must be a way to combine the two into one operation.