Sunday, October 12, 2008

Anarchism

As some of you probably know, I've been working a lot on getting together a compressed full history dump of Wikipedia which can be randomly accessed for reading and writing. Inspired by the article entitled Building a (fast) Wikipedia offline reader, my focus has been on using bzip2 compression - at least up until today.

For a case study, let's take Wikipedia's article on "anarchism". As dumped in the full history dump from May, this file takes up 902 megs of space. Bzip2 compression brings this down to 51 megs, which is a huge savings. Now, bzipping (-9) the entire file takes about 13 minutes, and bunzipping it takes about 1.75 minutes, so this is clearly not a solution in itself. But "bzip -9" uses a block size of 900K, so it's actually possible to access those 1000 or so individual bzipped blocks.

This is promising, and I was working on building indexes and recompressing things so that I can deal with this, however, the process of decompressing, manipulating, and recompressing, was taking longer than I had hoped for. I ran the script overnight and in the morning not as much progress had been made as I had hoped for (it turned out to be 2% done). While trying to figure just how big the problem was, I googled for "wikipedia number of article revisions" and one of the articles I read was Will Wikipedia collapse under its own weight? I finished my search of the number of article revisions and came up with my estimate that I was 2% done. Not terrible, but slow enough that I decided to hit Ctrl-C and look deeper into the problem.

In doing this, and while thinking about that article, I decided to look into RCS. I whipped up a script to put all the revisions of "Anarchism" into an RCS file. It started out fast, but then as the revisions went up and up it got slower and slower. (I think I have a way to solve this, but let's skip that for now.) I let it finish, and the RCS file wound up being 42 megs. I then bzipped the RCS file. 3.3 megs!

Even just standing alone, the RCS files, once created, are bearable for random access. On that 42 megabyte RCS file, the current revision can be accessed in less than 0.1 seconds, and the earliest revision (which should be the slowest) can be accessed in under 3 seconds. A new revision can be added in about 3-6 seconds [1]. This is clearly too slow for production use, but wouldn't be so bad for an offline copy. At that compression ratio you could probably fit a full history dump on a single blue-ray DVD which could be randomly accessed!

But it gets better, because RCS uses a needlessly inefficient process for adding new revisions. Basically, when you add a new addition using the "ci" command, the entire (42 megabyte) file gets rewritten. It's ironic, because RCS actually makes the exact opposite mistake as Wikipedia: it puts all the metadata in text files, instead of a database (whereas Wikipedia puts all the text in a database, instead of in text files). But in theory this is completely unnecessary. All that's needed is to throw out the old "current" revision, to save the new "current" revision, and to save the reverse delta. This should be possible on the order of that 0.1 seconds. Prior revision access is, in my opinion, already bearable, but this could be brought down through checkpointing. Putting checkpoints at every 100 revisions ("Anarchism" has 13350) would only incur a small space penalty, and would likely get the access time down to less than a quarter second. Creative placement of checkpoints (at page blankings and other major revisions where the delta is huge anyway) could probably even eliminate most of the space penalty. The use of skip-deltas can do even better than that. And intelligent caching could accomplish even more. Less than a dozen servers with 16 gigs of memory each, could hold every single uncompressed RCS file, covering the entire history of the English Wikipedia in memory.

  1. I'm not sure which, because this is the time for the locking checkout and the checkin, and I'd imagine there must be a way to combine the two into one operation.

Wednesday, October 8, 2008

Obama-Ayers controversy part 2

Ben Yates commented on my previous blog post about Wikipedia's Obama–Ayers controversy article. He basically told me there's nothing to worry about, that "the article has been assessed a lot", that "Wikipedia articles on political subjects tend to be pretty accurate", and that "as soon as anyone inserts misinformation, someone else removes it, or puts a 'citation needed' tag on it."

The last part of that is certainly not the case. Gregory Kohs has in fact compiled a list of counter-examples among US Senator BLPs. (There is some extra commentary about this spreadsheet which I won't link to because I'm not sure whether or not the site it's on is currently an OMG BADSITE or not.)

But most of that vandalism I can deal with, as a reader. It certainly must suck for the subject of the article, but that's a debate for another blog post. The stuff which really ticks me off as a reader, and really makes Wikipedia article on these sorts of subjects almost useless, is what's not there.

Well, it didn't take me too long (*) to find a good example of what I'm talking about, in the Obama–Ayers controversy article. I checked a few interesting looking history entries and found this one, "rv section to 13:29, 30 August 2008 version -- too many BLP vio / NPOV edits to properly process - discuss on talk page and propose slowly please". The diff is confusing, and there seem to be multiple edits involved, but my beef is this. According to ABC News:

Ayers admitted planting bombs at a number of government installations in the 1960s as part of protests against the Vietnam War, but he was never convicted for any crime related to these activities and no one was hurt in the incidents. In a New York Times article that, coincidentally, happened to be published Sept. 11, 2001, Ayers said "I don't regret setting bombs. I feel we didn't do enough."


and

Obama insists that he barely knows Ayers or his wife, Bernardine Dohrn, who is now a professor at Northwestern University's School of Law . Dohrn was also a member of the Weather Underground, and was once on the FBI's Top 10 Most Wanted List for inciting to riot. The couple live in Chicago and have long been politically active there.


Whereas, according to Wikipedia:

Ayers and Dohrn are fixtures of their Chicago neighborhood, "embraced, by and large, in the liberal circles dominating Hyde Park politics", according to Ben Smith, a writer for The Politico. Ayers has been described as "very respected and prominent in Chicago [with] a national reputation as an educator." But they have not been embraced everywhere due to their past leadership of the Weather Underground, a 1960s radical organization that placed bombs at a number of government institutions, causing damage, but no deaths or injuries.


Both factual? Probably. I haven't delved into the references enough to say for sure, but c'mon, the Wikipedia article is exceptionally kind with its "they have not been embraced everywhere" and no mention at all of the fact that "Ayers admitted planting bombs", that he said "I don't regret setting bombs", and that Dohrn "was once on the FBI's Top 10 Most Wanted List for inciting to riot." It's great that Wikipedia mentions the quote from Ben Smith, and in fact I'd consider the article biased if it didn't include something like that, but for this article to not even mention those three facts I listed above is incredibly biased.

Maybe it's just fear over BLP issues causing this excessive kindness. But you know what, if Wikipedia can't write a nonbiased article concerning living people due to BLP concerns, it shouldn't write one at all.

(*) Well, not to long for the sake of finding something to blog about. It took too long for me to want to do this every time I read a Wikipedia article on a controversial topic.

P.S. Before you accuse me of being biased on this, please know that my initial reaction to hearing Palin say that Obama "pals around with terrorists" was, and I quote: “Unless she has some serious evidence to back up this claim, Palin is being outrageously unethical here.” (see my shared Google Reader notes) And, frankly, I still do think she was being untruthful and unethical with that statement. But I read the Wikipedia article and came out with the impression that Ayers was merely someone who got mixed up with the wrong group in his youth.

I just checked my Google Web History. I first searched Google on "ayers wikipedia". Then, when I saw the results, since I didn't know the first name of Ayers, I changed the search to "ayers wikipedia obama". At the time, I believe [[Obama-Ayers controvery]] was the first hit for that search. In any case, that's the link I clicked on. In hindsight, I should have gone to the [[Bill Ayers]] page, which is more thorough (though I cannot vouch for its factualness or lack of bias). Still, I stand by my poor assessment of [[Obama-Ayers controvery]]. Sure, it didn't have to include everything about Ayers, but the paragraph I read on him was downright biased.

Tuesday, October 7, 2008

Sunday, October 5, 2008

Obama-Ayers controversy

Wikipedia is powerful. I just read an AP story that "Sarah Palin defended her claim that Barack Obama 'pals around with terrorists,'" and found myself typing "William Ayers" into Wikipedia and then clicking over to Obama–Ayers controversy. I have no idea how accurate and complete this article is, and a quick browse through the talk page doesn't really tell me too much.

If I didn't know that any idiot capable of using a keyboard (which apparently excludes John McCain) could have come along and screwed around with this article, it'd be by far the greatest resource on stuff like this. Long-breaking news stories is something Wikipedia excels at. But as it stands, the usefulness certainly goes down. I'm not sure how much. The fact that the article proudly advertises that "This article or section has been nominated to be checked for its neutrality." doesn't really affect my opinion one way or another, because I realize that anyone can add or remove that tag for essentially any reason.

Is this something that could even be solved? I guess in theory it's possible. If a particular version of this article had been rated a "good article", and I saw the names of a few people I trusted signing off on it, I'd feel a whole lot more comfortable about it, anyway. Is this a job Wikipedians are capable of? Or is this something better offered as a value-add by a mirror site (such as Veropedia)?

On a somewhat related point, I've turned on a cool gadget called "Article assessment" which puts at the top of the article "An unassessed article from Wikipedia, the free encyclopedia". So I quickly know that no version of this article has been assessed at all.

Wednesday, October 1, 2008

Wikipedia is not an encyclopedia

Wikipedia is not an encyclopedia. It is a website where people are trying to create an encyclopedia.

There are lots of long arguments that could be engaged in to prove this point, but there is also a relatively simple one. Pick a page in the talk namespace, or the Wikipedia namespace, or the Wikipedia talk namespace, or the User namespace, or the User talk namespace. These pages are all part of Wikipedia. But they are surely not part of an encyclopedia.