Monday, February 16, 2009

Be Careful!

Akahele has launched.

What if there were a place on the Internet where the Internet itself could be carefully and cautiously reviewed and critiqued? What if the voices you heard came from real, identifiable people who backed their musings and words with their real-life credentials and experiences? What if facts trumped speculation? Imagine a calm, rational place on the Web where complex Internet issues detrimentally impacting our society are presented for examination — without unnecessary jargon, without "insider" metaphors, without confusing (or boring!) the average citizen who doesn’t know the meaning of an "open proxy" or a "DoS attack".

The Akahele blog will strive to be that place.

Saturday, January 24, 2009

Hyperphysically existing "stuff"

I recently was looking for a good definition of "universe". Google naturally brought me to Wikipedia, and I found its definition: "everything that physically exists" (the definition went on to specify "the entirety of space and time, all forms of matter, energy and momentum, and the physical laws and physical constants that govern them", but the list of examples seems unnecessary). The word "physically" also seemed out of place, and the fact that it linked to the article on physics made it hard for me to understand what was meant by it.

I decided to go through the history to see how this definition had formed. A quick check revealed that the definition had been, for some while, "everything that exists". This seems a much better definition, and I managed to trace the change back to a single anonymous user. On December 19, 2007, an anonymous user editing from 72.151.50.172 added the word "physically", with the edit summary scientific multiverse hypotheses include the possibility of hyperphysically existing "stuff". In the year since this edit, no one seems to have thought to challenge it.

A little bit of investigation into this IP suggests that this editor is the same as User:Standonbible, a college freshman who relies heavily on the Bible and admits that editors often assume he is "a POV-pushing religious fanatic". He gives a list of reasons he "often get[s] labeled a lunatic", which I'll let you go to his user page to read. But for all of 2008, and continuing to this day, he helped write the lead sentence on the Wikipedia article on the universe.

By the way, the definition of "universe" in Wiktionary is "The sum of everything that exists in the cosmos, including time and space itself". Again with the unnecessary examples, and this time with the unnecessary "in the cosmos". The latter addition is particularly silly, since the Wiktionary definition of "cosmos" is "the universe". At least there doesn't seem to be any pseudoscientific POV pushing in this definition, though.

Sunday, October 12, 2008

Anarchism

As some of you probably know, I've been working a lot on getting together a compressed full history dump of Wikipedia which can be randomly accessed for reading and writing. Inspired by the article entitled Building a (fast) Wikipedia offline reader, my focus has been on using bzip2 compression - at least up until today.

For a case study, let's take Wikipedia's article on "anarchism". As dumped in the full history dump from May, this file takes up 902 megs of space. Bzip2 compression brings this down to 51 megs, which is a huge savings. Now, bzipping (-9) the entire file takes about 13 minutes, and bunzipping it takes about 1.75 minutes, so this is clearly not a solution in itself. But "bzip -9" uses a block size of 900K, so it's actually possible to access those 1000 or so individual bzipped blocks.

This is promising, and I was working on building indexes and recompressing things so that I can deal with this, however, the process of decompressing, manipulating, and recompressing, was taking longer than I had hoped for. I ran the script overnight and in the morning not as much progress had been made as I had hoped for (it turned out to be 2% done). While trying to figure just how big the problem was, I googled for "wikipedia number of article revisions" and one of the articles I read was Will Wikipedia collapse under its own weight? I finished my search of the number of article revisions and came up with my estimate that I was 2% done. Not terrible, but slow enough that I decided to hit Ctrl-C and look deeper into the problem.

In doing this, and while thinking about that article, I decided to look into RCS. I whipped up a script to put all the revisions of "Anarchism" into an RCS file. It started out fast, but then as the revisions went up and up it got slower and slower. (I think I have a way to solve this, but let's skip that for now.) I let it finish, and the RCS file wound up being 42 megs. I then bzipped the RCS file. 3.3 megs!

Even just standing alone, the RCS files, once created, are bearable for random access. On that 42 megabyte RCS file, the current revision can be accessed in less than 0.1 seconds, and the earliest revision (which should be the slowest) can be accessed in under 3 seconds. A new revision can be added in about 3-6 seconds [1]. This is clearly too slow for production use, but wouldn't be so bad for an offline copy. At that compression ratio you could probably fit a full history dump on a single blue-ray DVD which could be randomly accessed!

But it gets better, because RCS uses a needlessly inefficient process for adding new revisions. Basically, when you add a new addition using the "ci" command, the entire (42 megabyte) file gets rewritten. It's ironic, because RCS actually makes the exact opposite mistake as Wikipedia: it puts all the metadata in text files, instead of a database (whereas Wikipedia puts all the text in a database, instead of in text files). But in theory this is completely unnecessary. All that's needed is to throw out the old "current" revision, to save the new "current" revision, and to save the reverse delta. This should be possible on the order of that 0.1 seconds. Prior revision access is, in my opinion, already bearable, but this could be brought down through checkpointing. Putting checkpoints at every 100 revisions ("Anarchism" has 13350) would only incur a small space penalty, and would likely get the access time down to less than a quarter second. Creative placement of checkpoints (at page blankings and other major revisions where the delta is huge anyway) could probably even eliminate most of the space penalty. The use of skip-deltas can do even better than that. And intelligent caching could accomplish even more. Less than a dozen servers with 16 gigs of memory each, could hold every single uncompressed RCS file, covering the entire history of the English Wikipedia in memory.

  1. I'm not sure which, because this is the time for the locking checkout and the checkin, and I'd imagine there must be a way to combine the two into one operation.

Wednesday, October 8, 2008

Obama-Ayers controversy part 2

Ben Yates commented on my previous blog post about Wikipedia's Obama–Ayers controversy article. He basically told me there's nothing to worry about, that "the article has been assessed a lot", that "Wikipedia articles on political subjects tend to be pretty accurate", and that "as soon as anyone inserts misinformation, someone else removes it, or puts a 'citation needed' tag on it."

The last part of that is certainly not the case. Gregory Kohs has in fact compiled a list of counter-examples among US Senator BLPs. (There is some extra commentary about this spreadsheet which I won't link to because I'm not sure whether or not the site it's on is currently an OMG BADSITE or not.)

But most of that vandalism I can deal with, as a reader. It certainly must suck for the subject of the article, but that's a debate for another blog post. The stuff which really ticks me off as a reader, and really makes Wikipedia article on these sorts of subjects almost useless, is what's not there.

Well, it didn't take me too long (*) to find a good example of what I'm talking about, in the Obama–Ayers controversy article. I checked a few interesting looking history entries and found this one, "rv section to 13:29, 30 August 2008 version -- too many BLP vio / NPOV edits to properly process - discuss on talk page and propose slowly please". The diff is confusing, and there seem to be multiple edits involved, but my beef is this. According to ABC News:

Ayers admitted planting bombs at a number of government installations in the 1960s as part of protests against the Vietnam War, but he was never convicted for any crime related to these activities and no one was hurt in the incidents. In a New York Times article that, coincidentally, happened to be published Sept. 11, 2001, Ayers said "I don't regret setting bombs. I feel we didn't do enough."


and

Obama insists that he barely knows Ayers or his wife, Bernardine Dohrn, who is now a professor at Northwestern University's School of Law . Dohrn was also a member of the Weather Underground, and was once on the FBI's Top 10 Most Wanted List for inciting to riot. The couple live in Chicago and have long been politically active there.


Whereas, according to Wikipedia:

Ayers and Dohrn are fixtures of their Chicago neighborhood, "embraced, by and large, in the liberal circles dominating Hyde Park politics", according to Ben Smith, a writer for The Politico. Ayers has been described as "very respected and prominent in Chicago [with] a national reputation as an educator." But they have not been embraced everywhere due to their past leadership of the Weather Underground, a 1960s radical organization that placed bombs at a number of government institutions, causing damage, but no deaths or injuries.


Both factual? Probably. I haven't delved into the references enough to say for sure, but c'mon, the Wikipedia article is exceptionally kind with its "they have not been embraced everywhere" and no mention at all of the fact that "Ayers admitted planting bombs", that he said "I don't regret setting bombs", and that Dohrn "was once on the FBI's Top 10 Most Wanted List for inciting to riot." It's great that Wikipedia mentions the quote from Ben Smith, and in fact I'd consider the article biased if it didn't include something like that, but for this article to not even mention those three facts I listed above is incredibly biased.

Maybe it's just fear over BLP issues causing this excessive kindness. But you know what, if Wikipedia can't write a nonbiased article concerning living people due to BLP concerns, it shouldn't write one at all.

(*) Well, not to long for the sake of finding something to blog about. It took too long for me to want to do this every time I read a Wikipedia article on a controversial topic.

P.S. Before you accuse me of being biased on this, please know that my initial reaction to hearing Palin say that Obama "pals around with terrorists" was, and I quote: “Unless she has some serious evidence to back up this claim, Palin is being outrageously unethical here.” (see my shared Google Reader notes) And, frankly, I still do think she was being untruthful and unethical with that statement. But I read the Wikipedia article and came out with the impression that Ayers was merely someone who got mixed up with the wrong group in his youth.

I just checked my Google Web History. I first searched Google on "ayers wikipedia". Then, when I saw the results, since I didn't know the first name of Ayers, I changed the search to "ayers wikipedia obama". At the time, I believe [[Obama-Ayers controvery]] was the first hit for that search. In any case, that's the link I clicked on. In hindsight, I should have gone to the [[Bill Ayers]] page, which is more thorough (though I cannot vouch for its factualness or lack of bias). Still, I stand by my poor assessment of [[Obama-Ayers controvery]]. Sure, it didn't have to include everything about Ayers, but the paragraph I read on him was downright biased.

Tuesday, October 7, 2008

Sunday, October 5, 2008

Obama-Ayers controversy

Wikipedia is powerful. I just read an AP story that "Sarah Palin defended her claim that Barack Obama 'pals around with terrorists,'" and found myself typing "William Ayers" into Wikipedia and then clicking over to Obama–Ayers controversy. I have no idea how accurate and complete this article is, and a quick browse through the talk page doesn't really tell me too much.

If I didn't know that any idiot capable of using a keyboard (which apparently excludes John McCain) could have come along and screwed around with this article, it'd be by far the greatest resource on stuff like this. Long-breaking news stories is something Wikipedia excels at. But as it stands, the usefulness certainly goes down. I'm not sure how much. The fact that the article proudly advertises that "This article or section has been nominated to be checked for its neutrality." doesn't really affect my opinion one way or another, because I realize that anyone can add or remove that tag for essentially any reason.

Is this something that could even be solved? I guess in theory it's possible. If a particular version of this article had been rated a "good article", and I saw the names of a few people I trusted signing off on it, I'd feel a whole lot more comfortable about it, anyway. Is this a job Wikipedians are capable of? Or is this something better offered as a value-add by a mirror site (such as Veropedia)?

On a somewhat related point, I've turned on a cool gadget called "Article assessment" which puts at the top of the article "An unassessed article from Wikipedia, the free encyclopedia". So I quickly know that no version of this article has been assessed at all.

Wednesday, October 1, 2008

Wikipedia is not an encyclopedia

Wikipedia is not an encyclopedia. It is a website where people are trying to create an encyclopedia.

There are lots of long arguments that could be engaged in to prove this point, but there is also a relatively simple one. Pick a page in the talk namespace, or the Wikipedia namespace, or the Wikipedia talk namespace, or the User namespace, or the User talk namespace. These pages are all part of Wikipedia. But they are surely not part of an encyclopedia.