It has been reported elsewhere (Google research blog, New York Times, Language Log, IBM Research) that Fred Jelinek passed away on September 14th, 2010. I heard Fred (at a talk in Prague I think) repeat his famous 'quote' about the accuracy of his MT system going up when the linguists left the room!
And the other book I read on the beach (two books in a week!) was David Lodge's Deaf Sentence. Although the UK cover quotes the Guardian "Very funny" and the New Statesman "Seriously funny", I'd probably pick out the following key words: death, suicide and obviously deafness. Okay, so corpus linguistics gets a mention (page 32 in my edition).
While sitting on the beach recently, I read Feersum Endjinn by Iain M. Banks, recommended by Stuart. Okay, it did have something to do with work: one of the main character's thoughts are written in SMS-style shortenings and abbreviations.
And here's the key concept cloud from the Labour manifesto ... there seems to be much more variety of concepts here and hence the cloud is much bigger. I had to shrink it further to get it all in one screenshot.
In addition to key words, Wmatrix can produce key concepts by comparing a frequency list of semantic fields automatically tagged in the data with a reference corpus, again here the BNC written sampler. This shows statistically key concepts in the Conservative manifesto
Lou Burnard spotted some conversion errors in the Libdem manifesto (extra spaces after ligatures e.g. 'fi') and Martin has now fixed smart quotes to straight ones. The new text version of the Libdem manifesto is at http://ucrel.lancs.ac.uk/wmatrix/ukmanifestos2010/ and here is the updated key word cloud. You'll notice the main difference is that "Britains" is no longer key because it was actually "Britain's" and the apostrophe now being fixed means that it combines its frequency with "Britain".
Unlike tag clouds or those produced by Wordle where the size of a word depends on its frequency, the Wmatrix key word clouds show words where their size is related to their statistical keyness i.e. how different their frequency is from what it is expected to be (based on a large reference corpus). Here's the conservative key word cloud.
This week I've been reading the UK election manifestos, or rather I've set Wmatrix to read them for me. First, you have to convert the online versions in PDF or HTML into plain text. Saving automatically from Acrobat as plain text leaves unwanted headers and footers and some lost capitalisation. Thanks to Martin Wynne for editing the Libdem and Conservative files. I've edited the Labour manifesto by taking the HTML version from their website and marking the chapter boundaries with a pseudo-XML tag. The edited full plain text versions are available to download at: http://ucrel.lancs.ac.uk/wmatrix/ukmanifestos2010/
Labour's manifesto is 29,508 words long. The Conservative manifesto is 27,562 words and the LibDem one is shorter at 18,433 words.
Next, I loaded the files into Wmatrix and compared them to a general reference corpus for written British English. Key word clouds coming up ...
If you're used to Winzip or the Windows built in possibility to add a password to a zip file, then this can be done on OSX with "zip -e zipfile.zip /dir/files*" on the command line. You're then prompted for a password. To unzip, I've found it best to use Stuffit expander rather than the default archive utility.
It seems that this is happening a lot with third party apps on the iPod Touch. It begins with one app crashing as soon as you open it and eventually all the other third party apps stop working. There are a lot of solutions offered on the forums here and here, but the only thing that worked for me was a complete reset.