Automatic Spelling Dictionary Selection
By Adrian Sutton
David Welton described a frustration he had with FireFox’s spell checker which piqued my interest:
I write most of my email (in gmail) and submit most web site content in English, however, a significant portion is also done in Italian. I leave the spell checker in English, because Italian is, in general, quite easy to spell, so that even as a native speaker, a helping hand is occasionally welcome. However, it isn’t as if I write Italian perfectly either, so the help there would be nice as well. I find it quite annoying to go change the language in the spell checker option each time, especially when, as an example, I’m responding to email and do 2 in English, one in Italian, another in English, and so on. On the face of it, identifying what language an author is using looks a lot like a natural language processing problem and thus requires a PhD and many years of research to tackle. Looking a little closer though, you begin to realize that the problem domain is dramatically reduced in this case:
- We’re dealing with a preselected list of languages that the user can actually speak. For the majority of people who’d use this feature, that would be 2 languages and it would be very rare to make double digits.
- We don’t really want to know the language that’s supposed to be in use, we just want to find the spelling dictionary that most accurately checks the language that’s being used.
- We obviously have a spelling dictionary for each of the candidate languages around or it’s pointless detecting the language in the first place.
Given the first restriction of a limited number of languages, we can use a pretty brute force algorithm since we don’t need to check every language on earth. Given the third restriction, we have a big list of valid words in each language. Finally, given the second constraint, we just want to find the spelling dictionary that marks the fewest words as errors and we’re very likely to have the right language.
So all we do is check the document with each dictionary, count the errors for each and pick the dictionary that gives the fewest errors. I implemented this and it works amazingly well for such a stupid algorithm, but there’s a problem…
Dictionaries are really big. Downloading a dictionary for each possible language isn’t a lot of fun if you happen to be writing a browser based editor. Worse, loading 5 dictionaries into RAM takes an awful long time and consumes way too much RAM to cache.
Fortunately, the spell checker I happen to have handy breaks words up uses a simple optimization to avoid having to search through the entire word list every time – it includes a separate file of the most common words in the language. For the spelling checker that allows it to find a match for most words from a small hash table. For our purposes though, the most common words are exactly what we need and they reduce the size of the dictionary to something extremely manageable.
In fact, the most common words from the 12 languages EditLive! supports amount to amount to about 40k in total. Easily small enough to download them all and they load fast enough that there’s no need to cache them in RAM at all.
I was going to post the code I wrote but it depends on a proprietary, third party spelling checker and is frankly so basic that it’s really not worth sharing. I was going to put it up as a servlet for people to play around with but Google has one that’s way better anyway. Still, it was fun to do up and nice to know that something that seems to complex is actually ridiculously simple.