Friday, April 12, 2013

What's the straight dope on the I before E rule?

So I was minding my own business on Pinterest the other night (I read it for the articles?) and came across this monstrosity.


And... I was hooked. It can't be that I memorized a rule that actually does more harm than good, right? So, I got all nerdy on it.

I've written a game about regular expressions and a lot of the levels end up being word games. (e.g., match all the words that start with pre, like present, but don't match words that contain pre like unpredictable)

Writing the game, I needed a copy of all the dictionary words that I could mess with on the command line. I use the Ispell list from this list of wordlists. That breaks up the whole language into a set of files, both general English (cromulent in America and England and Australia), things only acceptable within dialects, and generally hints at use frequency (e.g, cat and the are in a different file than amidship.) So the most common use, globally acceptable file is english.0

Let's use regular expressions to test the rule.

Words that follow the rule: I before E, except after C: (like piece)
$ egrep "[^c]ie" english.0 |wc -l
    1867
Words that follow the rule: I before E, except after C: (like inconceivable)
$ egrep "cei" english.0 |wc -l
    62

Words that break the rule by having I before E after C: (like science)
$ egrep "cie" english.0 |wc -l
     116
Words that break the rule: E before I and no C around: (like "feisty heist on a weird beige foreign neighbor")
$ egrep "[^c]ei" english.0 |wc -l     
     250

What have we learned?

Jeremy is a lunatic.

OK, but really, the base of the rule is sound: when it doubt, i before e is is right 1867 times out of 2194 (in base, international English), or 85% of the time. I like those odds.

The problem is, the extension everyone remembers, "except after C" sucks. There are 327 exceptional words (including weird and inconceivable) but "after C" only accounts for 20% of the exceptions.

So, the hell with it. Let's have a laugh with Brian Regan, instead.