I Love Parser Generators, I Hate Parser Generators

By Adrian Sutton

June 22, 2009

Parsers drive me mad. I was reminded on the weekend of how much I like working with parser generators – they’re just so pure and clean. You really feel like you’re working with a grammar and all those CS lectures come flooding back. Writing code to parse the same content by hand just never has that feel. Plus they create incredibly accurate parsers in very little time at all.

I was also reminded of how much I hate parser generators. They generate very accurate code which is great when you have very accurate input. In the real world, it just means the parser craps out an awful lot on very minor syntax problems. So then you try to make the grammar more flexible to accept that input and the generator just complains that the language is no longer LL(1).

Off we go into the deep dark depths of those CS lectures. Now all of a sudden you find yourself with pen and paper out drawing states and the paths between them. Pretty soon you want to migrate to A3 paper and then on to butchers paper. Eventually you find yourself writing on the wall.

Real world content just isn’t sane. You can’t tokenize it first and then just use those tokens – characters in different places have all kinds of different meanings. You don’t really want to validate the content as you read it1{#footlink1:1245656519327.footnote}, you just want to do the most brain dead simple thing to get that content in and in a form that you can work with2{#footlink2:1245656663583.footnote}.

I run into this every time I work with parser generators and wind up spending so much time making the grammar fully tolerant that it winds up being easier to just write the entire thing by hand. I just can’t help but think that there should be a better way though.

1 – beyond making sure you’re avoiding buffer overflows and that the resulting model isn’t dangerous etc but often those kind of checks are best done in the code that actually does the work (i.e. assume every value is user supplied rather than assuming that the content is all nice and safe). ↩

2 – this of course is situation dependent. I happened to be parsing CSS where tolerance is the key to success. Parsing configuration files on the other hand should be strict and fail fast.↩