14 September 2015

Around four years ago, when I had first decided to start at Mozilla research, I had planned to write an LR(1) parser generator. It seemed like a good way to get to know Rust. However, I found that newborns actually occupy somewhat more time than anticipated (read: I was lucky to squeeze in a shower), and hence that never came to pass.

Well, I’m happy to say that, four years later, I’ve finally rectified that. For a few months now I’ve been working on a side project while I have my morning coffee: LALRPOP (pronounced like some sort of strangely accented version of “lollypop”). LALRPOP is an LR(1) parser generator that emits Rust code. It is designed for ease of use, so it includes a number of features that many parser generators are missing:

  • Regular-expression-like notation, so you can write Id* for “any number of Id” or Id? for “an optional Id”.
  • User-defined macros, so you can make a macro like Comma<Id> that means “comma separated list of Id with optional trailing comma”.
  • Conditional macros, so you can easily generate a subset of your grammar for some particular context (like, all expressions that don’t end in {).
  • Support for synthesizing tokenizers (currently somewhat limited, but sufficient for many uses) as well as external tokenizers (very flexible). If you’re using an external tokenizer, you don’t even need to be parsing input strings at all really, any iterator of “matchable values” will do.
  • Easy to get access to positional information.
  • Easy to write fallible rules where the action code can generate a parse error.

If you’d like to learn more about LALRPOP, I recently started a tutorial that introduces LALRPOP in more depth and walks through most of its features. The tutorial doesn’t cover everything yet, but I’ll try to close the gaps.

Why LR(1)? After all, aren’t LR(1) generators kind of annoying, what with those weird shift/reduce errors? Well, after teaching compiler design for so many years, I think I may have developed Stockholm syndrome – I kind of enjoy diagnosing and solving shift/reduce failures. ;) But more seriously, I personally like that once I get my grammar working with an LR(1) generator, I know that it is unambiguous and will basically work. When I’ve used PEG generators, I usually find that they work great in the beginning, but once in a while they will just mysteriously fail to parse something, and figuring out why is a horrible pain. This is why with LALRPOP I’ve tried to take the approach of adding tools to make handling shift/reduce errors relatively easy – basically automating the workarounds that one typically has to do by hand.

That said, eventually I would like LALRPOP to support a bunch of algorithms. In particular, I plan to add something that can handle universal CFGs, though other deterministic techniques, like LL(k), would be nice as well.

Performance. Another advantage of LR(1), of course, it that it offers linear performance. That said, I’ve found that in practice, parsing based on a parsing table is not particularly speedy. If you think about it, it’s more-or-less interpreting your grammar – you’ve basically got a small loop that’s loading data from a table and then doing an indirect jump based on the results, which happen to be the two operations that CPUs like least. In my experience, rewriting to use a recursive descent parser is often much faster.

LALRPOP takes a different approach. The idea is that instead of a parsing table, we generate a function for every state. This ought to be quite speedy; it also plays very nicely with Rust’s type system, since types in Rust don’t have uniform size, and using distinct functions lets us store the stack in local variables, rather than using a Vec. At first, I thought maybe I had invented something new with this style of parsing, but of course I should have known better: a little research revealed that this technique is called recursive ascent.

Now, as expected, recursive ascent is supposed to be quite fast. In fact, I was hoping to unveil some fantastic performance numbers with this post, but I’ve not had time to try to create a fair benchmark, so I can’t – since I haven’t done any measurements, LALRPOP’s generated code may in fact be quite slow. I just don’t know. Hopefully I’ll find some time to rectify that in the near future.

100% stable Rust. It’s probably worth pointing out that LALRPOP is 100% stable Rust, and I’m committed to keeping it that way.

Other parser generators. Should LALRPOP or LR(1) not be too your fancy, I just want to point out that the Rust ecosystem has grown quite a number of parser combinator and PEG libraries: nom, oak, peg, nailgun, peggler, combine, parser-combinators, and of course my own rusty-peg (and probably some others I’ve missed, sorry). I’m not aware of any other LR(1) (or GLL, GLR, etc) generators for Rust, but there may well be some. There are also two LALR parser generators for Rust you may want to check out, racc and and lemon_rust.

Future plans. I’ve got some plans for LALRPOP. There are a host of new features I’d like to add, with the aim of eliminating more boilerplate. I’d also like to explore adding new parser algorithms, particularly universal algorithms that can handle any CFG grammar, such as GLL, GLR, or LL(*). Finally, I’m really interesting in exploring the area of error recovery, and in particular techniques to find the minimum damaged area of a parse tree if there is an incorrect parse. (There is of course tons of existing work here.)

Plea for help. Of course, if I wind up doing all of this myself, it might take quite some time. So if you’re interested in working on LALRPOP, I’d love to hear from you! I’d also love to hear some other suggestions for things to do with LALRPOP. One of the things I plan to do over the next few weeks is also spend some more time writing up plans in LALRPOP’s issue database, as well as filling out its wiki.