Scannerless Parsers and Compile-Time Evaluation

April 08, 2005, at 12:00 AM

I have an implementation of Tomita's GLR Algorithm, which is able to parse any context-free grammar, including ambiguous ones. Earley's Algorithm seems to be getting more attention lately, and I'm not sure why, since as far as I know it's strictly less efficient. What I am attempting to do is expand my framework to be a scannerless parser, which would allow rapid development of parsers for just about anything, in native Lisp.

Tomita's Algorithm, like the traditional yacc LALR(1) algorithm and its variants, precomputes a parse table which drives the actual parse. The difference is that the parse stack isn't really a stack at all, it's a DAG, which means that when you there are shift-reduce or reduce-reduce conflicts, the parser is able to nondeterministically try both branches. A backtracking recursive-descent parser can do that, too - but here there's no exponential blow-up.

A scannerless parser essentially combines the roles of lex and yacc. Of course, the reason lexical analyzers have typically been separate is that most programming-language grammars are not context-free. Happily, GLR lends itself handily to extensions which address this. Right now, I have implemented one such extension, reject productions, and I am thinking about whether I should also implement follow restrictions. Reject productions say "don't allow anything that matches this," and follow restrictions say "don't allow matches which are followed by this character."

I'm pretty much done with adding features now, and am working on packaging and polishing. I still consider the code alpha-quality, and I'm going to need to clean it up quite a bit before I put it up...

This brings us to the subject of compile-time evaluation. Since generating the parse table does take some time, of course it needs to be pregenerated and saved for the next invocation of the parser. Traditional parser generators do this by outputting a new source file which contains the entire runtime component of the parser, as well as the parse table. For example, your input file would be grammar.g; the generator would output grammar.cpp, which you would then compile into grammar.o.

Although I haven't entirely convinced myself that there's anything wrong with doing things that way, with Lisp's introspective capabilities, it seems as though it ought to be able to do better than this. Couldn't the parse table just be computed when you compile the file, and stored into the fasl?

I'm not sure exactly how to do that, though. It looks like eval-when doesn't help with that; it can compute the parse table at compile time, but I don't see any way to save it for later loading. The best I can think of is to define a macro which does the computation at macro-expansion time. Does anybody else have suggestions?

Here are some papers which explain more about this type of parser:

"An Efficient Augmented-Context-Free Parsing Algorithm" -- Tomita 1987, 16 pages

"Scannerless Generalized-LR Parsing" -- Visser 1997, 54 pages

"Disambiguation Filters for Scannerless Generalized LR Parsers" -- Brand, Scheerder, Vinju, and Visser 2002, 15 pages

TrackBack

TrackBack URL for this entry:
http://www.accela.net/~dankna/cgi-bin/mt/mt-tb.cgi/4

Comments

(eval-when (:compile-toplevel) ...) doesn't help you because it only help to evaluate stuff at compile time, but doesn't let you store anything in the resulting code. You can either use a macro to generate the table at macroexpansion time and then put the table as a literal value in the source code, or you can use read-time evaluation to do something similar (with #.). However, this only works if your table is externalizable. Another approach would be to use load-time-value which allows you to generate the table once at load time. Afterwards, it will be treated as a constant.

I hope this helps.

Thanks... Yeah, it's what I eventually figured out (PL re-ran an old entry, for some reason). I decided it wasn't that important: if it's not going to be seamless for the user, I might as well fall back to the old-fashioned way of outputting an intermediate source file and compiling it. Doing it at load time isn't acceptable because, well, it's a compilation task; it could potentially be fairly lengthy.

cl-yacc uses the macro approach, which is trivial:

(defmacro define-grammar (name &body body)
  "..."
  (multiple-value-bind (options make-options productions) (parse-grammar body)
    (declare (ignore make-options))
    `(defparameter ,name
      ',(apply #'make-grammar
               :name name
               :productions productions
               options))))

The only restriction is that your compiled-grammar object must be externalizable, but defining the appropriate make-load-form methods isn't usually hard.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)