LL(k) Parsers

A LL(k) parser is a multi-token lookahead Recursive Descent parser. This means it works by trying to predict which production to match next wihout consuming tokens from the input stream.

Predictive parsing is in simple terms trying to match each input with the production that is predicted to be the proper in order to success. Predictions are done via lookahead. Lookahead is a recods of one or more next tokens in the stream not consumed yet. Lookahead of a parser is tested against the lookahead set of each of the productions, until i t sufficiently knows which production is proper. A lookahead set is a record of all lookaheads that are suitable to make the parser invoke a specific production.

For example consider this simple garmmar of a single rule with two productions:

rule: A
    | B
    ;

We're running an LL(1) parser with this grammar against this part of an input stream

... B A ...

When the parser reaches the token B, the lookahead of the parser is { B }. The parser must know which production is next to invoke to cop with the current position in the stream. To make a wise decision, the parser has to test the lookahead against each of the possible lookahead sets. The LL(1) lookahead set of the first production is clearly { A }, of the LL(1) lookahead set of the second production is clearly { B }. First we test the lookahead set of the first production: is { B } ⊆ { A } satisfied? no ... No problems, we have more sets awaiting. Now for the second production: is { B } ⊆ { B } satisfied? yes! Nice, now we have enough information to predict that the second prodctuction is what next. A simple code that makes this decision would be similar to:

if A == lookahead
	// first production 
else if B == lookahead
	// second production
else
	fail

Basically this is how a recursive descent parser works. It tries to predict what to do next by analyzing what's next in the stream. In order for a recursive descent parser to work properly, it must always know what's next in the stream and update it whenever it consumes a token. So, for the exampel above, after B is consumed by rule2 the lookahead of the parser becomes { A }.

Can a LL(1) parser cop with all LL(k) grammars? To answers this question, let's consider another LL(k) grammar:

rule: X A
    | X B
    ;

We're running an LL(1) parser with this grammar against this part of an input stream

... X B ...

Obviously the parser should invoke rule2. Or would it? Let's go over the procedure again. when the parser reaches the X token, the looahead of the parser is { X } (!!). The LL(1) lookahead set of the first production is { X }, and also of the second production is { X }. So the first test { X } ⊆ { X } would be satisfied for the first production, the first production would be invoked and eventually would fail when it tries to match A while B would be there looking for problems. Problems similar to this are often called None-determinism, and the clashing productions are called Ambiguous.

To overcome such probelms, there are basically two solutions. One solution is transformations, which tries to transform the grammar so that it becomes deterministic for LL(1) parsers. The proper transformation for this case is left-factoring, which puts out the longest identical left sections of two or more productions, and reduces the productions to the non-identical parts:

 rule1: X ( A | B );

This solution removes the clash without changing the properties of the parser, { X } would satisfy rule1 as whole without problems, X would be consumed, and the decision becomes exclusively between A and B. Eventually the test would satisfy B's { B } ⊆ { B }, and the problem is gone. One notable advantage of left-factoring and most other transformations is the reduced size of the grammar. This might sound as a magical solution, but there are two main drawbacks:

  1. Complication. Most transformations increase complication of grammars and make them together with the generated code harder to read. For example this grammar:

    A: X A B
    | X A C
    | X D D
    ;
    

    Would be transformed to:

    A: X (A ( B | C ) | D D );

    Which is obviously too complicated compared with the original grammar.

  2. Limited application. Not all grammars are transformable. For example embedding actions before X in any of the two alts in the second example would make left-factoring X not possible (and embedding actions before any of the two A's would make left-factoring A not possible ...). Some grammars cannot be simply transformed via left-factorization, like this one:

    INT: "0x" ('0'-'9'|'a'-'f'|'a'-'F')+
       | '0'-'9'+
       ;
    

To overcome such problems we use parsers with multi-token lookahead or LL(k) parsers. A LL(k) parser is a parser that looks in the stream deeper than one token, specifically k-tokens. Now in order to test two, let's say LL(2) lookahead sets, { A1, A2 } and { B1, B2 }, we must satisfy A1 ⊆ B1 and A2 ⊆ B2.

To illustrate how a LL(k) parser would solve the none-determinism problem introduced in example 2, we'll try to parse the stream again but this time with a LL(2) parser. When the parser reaches the X in the stream the lookahead of the parser is { X, A }, the LL(2) lookahead set of the first production is { X, A } and the LL(2) lookahead set of the second production is { X, B }. First we test against the lookahead set of the first production: are { X } ⊆ { X } and { B } ⊆ { A } satisfied? no ... No problem, more lookahead sets are awaiting. For the second production: are { X } ⊆ { X } and { B } ⊆ { B } satisfied? yes! The proper next production is the second, and the problem is solved. A simple code that makes this decision would be similar to:

if X == lookahead-at-0 and A == lookahead-at-1
	// first production 
else if X == lookahead-at-0 and B == lookahead-at-1
	// second production
else
	fail

LL(k) parsers solve none-determinism problems by looking further in the stream. In contrast to transformations, increasing the depth of the parser, or raising k, does not affect the grammar. LL(k) parsers use larger lookahead sets which require more comparison, when LL(1) sets are not sufficient. JetPAG greatly reduces checks by utilizing the concept of lookahead paths which allows even multi-stage optimizations. JetPAG nearly eliminates the overhead of lookahead invokations via utilizing fast local caching. Optimizatizations are an integral part of JetPAG and performance is a design goal.