
LL(k) Parsers
A LL(k) parser is a multitoken lookahead Recursive Descent parser. This means it works by trying to predict which production to match next wihout consuming tokens from the input stream. Predictive parsing is in simple terms trying to match each input with the production that is predicted to be the proper in order to success. Predictions are done via lookahead. Lookahead is a recods of one or more next tokens in the stream not consumed yet. Lookahead of a parser is tested against the lookahead set of each of the productions, until i t sufficiently knows which production is proper. A lookahead set is a record of all lookaheads that are suitable to make the parser invoke a specific production. For example consider this simple garmmar of a single rule with two productions: rule: A  B ; We're running an LL(1) parser with this grammar against this part of an input stream ... B A ... When the parser reaches the token B, the lookahead of the parser is { B }. The parser must know which production is next to invoke to cop with the current position in the stream. To make a wise decision, the parser has to test the lookahead against each of the possible lookahead sets. The LL(1) lookahead set of the first production is clearly { A }, of the LL(1) lookahead set of the second production is clearly { B }. First we test the lookahead set of the first production: is { B } ⊆ { A } satisfied? no ... No problems, we have more sets awaiting. Now for the second production: is { B } ⊆ { B } satisfied? yes! Nice, now we have enough information to predict that the second prodctuction is what next. A simple code that makes this decision would be similar to: if A == lookahead // first production else if B == lookahead // second production else fail Basically this is how a recursive descent parser works. It tries to predict what to do next by analyzing what's next in the stream. In order for a recursive descent parser to work properly, it must always know what's next in the stream and update it whenever it consumes a token. So, for the exampel above, after B is consumed by rule2 the lookahead of the parser becomes { A }. Can a LL(1) parser cop with all LL(k) grammars? To answers this question, let's consider another LL(k) grammar: rule: X A  X B ; We're running an LL(1) parser with this grammar against this part of an input stream ... X B ... Obviously the parser should invoke rule2. Or would it? Let's go over the procedure again. when the parser reaches the X token, the looahead of the parser is { X } (!!). The LL(1) lookahead set of the first production is { X }, and also of the second production is { X }. So the first test { X } ⊆ { X } would be satisfied for the first production, the first production would be invoked and eventually would fail when it tries to match A while B would be there looking for problems. Problems similar to this are often called Nonedeterminism, and the clashing productions are called Ambiguous. To overcome such probelms, there are basically two solutions. One solution is transformations, which tries to transform the grammar so that it becomes deterministic for LL(1) parsers. The proper transformation for this case is leftfactoring, which puts out the longest identical left sections of two or more productions, and reduces the productions to the nonidentical parts: rule1: X ( A  B ); This solution removes the clash without changing the properties of the parser, { X } would satisfy rule1 as whole without problems, X would be consumed, and the decision becomes exclusively between A and B. Eventually the test would satisfy B's { B } ⊆ { B }, and the problem is gone. One notable advantage of leftfactoring and most other transformations is the reduced size of the grammar. This might sound as a magical solution, but there are two main drawbacks:
To overcome such problems we use parsers with multitoken lookahead or LL(k) parsers. A LL(k) parser is a parser that looks in the stream deeper than one token, specifically ktokens. Now in order to test two, let's say LL(2) lookahead sets, { A_{1}, A_{2} } and { B_{1}, B_{2} }, we must satisfy A_{1} ⊆ B_{1} and A_{2} ⊆ B_{2}. To illustrate how a LL(k) parser would solve the nonedeterminism problem introduced in example 2, we'll try to parse the stream again but this time with a LL(2) parser. When the parser reaches the X in the stream the lookahead of the parser is { X, A }, the LL(2) lookahead set of the first production is { X, A } and the LL(2) lookahead set of the second production is { X, B }. First we test against the lookahead set of the first production: are { X } ⊆ { X } and { B } ⊆ { A } satisfied? no ... No problem, more lookahead sets are awaiting. For the second production: are { X } ⊆ { X } and { B } ⊆ { B } satisfied? yes! The proper next production is the second, and the problem is solved. A simple code that makes this decision would be similar to: if X == lookaheadat0 and A == lookaheadat1 // first production else if X == lookaheadat0 and B == lookaheadat1 // second production else fail LL(k) parsers solve nonedeterminism problems by looking further in the stream. In contrast to transformations, increasing the depth of the parser, or raising k, does not affect the grammar. LL(k) parsers use larger lookahead sets which require more comparison, when LL(1) sets are not sufficient. JetPAG greatly reduces checks by utilizing the concept of lookahead paths which allows even multistage optimizations. JetPAG nearly eliminates the overhead of lookahead invokations via utilizing fast local caching. Optimizatizations are an integral part of JetPAG and performance is a design goal. 