JetPAG - Jet Parser Auto-Generator
Lookahead paths explained

Recursive-descent recognizers have a very good advantages like easier construction, simpler structure and better code redability. Any recursive-descent recognizer is predictive, i.e. it predicts what to execute before actual execution. For example a recognizer governed by a grammar like:

a: A | B

would need to predict what alternative to pick without making any consumptions from the stream. A code that does this looks like:

switch (ncla())
{
case A:
	// match A
break;
case B:
	// match B
break;
dfault:
	// fail
}

Such simple one-by-one checks are not bad for LL(1) grammars and some LL(k) grammar. However they can get really dirty, consider this example:

a: A B | A C

A normal code for a recognizer that matches this grammar would look like:

if (A==ncla() && B==ncla(1))
	// match A B
else if (A==ncla() && C==ncla(1))
	// match A C
else
	// fail

It is obvious that such code includes a redundant check for A. In real like even mor dirty thing may happend. For such situations, and more complicated ones, JetPAG utilizes a very important concept called lookahead paths.

Lookahead paths are special forms of representing a set of lookaheads based on extensive information collected using special functions. The main concept behind lookahead paths is to never contain any redundant tasks, and house many optimizations. In lookahead paths each groups of lookahead sets is nested recursively in a smaller tree, where this nesting translates to some reduction in the lookahead. Lookahead paths guarantee that the runtime wouldn't check an atom at some depth in the stream more than once, gaining a notable reduction in run-time overhead of predictions.

A lookahead path tree is aligned in a special structure. Each tree holds a mixed collection of leaves, each pointing to some altenative and end the prediction, and sub-trees, which drive the prediction a step deeper. If the prediction process enters a sub-tree, it never leaves it and must end at some leaf in the sub-tree's family, this way the prediction process walks in one path starting at the root and ending at some leaf, never stepping back. The flow of the prediction process is similar to extended binary search trees or state machines.

For our example, let's list the lookahead sets available and try to figure out what can we do with lookahead paths to improve the recognizer:

  1. A B → alt1
  2. A C → alt2

We can see that alt1 and alt2 share A at left, thus we can left-factorize A from them:

  1. A
    1. B → alt1
    2. C → alt2

We get a simple tree for the lookahead path. A code generated for this tree would look like:

if (A==ncla())
	switch (ncla(1))
	{
	case B:
		// alt2
	break;
	case C:
		//  atl3
	break;
	default:
		// fail
	}
else
	// fail

It is clear the recognizer will perform better at runtime, added the fact the the achieved LL(1) inner block is generated as switch..case, which further performs better than if..else blocks.

Regarding left-factorization, trees can have a useful leaf called the epsilon leaf. Epsilon in linguistics means and empty string, and in lookahead paths simillary means what to match if all the other leaves in the tree fail to predict. Epsilon can be though as the 'else' in if..else block of the 'default' in switch..case blocks. For example the grammar:

a: A B | A

Would normall be predicted with a bad case if only A is available in the stream, as A would have a redundant check:

if (A==ncla() && B==ncla(1))
	// match A B
else if (A==ncla())
	// match A
else
	// fail

while a well-form tree for a lookahead path would look like:

  1. A
    1. B → alt1
    2. ε → alt2

Note the epsilon case. A code generated that does this would look like:

if (A==ncla())
	if (B==ncla(1))
		// match A B
	else
		// match A
else
	// fail

Although lookahead paths seem to be more complicated in terms of code readability than normal flat predicton blocks, they actually tend to push code readability and ease of learning the code further. This is because lookahead paths, as shown, tend to make the prediction process 'more logical', as they contain no redundant checks and make the most out of what they get from the stream lookahead and greatly increase efficiency of generated code. Code readability is also increased due to less code generated, this advantage can be better seen in larger blocks like those that normally reside at scanner::nextToken().

Lookahead paths can also house several more features. One frequently used form of grammars is using a complementary lookahead, as in this example:

a: A B | ~A C

The alternative block is LL(1). An initial lookahead path is:

  1. A → alt1
  2. ~A → alt2

It seems clear that if '"' is if the first check for '"' fails, then the second sure will succeed (unless we're at the end of the input, this is handled well). So a better lookahead paths would look like:

  1. A → alt1
  2. ε → alt2

Now lookahead paths can be used to generalize this method to more than two complementary leaves:

A: A B | A ~B | ~A C | ~A ~C
  1. A
    1. B → alt1
    2. ε → alt2
  2. ~A
    1. C → alt3
    2. ε → alt4

We can see that the second sub-tree itself complements the first one, so we can check its content directly. This is the final tweaked tree, the small numbers indicate the depth of the lookahead checks:

  1. A[0]
    1. B[1]alt1
    2. ε → alt2
  2. C[1]alt3
  3. ε alt4

The result is a worst case of 2 checks. Compared to the normal 8, lookahead paths helped gaining a reduction of 75% in worst case checks in this situation. It is clear that lookahead paths are practically useful, and help improve the performance of recursive-descent recognizers greatly. In real world lookahead paths become very useful, as they approach situations frequently ecnountered. Let's look at a real-world grammar for some C operators:

Less:		'<';
LessEqu:	"<=";
LShift:		"<<";
LSiftAssign:	"<<=";

This grammar is LL(3). Any recursive-descent lexer governed by this grammar may trigger a worst case of 8 checks. But with lookahead paths a worst case won't trigger more than 4 checks:

  1. '<'
    1. '=' → LessEqu
    2. '<'
      1. '=' → LSiftAssign
      2. ε → LShift
    3. ε → Less

From previous examples in this article we can see that lookahead paths are really useful in practice.