Syntax Definitions

Syntax definitions are one of the most important part in a grammar specification. Syntax definition are a extensive patterns which the recognizers should work according to at runtime, matching the free input to the associated syntax definition. Syntax patterns accepted by JetPAG are similar to regular expression and EBNF, with some extensions.

Syntax definitions take part in normal scanner rules, hidden scanner rules, all parsers rules and inline rule. They come after the colon and end with a semi-colon. Here are some examples:

A:	'a'
	;

INT:	'0'-'9'+
	;

The most basic units of a syntax definition is literals, which match a single atom in the stream. In scanner rules literals can be represented in three ways: character literls, string literals and numerical integers. Integers can be represented either as normal numbers or as hexadecimal numbers. In this example all rules are the same:

A:	'a'
	;
B:	"a"
	;
C:	97
	;
D:	0x61
	;

In parser rules literals are token references, thus they must refer to some token in the available set of token types. The first form of token references is explicitly specifying the name of token type. The more restricted second form is specifying a string (of a character) literal, which will refer to the token type that has a similar fixed value. The second form is more natural and suitable for punctuation marks and keywords. This example shows a parser rule which uses both forms:

grammar G:
parser P:
addition_opr:
	INT '+' INT
	;
scanner L:
INT:	'0'-'9'+
	;
PLUS:	'+'
	;

Literals can be extended into ranges and sets. Ranges can be specfied only in scanner rules by putting a hyphen (-) between the boundaries. Sets are automaticaly detected and allowed in all rules, they are simply an alternation between several literals or ranges. Note that JetPAG automatically reduces sets into ranges or literal and ranges into literals if possible. In this example rule A has a range and rule B has a set:

A:	'a'-'z'
	;
B:	'a'-'z'
|	'A'-'Z'
	;

Literals, ranges and sets can all be inverted by preceding them with a tild (~). For example this rule match any non-ASCII character:

NO_ASCII:
	~0-255
	;

In scanner rules, when a literal, a range or se is matched the consumed character is appended to the text record which would contain the text of the token when is it return via scanner::nextToken(). Sometimes a certain character shouldn't appear in the return token, like quotation marks of strings. The consumed character can be prevented being append to the text record with the skip operator (!), as shown in this example:

CHAR:	!"'" ~"'" !"'"
	;

Symbolic names can be specified for saving the result of some operation in a run-time variable. Literal are stored in integers for scanner rules and token refrence objects (type StreamToken) for parser rules. Symbolic names may also be specified for rule reference if the referenced rule a return value. This example shows both several uses: symbols a and b are integers and symbol D is a token reference object.

grammar G:
parser P:
int r
addition_opr:
	a@digit + b@digit
	$$ r = a + b;
	;
int r
digit:	D@DGT  $$ r = D->text[0] - '0';
	;
scanner L:
DGT:	'0'-'9'
	;
PLUS:	'+'
	;

Quantifier define pieces of definition which should be repeated withing repetition boundaries. Quantifiers are specified within square brackets. Quantities are specified by separating minimum and maximum repetition boundaries with a comma, they both are optional. If the minimum boundary is not specified it is set to zero, and if the maximum boundary is not specified it is set of infinite. Quantities can can also have a fixed repetition by specifying a single boundary. This example shows all types of quantifiers:

A[2,6]
A[,6]
A[2,]
A[,]
A[4]

Some special cases have built-in shortcuts:

A?	=	A[,1]
A*	=	A[,]
A+	=	A[1,]

Alternative blocks can be used to specify that one of several patterns should be matched at runtime. Alternative patterns are separated with vertical bars (|), it is a low-precedence operator. For example:

INT:	'0'-'9'+
|	"0x" ('a'-'f'|'A'-'F'|'0'-'9')+
	;

Here the rule INT matched either a normal number of a hexadecimal number. Note that the inner alternative block that specifies hexadecimal digits is actually a character set.

Alternative blocks are very important parts fo syntax definitions in recursive-descent recognizers, as the correct alternative pattern is selected early with lookahead. Anyway sometime other special factors than uniqueness of lookahead are needed for the alternative block like token texts and run-time paramters. For such purposes JetPAG offers semantic conditions, they are blocks of free-text enclosed withing {? and ?} and precedeing the alternative pattern that is copied properly as a boolean condition in the generate code. For example this block uses the text of te token as a conditionin a parser rule:

rule:	T@NAME
	(	{? texti::str_equ(T->text, "bag") ?}
		...
	|	{? texti::str_equ(T->text, "wallet") ?}
		...
 	)
	;

Another type of custom conditions is predicates. Predicates are special patterns enclosed within (? and ) that are are explicitly tested at runtime, instead of zero-length lookaheads. If the whol predicate pattern matches successfully, the predicate succeeds and the predicates alternative pattern is executed. For example:

rule:	(? 'a' 'b'+ 'c' ) 'a'
|	'a' 'b'
	;

Alternative patterns of different types can be mixed together, but when generating code they are generated in this order:

  1. Normal lookahead-driven alternatives.
  2. Predicate-pattern-driven alternatives.
  3. Semantic-condition-driven alternatives.