Scanners

Scanner specifications are essential in any JetPAG grammars. Scanners scan the input stream of the program, usual a character stream, and split it into tokens. A token is a small piece of text split from input stream. Each token has type ID (called TTID - Token Type ID) that identifies this token an usually informs the parser and token stream manager about what information it holds (an operator, a punctuation mark, literals, ...) to determine if it is allowed in the surrounding context (the basic operation of any recognizer is to match input with grammars).

Scanners operate on STL streams (std::basic_istream) of characters specified by the option char_type for the grammar. Default char type is char (for most 32-bit bit environments, it is a 8-bit signed integer type, one byte size). Anyway JetPAG uses normal integers (int) in buffers and other functions, just to make scanner and parser closer and more similar, as well as to allow detecting special states such as end-of-stream.

Token streams fed by scanners are like linked lists - each token points to the next one. At the end of the token stream scanners put a special token called EndOfStream, which indicates end of input scanned by the scanner. Tokens are returned from the scanner by invoking the public member function nextToken(), which redirects the flow to the correct rule.

Scanner grammars are defined as follows:

scanner scanner-name base-types? options? :
  rule-definitions

Scanner grammars may contain wide range of rules. Note that not all rules yield a token in the token stream fed by the scanner, and not all rules acquire a TTID.

  • Abstract rules acquire TTIDs, but they do not feed tokens to stream neither are determined and invokes by nextToken(). These rules actually define no syntax grammars, they are just abstract rules that might be reused for TTID acquisition by normal rules or test table, or for special cases in actions via embedded sources. Abstract rule may only be defined defined with options. Multiple abstract rules might be defined by one declarations by separating them with commas:

    abstract
      AbsRule1,
      AbsRule2;
  • Normal rules acquire TTIDs and feed tokens to streams. These rules are determined and invoked through nextToken(). Example normal scanner rule:

    LP: '(';

    Normal rules may be defined with return values, optional arguments and exception handlers. Note that normal rules may only be defined with optional arguments and no normal arguments (so that they can be called freely from nextToken()). Normal rule may also acquire TTIDs of other normal or abstract rules by trailing rule's name with an equal sign and the the name of the TTID being acquired. Such rule do not acquire their own TTID. This example demonstrates a typical TTID acquisition:

    abstract AbsRule;
    
    MyRule = AbsRule:
      ...
      ;
  • Hidden rules do not acquire TTIDs neither do they feed tokens to streams. Hidden rule aren't invoked through nextToken() also. They are useful if a piece of grammar is used more than once and inlining it isn't desired. A good example is identifiers:

    hidden IDEN:
      ( 'a'-'z' | 'A'-'Z' )+
      ;

    Hidden rules may be defined with return values, normal and optional arguments and exception handlers.

  • Skipped rules do not acquire TTIDs neither do they feed tokens to streams. But they invoked through nextToken() and when they finish nextToken() recalls itself to proceed with next token. They are useful for hiding input like comments and whitespaces from the token stream fed by the scanner. This example skips Python comments:

    skip COMMENT:
      '#' ~('\r'|'\n')*
      ;
  • Test tables are special rules that have nothing to do with token streams. These rules define key-value pairs which can be used with normal rules for modifying the TTID of the token returned by the rule based on it's textual value. Test tables are typically used for determining if an identifier is a keyword. Each entry in a test table either yields a TTID of a previously defined abstract rule, or the TTID of an inline defined abstract rule by peceding the name of the rule with +. This is a compund example:

    IDEN < test_table = keywords >:
      ( 'a'-'z' | 'A'-'Z' )+
      ;
    
    abstract KwdConst;
    
    ttable keywords {
      "const"  = kwdConst,
      "new"    = +KwdNew
      }