Qaqao Lexics

1.2 Structural Lexics

Qaqao lexics is divided into two levels: structural lexics and semantic lexics. At the structural level, the scanner divides the character stream into four general classes of tokens: comments, directives, whitespace, delimiters and identifiers. On the semantic level, identifiers are further classified according to their form. Both levels are specified using regular expressions. Implementations of Qaqao may use a single scanner for both lexical levels or may scan the structural level separately from the semantic level. This section deals with structural lexics.

1.2.1 Comments

In this version of Qaqao, comments are ignored by the compiler, so the scanner may discard them. In future versions of the language, the compiler may use some comments to insure compatibility between documentation and code specification. For this reason, comments constitute a token class; the scanner should process them as tokens but should not insert them into the token stream.

Qaquo uses C/C++/Java comment format. Comment tokens are defined structurally as follows:

Comment_token ::

The first type of comment has an explicit termination sequence, which is part of the comment; this type of comment may extend over several lines. The second type of comment terminates at the first end of line character: new_line, carriage_return, vertical_tab or form_feed; the terminating character is not part of the comment and so constitutes the beginning of a whitespace token.

Comments do not nest arbitrarily. An explicitly terminated comment (/* … */) may enclose one or more line-terminated comments, but the termination of the enclosing comment must be at or after the end of the last line-terminated comment. Similarly, line-terminated comments may enclose an explicitly terminated comment, but in this case, every portion of the enclosed comment must be covered by a line-terminated comment initializer. IDEs should facilitate well-formed comments by allowing users to select a section of code to enclose in a comment and to select a comment to remove from comment status. If a user selects a section of code to enclose in a comment and that code already contains explicitly terminated comments, the initial and terminal sequences of the comment are modified as follows: “/*” becomes “/1*”; “*/” becomes “*1/”; “/1*” becomes “/2*”; “*1/” becomes “*2/”; “/*n” becomes “/n+1*”; “*n/” becomes “*n+1/”. When an enclosing comment is removed, the index on enclosed comments is decremented accordingly. The IDE need not support more than nine levels of nesting in this manner.

Comments that begin with the sequence “/**” are documentation comments. Qaqao documentation comments follow the format of Java documentation comments. ***See Appendix ### for a complete description of documentation comments


1.2.2 Directives

Directives are messages to various components of the compiler. They use a similar format to comments. Directive tokens are defined structurally as follows:

Directive_token ::

The first type of directive has an explicit termination sequence, which is part of the directive; this type of directive may be embedded in a line or extend over several lines. The second type of directive terminates at the first end of line character: new_line, carriage_return, vertical_tab or form_feed; the terminating character is not part of the directive and so constitutes the beginning of a whitespace token.

Directives do not nest. However, comments may be embedded in directives according to the rules for comments given above. Comments embedded in directives do not generate a separate comment token.

Directives may pertain to different parts of the compiler; the directive may indicate its target by appending the appropriate name—scanner, parser, etc—to the initial sequence (e.g. ##scanner). If the directive does not indicate a target and the current program cannot interpret the directive, the directive passes to the next stage of the compiler. ***Complete information about compiler directives is in Appendix ###.


1.2.3 Delimiters

There are nine types of delimiters, corresponding to the nine delimiter characters, defined as follows:

Delimiter_token ::
Left_brace_token | Right_brace_token | Semicolon_token | Left_parenthesis_token | Right_parenthesis_token | Comma_token | Left_bracket_token | Right_bracket_token | Colon_token
Left_brace_token ::
{
Right_brace_token ::
}
Semicolon_token ::
;
Left_parenthesis_token ::
(
Right_parenthesis_token ::
)
Comma_token ::
,
Left_bracket_token ::
[
Right_bracket_token ::
]
Colon_token ::
:

The specific delimiter is significant in the structural syntax of the language. The actual generating character is implicit in the type of the token.


1.2.4 Identifiers

An identifier is a sequence of one or more non-special characters (including backslash escape sequences):

Identifier_token

The first form of Qaqao identifiers are scanned using the longest prefix convention. The second form allows the programmer to specify the start and end of an identifier without using escape sequences for spaces and delimiter characters within the identifier; this is a convenience feature for literals such as strings. Neither type of identifier may extend beyond a single line. Unlike many languages, Qaqao allows any characters to be part of an identifier. The semantic lexics, Section 1.3, defines specific types of identifiers and specifies identifiers that have no legal meaning.


1.2.5 Whitespace

Whitespace separates tokens and enables the parser to distinguish separate components of tuples; it does not generate tokens on its on. Section 1.1.1 includes a complete list of whitespace characters. Note that some space characters, such as the non-breaking space and the em-space are not considered to be whitespace characters.


Qaqao Language Definition, section 1.2, version 0.3α

© J. Andrew Holey, 2009

email: jholey@csbsju.edu