Qaqao lexics is divided into two levels: structural lexics and semantic lexics. At the structural level, the scanner divides the character stream into five general classes of tokens: comments, directives, whitespace, delimiters and identifiers. On the semantic level, identifiers are further classified according to their form. Both levels are specified using regular expressions. Implementations of Qaqao may use a single scanner for both lexical levels or may scan the structural level separately from the semantic level. This section deals with semantic lexics, which applies only to identifier tokens.
There are two general semantic classes of identifiers: type literals and general names. Type literals are names of objects that directly convey the state of the object through the structure of the name; all other identifiers are general names.
Type literals convey the state of the corresponding object through the their name. Thus, the name itself is a representation of an actual object's state, such that it is possible to compute the state from the name and to compute the name (or a closely similar name) from the state. For example, the decimal Integer literals 0, 1 and −28 convey the value of the corresponding Integer object, and a String corresponding to the names can easily be computed from the objects. Using a literal name in a program causes the corresponding object to be created if it does not already exist. If the object is already instantitated, the use of the name references the object.
The Qaqao language includes both integer and floating-point literals as well as bit, bit vector, character and string literals. It is also possible to design literals for other types. In cases where multiple literals represent the same object, there must be a canonical form that can be derived from the alternate forms. The compiler and runtime system use the canonical form for lookup in the symbol table. The semantic scanner must therefore convert identifiers that match a literal to their canonical form and replace the Identifier_token with the appropriate Literal_token (type Type).
Note that named constants, such as Boolean:TRUE and Boolean:FALSE, are not literals in the sense that their names are not directly related to the representation of their underlying values.
Integer literals may be expressed in decimal, hexadecimal or binary; octal is not supported.
Decimal integer literals consist of an optional sign followed by at least one decimal digit. Single underscore characters (‘_’) may be placed between digits to aid readability. Qaqao style suggests placing underscores between every three decimal digits from the rightmost digit. The formal definition is:
Examples include 0, 25, −37, 1_238 and −3_000_000. Underscores do not need to conform to Qaqao style or to be placed at regular intervals— 1_23_456_7890 is a valid integer literal—but such usage would not improve readability in most cases and so should be avoided. Leading, multiple and trailing underscores are not permitted. Thus, _1, +_2, 3__4 and −6_ are not valid integer literals. Decimal integer literals without underscores, such as 1238 and −3000000, are permitted, but their use is discouraged, particularly in larger magnitude numbers, as a matter of readability.
Hexadecimal integer literals consist of an optional sign followed by the sequence “0x” and at least one hexadecimal digit or the unsigned sequence “1x” and at least one hexadecimal digit. Single underscore characters are also allowed in hexadecimal integer literals. Qaqao style recommends placing underscores every two or four hexadecimal digits from the rightmost digit; it also recommends uppercase for the last six hexadecimal digits. The formal definition is:
Examples include 0x0, 0x1A, −0x25, 0x20_00 and 1xBFFF_FFFF. Again, underscore style is up to the user, but underscores are permitted only between digits and only one at a time. The first two forms are treated as sign-magnitude representations in the same way as the corresponding decimal forms. This respresentation is different from C or Java where 0xFFFF and 0xFFFFFFFF are both −1; in Qaqao, 0xFFFF is 65_535 in decimal and 0xFFFF_FFFF is 4_294_967_295 in decimal. −0x1 is the corresponding hexadecimal form for −1.
The form beginning with “1x” is a negative two's-complement form, corresponding exactly to the underlying two's-complement representation. Thus, 1xF represents the same value as −0x1 or &minus1; 1xBFFF_FFFF is the same as −0x4000_0001 or −1073741825; 1x7, 1x7F and 1x7FFF_FFFF all represent the same value as −1. The initial ‘0’ or ‘1’ is essentially the binary sign-extension mark and does not ordinarily add bits to the underlying representation. When the initial “1x” is followed by a hexadecimal digit with an opposing initial bit—[0-7]—it necessarily adds at least one bit to the underlying representation. For example, 1x4000 represents the same value as −0xC000 or −49152, which is not representable as a sixteen-bit two's-complement integer; 1x0000_0000 is the same as −0x1_0000_0000 or −4294967296. In general, sequences beginning with “0x” or “−0x” should continue with one of the digits [0-7] and sequences beginning with “1x” should continue with one of the digits [8, 9, A-F], but this recommendation is not enforced.
Binary integer literals consist of an optional sign followed by the sequence “0b” and at least one binary digit or the sequence “1b” and at least one binary digit. Single underscore characters are also allowed in binary integer literals. Qaqao style recommends placing underscores every four binary digits from the rightmost digit. The formal definition is:
Examples include 0b0, 0b10, −0b101, 0b100_0000 and 1b000_1111, which correspond to the decimal literals 0, 2, −5, 64 and −113, respectively. Again, underscore style is up to the user, but underscores are permitted only between digits and only one at a time. The first two forms are treated as sign-magnitude representations in the same way as the corresponding decimal forms. The form beginning with “1b” is a negative two's-complement form, corresponding bit-by-bit to the underlying two's-complement representation. Note that in the absense of a sign, the initial binary digit before the ‘b’ represents the sign-extension; it is also treated as the first required bit of the representation. Ordinarily then, the initial digit should be included in the count of binary digits in the literal.
(*** This paragraph is subject to change. ***) The canonical form for integer literals is the unsigned hexadecimal form for non-negative integers and the signed form for negative integers. In both cases, leading zeros and underscores are removed. For example, 0x0 (0), 0xFFFF (65535), and −0x1000 (−4096) are all integer literals in canonical form. (*** The following sentence is subject to change. ***) The scanner returns a canoncial Integer_literal_token for any identifier matching one of the preceding forms.
There are two bit literals, 0b and 1b, which are defined as follows:
The bit literals correspond to the initial sequences of unsigned binary integer literals, but there are no characters following the ‘b’.
Bit-vector literals consist of two or more binary digits, optionally interspersed with underscores, followed by the lowercase letter ‘b’. As with binary integer literals, Qaqao style prefers placing underscores every fourth digit from the right.
Examples of bit-vector literals are 01b, 1000b, 0000_1111b and 100_1011_1100_1110b, which contain two, four, eight and fifteen bits, respectively. In bit-vector literals, all digits are significant, and the length of the vector is equal to the number of digits. The two bit literals are used to express a one-digit bit-vector literal.
The Bit type has two values corresponding to its two literals, and so there are no non-canonical bit literals. The canonical form of bit-vector literals is the form with no underscore characters, and the scanner should construct a Bit_vector_literal accordingly.
Floating-point literals consist of a decimal integer literal followed by a period (decimal point), at least one more decimal digit and an optional exponent part. The exponent part consists of the letter ‘e’ or ‘E’ followed by an optional sign and at least one decimal digit. Single underscores are permitted between digits, except in the exponent part. It is prefered Qaqao style to place underscores every three digits from the decimal point and to use an uppercase ‘E’ in the exponent part.
For example, 0.0, 7.825_005, −1_395.26, 1.25E6 and 2.625_000E−15 are all floating-point literals. Unlike C and Java, Qaqao requires a decimal point with at least one digit both before and after it in all such literals; only the exponent part is optional.
The canonical form of float-point literals is in normal form with an exponent part and no underscores; that is, there is a single non-zero digit to the left of the decimal point, there is at least one digit to the right of the decimal point with no trailing zeros (unless a zero is the only digit to the right of the point), the exponent part begins with an uppercase ‘E’ and there are no leading zeros in the exponent (except that a single zero represents the zero exponent. The only exception is that the canonical form of zero is 0.0E0. The canonical forms of the floating-point literals in the preceding paragraph are: 0.0E0, 7.825005E0, −1.39526E3, 1.25E6 and 2.625E−15. The scanner converts all floating-point literals to canonical form when constructing floating-point literal tokens.
Character literals consist of a left single quote character, a Character identification sequence and a right single quote character. Character identification sequences include single printing characters as well as most of the special sequences described in Section 1.1. Substitution sequences are not allowed in Character literals (or in String literals).
‘A’, ‘8’ and ‘→’ are examples of Character literals written with non-special characters; ‘&u0041’, ‘&u0038’ and ‘&u2192’ are examples of the same three characters written using a Unicode definition sequence. ‘ℵ’ is written with a named character entity and corresponds to ‘ℵ’ and ‘&u2135’; ‘\f’ is written with a non-printing escape sequence and corresponds to ‘&u000C’ which is the new page character; ‘\}’ is written with a normalizing escape sequence and corresponds to ‘&u007D’ which is the right brace character.
The canonical form of Character literals for non-special characters is the form written with the non-special character itself. For whitespace characters, the canonical form is the form written with a non-printing escape sequence, and for delimiter and escape characters, the canonical form is the form written with a normalizing escape sequence. (Design note***: At present, non-special, non-printing characters have the same canonical form as other non-special characters; we may need to find an alternative form for them.)
Qaqao Language Definition, section 1.3, version 0.3α
© J. Andrew Holey, 2009