Qaqao lexics is divided into two levels: structural lexics and semantic lexics. At the structural level, the scanner divides the character stream into five general classes of tokens: comments, directives, whitespace, delimiters and identifiers. On the semantic level, identifiers are further classified according to their form. Both levels are specified using regular expressions. Implementations of Qaqao may use a single scanner for both lexical levels or may scan the structural level separately from the semantic level. This section deals with semantic lexics, which applies only to identifier tokens.

There are two general semantic classes of identifiers: type literals and general names. Type literals are names of objects that directly convey the state of the object through the structure of the name; all other identifiers are general names.

Type literals convey the state of the corresponding object through the their
name.
Thus, the name itself is a representation of an actual object's state,
such that it is possible to compute the state from the name and to
compute the name (or a closely similar name) from the state.
For example, the decimal Integer literals *0*, *1* and
−*28* convey the value of the corresponding Integer object,
and a String corresponding to the names can easily be computed from the
objects.
Using a literal name in a program causes the corresponding object
to be created if it does not already exist.
If the object is already instantitated, the use of the name references
the object.

The Qaqao language includes both integer and floating-point literals
as well as bit, bit vector, character and string literals.
It is also possible to design literals for other types.
In cases where multiple literals represent the same object, there must
be a canonical form that can be derived from the alternate forms.
The compiler and runtime system use the canonical form for lookup in
the symbol table.
The semantic scanner must therefore convert identifiers that match
a literal to their canonical form and replace the *Identifier_token*
with the appropriate *Literal_token* (*type Type*).

Note that named constants, such as Boolean:TRUE and Boolean:FALSE, are not literals in the sense that their names are not directly related to the representation of their underlying values.

Integer literals may be expressed in decimal, hexadecimal or binary; octal is not supported.

Decimal integer literals consist of an optional sign followed by at least one decimal digit. Single underscore characters (‘_’) may be placed between digits to aid readability. Qaqao style suggests placing underscores between every three decimal digits from the rightmost digit. The formal definition is:

*Decimal_integer_literal*::- (
*decimal_digit*| ‘**−**’*decimal_digit*) (‘**_**?’*decimal_digit*)* *decimal_digit*::- [‘
**0**’-‘**9**’]

Examples include *0*, *25*, −*37*,
*1_238* and −*3_000_000*.
Underscores do not need to conform to Qaqao style or to be placed at
regular intervals—
*1_23_456_7890* is a valid integer literal—but such usage
would not improve readability in most cases and so should be avoided.
Leading, multiple and trailing underscores are not permitted.
Thus, *_1*, +*_2*, *3__4* and −*6_* are not
valid integer literals.
Decimal integer literals without underscores, such as *1238* and
−*3000000*, are permitted, but their use is discouraged,
particularly in larger magnitude numbers, as a matter of readability.

Hexadecimal integer literals consist of an optional sign followed by the sequence “0x” and at least one hexadecimal digit or the unsigned sequence “1x” and at least one hexadecimal digit. Single underscore characters are also allowed in hexadecimal integer literals. Qaqao style recommends placing underscores every two or four hexadecimal digits from the rightmost digit; it also recommends uppercase for the last six hexadecimal digits. The formal definition is:

*Hexadecimal_integer_literal*::- (“
**0x**” | “**−0x**” | “**1x**”)*hexadecimal_digit*(‘**_**’?*hexadecimal_digit*)* *hexadecimal_digit*::- [‘
**0**’-‘**9**’‘**a**’-‘**f**’‘**A**’-‘**F**’]

Examples include *0x0*, *0x1A*, −*0x25*,
*0x20_00* and *1xBFFF_FFFF*.
Again, underscore style is up to the user, but underscores are permitted
only between digits and only one at a time.
The first two forms are treated as sign-magnitude representations
in the same way as the corresponding decimal forms.
This respresentation is different from C or Java where *0xFFFF*
and *0xFFFFFFFF* are both −*1*;
in Qaqao, *0xFFFF* is *65_535* in decimal and
*0xFFFF_FFFF* is *4_294_967_295* in decimal.
−*0x1* is the corresponding hexadecimal form for −*1*.

The form beginning with “1x” is a negative
two's-complement form, corresponding exactly to the underlying
two's-complement representation.
Thus, *1xF* represents the same value as −*0x1*
or &minus*1*;
*1xBFFF_FFFF* is the same as −*0x4000_0001*
or −*1073741825*;
*1x7*, *1x7F* and *1x7FFF_FFFF* all represent the
same value as −*1*.
The initial ‘0’ or ‘1’ is essentially the
binary sign-extension mark and does not ordinarily add bits to the
underlying representation.
When the initial “1x” is followed by
a hexadecimal digit with an opposing initial
bit—[**0**-**7**]—it necessarily adds at least
one bit to the underlying representation.
For example, *1x4000* represents the same value as −*0xC000*
or −*49152*, which is not representable as a sixteen-bit
two's-complement integer;
*1x0000_0000* is the same as −*0x1_0000_0000* or
−*4294967296*.
In general, sequences beginning with “0x” or
“−0x” should continue with one of the digits
[**0**-**7**] and sequences beginning with “1x”
should continue with one of the digits [**8**, **9**,
**A**-**F**], but this recommendation is not enforced.

Binary integer literals consist of an optional sign followed by the sequence “0b” and at least one binary digit or the sequence “1b” and at least one binary digit. Single underscore characters are also allowed in binary integer literals. Qaqao style recommends placing underscores every four binary digits from the rightmost digit. The formal definition is:

*Binary_integer_literal*::- (“
**0b**” | “**−0b**” | “**1b**”)*binary_digit*(‘**_**’?*binary_digit*)* *binary_digit*::- (‘
**0**’ | ‘**1**’)

Examples include *0b0*, *0b10*, −*0b101*,
*0b100_0000* and *1b000_1111*,
which correspond to the decimal literals *0*, *2*,
−*5*, *64* and −*113*, respectively.
Again, underscore style is up to the user, but underscores are permitted
only between digits and only one at a time.
The first two forms are treated as sign-magnitude representations
in the same way as the corresponding decimal forms.
The form beginning with “1b” is a negative
two's-complement form, corresponding bit-by-bit to the underlying
two's-complement representation.
Note that in the absense of a sign, the initial binary digit before
the ‘b’ represents the sign-extension;
it is also treated as the first required bit of the representation.
Ordinarily then, the initial digit should be included in the count of
binary digits in the literal.

(*** This paragraph is subject to change. ***)
The canonical form for integer literals is the unsigned hexadecimal form
for non-negative integers and the signed form for negative integers.
In both cases, leading zeros and underscores are removed.
For example, *0x0* (0), *0xFFFF* (65535),
and −*0x1000* (−4096)
are all integer literals in canonical form.
(*** The following sentence is subject to change. ***)
The scanner returns a canoncial *Integer_literal_token* for any
identifier matching one of the preceding forms.

There are two bit literals, *0b* and *1b*,
which are defined as follows:

*Bit_literal*::*binary_digit*‘**b**’

The bit literals correspond to the initial sequences of unsigned binary integer literals, but there are no characters following the ‘b’.

Bit-vector literals consist of two or more binary digits, optionally interspersed with underscores, followed by the lowercase letter ‘b’. As with binary integer literals, Qaqao style prefers placing underscores every fourth digit from the right.

*Bit_vector_literal*::*binary_digit*(‘**_**’?*binary_digit*)+‘**b**’

Examples of bit-vector literals are *01b*, *1000b*,
*0000_1111b* and *100_1011_1100_1110b*,
which contain two, four, eight and fifteen bits, respectively.
In bit-vector literals, all digits are significant, and the length of
the vector is equal to the number of digits.
The two bit literals are used to express a one-digit bit-vector literal.

The Bit type has two values corresponding to its two literals, and so
there are no non-canonical bit literals.
The canonical form of bit-vector literals is the form with no underscore
characters, and the scanner should construct a *Bit_vector_literal*
accordingly.

Floating-point literals consist of a decimal integer literal followed by a period (decimal point), at least one more decimal digit and an optional exponent part. The exponent part consists of the letter ‘e’ or ‘E’ followed by an optional sign and at least one decimal digit. Single underscores are permitted between digits, except in the exponent part. It is prefered Qaqao style to place underscores every three digits from the decimal point and to use an uppercase ‘E’ in the exponent part.

*Floating_point_literal*::- ‘−’?
*decimal_digit*(‘_’?*decimal_digit*)* ‘**.**’*decimal_digit*(‘_’?*decimal_digit*)*((‘**e**’ | ‘**E**’)*decimal_digit*+)?

For example, *0.0*, *7.825_005*, −*1_395.26*,
*1.25E6* and *2.625_000E−15* are all floating-point
literals.
Unlike C and Java, Qaqao requires a decimal point with at least
one digit both before and after it in all such literals;
only the exponent part is optional.

The canonical form of float-point literals is in normal form with an
exponent part and no underscores;
that is, there is a single non-zero digit to the left of the decimal point,
there is at least one digit to the right of the decimal point with no
trailing zeros (unless a zero is the only digit to the right of the point),
the exponent part begins with an uppercase ‘E’
and there are no leading zeros in the exponent (except that a single
zero represents the zero exponent.
The only exception is that the canonical form of zero is *0.0E0*.
The canonical forms of the floating-point literals in the preceding
paragraph are: *0.0E0*, *7.825005E0*, *−1.39526E3*,
*1.25E6* and *2.625E−15*.
The scanner converts all floating-point literals to canonical form when
constructing floating-point literal tokens.

Character literals consist of a left single quote character, a Character identification sequence and a right single quote character. Character identification sequences include single printing characters as well as most of the special sequences described in Section 1.1. Substitution sequences are not allowed in Character literals (or in String literals).

*Character_literal*::- ‘
**‘**’*character_identification_sequence*‘**’**’ *character_identification_sequence*::*non-special_character*|*Unicode_definition_sequence*|*Named_character_entity*|*Non-printing_escape_sequence*|*Normalizing_escape_sequence*

‘A’, ‘8’ and ‘→’ are examples of
Character literals written with non-special characters;
‘&u0041’, ‘&u0038’ and ‘&u2192’
are examples of the same three characters written using a Unicode definition
sequence.
‘ℵ’ is written with a named character entity and
corresponds to ‘ℵ’ and ‘&u2135’;
‘\f’ is written with a non-printing escape sequence and
corresponds to ‘&u000C’ which is the *new page* character;
‘\}’ is written with a normalizing escape sequence and
corresponds to ‘&u007D’
which is the *right brace* character.

The canonical form of Character literals for non-special characters is the form written with the non-special character itself. For whitespace characters, the canonical form is the form written with a non-printing escape sequence, and for delimiter and escape characters, the canonical form is the form written with a normalizing escape sequence. (Design note***: At present, non-special, non-printing characters have the same canonical form as other non-special characters; we may need to find an alternative form for them.)

Qaqao Language Definition, section 1.3, version 0.3α

© J. Andrew Holey, 2009