Qaqao Lexics

1.1 Character Set

Qaqao programs may be written with the ASCII character set or with the full Unicode character set. All Qaqao run-time systems must process standard ASCII text files and Unicode text files encoded in the UTF-8 format; they may support other encodings that are appropriate for the system.

***Design note: In this version of Qaqao, the only version is the standard English version.

1.1.1 Character categories

Qaqao divides characters into four general classes:

whitespace

The whitespace characters are:

Key ASCII Unicode Description
Tab 9, 0x09 &u+0009 horizontal tab (HT)
Enter 10, 0x0A &u+000A new line (line feed, LF)
11, 0x0B &u+000B vertical tab (VT)
12, 0x0C &u+000C new page (form feed, FF)
Return 13, 0x0D &u+000D carriage return (CR)
Space_bar 32, 0x20 &u+0020 space (SP)
&u+0085 control character next line (NEL)
&u+200E left-to-right mark
&u+200F right-to-left mark
&u+2028 Unicode line separator
&u+2029 Unicode paragraph separator

The scanner will only recognize whitespace characters that are already part of the character stream. Escape sequences as noted above are converted to the corresponding whitespace character before being passed to the scanner. If the ‘&’ is replaced with a ‘\’ in any of the above sequences, conversion is postponed until after scanning, and the character is not treated as whitespace. (See Section 1.1.2 below.)

The Return or Enter key will generate a carriage_return character, a new_line character or both, depending on the operating system. The Qaqao system treats all standard combinations of these characters equally as a single end-of-line sequence; the scanner converts consecutive whitespace characters to a single Whitespace token.

For purposes of conformance with the Unicode standard, the above table is intended to specify exactly the standard pattern whitespace characters. Other spacing characters, such as the no-break space, the em space and the thin space, are intended as formatting characters and are not handled as whitespace by the scanner.


delimiter

The delimiter characters are:

Character Unicode Description
( &u+0028 left parenthesis
) &u+0029 right parenthesis
, &u+002C comma
: &u+003A colon
; &u+003B semicolon
[ &u+005B left bracket (left square bracket)
] &u+005D right bracket (right square bracket)
{ &u+007B left brace (left curly brace)
} &u+007D right brace (right curly brace)

The scanner will only recognize delimiter characters that are already part oft the character stream. Escape sequences as noted above are converted by the to the corresponding delimiter character before being passed to the scanner. If the ‘&’ is replaced with a ‘\’ in one of the above sequences, conversion is postponed until after scanning, and the character is not treated as a delimiter. (See Section 1.1.2 below.)

For purposes of conformance with the Unicode standard, the above table notes exactly those characters that are pattern syntax characters in the structural scanning phase.


escape

The escape characters are:

Character Unicode Description
& \u+0026 ampersand (&)
\ \u+005C backslash

The ampersand escape is used to embed characters that should be converted before tokenization; the backslah escape is used to postpone conversion until after tokenization. In general, pre-tokenization conversion is preferable except for whitespace and delimiter characters.

Note: the escape characters are denoted by a single ampersand or backslash character when part of an escape sequence; the ampersand character itself is written as the escape sequence ‘\&’ or with the Unicode sequence above; the backslash character is written as the escape sequence ‘\\’ or with the Unicode sequence above. The escape sequence denoting the backslash character can be interpreted as an escape character followed by a backslash character, the initial escape character preventing the second backslash character from being interpreted as an escape character. (See Section 1.1.2 below.)

For purposes of conformance with the Unicode standard, the above table notes those characters that are used for quoting pattern whitespace and pattern syntax characters.


non-special

The non-special characters are the remaining Unicode characters. In addition, all the special class characters may be treated as non-special characters through the use of escape sequences described below. In this document the term non-special will refer only to those characters that are not members of the other three classes; the term any will refer to the non-special characters together with special characters escaped to non-special status.

For purposes of conformance with the Unicode standard, the non-special characters correspond to the category id-start. In general, Qaqao identifiers may begin with any non-special character; in the structural lexics (Section 1.3), there are particular categories of identifiers that may have restricted start characters.


1.1.2 Special character sequences

There are several classes of special character sequences that make it possible to express any Unicode character using an ASCII text editor as well as to override the action of the Qaqao system to treat certain characters and sequences as special. The first three classes of special sequences are escape sequences that impart special meaning to the following character or characters; the fourth class is escape sequences that remove special attributes from the character following the escape character; the last class is for character substitutions.

Unicode definition sequences

Any Unicode character can be expressed using its hexadecimal code. There are four formats:

Editors and IDEs may convert those sequences beginning with the ampersand on the fly; the scanner will convert any of these sequences remaining before tokenization. Conversion of sequences beginning with the backslash character must be postponed until after scanning, although conversion to another escape sequence is allowed. For example, the sequence ‘\u+005C’, which is the backslash character, may be converted to the sequence ‘\\’ or vice versa before scanning, conversion of either sequence to the sequence ‘\’ takes place after scanning. Programmers should take care using these sequences in the context of other escape sequences. For example, in the sequence “\\u+004F”, the initial escapebackslash sequence becomes a single backslash character after scanning and the remaining characters do not change, yielding a character sequence that represents an escape sequence without escaping it.


Named character entities

Qaqao supports the full set of XML and HTML named character entities using the format: ‘&’ name‘ ;’. For example, “¬” is the not character ‘¬’ and “≡” is the identical_to character ‘≡’. A full table of the named character entities is at this list of XML and HTML character entity references. A few additional named character sequences are included in tables below. As with the Unicode definition sequences, editors and IDEs may convert named character entities on the fly. The scanner must convert any of these sequences that remain before resolving tokens so that the terminating semicolon character will not be treated as a delimiter. The actual sequence can be obtained by placing a backslash character before the terminating semicolon, as in “&larr\;”, which yields the sequence “←” after scanning. Since none of the named character entities corresponds to a whitespace or delimiter character, there is no provision for postponing conversion until after scanning.


Non-printing escape sequences

The non-printing escape sequences are used to express non-printing characters as non-special characters, typically in character and string literals. They have the format escapelowercase_letter or escapelowercase_letterlowercase_letter, where the letters are mnemonics for the intended character. The non-printing escape sequences are:

Sequence Unicode Description
‘\0’ \u+0000 null character
‘\a’ \u+0007 alert (bell)
‘\b’ \u+0008 backspace
‘\t’ \u+0009 horizontal tab
‘\n’ \u+000A new line (line feed)
‘\v’ \u+000B vertical tab
‘\f ’ \u+000C new page (form feed)
‘\r’ \u+000D carriage return
‘\l’ \u+2028 line separator
‘\p’ \u+000B paragraph separator

Editors and IDEs must leave these sequences intact; they are converted to the specified character after tokenization.


Normalizing escape sequences

Any character which has special meaning in a particular unescaped lexical context loses the special meaning if it is preceded by an escape character. The format for a normalizing escape sequence is backslashcharacter. For example the escape sequences ‘\)’ and ‘\;’ are treated as non-special characters rather than as delimiters and ‘\ ’ is not treated as whitespace by the scanner. Similarly, placing an escape in front of one the characters in a subtitution sequence prevents the substitution from occurring. Editors and IDEs must leave these sequences intact; they are converted to the specified character after tokenization.

Note: the escape character may not be used before actual control characters. Thus, it is a lexical error to place an escape character at the end of a line or before a tab character.


Substitution sequences

To facilitate writing Qaqao code with ASCII text editors, several ASCII code sequences provide substitutions for Unicode symbols required by the language that are not part of the ASCII code. The following substitution sequences are for single character symbols; the substitution is made only when the sequence stands alone as an identifier, i.e. when the sequence is immediately preceded and followed by whitespace and/or delimiter characters:

Character Sequence Unicode HTML Description (Sequence Description)
‘«’ “<<” &u+00AB &laquo; bind operator (less-than less-than)
‘¬’ ‘~’ &u+00AC &not; not (tilde)
‘×’ ‘*’ &u+00D7 &times; multiplication sign (asterisk)
‘÷’ ‘/’
“-:-”
&u+00F7 &divide; division sign (slash or hyphen colon hyphen)
‘←’ “<-” &u+2190 &larr; left arrow (less-than hyphen)
‘↑’ “^” &u+2191 &uarr; up arrow (caret)
‘→’ “->” &u+2192 &rarr; right arrow (hyphen greater-than)
‘↞’ “<<-” &u+219E &ldarr; left double arrow (less-than less-than hyphen)
‘↠’ “->>” &u+21A0 &rdarr; right double arrow (hyphen greater-than greater-than)
‘−’ ‘-’ &u+2212 &minus; minus sign (hyphen)
‘∧’ “&&” &u+2227 &and; logical and (ampersand ampersand)
‘∨’ “||” &u+2228 &or; logical or (vertical-bar vertical-bar)
‘≠’ “/=”
“~=”
&u+2260 &ne; not equal to (slash equal or tilde equal)
‘≡’ “===” &u+2261 &equiv; identical to (equal equal equal)
‘≤’ “<=” &u+2264 &le; less than or equal to (less-than equal)
‘≥’ “>=” &u+2265 &ge; greater than or equal to (greater-than equal)
‘⊕’ “|x|” &u+2295 &oplus; logical exclusive or (vertical-bar x vertical-bar)
‘⌈’ “|-” &u+2308 &lceil; left ceiling (vertical-bar hyphen)
‘⌉’ “-|” &u+2309 &rceil; right ceiling (hyphen vertical-bar)
‘⌊’ “|_” &u+230A &lfloor; left floor (vertical-bar underscore)
‘⌋’ “_|” &u+230B &rfloor; right floor (underscore vertical-bar)
‘〈’ “|<” &u+2329 &lang; left absolute value (vertical-bar less-than)
‘〉’ “>|” &u+232A &rang; right absolute value (greater-than vertical-bar)

Note that the left double arrow (‘↞’) and right double arrow (‘↠’) do not have standard named character sequences; new sequences (&ldarr; and &rdarr;) have been supplied.

In addition to the stand-alone subtitution sequences in the table above, there are several other substitutions: the hyphen character converts to a minus character when it stands at the beginning of a token; the apostrophe (‘'’) and quotation (‘"’) characters convert to the left single quote (‘‘’) and left double quote (‘“’), respectively, when they stand at the beginning of a token or after a non-special whitespace character (including both escaped standard whitespace characters and formatting characters such as the em-space); the same characters convert to the right single quote (‘’’) and right double quote (‘”’), respectively, when they occur anywhere else in a token. The following table summarizes this situation:

Character Sequence Unicode HTML Description (Sequence Description)
‘−’ ‘-’ &u+2212 &minus; minus sign at the beginning of a token (hyphen)
‘‘’ ‘'’ &u+2018 &lsquo; left single quotation mark at the beginning of a token (apostrophe)
‘“’ ‘"’ &u+201C &ldquo; left double quotation mark at the beginning of a token (quotation mark)
‘’’ ‘'’ &u+2019 &rsquo; right single quotation mark elsewhere (apostrophe)
‘”’ ‘"’ &u+201D &rdquo; right double quotation mark elsewhere (quotation mark)

Editors and IDEs may substitute the character for the substitution sequence on the fly.

Qaqao Language Definition, section 1.1, version 0.4α
© J. Andrew Holey, 2013
email: jholey@csbsju.edu