Qaqao programs are written with the full Unicode character set. All Qaqao runtime systems must process Unicode text files encoded in the UTF-8 format; they may support other encodings that are appropriate for the system. The standard English version of Qaqao will also support plain ASCII files that use only seven-bit codes, since these are de facto UTF-8 files. Future versions of this standard will conform to the Unicode standard; this version may not fully conform to the standard, but does so to the extent practical.
***Design note: In this version of Qaqao, the only version is the standard English version.
Qaqao divides characters into four general classes:
The whitespace characters are:
| Key | Unicode | Description |
| Tab | &u+0009 | horizontal tab |
| Enter ↵ | &u+000A | new line (line feed) |
| – | &u+000B | vertical tab |
| – | &u+000C | new page (form feed) |
| Return ↵ | &u+000D | carriage return |
| Space_bar | &u+0020 | space |
| – | &u+0085 | control character next line |
| – | &u+200E | left-to-right mark |
| – | &u+200F | right-to-left mark |
| – | &u+2028 | Unicode line separator |
| – | &u+2029 | Unicode paragraph separator |
The scanner will only recognize whitespace characters that are already part of the character stream. Escape sequences as noted above are converted to the corresponding whitespace character before being passed to the scanner. If the ‘&’ is replaced with a ‘\’ in any of the above sequences, conversion is postponed until after scanning, and the character is not treated as whitespace. (See Section 1.1.2 below.)
The Return or Enter key will generate a carriage_return character, a new_line character or both, depending on the operating system. The Qaqao system treats all standard combinations of these characters equally as a single end-of-line sequence; the scanner converts consecutive whitespace characters to a single Whitespace token.
For purposes of conformance with the Unicode standard, the above table is intended to specify exactly the standard pattern whitespace characters. Other spacing characters, such as the no-break space, the em space and the thin space, are intended as formatting characters and are not handled as whitespace by the scanner.
The delimiter characters are:
| Character | Unicode | Description |
| ( | &u+0028 | left parenthesis |
| ) | &u+0029 | right parenthesis |
| , | &u+002C | comma |
| : | &u+003A | colon |
| ; | &u+003B | semicolon |
| [ | &u+005B | left bracket (left square bracket) |
| ] | &u+005D | right bracket (right square bracket) |
| { | &u+007B | left brace (left curly brace) |
| } | &u+007D | right brace (right curly brace) |
The scanner will only recognize delimiter characters that are already part oft the character stream. Escape sequences as noted above are converted by the to the corresponding delimiter character before being passed to the scanner. If the ‘&’ is replaced with a ‘\’ in one of the above sequences, conversion is postponed until after scanning, and the character is not treated as a delimiter. (See Section 1.1.2 below.)
For purposes of conformance with the Unicode standard, the above table notes exactly those characters that are pattern syntax characters in the structural scanning phase.
The escape characters are:
| Character | Unicode | Description |
| & | \u+0026 | ampersand (&) |
| \ | \u+005C | backslash |
The ampersand escape is used to embed characters that should be converted before tokenization; the backslah escape is used to postpone conversion until after tokenization. In general, pre-tokenization conversion is preferable except for whitespace and delimiter characters.
Note: the escape characters are denoted by a single ampersand or backslash character when part of an escape sequence; the ampersand character itself is written as the escape sequence ‘\&’ or with the Unicode sequence above; the backslash character is written as the escape sequence ‘\\’ or with the Unicode sequence above. The escape sequence denoting the backslash character can be interpreted as an escape character followed by a backslash character, the initial escape character preventing the second backslash character from being interpreted as an escape character. (See Section 1.1.2 below.)
For purposes of conformance with the Unicode standard, the above table notes those characters that are used for quoting pattern whitespace and pattern syntax characters.
The non-special characters are the remaining Unicode characters. In addition, all the special class characters may be treated as non-special characters through the use of escape sequences described below. In this document the term non-special will refer only to those characters that are not members of the other three classes; the term any will refer to the non-special characters together with special characters escaped to non-special status.
For purposes of conformance with the Unicode standard, the non-special characters correspond to the category id-start. In general, Qaqao identifiers may begin with any non-special character; in the structural lexics (Section 1.3), there are particular categories of identifiers that may have restricted start characters.
There are several classes of special character sequences that make it possible to express any Unicode character using an ASCII text editor as well as to override the action of the Qaqao system to treat certain characters and sequences as special. The first three classes of special sequences are escape sequences that impart special meaning to the following character or characters; the fourth class is escape sequences that remove special attributes from the character following the escape character; the last class is for character substitutions.
Any Unicode character can be expressed using its hexadecimal code. There are four formats:
Editors and IDEs may convert those sequences beginning with the ampersand on the fly; the scanner will convert any of these sequences remaining before tokenization. Conversion of sequences beginning with the backslash character must be postponed until after scanning, although conversion to another escape sequence is allowed. For example, the sequence ‘\u+005C’, which is the backslash character, may be converted to the sequence ‘\\’ or vice versa before scanning, conversion of either sequence to the sequence ‘\’ takes place after scanning. Programmers should take care using these sequences in the context of other escape sequences. For example, in the sequence “\\u+004F”, the initial escape–backslash sequence becomes a single backslash character after scanning and the remaining characters do not change, yielding a character sequence that represents an escape sequence without escaping it.
Qaqao supports the full set of XML and HTML named character entities using the format: ‘&’ name‘ ;’. For example, “¬” is the not character ‘¬’ and “≡” is the identical_to character ‘≡’. A full table of the named character entities is at this list of XML and HTML character entity references. A few additional named character sequences are included in tables below. As with the Unicode definition sequences, editors and IDEs may convert named character entities on the fly. The scanner must convert any of these sequences that remain before resolving tokens so that the terminating semicolon character will not be treated as a delimiter. The actual sequence can be obtained by placing a backslash character before the terminating semicolon, as in “&larr\;”, which yields the sequence “←” after scanning. Since none of the named character entities corresponds to a whitespace or delimiter character, there is no provision for postponing conversion until after scanning.
The non-printing escape sequences are used to express non-printing characters as non-special characters, typically in character and string literals. They have the format escape lowercase_letter or escape lowercase_letter lowercase_letter, where the letters are mnemonics for the intended character. The non-printing escape sequences are:
| Sequence | Unicode | Description |
| ‘\0’ | \u+0000 | null character |
| ‘\a’ | \u+0007 | alert (bell) |
| ‘\b’ | \u+0008 | backspace |
| ‘\t’ | \u+0009 | horizontal tab |
| ‘\n’ | \u+000A | new line (line feed) |
| ‘\v’ | \u+000B | vertical tab |
| ‘\f ’ | \u+000C | new page (form feed) |
| ‘\r’ | \u+000D | carriage return |
| ‘\l’ | \u+2028 | line separator |
| ‘\p’ | \u+000B | paragraph separator |
Editors and IDEs must leave these sequences intact; they are converted to the specified character after tokenization.
Any character which has special meaning in a particular unescaped lexical context loses the special meaning if it is preceded by an escape character. The format for a normalizing escape sequence is backslash character. For example the escape sequences ‘\)’ and ‘\;’ are treated as non-special characters rather than as delimiters and ‘\ ’ is not treated as whitespace by the scanner. Similarly, placing an escape in front of one the characters in a subtitution sequence prevents the substitution from occurring. Editors and IDEs must leave these sequences intact; they are converted to the specified character after tokenization.
Note: the escape character may not be used before actual control characters. Thus, it is a lexical error to place an escape character at the end of a line or before a tab character.
To facilitate writing Qaqao code with ASCII text editors, several ASCII code sequences provide substitutions for Unicode symbols required by the language that are not part of the ASCII code. The following substitution sequences are for single character symbols; the substitution is made only when the sequence stands alone as an identifier, i.e. when the sequence is immediately preceded and followed by whitespace and/or delimiter characters:
| Character | Sequence | Unicode | HTML | Description (Sequence Description) |
| ‘«’ | “<<” | &u+00AB | « | bind operator (less-than less-than) |
| ‘¬’ | ‘~’ | &u+00AC | ¬ | not (tilde) |
| ‘×’ | ‘*’ | &u+00D7 | × | multiplication sign (asterisk) |
| ‘÷’ | ‘/’ “-:-” |
&u+00F7 | ÷ | division sign (slash or hyphen colon hyphen) |
| ‘←’ | “<-” | &u+2190 | ← | left arrow (less-than hyphen) |
| ‘↑’ | “^” | &u+2191 | ↑ | up arrow (caret) |
| ‘→’ | “->” | &u+2192 | → | right arrow (hyphen greater-than) |
| ‘↞’ | “<<-” | &u+219E | &ldarr; | left double arrow (less-than less-than hyphen) |
| ‘↠’ | “->>” | &u+21A0 | &rdarr; | right double arrow (hyphen greater-than greater-than) |
| ‘−’ | ‘-’ | &u+2212 | − | minus sign (hyphen) |
| ‘∧’ | “&&” | &u+2227 | ∧ | logical and (ampersand ampersand) |
| ‘∨’ | “||” | &u+2228 | ∨ | logical or (vertical-bar vertical-bar) |
| ‘≠’ | “/=” “~=” |
&u+2260 | ≠ | not equal to (slash equal or tilde equal) |
| ‘≡’ | “===” | &u+2261 | ≡ | identical to (equal equal equal) |
| ‘≤’ | “<=” | &u+2264 | ≤ | less than or equal to (less-than equal) |
| ‘≥’ | “>=” | &u+2265 | ≥ | greater than or equal to (greater-than equal) |
| ‘⊕’ | “|x|” | &u+2295 | ⊕ | logical exclusive or (vertical-bar x vertical-bar) |
| ‘⌈’ | “|-” | &u+2308 | ⌈ | left ceiling (vertical-bar hyphen) |
| ‘⌉’ | “-|” | &u+2309 | ⌉ | right ceiling (hyphen vertical-bar) |
| ‘⌊’ | “|_” | &u+230A | ⌊ | left floor (vertical-bar underscore) |
| ‘⌋’ | “_|” | &u+230B | ⌋ | right floor (underscore vertical-bar) |
| ‘〈’ | “|<” | &u+2329 | ⟨ | left absolute value (vertical-bar less-than) |
| ‘〉’ | “>|” | &u+232A | ⟩ | right absolute value (greater-than vertical-bar) |
Note that the left double arrow (‘↞’) and right double arrow (‘↠’) do not have standard named character sequences; new sequences (&ldarr; and &rdarr;) have been supplied.
In addition to the stand-alone subtitution sequences in the table above, there are several other substitutions: the hyphen character converts to a minus character when it stands at the beginning of a token; the apostrophe (‘'’) and quotation (‘"’) characters convert to the left single quote (‘‘’) and left double quote (‘“’), respectively, when they stand at the beginning of a token or after a non-special whitespace character (including both escaped standard whitespace characters and formatting characters such as the em-space); the same characters convert to the right single quote (‘’’) and right double quote (‘”’), respectively, when they occur anywhere else in a token. The following table summarizes this situation:
| Character | Sequence | Unicode | HTML | Description (Sequence Description) |
| ‘−’ | ‘-’ | &u+2212 | − | minus sign at the beginning of a token (hyphen) |
| ‘‘’ | ‘'’ | &u+2018 | ‘ | left single quotation mark at the beginning of a token (apostrophe) |
| ‘“’ | ‘"’ | &u+201C | “ | left double quotation mark at the beginning of a token (quotation mark) |
| ‘’’ | ‘'’ | &u+2019 | ’ | right single quotation mark elsewhere (apostrophe) |
| ‘”’ | ‘"’ | &u+201D | ” | right double quotation mark elsewhere (quotation mark) |
Editors and IDEs may substitute the character for the substitution sequence on the fly.
Qaqao Language Definition, section 1.1, version 0.3α
© J. Andrew Holey, 2009