Preface

The Tokay documentation is currently under heavy development and unfinished. Feel free to contribute in any kind! To do so, visit https://github.com/tokay-lang.

Tokay programs are expressed and executed differently as common programmming languages like Rust or Python. Therefore, Tokay is not "yet another programming language". It was designed with the goal to let its programs directly operate on input streams that are either read from files, strings, piped commands or any other device emitting characters.

The most obvious example to show how Tokay executes its programs is this little matcher. It recognizes either "Hello Mercury", "Hello Venus" or "Hello Earth" from a text stream. Any other input is automatically skipped.

'Hello' _ {
    'Mercury'
    'Venus'
    'Earth'
}

In comparison to a general purpose programming languages, there's no explicit branching, substring extraction or reading from input required, as this is directly built into the language and its entire structuring.

If you're familiar with awk, you might find a similarity in above example to awk's PATTERN { action } syntax. This approach is recursive in Tokay, so that the action-part again is a further pattern area, or just action code. This is exactly where the intention behind Tokay starts, but not by thinking of a line-based execution working on fields, but a token-based approach working on anything matched from the input, including recursive structures that can be expressed by a grammar.

Getting started

Installation

Currently, Tokay is in a very early project state. Therefore you have to built it from source, using the Rust programming language and its build-tool cargo.

Once you got Rust installed, install Tokay with cargo install tokay.

When this is done, you can run Tokay directly, like so, to start a REPL:

$ tokay
Tokay 0.4.0
>>> print("Hello Tokay")
Hello Tokay
>>>

To exit the REPL, type exit or press Ctrl+C.

Using the tokay command

Invoking the tokay command without any arguments starts the REPL (read-eval-print-loop). This allows to enter expressions or even full programs interactively with a direct result.

Start a REPL

$ tokay
Tokay 0.4.0
>>> 23 * 5
115
>>> for i=0; i < 10; i++ print(i)
0
1
2
3
4
5
6
7
8
9
>>>

Start a REPL working on an input stream read from file.txt:

$ tokay -- file.txt

Start a REPL working on the input string "save all the whales":

$ tokay -- "save all the whales"
Tokay 0.4.0
>>> Word
("save", "all", "the", "whales")
>>>

In case you compile and run Tokay from the source code of the Git repository on your own, just run cargo run -- with any desired parameters attached.

Next runs the Tokay program from the file prog.tok:

$ tokay prog.tok
...

To directly work on files as input stream, do this as shown next. Further files can be specified and are executed on the same program sequentially. Its also possible to read from stdin using the special filename -.

Run a program from a file with another file as input stream

$ tokay prog.tok -- file.txt
...

Run a program from with multiple files as input stream

$ tokay prog.tok -- file1.txt file2.txt file3.txt
...

Run a program from with files or strings as input stream

$ tokay prog.tok -- file1.txt "save all the whales" file2.txt
...

Pipe input through tokay

$ cat file.txt | tokay prog.tok -- -
...

A Tokay program can also be specified directly by parameter. This call just prints the content of the files specified:

$ tokay '.+' -- file1.txt file2.txt file3.txt
file1.txt: ...
file2.txt: ...
file3.txt: ...

First steps

Next you can see some little programs expressed in Tokay to become familiar with the syntax and behavior.

Hello Tokay

You probably found out how to express the "Hello World" program in Tokay already.

print("Hello World")

Tokens

Writing comments

It is good practise to document source code and what's going on using comments. Likewise bash, Python or awk, Tokay supports line-comments starting with a hash (#). The rest of the line will be ignored.

# This is my little program

print("Hello World")  # printing welcome message to the user
hash = "# this is a string"  # assign "# this is a string" to hash.

Shebang

Therefore a shebang is also possible in case a Tokay source file shall be directly executable.

#!/bin/tokay
print("Hello World")

This assumes tokay is installed to /bin on a Posix-like system.

$ ls -lta hello.tok
-rwxr-xr-x  hello.tok
$ ./hello.tok
Hello World

Basics

Basically, a Tokay program is established on these parts:

  • Items
  • Sequences
  • Blocks

All these parts belong together and depend on each other in some way.

The following Tokay program demonstrates the usage of all three of these parts in action.

{ # A block...
    # ... is made of sequences
    'Hello' _ Name count_hello++   # ... which are made of items.

    'Goodbye' _ {  # an item of a sequence can be a block again
        'Max'  count_bye_max++  # ... which contains one sequences with items
        Name   count_bye++      # ... and a second one, too.
    }

    {}  # a sequence with an empty block as its item
}

This program is a little parser, which looks for greetings in some input.

  • The occurence of e.g. Hello Jan and Hello Max causes the variablecount_hello to be incremented
  • The occurence of e.g. Goodbye Jan increments the counter count_bye, but
  • An occurence of Goodbye Max, which is a special case here, counts on count_bye_max.

If you are familiar with the Awk programming language, you might see some similarities to the PATTERN { action }-syntax here.

In Tokay, PATTERN can be any sequence of items that need to match before, and { action } can hold further PATTERN { action }-components.

Items

Items are the atomic parts of sequences, and represent values.

The following examples for items are direct values that, once specified, stay on their own.

123               # the number 123
true              # the boolean value for truth
"Tokay 🦎"        # a unicode string

Items can also be the result of expressions or calls to callable objects.

"a " + "string"   # concatenating a string
42 * 23.5         # the result of a multiplication
'check'           # the occurence of string "check" in the input
Integer           # calling a built-in token for parsing integer values
func(42)          # calling a function
++count           # the incremented value of count

But items can also be more complex.

x = count * 23.5  # the result of a calculation is assigned to a variable

This is an assignment, and always produces the item value void, which means just "nothing". This is, because the result of the calculation is stored to a variable, but the item must represent some value.

Here's another item:

if x > 100 "much" # conditional expression, which is either "more" or void

This if-clause allows for conditional programming. It either produces a string when the provided condition is met, and otherwise also produces void.

This behavior can be changed by providing an else-branch next, like this:

if x > 100 "much" else "less"

As you see, every single value, call, expression or control-flow statement is considered to be an item.

A block is also an item as well, but this will be disussed later.

Severities

This is not important for the first steps and programs with Tokay, but a fundamental feature of the magic behind Tokay's automatic value construction features, which will be discussed later. You should know about it!

Every item has a severity, which defines its value's "weight".

Tokay currently knows 4 levels of severitity:

  1. Whitespace
  2. Match
  3. Value
  4. Result

The severity of an item depends on how it is constructed. For example

123               # pushes 123 with severity 3
_                 # matches whitespace
'check'           # matches "check" in input and pushes it considered as match
''check''         # matches "check" in input and pushes it considered as value
'check' * 3       # matches "check" in input and repeats it 3 times, resuling in value
push "yes"        # pushes result value "yes"

Right now, this isn't so important, and you shouldn't keep this in mind all the time. It will become useful during the next chapters, and especially when writing programs that parse or extract data off something.

Conclusion

In conclusion, an item is the result of some expression which always stands for a value. An item in turn is part of a sequence. Every item has a hidden severity, which is important for constructing values from sequences later on.

Sequences

Sequences are occurences of items in a row.

Here is a sequence of three items:

1 2 3 + 4    # results in a list (1, 2, 7)

For better readability, items of a sequence can be optionally separated by commas (,), so

1, 2, 3 + 4  # (1, 2, 7)

encodes the same.

All items of a sequence with a given severity are used to determine the result of the sequence. Therefore, these sequences return (1, 2, 7) in the above examples when entered in a Tokay REPL. This has to deal with the severities the items own.

The end of the sequence is delimited by a line-break, but the sequence can be wrapped into to multiple using a backslash before the line-break. So

1, 2 \
3 + 4  # (1, 2, 7)

means also the same as above.

Captures

The already executed items of a sequence are captured, so they can be accessed inside of the sequence using capture variables.

In the next example, the first capture, which holds the result 7 from the expression 3 + 4 is referenced with $1 and used in the second item as value of the expression. Referencing a capture which is out of bounds will just return void.

3 + 4, $1 * 2  # (7, 14)

Captures can also be re-assigned by subsequent items. The next one assigns a value at the second item to the first item, and uses the first item inside of the calculation. The second item which is the assignment, exists also as item of the sequence and refers to void, as all assignments do.

This is the reason why Tokay has two values to simply define nothing, which are void and null, but null has a higher precedence.

3 + 4, $1 = $1 * 2  # 14

As the result of the above sequence, just one value results which is 14, but the second item's value, void, has a lower severity than the calculated and assigned first value. This is the magic with sequences that you will soon figure out in detail, especially when tokens from streams are accessed and processed, or your programs work on extracted information from the input, and the automatic abstract syntax tree construction occurs.

As the last example, we shortly show how sequence items can also be named and accessed by a more meaningful name than just the index.

hello => "Hello", $hello = 3 * $hello  # (hello => "HelloHelloHello")

Here, the first item, which is referenced by the capture variable $hello is repeated 3 times as the second item.

It might be quite annoying, but the result of this sequence is a dict as shown in the comment. A dict is a hash-table where values can be referenced by a key.

If you come from Python, you might already know about list and dict objects. Their behavior and meaning is similar in Tokay.

Parsing input sequences

As Tokay is a programming language with built-in parsing capabilities, let's see how parsing integrates to sequences and captures.

Given the sequence

Word __ ''the'' __ Word

we make use of the built-in token Word which matches anything made of characters and digits, and the special constant __, which matches arbitrary whitespace, but at least one whitespace character must be present. Whitespace is anything represented by non-printable characters, like spaces or tabs.

We can now run this sequence on any input existing of three words, where the word in the middle is "the". Let's say

Save the planet

and we get the output

("Save", "the", "planet")

To try it out, either start a Tokay REPL with $ tokay -- "Save the planet" and enter the sequence Word __ ''the'' __ Word afterwards, or directly specify both at invocation, like
$ tokay "Word __ ''the'' __ Word" -- "Save the planet".

You will see, it's regardless of how many whitespace you insert, the result will always be the same. The reason for this are the item severities discussed earlier. Whitespace, used by the pre-defined constant __, has a lower severity, and therefore won't make it in the result of the sequence.

Using capture aliases

Captures can also have a name, called "alias". This is ideal for parsing, to give items meaningful names and make them independent from their position.

predicate => Word __ 'the' __ object => Word

will output

(object => "planet", predicate => "Save")

In this example, the match for the word ''the'' was degrated to a touch 'the', which has a lower item severity and won't make it into the sequence result.

This was done to make the output more clear, and because "the" is only an article without relevance to the meaning of the sentence we try to parse.

Now we can also work with alias variables inside of the sequence

predicate => Word __ 'the' __ object => Word \
    print("What to " + $predicate.lower() + "? The " + $object + "!")

will output

What to save? The planet!

The advantage here is, that we can change the sequence to further items in between, and don't have to change all references to these items in the print function call, because they are identified by name, and not by their offset, which might have changed.

The capture variable $0

There is also a special capture variable $0. It contains the input captured by the currently executed parselet the sequence belongs to. A parselet is a function that consumes some sort of input, which will be discussed later.

Let's see how all capture variables, including $0, are growing when the items from the examples above are being executed.

Capture $1 $2 $3 $4 $5
Alias $predicate $object
Item predicate => Word __ 'the' __ object => Word
Input "Save" " " "the" " " "planet"
$0 contains "Save" "Save " "Save the" "Save the " "Save the planet"

As you can see, $0 always contains the input matched so far from the start of the parselet.

$0 can also be assigned to any other value, which makes it the result of the parselet in case no other result of higher precedence was set.

Sequence interruption

todo

Conclusion

Sequences define occurences of items. An item inside of a sequence can have a meanigful alias.

Every item of a sequence that has been executed is called a capture, and can be accessed using context-variables, either by their offset (position of occurence) like $1, $2, $3 or by their alias, like $predicate.

The special capture $0 provides the consumed information read so far by the parselet, and can also be set to a value.

Blocks

Sequences are organized in blocks. Blocks may contain several sequences, which are executed in order of their definition. Every sequence inside of a block is separated by a newline.

The main scope of a Tokay program is also an implicit block, therefore it is not necessary to start every program with a new block.

Newlines

In Tokay, newlines (line-breaks, \n respectively) are meaningful. They separate sequences from each other, as you will learn in the next section.

"1st" "sequence"
"2nd" "sequence"
"3rd" "sequence"

Instead of a newline, a semicolon (;) can also be used, which has the same meaning. A single-line sequence can be split into multiple lines by preceding a backslash (\) in front of the line-break.

"1st" \
    "sequence"
"2nd" "sequence" ; "3rd" "sequence"

The first and second example are literally the same.

Concepts

Terminology

Identifiers

Naming rules for identifiers in Tokay differ to other programming languages, and this is an essential feature.

  1. As known from other languages, identifiers may not start with any digit (0-9).
  2. Variables need to start with a lower-case letter from (a-z)
  3. Constants need to start either
    • with an upper-case letter (A-Z) or an underscore (_) when they refer consumable values,
    • otherwise they can also start with a lower-case letter from (a-z).

Some examples for better understanding:

# Valid
pi : 3.1415
mul2 : @x { x * 2 }
Planet : @{ 'Venus' ; 'Earth'; 'Mars' }
the_Tribe = "Apache"

# Invalid
Pi : 3.1415  # float value is not consumable
planet : @{ 'Venus' ; 'Earth'; 'Mars' }  # identifier must specify consumable
The_Tribe = "Cherokee"  # Upper-case variable name not allowed

9th = 9  # interpreted as '9 th = 9'

More about consumable and non-consumable values, variables and constants will be discussed later.

Variables and constants

Symbolic identifiers for named values can either be defined as variables or constants.

variable = 0  # assign 0 to a variable
constant : 0  # assign 0 to a constant

Obviously, this looks like the same. variable becomes 0 and constant also. Let's try to modify these values afterwards.

variable += 1  # increment variable by 1
constant += 1  # throws compile error: Cannot assign to constant 'constant'

Now variable becomes 1, but constant can't be assigned and Tokay throws a compile error. What you can do is to redefine the constant with a new value.

variable++    # increment variable by 1
constant : 1  # re-assign constant to 1

The reason is, that variables are evaluated at runtime, whereas constants are evaluated at compile-time, before the program is being executed.

The distinction between variables and constants is a tradeoff between flexibility and predictivity to make different concepts behind Tokay possible. The values of variables aren't known at compile-time, therefore predictive construction of code depending on the values used is not possible. On the other hand, constants can be used before their definition, which is very useful when thinking of functions being called by other functions before their definition.

Callables and consumables

From the object types presented above, tokens and functions have the special properties that they are callable and possibly consumable.

  • Tokens are always callable and considered to consume input
  • Functions are always callable and are named
    • parselets when they consume input by either using tokens or a consumable constant
    • functions when they don't consume any input

For variables and constants, special naming rules apply which allow Tokay to determine a symbol type based on its identifier only.

todo: This section is a stub. More examples and detailed explanations needed here.

Scopes

Variables and constants are organized in scopes.

  1. A scope is any block, and the global scope.
  2. Constants can be defined in any block. They can be re-defined by other constants in the same or in subsequent blocks. Constants being re-defined in a subsequent block are valid until the block ends, afterwards the previous constant will be valid again.
  3. Variables are only distinguished between global and local scope of a parselet. Unknown variables used in a parselet block are considered as local variables.

Here's some commented code for clarification:

x = 10  # global scope variable x
y : 2000  # global scope constant y
z = 30  # global scope variable z

# entering new scope of function f
f : @x {  # x is overridden as local variable
    y : 1000  # local constant y overrides global constant y temporarily in this block
    z += y + x # adds local constant y and local value of x to global value of z
}

f(42)

# back in global scope, x is still 10, y is 2000 again, z is 1072 now.
x y z

Values

Let's discuss the meaning of values in Tokay next. Values are used everywhere, even when its not directly obvious. Generally speaking, everything in Tokay is some kind of value or part of a value.

Atomic values are one of the following.

void           # values to representing just nothing
null           # values representing a defined "set to null"
true false     # boolean values
42 -23         # signed 64-bit integers
3.1415 -1.337  # signed 64-bit floats
"Tokay 🦎"     # unicode strings

Values can also be one of the following objects.

# list of values
(42, true, "yes")
(42 true "yes")

# dictionary (dict), a map of key-value-pairs
(i => 42, b => true, status => "success")
(i => 42 b => true status => "success")

# tokens are callables consuming input from the stream
'touch'    # silently touch a string in the input
''match''  # verbosely match a string from the input
[A-Z0-9]+  # matching a sequence of valid characters
Integer    # built-in token for parsing and returning Integer values

# functions and parselets are callable, enclosed blocks of code
f : @x{ x * 2 }
f(9)  # 18

@x{ x * 3 }(5)  # 15, returned by anonymous function that is called in-place

Objects are discussed in detail in a later chapter below.

Captures

Items in sequences are captured during execution. They are temporarily pushed and hold onto a stack, for later access. It is possible to access previously captured items using capture variables. Capture variables start with a dollar-sign ($) followed either by an index, an aliased name or any Tokay expression which evalutes to an index or an aliased named dynamically.

Given the expression

first => Word  _  second => Word  _  third => Word

executed on the input

Save the planet

the sequence and input can be broken down into the following components.

Capture $1 $2 $3 $4 $5
Alias $first $second $third
Sequence first => Word _ second => Word _ third => Word
Input "Save" " " "the" " " "Planet"
$0 contains "Save" "Save " "Save the" "Save the " "Save the planet"

As you can see, $0 always contains the input matched so far from the start of the capture.

Tokay also allows to assign values to captures. This makes it possible to directly use captures like any other variable inside of the sequence and any subsequent blocks that belong to the sequence.

# planets2.tok
Name {
    if $1 == "Earth" {
        $1 = "Home"
    }
    else if $1 == "Mars" || $1 == "Venus" {
        $1 += " (neighbour)"
    }
}
$ tokay planets2.tok -- "Mercury Venus Earth Mars Jupiter"
("Mercury", "Venus (neighbour)", "Home", "Mars (neighbour)", "Jupiter")

Objects

Tokens

Tokens are the fundamental building blocks used to process input. Tokay implements first-level tokens which direcly consume input, but usages of parselets, which are functions consuming input, are considered as second-level tokens, and are at least tokens as well.

Touch & match

To match exact strings of characters from the input, like keywords, the match and touch token-type is used. Touch was yet mostly used in our examples, but match is also useful, depending on use-case.

'Touch'    # match string in the input and discard
''Match''  # match string in the input and take

The only difference between the two types is, that a match has a higher severity than a touch, and will be recognized within automatic value construction. Both type of matches can be referred by capture variables, therefore

'Match' $1

is the same result like a direct match.

Check out the following one-liner when executed on the input 1+2-3+4, it will return (1, "+", (2, (3, "+", 4))). The matches on the plus (''+'') is taken into the result, the touch on minus ('-') are discarded.

E : { E ''+'' E ; E '-' E; Integer }; E

Character-classes

Character tokens are expressed as character-classes known from regular expressions. They are encapsulated in brackets [...] and allow for a specification of ranges or single characters.

  • Single Characters are either specified by a Unicode-character or an escape sequence
  • Ranges are delimited by a dash (-). If a Max-Min-Range is specified, it is automatically converted into a Min-Max-Range, so [z-a] becomes [a-z].
  • If a dash (-) should be part of the character-class, it should be specified first or last.
  • If a circumflex (^) is specified as first character in the character-class, the character-class will be inverted, so [^a-z] matches everything except a to z.
[a]           # just "a"
[az]          # either "a" or "z"
[abc]         # "a", "b" or "c"
[a-c]         # "a", "b" or "c" also
[a-zA-Z0-9_]  # All ASCII digit or letter and underscore
[^0-9]        # Any character except ASCII digits
[-+*/]        # Mathematical base operators (minus-dash first!)

Parselets

begin, end

coming soon

accept, reject

coming soon

repeat

coming soon

Functions

A function is introduced by an at-character (@), where a parameter list might optionally follow. The function's body is obgligatory, but can also exist of just a sequence or an item. Functions are normally assigned to constants, but can also be assigned to variables, with some loose of flexibility, but opening other features.

# f is a function
f : @x = 1 {
    print("I am a function, x is " + x)
}

f        # calls f, because it has no required parameters!
f()      # same as just f
f(5)     # calls f with x=5
f(x=10)  # calls f with x=10

Tokay functions that consume input are called parselets. It depends on the function's body if its either considered to be a function or a parselet. Generally, when talking about parselets in Tokay, both function and real parselets are meant as shorthand.

# P is a parselet, as it uses a consuming token
P : @x = 1 {
    Word print("I am a parselet, x is " + x)
}

P        # calls P, because it has no required parameters!
P()      # same as just P
P(5)     # calls P with x=5
P(x=10)  # calls P with x=10

Control structures

In comparison to many other languages, control structures in Tokay are part of expressions. They always return a value, which defaults to void when no other value is explicitly returned.

if...else

The if...else-construct implements conditional branching depending on the result of an expression.
The else part is optional, and can be omitted.

if sense == 42 && axis == 23.5 {
    print("Well, this is fine!")
}
else {
    print("That's quite bad.")
}

As stated before, all control structures are part of Tokays expression syntax. Above example can easily by turned into

print(
    if sense == 42 && axis == 23.5
        "Well, this is fine!"
    else
        "That's quite bad."
)

or directly used inside of an expression.

# if can be part of an expression
Word "Hello " + if $1 == "World" "Earth" else $1

if...else constructs working on static expressions are optimized away during compile-time.

loop

The loop-keyword is used to create loops, either with an aborting conditions on top or without any condition.

# Countdown
count = 10
loop count >= 0 print(
    if --count == 3
        "Ignition"
    else if count < 0
        "Liftoff"
    else
        count
)

A loop can be aborted everytime using the break-statement.
The continue-statement restarts the loop at the beginning, but a present abort-condition will be re-checked again.

count = 10
loop {
    count = count - 1
    if count == 3 {
        print("Ignition")
        continue
    }

    print(count)
    if count == 0 {
        print("Liftoff")
        break
    }
}

A loop without any aborting condition loops forever.

loop print("Forever!")

for

The for-keyword introduces a special form of loop that syntacically glues the parts initialization, abort condition and iteration together into a separate syntactic element.

for count = 10; count >= 0; count-- {
    print(i)
}

Appendix

Appendix A: Keywords

In Tokay, the following keywords are reserved words for control structures, values and special operators.

  • accept - accept parselet, optionally with a return value
  • begin - sequence to execute at begining of a parselet
  • break - break from a loop, optonally with a return value
  • continue - restart iteration in a loop
  • else - fallback for if constructs
  • end - sequence to execute at end of a parselet
  • exit - stop program execution, optional with exit code
  • expect - operator for consumable that expects the consumable and throws an error if not present
  • false - the false value
  • for - head-controlled for loop
  • if - branch based on the result of a conditional expression
  • in - part of the for-loop syntax
  • loop - head-controlled loop with an optional abort conition
  • next - continue with next sequence in a block
  • not - operator for consumable that satisfies when the consumable is not consumed
  • null - the null value
  • peek - operator for consumable that satisfies when consumable is consumed but the reader rolls back afterwards
  • push - accept a sequence by pushing a value
  • reject - reject parselet as not being consumed
  • repeat - repeat parselet, optionally push a result
  • return - same like accept, but with a meaning for ordinary functions
  • true - the true value
  • void - the void value

Appendix B: Operators

Tokay implements the following operators for use in expressions. The operators are ordered by precedence, operators in the same row share the same precedence.

Operator Description Associativity
= += -= *= /= Assignment, combined operation-assignment left
|| Logical or left
&& Logical and left
== != < <= >= > Equal, unequal, Comparison left
+ - Add, subtract left
* / Multiply, divide left
- ! Negate, not right
++ -- Increment, decrement right
() [] . Grouping, subscript, attribute left

Operators produce different results depending on the data-types of their operands. For example, 3 * 10 multiplies 10 by 3, whereas 3 * "test" creates a new string repeating "test" 3 times. Try out the results of different operands in a Tokay REPL for clarification.

Appendix C: Modifiers

Tokay allows to use the following modifiers for calls to consumable values. Modifiers are used to describe repetitions or optional occurences of consumables.

Modifier Description Examples
+ Positive repetition (one or many) `'t'+, P(n=3)+`
? Optional (one or none) `'t'?, P(n=3)?`
* Kleene star (none or many) `'t'*, P(n=3)*`

Redudancy with expressional operators

You might have noticed that the operators + and * are used as operators for add and multiply as well. To clarify meaning, all modifiers stick to the token they belong to, and no whitespace is accepted between them. Modifiers are only allowed on tokens and parselet calls, and nowhere else, as it makes no sense.

Here are some examples for clarification:

't' * 3    # match 't' and repeat the result 3 times
't'* * 3   # match 't' one or multiple times and repeat the result 3 times
't' * * 3  # syntax error

Appendix D: Builtins

Functions

Tokens

The following tokens are built into Tokay and can be used immediatelly. Programs can override these constants on-demand.

Token Token+ Description
AlphabeticAlphabeticsAll Unicode characters having the Alphabetic property
AlphanumericAlphanumericsThe union of Alphabetic and Numeric
Any / .-Any character, except EOF
AsciiAsciisAll characters within the ASCII range.
AsciiAlphabeticAsciiAlphabeticsAll ASCII alphabetic characters [A-Za-z]
AsciiAlphanumericAsciiAlphanumericsASCII alphanumeric characters [0-9A-Za-z]
AsciiControlAsciiControlsAll ASCII control characters [\x00-\x1F\x7f]. SPACE is not a control character.
AsciiDigitAsciiDigitsASCII decimal digits [0-9]
AsciiGraphicAsciiGraphicsASCII graphic character [!-~]
AsciiHexdigitAsciiHexdigitsASCII hex digits [0-9A-Fa-f]
AsciiLowercaseAsciiLowercasesAll ASCII lowercase characters [a-z]
AsciiPunctuationAsciiPunctuationsAll ASCII punctuation characters [-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~]
AsciiUppercaseAsciiUppercasesAll ASCII uppercase characters [A-Z]
AsciiWhitespaceAsciiWhitespacesAll characters defining ASCII whitespace [ \t\n\f\r]
ControlControlsAll Unicode characters in the controls category
DigitDigitsASCII decimal digits [0-9]
EOF-Matches End-Of-File.
LowercaseLowercasesAll Unicode characters having the Lowercase property
NumericNumericsAll Unicode characters in the numbers category
UppercaseUppercasesAll Unicode characters having the Uppercase property
WhitespaceWhitespacesAll Unicode characters having the White_Space property
Void-The empty token, which consuming nothing, but consumes!

The respective properties of the built-in character classes is described in Chapter 4 (Character Properties) of the Unicode Standard and specified in the Unicode Character Database in DerivedCoreProperties.txt.

Appendix E: Escape sequences

Escape sequences can be used inside of strings, match/touch tokens and character-classes to encode any unicode character. They are introduced with a backslash.

Escape-sequences should be used to simplify the source code and its readability, but any unicode character can also be directly expressed.

Sequence Description Examples
\a \b \f \n \r \t \v Bell (alert), backspace, formfeed, new line, carriage return, horizontal tab, vertical tab, "\a\b\f\n\r\t\v"
\' \" \\ Quotation marks, backslash "\'\"\\" # '"\
\ooo ASCII character in octal notation "\100" # @
\xhh ASCII character in hexadecimal notation "\xCA" # Ê
\uhhhh 16-Bit Unicode character in hexadecimal notation "\u20ac" # €
\Uhhhhhhhh 32-Bit Unicode character in hexadecimal notation "\U0001F98E" # 🦎