Preface

This documentation was updated to be used with Tokay 0.6.
Some features might not work as expected when used with a more recent version.

Likewise Tokay itself, this documentation is currently under development and unfinished.
If you found any mistakes or can explain things better, please contribute!
To do so, visit https://github.com/tokay-lang.

Tokay is a programming language designed for ad-hoc parsing and text processing. Tokay programs operate directly on input streams that are read from files, strings, piped commands or any other device emitting characters.

The following example is a short Tokay program that illustrates how Tokay works. It recognizes either "Hello Mercury", "Hello Venus" or "Hello Earth" from a text stream. Any other input is automatically skipped.

'Hello' _ {
    'Mercury'
    'Venus'
    'Earth'
}

Unlike general purpose programming languages like Rust or Python, in Tokay no explicit branching, substring extraction, or reading from input is required. Instead, these operations are directly built into the language.

If you're familiar with AWK, you might find the syntax in the previous example to be similar to the PATTERN { action } syntax. This approach is recursive in Tokay, so that the action-part can also be treated as a pattern, or as plain action code. This highlights a core tenet of Tokay's design and its key difference from AWK: instead of using a line-based execution model, Tokay takes a token-based approach that permits operating on anything matched from the input. This enables Tokay programs to operate on recursive structures that can be expressed by a grammar.

Getting started

Installation

Currently, Tokay is in a very early project state. Therefore you have to built it from source, using the Rust programming language and its build-tool cargo.

Once you got Rust installed, install Tokay by

$ cargo install tokay

Once done, you should run the Tokay REPL with

$ tokay
Tokay 0.6.0
>>> print("Hello Tokay")
Hello Tokay
>>>

You can exit the Tokay REPL with Ctrl+C.

The next examples are showing the REPL-prompt >>> with a given input and output. The output may differ when other input is provided.

Usage

Invoking the tokay command without any arguments starts the REPL (read-eval-print-loop). This allows to enter expressions or even full programs interactively with a direct result.

# Start a repl
$ tokay

# Start a repl working on an input stream from file.txt
$ tokay -- file.txt

# Start a repl working on the input string "gliding is flying with the clouds"
$ tokay -- "gliding is flying with the clouds"
Tokay 0.6.0
>>> Word(5)
("gliding", "flying", "clouds")
>>>

In case you compile and run Tokay from source, use cargo run -- with any desired parameters here, instead of the tokay command.

Next runs the Tokay program from the file program.tok:

# Run a program from a file
$ tokay program.tok

To directly work on files as input stream, do this as shown next. Further files can be specified and are executed on the same program sequentially. Its also possible to read from stdin using the special filename -.

# Run a program from a file with another file as input stream
$ tokay program.tok -- file.txt

# Run a program from with multiple files as input stream
$ tokay program.tok -- file1.txt file2.txt file3.txt

# Run a program from with files or strings as input stream
$ tokay program.tok -- file1.txt "gliding is fun" file2.txt

# Pipe input through tokay
$ cat file.txt | tokay program.tok -- -

A Tokay program can also be specified directly as first parameter. This call just prints the content of the files specified:

# Directly provide program via command-line parameter
$ tokay 'print(Char+)' -- file.txt

tokay --help will give you an overview about further parameters and invocation.

First steps

Tokay programs are made of items, sequences and blocks.

Items

An item can be an expression, a function or token call or a statement. The following are all items.

# Expression
>>> 2 + 3 * 5
17

# Assignment of an expression to a variable
>>> i = 2 + 3 * 5

# Using a variable within an expression
>>> i + 3
20

# Conditional if-statement
>>> if i == 17 "yes" else "no"
"yes"

# Function call
>>> print("hello" + i)
hello17

# Method call
>>> "hello".upper * 3
"HELLOHELLOHELLO"

# Token call ("hello" is read by Word(3) from the input stream)
>>> Word(3)
"hello"

# Token call in an expression (42 is read by Int from the input stream)
>>> Int * 3
126

Sequences

Sequences are multiple items in a row. Items in a sequence can optionally be separated by commas, but this is not mandatory. Sequences are either delimited by line-break, or a semicolon (;).

# A sequence of items with the same weighting result in a list
>>> 1 2 3
(1, 2, 3)

# This works also comma-separated
>>> 1, 2, 3
(1, 2, 3)

# This is a sequence of lists (indeed, lists are sequences, too)
>>> (1 2 3) (4 5 6)
((1, 2, 3), (4, 5, 6))

# Two sequences in one row; only last result is printed in REPL.
>>> (1 2 3); (4 5 6)
(4, 5, 6)

# This is a simple parsing sequence, accepting  assignments like
# "i=1" or "number = 123"
>>> Ident _ '=' _ Int
("number", 123)

# This is a version of the same sequence constructing a dictionary
# rather than a list
>>> name => Ident _ '=' _ value => Int
(name => "number", value => 123)

Blocks

Finally, sequences are organized in blocks. The execution of a sequence is influenced by failing token matches or special keywords (like push, next or accept, reject, etc.), which either enforce to execute the next sequence, or accept or reject a parselet, which can be referred to as a function. The main-parselet is also a parselet executing the main block, where the REPL runs in.

A block itself is also an item inside of a sequence of another block (or the main block). A new block is defined by { and }.

The next piece of code is already a demonstration of Tokays parsing features together with a parselet and two blocks, implementing an assignment grammar for either float or integer values, and some error reporting.

# Parselet definition of Assignment (identified by the @{...}-block)
# Match an identifier, followed by either a float or an integer;
# Throws an error on mismatch.
>>> Assignment : @{
    Ident _ '=' _ {
        Float
        Int
        error("Expecting a number here")
    }
}

# Given input "i = 23.5"
>>> Assignment
("i", 23.5)

# Given input "i = 42"
>>> Assignment
("i", 42)

# Given input "i = j"
>>> Assignment
Line 1, column 5: Expecting a number here

Writing comments

It is good practise to document source code and what's going on using comments. Likewise bash, Python or awk, Tokay supports line-comments starting with a hash (#). The rest of the line will be ignored.

# This is my little program

print("Hello World")  # printing welcome message to the user
hash = "# this is a string"  # assign "# this is a string" to hash.

Shebang

Therefore a shebang is also possible in case a Tokay source file shall be directly executable.

#!/bin/tokay
print("Hello World")

This assumes tokay is installed to /bin on a Posix-like system.

$ ls -lta hello.tok
-rwxr-xr-x  hello.tok
$ ./hello.tok
Hello World

Basics

Basically, a Tokay program is made of

  • Items
  • Sequences
  • Blocks

All these belong together and depend on each other in some way.

The following program demonstrates the usage of items, sequences and blocks in action:

{ # A block...
    # ... is made of sequences
    'Hello' _ Name \
        count_hello++   # ... which are made of items (4 items here).

    'Goodbye' _ {  # an item of a sequence can be a block again
        'Max'  count_bye_max++  # ... which contains other sequences...
        Name   count_bye++      # ... made of items again.
    }

    {}  # a sequence with an empty block as its item
}

This program is a little parser, which looks for greetings in some input.

  • The occurence of e.g. Hello Jan and Hello Max causes the variablecount_hello to be incremented
  • The occurence of e.g. Goodbye Jan increments the counter count_bye, but
  • An occurence of Goodbye Max, which is a special case here, counts on count_bye_max.

If you are familiar with the AWK programming language, you might see some similarities to the PATTERN { action }-syntax here.

In Tokay, PATTERN can be any sequence of items that need to match before, and { action } can hold further PATTERN { action }-components.

Items

Items are the atomic parts of sequences, and represent values.

The following examples for items are direct values that, once specified, stay on their own.

123               # the number 123
true              # the boolean value for truth
"Tokay 🦎"        # a unicode string

Items can also be the result of expressions or calls to callable objects.

"a " + "string"   # concatenating a string
42 * 23.5         # the result of a multiplication
'check'           # the occurence of string "check" in the input
Integer           # calling a built-in token for parsing integer values
func(42)          # calling a function
++count           # the incremented value of count

But items can also be more complex.

x = count * 23.5  # the result of a calculation is assigned to a variable

This is an assignment, and always produces the item value void, which means just "nothing". This is, because the result of the calculation is stored to a variable, but the item must represent some value.

Here's another item:

if x > 100 "much" # conditional expression, which is either "much" or void

This if-clause allows for conditional programming. It either produces a string when the provided condition is met, and otherwise also produces void.

This behavior can be changed by providing an else-branch next, like this:

if x > 100 "much" else "less"

As you see, every single value, call, expression or control-flow statement is considered to be an item.

A block is also an item as well, but this will be disussed later.

Severities

This is not important for the first steps and programs with Tokay, but a fundamental feature of the magic behind Tokay's automatic value construction features, which will be discussed later. You should know about it!

Every item has a severity, which defines its value's "weight".

Tokay currently knows 4 levels of severitity:

  1. Whitespace
  2. Match
  3. Value
  4. Result

The severity of an item depends on how it is constructed. For example

123               # pushes 123 with severity 3
_                 # matches whitespace
'check'           # matches "check" in input and pushes it considered as match
''check''         # matches "check" in input and pushes it considered as value
'check' * 3       # matches "check" in input and repeats it 3 times, resuling in value
push "yes"        # pushes result value "yes"

Right now, this isn't so important, and you shouldn't keep this in mind all the time. It will become useful during the next chapters, and especially when writing programs that parse or extract data off something.

Conclusion

In conclusion, an item is the result of some expression which always stands for a value. An item in turn is part of a sequence. Every item has a hidden severity, which is important for constructing values from sequences later on.

Sequences

Sequences are occurences of items in a row.

Here is a sequence of three items:

1 2 3 + 4    # results in a list (1, 2, 7)

For better readability, items of a sequence can be optionally separated by commas (,), so

1, 2, 3 + 4  # (1, 2, 7)

encodes the same.

All items of a sequence with a given severity are used to determine the result of the sequence. Therefore, these sequences return (1, 2, 7) in the above examples when entered in a Tokay REPL. This has to deal with the severities the items own.

The end of the sequence is delimited by a line-break, but the sequence can be wrapped into to multiple using a backslash before the line-break. So

1, 2 \
3 + 4  # (1, 2, 7)

means also the same as above.

Captures

The already executed items of a sequence are captured, so they can be accessed inside of the sequence using capture variables.

In the next example, the first capture, which holds the result 7 from the expression 3 + 4 is referenced with $1 and used in the second item as value of the expression. Referencing a capture which is out of bounds will just return void.

3 + 4, $1 * 2  # (7, 14)

Captures can also be re-assigned by subsequent items. The next one assigns a value at the second item to the first item, and uses the first item inside of the calculation. The second item which is the assignment, exists also as item of the sequence and refers to void, as all assignments do.

This is the reason why Tokay has two values to simply define nothing, which are void and null, but null has a higher precedence.

3 + 4, $1 = $1 * 2  # 14

As the result of the above sequence, just one value results which is 14, but the second item's value, void, has a lower severity than the calculated and assigned first value. This is the magic with sequences that you will soon figure out in detail, especially when tokens from streams are accessed and processed, or your programs work on extracted information from the input, and the automatic abstract syntax tree construction occurs.

As the last example, we shortly show how sequence items can also be named and accessed by a more meaningful name than just the index.

hello => "Hello", $hello = 3 * $hello  # (hello => "HelloHelloHello")

Here, the first item, which is referenced by the capture variable $hello is repeated 3 times as the second item.

It might be quite annoying, but the result of this sequence is a dict as shown in the comment. A dict is a hash-table where values can be referenced by a key.

If you come from Python, you might already know about list and dict objects. Their behavior and meaning is similar in Tokay.

Parsing input sequences

As Tokay is a programming language with built-in parsing capabilities, let's see how parsing integrates to sequences and captures.

Given the sequence

Word __ ''the'' __ Word

we make use of the built-in token Word which matches anything made of characters and digits, and the special constant __, which matches arbitrary whitespace, but at least one whitespace character must be present. Whitespace is anything represented by non-printable characters, like spaces or tabs.

We can now run this sequence on any input existing of three words, where the word in the middle is "the". Let's say

Save the planet

and we get the output

("Save", "the", "planet")

To try it out, either start a Tokay REPL with $ tokay -- "Save the planet" and enter the sequence Word __ ''the'' __ Word afterwards, or directly specify both at invocation, like
$ tokay "Word __ ''the'' __ Word" -- "Save the planet".

You will see, it's regardless of how many whitespace you insert, the result will always be the same. The reason for this are the item severities discussed earlier. Whitespace, used by the pre-defined constant __, has a lower severity, and therefore won't make it in the result of the sequence.

Using capture aliases

Captures can also have a name, called "alias". This is ideal for parsing, to give items meaningful names and make them independent from their position.

predicate => Word __ 'the' __ object => Word

will output

(object => "planet", predicate => "Save")

In this example, the match for the word ''the'' was degrated to a touch 'the', which has a lower item severity and won't make it into the sequence result.

This was done to make the output more clear, and because "the" is only an article without relevance to the meaning of the sentence we try to parse.

Now we can also work with alias variables inside of the sequence

predicate => Word __ 'the' __ object => Word \
    print("What to " + $predicate.lower() + "? The " + $object + "!")

will output

What to save? The planet!

The advantage here is, that we can change the sequence to further items in between, and don't have to change all references to these items in the print function call, because they are identified by name, and not by their offset, which might have changed.

The capture variable $0

There is also a special capture variable $0. It contains the input captured by the currently executed parselet the sequence belongs to. A parselet is a function that consumes some sort of input, which will be discussed later.

Let's see how all capture variables, including $0, are growing when the items from the examples above are being executed.

Capture $1 $2 $3 $4 $5
Alias $predicate $object
Item predicate => Word __ 'the' __ object => Word
Input "Save" " " "the" " " "planet"
$0 contains "Save" "Save " "Save the" "Save the " "Save the planet"

As you can see, $0 always contains the input matched so far from the start of the parselet.

$0 can also be assigned to any other value, which makes it the result of the parselet in case no other result of higher precedence was set.

Sequence interruption

todo

Conclusion

Sequences define occurences of items. An item inside of a sequence can have a meanigful alias.

Every item of a sequence that has been executed is called a capture, and can be accessed using context-variables, either by their offset (position of occurence) like $1, $2, $3 or by their alias, like $predicate.

The special capture $0 provides the consumed information read so far by the parselet, and can also be set to a value.

Blocks

Sequences are organized in blocks. Blocks may contain several sequences, which are executed in order of their definition. Every sequence inside of a block is separated by a newline.

The main scope of a Tokay program is also an implicit block, therefore it is not necessary to start every program with a new block.

Newlines

Newlines (line-breaks, \n respectively) are meaningful, and belong to the syntax of blocks.
They separate sequences inside a block from each other.

"1st" "sequence"
"2nd" "sequence"
"3rd" "sequence"

Instead of a newline, a semicolon (;) can also be used, which has the same meaning. A single-line sequence can be split into multiple lines by preceding a backslash (\) in front of the line-break.

"1st" \
    "sequence"
"2nd" "sequence" ; "3rd" "sequence"

The first and second example are literally the same.

Behavior

Blocks have two important purposes:

First, they group sequences into items of other sequences.

# typical use of a block
if x > 0 {
    x += 1
    print("x is now " + x)
}

Second, they provide alternations for sequences which consume input. Therefore, their behavior different in Tokay in comparison to other programming languages. When all sequences inside of a block don't consume any input, the block behaves exactly as in other languages. But when a sequence consumes input, the block might stop execution of alternatives (=sequences) before the end of the block is reached.

# alternation behavior of a block, when used with tokens
'Hello' _ {
    checked = true  # always executed, consumes no input
    'World' print("Hello World")
    'Mars' print("Hello Mars")
    print("Hello Unknown")  # fallback case
}

Concepts

In this chapter we deal in detail with the basic concepts of Tokay.

Terminology

First of all, there are some terms which are oftenly used in Tokay.

Names and identifiers

The naming rules for identifiers in Tokay differ to other programming languages, and this is an essential feature.

  1. Any identifier may not start with any digit (Char<0-9>).
  2. Variable names have to start with any lower-case letter (Char<a-z>)
  3. Constant names have to start either
    • when they refer consumable values, with an upper-case letter or an underscore
      (Char<A-Z_>)
    • otherwise they can also start with a lower-case letter, likewise variable names
      (Char<a-z>).

Some examples for better understanding:

# Valid
pi : 3.1415
mul2 : @x { x * 2 }
Planet : @{ 'Venus' ; 'Earth'; 'Mars' }
the_Tribe = "Cherokee"

# Invalid
Pi : 3.1415  # float value is not consumable
planet : @{ 'Venus' ; 'Earth'; 'Mars' }  # identifier must specify consumable
The_Tribe = "Cherokee"  # Upper-case variable name not allowed

9th = 9  # valid, but is interpreted as sequence `9 th = 9`

More about consumable and non-consumable values, variables and constants is discussed shortly.

Variables and constants

Symbolic identifiers for named values can either be defined as variables or constants.

variable = 0  # assign 0 to a variable
constant : 0  # assign 0 to a constant

Obviously, this looks like the same. variable becomes 0 and constant also. Let's try to modify these values afterwards.

variable += 1  # increment variable by 1
constant += 1  # throws compiler error: Cannot assign to constant 'constant'

Now variable becomes 1, but constant can't be assigned and Tokay throws a compile error. What you can do is to redefine the constant with a new value.

variable++    # increment variable by 1
constant : 1  # re-assign constant to 1

The reason is, that variables are evaluated at runtime, whereas constants are evaluated at compile-time, before the program is being executed.

The distinction between variables and constants is a tradeoff between flexibility and predictivity to make different concepts behind Tokay possible. The values of variables aren't known at compile-time, therefore predictive construction of code depending on the values used is not possible. On the other hand, constants can be used before their definition, which is very useful when thinking of functions being called by other functions before their definition.

Callables and consumables

Some values are callable. Generally, all tokens, functions and builtins are callable.

Here are some usages of callables:

int("123")        # Builtin int constructor
Int               # Builtin Int parser token
Char<A-Z>         # builtin char token
'Check'           # Touch token 'Check'
''Bold''          # Match token 'Bold'

s = "Hello"
s.upper           # calls method 'str_upper', returns "HELLO"
s[0]              # Internally call 'str_get_item', returns "H"

f : @x { x * 2 }  # function definition
f(42)             # function call, producing 84

Additionally, a callable can be attributed to be consumable. This is the case when the callable either makes use of another consumable callable, or it direcly consumes input. Consumables are always identified by either starting with an upper-case letter or an underscore. A function which makes use of a consumable is called parselet.

# invalid attempt; the parselet makes use of consumables,
# but is assigned to a name for a non-consumable constant.
assign : @{
    Ident _ expect '=' _ expect Expr
}

# creating a parselet
Assign : @{
    Ident _ expect '=' _ expect Expr
}

Scopes

Variables and constants are organized in scopes.

  1. A scope is any block, and the global scope.
  2. Constants can be defined in any block. They can be re-defined by other constants in the same or in subsequent blocks. Constants being re-defined in a subsequent block are valid until the block ends, afterwards the previous constant will be valid again.
  3. Variables are only distinguished between global and local scope of a parselet. Unknown variables used in a parselet block are considered as local variables.

Here's some commented code for clarification:

x = 10  # global scope variable x
y : 2000  # global scope constant y
z = 30  # global scope variable z

# entering new scope of function f
f : @x {  # x is overridden as local variable
    y : 1000  # local constant y overrides global constant y temporarily in this block
    z += y + x # adds local constant y and local value of x to global value of z
}

f(42)

# back in global scope, x is still 10, y is 2000 again, z is 1072 now.
x y z

Values

Generally, everything in Tokay is some kind of value or part of a value. The term "value" refers to both simple atomic values like booleans, numbers, strings but also objects which are partly mutable, recursive or callable.

Atomics

Atomic values stand on their own and are generally not mutable in the sense of an object. These values are the following:

void           # values to representing just nothing
null           # values representing a defined "set to null"
true false     # boolean values

Using the builtin function bool(v), a boolean value can be constructed from any other value v, by testing for truth.

Numbers

Tokay supports int and float as numeric types.

42 -23         # signed integer number object of arbitrary size (bigint)
3.1415 -1.337  # signed 64-bit float number object

For numbers, the following methods can be used:

  • int(v) - contructs an int value from any other value v
  • float(v) - contructs a float value from any other value v
  • float.ceil() - returns the next integer ceiling of a float
  • float.fract() - returns only the fractional part of a float
  • float.trunc() - truncates the fractional part off a float

Strings

A string (str) is a unicode-character sequence of arbitrary length.

s = "Tokay 🦎"
s = s + " is cool"
s += "!"

str objects can be concatenated by the operators + and +=.
They can also be multiplied by the operators * and *=.

Additionally, they provide the following methods:

  • str(v) - constructs a string object from any other value v
  • str.byteslen() - return total length of bytes used by the string
  • str.endswith(s) - check if string ends with postfix s
  • str.join(l) - create a string delimited by str from a list l
  • str.len() - return number of characters in the string
  • str.lower() - turns any upper-case characters of a string into lower-case order
  • str.replace(from, to="", n=void) - replace string from by to for at least n-times
  • str.startswith(s) - check if string begins with prefix s
  • str.substr(start=0, length=void) - returns a substring from start of length or to the end
  • str.upper() - turns any lower-case characters of a string into lower-case order

Lists

A list is a sequence of arbitrary values in a row. Therefore, a list can also contain further lists, or other complex objects. A list is also mutable, which means items can be extended or removed during runtime.

# list of values
(42, true, "yes")
l = (42 true "yes")
l[1] = false
l.push("🦎")
l.len()  # 4

Lists can be concatenated by the +- and +=-operators, and provide the following methods:

  • list(*args) - constructs a new list from all arguments provided
  • list.flatten() - integrates items of lists inside a list into itself
  • list.len() - returns number of items in the list
  • list.push(item, index=void) - either appends an item to the list or inserts it at position index
  • list.pop(index=void) - either pops the last item off the list or removes and returns item at position index

Dicts

Dictionaries ("dicts") are hash tables or maps with key-value-pairs, where a value is referenced by using the key as its storage location.

# dictionary (dict), a map of key-value-pairs
(i => 42, b => true, status => "success", true => false)
d = (i => 42 b => true status => "success" true => false)
d["angle"] = 23.5  # add key "angle"
d["i"] = void  # remove key "i"

Dicts provide the following methods:

  • dict() - creates a new, empty dict
  • dict_clone() - create an independ copy of dict
  • dict_items() - returns a list of lists (key, value)
  • dict_keys() - returns a list of keys
  • dict.len() - returns number of items in the dict
  • dict.merge(other) - merges antoher dict into the dict
  • dict_pop(k=void, d=void) - remove and return k from dict; returns d when key is not present; when k is not present, the last item will be removed.
  • dict_push(k, v) - insert v as key k into dict
  • dict_values() - returns a list of values

Tokens

Tokens are callables consuming input from the stream. They are object values as well. They always return a value parsed from the input stream in case the token matches. Otherwise, tokens usually reject the current block branch or parselet, to try other alternatives.

'touch'        # silently touch a string in the input (low severity)
''match''      # verbosely match a string from the input (high severity)
Char<A-Z0-9>+  # matching a sequence of multiple valid characters
Int            # built-in token for parsing and returning Integer values
Word(3)        # built-in token Word, matching at least words of length 3

In terms of parsing, tokens are the terminal symbols of a context-free grammar.

Functions and parselets

Functions are sub-programs for a specific task or routine which can be used for multiple tasks. A function can accept arguments with default values.

# function that doubles its value
f : @x { x * 2 }
f(9)  # 18

# anonymous function example
@x { x * 3 }(5)  # 15, returned by anonymous function that is called in-place

Parselets are more-specific functions consuming input and used for parsing. They are conceptionally the same, but also they are very distinguishable in their usage.

# parselet that parses simple assignments to variables
Assign : @{
    variable => Ident _ '=' _ value => Number
}

# called on a given input `n = 42`...
Assign
# ... returns dict `(variable => "n", value => 42)`

In terms of parsing, parselets are considered as non-terminal symbols of a context-free grammar.

Tokens

Tokens are the fundamental building blocks used to process input. Tokay implements first-level tokens which direcly consume input, but usages of parselets, which are functions consuming input, are considered as second-level tokens, and are at least tokens as well.

'touch' and ''match''

To match exact strings of characters from the input, like keywords, the match and touch token-type is used. Touch was yet mostly used in our examples, but match is also useful, depending on use-case.

'Touch'    # match string in the input and discard
''Match''  # match string in the input and take

The only difference between the two types is, that a match has a higher severity than a touch, and will be recognized within automatic value construction. Both type of matches can be referred by capture variables, therefore

'Match' $1

is the same result like a direct match.

Check out the following one-liner when executed on the input 1+2-3+4, it will return (1, "+", (2, (3, "+", 4))). The matches on the plus (''+'') is taken into the result, the touch on minus ('-') are discarded.

E : { E ''+'' E ; E '-' E; Integer }; E

Char

To match a character, the Char-token is both builtin and part of Tokay's syntax.

  • Single characters are either specified by a Unicode-character or an escape sequence
  • Ranges are delimited by a dash (-). If a Max-Min-Range is specified, it is automatically converted into a Min-Max-Range, so Char<z-a> is equal to Char<a-z>.
  • If a dash (-) should be part of the character-class, it should be specified first or last.
  • If a circumflex (^) is specified as first character in the character-class, the character-class will be inverted, so Char<^a-z> matches everything except a to z.
Char              # any character
Char<a>           # just "a"
Char<az>          # either "a" or "z"
Char<a-z>         # any character from "a" to "z"
Char<a-zA-Z0-9_>  # All ASCII digit or letter and underscore
Char<^0-9>        # Any character except ASCII digits
Char<-+*/>        # Mathematical base operators (minus-dash first!)

When using the Char-token with the multiplicative operators + (many repetition) or * (kleene, none or many), they are internally revised to a Chars-version, for better performance.

Builtin tokens

The following tokens are builtin and can be parametrized.

  • Ident - parses any C-style idenfifier name
  • Int(base=10, with_signs=true) - parses an int-value to the provided base, optionally with + or - signs
  • Float(with_signs=true) - parses a float-value, optionally with + or - signs
  • Number - parses Float or Int
  • Token - either parses Number, Word or AsciiPunctuation
  • Word(min=1, max=void) - parses any word, number, etc. with the specified min- and max-length

Parselets

Currently this chapter is an unfinished work-in-progress.

Parselets are functions, which consume input.

begin, end

coming soon

accept, reject

coming soon

repeat

coming soon

Functions

A function is introduced by an at-character (@), where a parameter list might optionally follow. The function's body is obgligatory, but can also exist of just a sequence or an item. Functions are normally assigned to constants, but can also be assigned to variables, with some loose of flexibility, but opening other features.

# f is a function
f : @x = 1 {
    print("I am a function, x is " + x)
}

f        # calls f, because it has no required parameters!
f()      # same as just f
f(5)     # calls f with x=5
f(x=10)  # calls f with x=10

Tokay functions that consume input are called parselets. It depends on the function's body if its either considered to be a function or a parselet. Generally, when talking about parselets in Tokay, both function and real parselets are meant as shorthand.

# P is a parselet, as it uses a consuming token
P : @x = 1 {
    Word print("I am a parselet, x is " + x)
}

P        # calls P, because it has no required parameters!
P()      # same as just P
P(5)     # calls P with x=5
P(x=10)  # calls P with x=10

Control structures

In comparison to many other languages, control structures in Tokay are part of expressions. They always return a value, which defaults to void when no other value is explicitly returned.

if...else

The if...else-construct implements conditional branching depending on the result of an expression.
The else part is optional, and can be omitted.

if sense == 42 && axis == 23.5 {
    print("Well, this is fine!")
}
else {
    print("That's quite bad.")
}

As stated before, all control structures are part of Tokays expression syntax. Above example can easily by turned into

print(
    if sense == 42 && axis == 23.5
        "Well, this is fine!"
    else
        "That's quite bad."
)

or directly used inside of an expression.

# if can be part of an expression
Word "Hello " + if $1 == "World" "Earth" else $1

if...else constructs working on static expressions are optimized away during compile-time.

loop

The loop-keyword is used to create loops, either with an aborting conditions on top or without any condition.

# Countdown
count = 10
loop count >= 0 print(
    if --count == 3
        "Ignition"
    else if count < 0
        "Liftoff"
    else
        count
)

A loop can be aborted everytime using the break-statement.
The continue-statement restarts the loop at the beginning, but a present abort-condition will be re-checked again.

count = 10
loop {
    count = count - 1
    if count == 3 {
        print("Ignition")
        continue
    }

    print(count)
    if count == 0 {
        print("Liftoff")
        break
    }
}

A loop without any aborting condition loops forever.

loop print("Forever!")

for

The for-keyword introduces a special form of loop that syntacically glues the parts initialization, abort condition and iteration together into a separate syntactic element.

for count = 10; count >= 0; count-- {
    print(i)
}

This syntax is abandoned. The upcoming version 0.7 of Tokay will only support for...in.

Appendix

Appendix A: Keywords

In Tokay, the following keywords are reserved words for control structures, values and special operators.

  • accept - accept parselet, optionally with a return value
  • begin - sequence to execute at begining of a parselet
  • break - break from a loop, optonally with a return value
  • continue - restart iteration in a loop
  • else - fallback for if constructs
  • end - sequence to execute at end of a parselet
  • exit - stop program execution, optional with exit code
  • expect - operator for consumable that expects the consumable and throws an error if not present
  • false - the false value
  • for - head-controlled for loop
  • if - branch based on the result of a conditional expression
  • in - part of the for-loop syntax
  • loop - head-controlled loop with an optional abort conition
  • next - continue with next sequence in a block
  • not - operator for consumable that satisfies when the consumable is not consumed
  • null - the null value
  • peek - operator for consumable that satisfies when consumable is consumed but the reader rolls back afterwards
  • push - accept a sequence by pushing a value
  • reject - reject parselet as not being consumed
  • repeat - repeat parselet, optionally push a result
  • return - same like accept, but with a meaning for ordinary functions
  • true - the true value
  • void - the void value

Appendix B: Operators

Tokay implements the following operators for use in expressions. The operators are ordered by precedence, operators in the same row share the same precedence.

Don't confuse with some rows which look as redundant, this depends on either the operator is delimited by whitespace from its operands or not.

Operator Description Associativity
=
+=
-=
*=
/=
//=
%=
Assignment, combined operation-assignment left
|| Logical or left
&& Logical and left
==
!=
<
<=
>=
>
Equal, unequal, comparison left
+
-
Add, subtract left
*
/
//
%
Multiply, divide, integer divide, modulo left
-
!
Negate, not right
++
--
Increment, decrement right
(...) Inline sequence left
(...)
[...]
.
Call parameters, subscript, attribute left

Operators produce different results depending on the data-types of their operands. For example, 3 * 10 multiplies 10 by 3, whereas 3 * "test" creates a new string repeating "test" 3 times. Try out the results of different operands in a Tokay REPL for clarification.

Appendix C: Modifiers

Tokay allows to use the following modifiers for calls to consumable values. Modifiers are used to describe repetitions or optional occurences of consumables.

Modifier Description Examples
+ Positive repetition (one or many) `'t'+, P(n=3)+`
? Optional (one or none) `'t'?, P(n=3)?`
* Kleene star (none or many) `'t'*, P(n=3)*`

Redudancy with expressional operators

You might have noticed that the operators + and * are used as operators for add and multiply as well. To clarify meaning, all modifiers stick to the token they belong to, and no whitespace is accepted between them. Modifiers are only allowed on tokens and parselet calls, and nowhere else, as it makes no sense.

Here are some examples for clarification:

't' * 3    # match 't' and repeat the result 3 times
't'* * 3   # match 't' one or multiple times and repeat the result 3 times
't' * * 3  # syntax error

Appendix D: Builtins

Functions

Tokens

The following tokens are built into Tokay and can be used immediatelly. Programs can override these constants on-demand.

Token Token+ Description
AlphabeticAlphabeticsAll Unicode characters having the Alphabetic property
AlphanumericAlphanumericsThe union of Alphabetic and Numeric
AsciiAsciisAll characters within the ASCII range.
AsciiAlphabeticAsciiAlphabeticsAll ASCII alphabetic characters [A-Za-z]
AsciiAlphanumericAsciiAlphanumericsASCII alphanumeric characters [0-9A-Za-z]
AsciiControlAsciiControlsAll ASCII control characters [\x00-\x1F\x7f]. SPACE is not a control character.
AsciiDigitAsciiDigitsASCII decimal digits [0-9]
AsciiGraphicAsciiGraphicsASCII graphic character [!-~]
AsciiHexdigitAsciiHexdigitsASCII hex digits [0-9A-Fa-f]
AsciiLowercaseAsciiLowercasesAll ASCII lowercase characters [a-z]
AsciiPunctuationAsciiPunctuationsAll ASCII punctuation characters [-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~]
AsciiUppercaseAsciiUppercasesAll ASCII uppercase characters [A-Z]
AsciiWhitespaceAsciiWhitespacesAll characters defining ASCII whitespace [ \t\n\f\r]
CharCharsAny character, except EOF
Char<...>Chars<...>Any character of specified character-class, except EOF
ControlControlsAll Unicode characters in the controls category
DigitDigitsASCII decimal digits [0-9]
EOF-Matches End-Of-File.
LowercaseLowercasesAll Unicode characters having the Lowercase property
NumericNumericsAll Unicode characters in the numbers category
UppercaseUppercasesAll Unicode characters having the Uppercase property
WhitespaceWhitespacesAll Unicode characters having the White_Space property
Void-The empty token, which consuming nothing. But it consumes!

The respective properties of the built-in character classes is described in Chapter 4 (Character Properties) of the Unicode Standard and specified in the Unicode Character Database in DerivedCoreProperties.txt.

Appendix E: Escape sequences

Escape sequences can be used inside of strings, match/touch tokens and character-classes to encode any unicode character. They are introduced with a backslash.

Escape-sequences should be used to simplify the source code and its readability, but any unicode character can also be directly expressed.

Sequence Description Examples
\a \b \f \n \r \t \v Bell (alert), backspace, formfeed, new line, carriage return, horizontal tab, vertical tab, "\a\b\f\n\r\t\v"
\' \" \\ Quotation marks, backslash "\'\"\\" # '"\
\ooo ASCII character in octal notation "\100" # @
\xhh ASCII character in hexadecimal notation "\xCA" # Ê
\uhhhh 16-Bit Unicode character in hexadecimal notation "\u20ac" # €
\Uhhhhhhhh 32-Bit Unicode character in hexadecimal notation "\U0001F98E" # 🦎