Preface
This documentation was updated to be used with Tokay 0.6.
Some features might not work as expected when used with a more recent version.
Likewise Tokay itself, this documentation is currently under development and unfinished.
If you found any mistakes or can explain things better, please contribute!
To do so, visit https://github.com/tokay-lang.
Tokay is a programming language designed for ad-hoc parsing and text processing. Tokay programs operate directly on input streams that are read from files, strings, piped commands or any other device emitting characters.
The following example is a short Tokay program that illustrates how Tokay works. It recognizes either "Hello Mercury", "Hello Venus" or "Hello Earth" from a text stream. Any other input is automatically skipped.
'Hello' _ {
'Mercury'
'Venus'
'Earth'
}
Unlike general purpose programming languages like Rust or Python, in Tokay no explicit branching, substring extraction, or reading from input is required. Instead, these operations are directly built into the language.
If you're familiar with AWK, you might find the syntax in the previous example to be similar to the PATTERN { action }
syntax. This approach is recursive in Tokay, so that the action-part can also be treated as a pattern, or as plain action code. This highlights a core tenet of Tokay's design and its key difference from AWK: instead of using a line-based execution model, Tokay takes a token-based approach that permits operating on anything matched from the input. This enables Tokay programs to operate on recursive structures that can be expressed by a grammar.
Getting started
Installation
Currently, Tokay is in a very early project state. Therefore you have to built it from source, using the Rust programming language and its build-tool cargo
.
Once you got Rust installed, install Tokay by
$ cargo install tokay
Once done, you should run the Tokay REPL with
$ tokay
Tokay 0.6.0
>>> print("Hello Tokay")
Hello Tokay
>>>
You can exit the Tokay REPL with Ctrl+C
.
The next examples are showing the REPL-prompt
>>>
with a given input and output. The output may differ when other input is provided.
Usage
Invoking the tokay
command without any arguments starts the REPL (read-eval-print-loop). This allows to enter expressions or even full programs interactively with a direct result.
# Start a repl
$ tokay
# Start a repl working on an input stream from file.txt
$ tokay -- file.txt
# Start a repl working on the input string "gliding is flying with the clouds"
$ tokay -- "gliding is flying with the clouds"
Tokay 0.6.0
>>> Word(5)
("gliding", "flying", "clouds")
>>>
In case you compile and run Tokay from source, use
cargo run --
with any desired parameters here, instead of thetokay
command.
Next runs the Tokay program from the file program.tok:
# Run a program from a file
$ tokay program.tok
To directly work on files as input stream, do this as shown next. Further files can be specified and are executed on the same program sequentially. Its also possible to read from stdin using the special filename -
.
# Run a program from a file with another file as input stream
$ tokay program.tok -- file.txt
# Run a program from with multiple files as input stream
$ tokay program.tok -- file1.txt file2.txt file3.txt
# Run a program from with files or strings as input stream
$ tokay program.tok -- file1.txt "gliding is fun" file2.txt
# Pipe input through tokay
$ cat file.txt | tokay program.tok -- -
A Tokay program can also be specified directly as first parameter. This call just prints the content of the files specified:
# Directly provide program via command-line parameter
$ tokay 'print(Char+)' -- file.txt
tokay --help
will give you an overview about further parameters and invocation.
First steps
Tokay programs are made of items, sequences and blocks.
Items
An item can be an expression, a function or token call or a statement. The following are all items.
# Expression
>>> 2 + 3 * 5
17
# Assignment of an expression to a variable
>>> i = 2 + 3 * 5
# Using a variable within an expression
>>> i + 3
20
# Conditional if-statement
>>> if i == 17 "yes" else "no"
"yes"
# Function call
>>> print("hello" + i)
hello17
# Method call
>>> "hello".upper * 3
"HELLOHELLOHELLO"
# Token call ("hello" is read by Word(3) from the input stream)
>>> Word(3)
"hello"
# Token call in an expression (42 is read by Int from the input stream)
>>> Int * 3
126
Sequences
Sequences are multiple items in a row. Items in a sequence can optionally be separated by commas, but this is not mandatory. Sequences are either delimited by line-break, or a semicolon (;
).
# A sequence of items with the same weighting result in a list
>>> 1 2 3
(1, 2, 3)
# This works also comma-separated
>>> 1, 2, 3
(1, 2, 3)
# This is a sequence of lists (indeed, lists are sequences, too)
>>> (1 2 3) (4 5 6)
((1, 2, 3), (4, 5, 6))
# Two sequences in one row; only last result is printed in REPL.
>>> (1 2 3); (4 5 6)
(4, 5, 6)
# This is a simple parsing sequence, accepting assignments like
# "i=1" or "number = 123"
>>> Ident _ '=' _ Int
("number", 123)
# This is a version of the same sequence constructing a dictionary
# rather than a list
>>> name => Ident _ '=' _ value => Int
(name => "number", value => 123)
Blocks
Finally, sequences are organized in blocks. The execution of a sequence is influenced by failing token matches or special keywords (like push
, next
or accept
, reject
, etc.), which either enforce to execute the next sequence, or accept or reject a parselet, which can be referred to as a function. The main-parselet is also a parselet executing the main block, where the REPL runs in.
A block itself is also an item inside of a sequence of another block (or the main block). A new block is defined by {
and }
.
The next piece of code is already a demonstration of Tokays parsing features together with a parselet and two blocks, implementing an assignment grammar for either float or integer values, and some error reporting.
# Parselet definition of Assignment (identified by the @{...}-block)
# Match an identifier, followed by either a float or an integer;
# Throws an error on mismatch.
>>> Assignment : @{
Ident _ '=' _ {
Float
Int
error("Expecting a number here")
}
}
# Given input "i = 23.5"
>>> Assignment
("i", 23.5)
# Given input "i = 42"
>>> Assignment
("i", 42)
# Given input "i = j"
>>> Assignment
Line 1, column 5: Expecting a number here
Writing comments
It is good practise to document source code and what's going on using comments. Likewise bash, Python or awk, Tokay supports line-comments starting with a hash (#
). The rest of the line will be ignored.
# This is my little program
print("Hello World") # printing welcome message to the user
hash = "# this is a string" # assign "# this is a string" to hash.
Shebang
Therefore a shebang is also possible in case a Tokay source file shall be directly executable.
#!/bin/tokay
print("Hello World")
This assumes tokay
is installed to /bin
on a Posix-like system.
$ ls -lta hello.tok
-rwxr-xr-x hello.tok
$ ./hello.tok
Hello World
Basics
Basically, a Tokay program is made of
- Items
- Sequences
- Blocks
All these belong together and depend on each other in some way.
The following program demonstrates the usage of items, sequences and blocks in action:
{ # A block...
# ... is made of sequences
'Hello' _ Name \
count_hello++ # ... which are made of items (4 items here).
'Goodbye' _ { # an item of a sequence can be a block again
'Max' count_bye_max++ # ... which contains other sequences...
Name count_bye++ # ... made of items again.
}
{} # a sequence with an empty block as its item
}
This program is a little parser, which looks for greetings in some input.
- The occurence of e.g.
Hello Jan
andHello Max
causes the variablecount_hello
to be incremented - The occurence of e.g.
Goodbye Jan
increments the countercount_bye
, but - An occurence of
Goodbye Max
, which is a special case here, counts oncount_bye_max
.
If you are familiar with the AWK programming language, you might see some similarities to the
PATTERN { action }
-syntax here.In Tokay, PATTERN can be any sequence of items that need to match before, and { action } can hold further
PATTERN { action }
-components.
Items
Items are the atomic parts of sequences, and represent values.
The following examples for items are direct values that, once specified, stay on their own.
123 # the number 123
true # the boolean value for truth
"Tokay 🦎" # a unicode string
Items can also be the result of expressions or calls to callable objects.
"a " + "string" # concatenating a string
42 * 23.5 # the result of a multiplication
'check' # the occurence of string "check" in the input
Integer # calling a built-in token for parsing integer values
func(42) # calling a function
++count # the incremented value of count
But items can also be more complex.
x = count * 23.5 # the result of a calculation is assigned to a variable
This is an assignment, and always produces the item value void
, which means just "nothing". This is, because the result of the calculation is stored to a variable, but the item must represent some value.
Here's another item:
if x > 100 "much" # conditional expression, which is either "much" or void
This if
-clause allows for conditional programming. It either produces a string when the provided condition is met, and otherwise also produces void
.
This behavior can be changed by providing an else
-branch next, like this:
if x > 100 "much" else "less"
As you see, every single value, call, expression or control-flow statement is considered to be an item.
A block is also an item as well, but this will be disussed later.
Severities
This is not important for the first steps and programs with Tokay, but a fundamental feature of the magic behind Tokay's automatic value construction features, which will be discussed later. You should know about it!
Every item has a severity, which defines its value's "weight".
Tokay currently knows 4 levels of severitity:
- Whitespace
- Match
- Value
- Result
The severity of an item depends on how it is constructed. For example
123 # pushes 123 with severity 3
_ # matches whitespace
'check' # matches "check" in input and pushes it considered as match
''check'' # matches "check" in input and pushes it considered as value
'check' * 3 # matches "check" in input and repeats it 3 times, resuling in value
push "yes" # pushes result value "yes"
Right now, this isn't so important, and you shouldn't keep this in mind all the time. It will become useful during the next chapters, and especially when writing programs that parse or extract data off something.
Conclusion
In conclusion, an item is the result of some expression which always stands for a value. An item in turn is part of a sequence. Every item has a hidden severity, which is important for constructing values from sequences later on.
Sequences
Sequences are occurences of items in a row.
Here is a sequence of three items:
1 2 3 + 4 # results in a list (1, 2, 7)
For better readability, items of a sequence can be optionally separated by commas (,
), so
1, 2, 3 + 4 # (1, 2, 7)
encodes the same.
All items of a sequence with a given severity are used to determine the result of the sequence. Therefore, these sequences return (1, 2, 7)
in the above examples when entered in a Tokay REPL. This has to deal with the severities the items own.
The end of the sequence is delimited by a line-break, but the sequence can be wrapped into to multiple using a backslash before the line-break. So
1, 2 \
3 + 4 # (1, 2, 7)
means also the same as above.
Captures
The already executed items of a sequence are captured, so they can be accessed inside of the sequence using capture variables.
In the next example, the first capture, which holds the result 7
from the expression 3 + 4
is referenced with $1
and used in the second item as value of the expression. Referencing a capture which is out of bounds will just return void
.
3 + 4, $1 * 2 # (7, 14)
Captures can also be re-assigned by subsequent items. The next one assigns a value at the second item to the first item, and uses the first item inside of the calculation. The second item which is the assignment, exists also as item of the sequence and refers to void
, as all assignments do.
This is the reason why Tokay has two values to simply define nothing, which are
void
andnull
, butnull
has a higher precedence.
3 + 4, $1 = $1 * 2 # 14
As the result of the above sequence, just one value results which is 14
, but the second item's value, void
, has a lower severity than the calculated and assigned first value. This is the magic with sequences that you will soon figure out in detail, especially when tokens from streams are accessed and processed, or your programs work on extracted information from the input, and the automatic abstract syntax tree construction occurs.
As the last example, we shortly show how sequence items can also be named and accessed by a more meaningful name than just the index.
hello => "Hello", $hello = 3 * $hello # (hello => "HelloHelloHello")
Here, the first item, which is referenced by the capture variable $hello
is repeated 3 times as the second item.
It might be quite annoying, but the result of this sequence is a dict as shown in the comment. A dict is a hash-table where values can be referenced by a key.
If you come from Python, you might already know about list and dict objects. Their behavior and meaning is similar in Tokay.
Parsing input sequences
As Tokay is a programming language with built-in parsing capabilities, let's see how parsing integrates to sequences and captures.
Given the sequence
Word __ ''the'' __ Word
we make use of the built-in token Word
which matches anything made of characters and digits, and the special constant __
, which matches arbitrary whitespace, but at least one whitespace character must be present. Whitespace is anything represented by non-printable characters, like spaces or tabs.
We can now run this sequence on any input existing of three words, where the word in the middle is "the". Let's say
Save the planet
and we get the output
("Save", "the", "planet")
To try it out, either start a Tokay REPL with
$ tokay -- "Save the planet"
and enter the sequenceWord __ ''the'' __ Word
afterwards, or directly specify both at invocation, like
$ tokay "Word __ ''the'' __ Word" -- "Save the planet"
.
You will see, it's regardless of how many whitespace you insert, the result will always be the same. The reason for this are the item severities discussed earlier. Whitespace, used by the pre-defined constant __
, has a lower severity, and therefore won't make it in the result of the sequence.
Using capture aliases
Captures can also have a name, called "alias". This is ideal for parsing, to give items meaningful names and make them independent from their position.
predicate => Word __ 'the' __ object => Word
will output
(object => "planet", predicate => "Save")
In this example, the match for the word
''the''
was degrated to a touch'the'
, which has a lower item severity and won't make it into the sequence result.This was done to make the output more clear, and because "the" is only an article without relevance to the meaning of the sentence we try to parse.
Now we can also work with alias variables inside of the sequence
predicate => Word __ 'the' __ object => Word \
print("What to " + $predicate.lower() + "? The " + $object + "!")
will output
What to save? The planet!
The advantage here is, that we can change the sequence to further items in between, and don't have to change all references to these items in the print function call, because they are identified by name, and not by their offset, which might have changed.
The capture variable $0
There is also a special capture variable $0
. It contains the input captured by the currently executed parselet the sequence belongs to. A parselet is a function that consumes some sort of input, which will be discussed later.
Let's see how all capture variables, including $0
, are growing when the items from the examples above are being executed.
Capture | $1 | $2 | $3 | $4 | $5 |
Alias | $predicate | $object | |||
Item |
predicate => Word
|
__
|
'the'
|
__
|
object => Word
|
Input | "Save" | " " | "the" | " " | "planet" |
$0 contains | "Save" | "Save " | "Save the" | "Save the " | "Save the planet" |
As you can see, $0
always contains the input matched so far from the start of the parselet.
$0
can also be assigned to any other value, which makes it the result of the parselet in case no other result of higher precedence was set.
Sequence interruption
todo
Conclusion
Sequences define occurences of items. An item inside of a sequence can have a meanigful alias.
Every item of a sequence that has been executed is called a capture, and can be accessed using context-variables, either by their offset (position of occurence) like $1
, $2
, $3
or by their alias, like $predicate
.
The special capture $0
provides the consumed information read so far by the parselet, and can also be set to a value.
Blocks
Sequences are organized in blocks. Blocks may contain several sequences, which are executed in order of their definition. Every sequence inside of a block is separated by a newline.
The main scope of a Tokay program is also an implicit block, therefore it is not necessary to start every program with a new block.
Newlines
Newlines (line-breaks, \n
respectively) are meaningful, and belong to the syntax of blocks.
They separate sequences inside a block from each other.
"1st" "sequence"
"2nd" "sequence"
"3rd" "sequence"
Instead of a newline, a semicolon (;
) can also be used, which has the same meaning. A single-line sequence can be split into multiple lines by preceding a backslash (\
) in front of the line-break.
"1st" \
"sequence"
"2nd" "sequence" ; "3rd" "sequence"
The first and second example are literally the same.
Behavior
Blocks have two important purposes:
First, they group sequences into items of other sequences.
# typical use of a block
if x > 0 {
x += 1
print("x is now " + x)
}
Second, they provide alternations for sequences which consume input. Therefore, their behavior different in Tokay in comparison to other programming languages. When all sequences inside of a block don't consume any input, the block behaves exactly as in other languages. But when a sequence consumes input, the block might stop execution of alternatives (=sequences) before the end of the block is reached.
# alternation behavior of a block, when used with tokens
'Hello' _ {
checked = true # always executed, consumes no input
'World' print("Hello World")
'Mars' print("Hello Mars")
print("Hello Unknown") # fallback case
}
Concepts
In this chapter we deal in detail with the basic concepts of Tokay.
Terminology
First of all, there are some terms which are oftenly used in Tokay.
Names and identifiers
The naming rules for identifiers in Tokay differ to other programming languages, and this is an essential feature.
- Any identifier may not start with any digit (
Char<0-9>
). - Variable names have to start with any lower-case letter (
Char<a-z>
) - Constant names have to start either
- when they refer consumable values, with an upper-case letter or an underscore
(Char<A-Z_>
) - otherwise they can also start with a lower-case letter, likewise variable names
(Char<a-z>
).
- when they refer consumable values, with an upper-case letter or an underscore
Some examples for better understanding:
# Valid
pi : 3.1415
mul2 : @x { x * 2 }
Planet : @{ 'Venus' ; 'Earth'; 'Mars' }
the_Tribe = "Cherokee"
# Invalid
Pi : 3.1415 # float value is not consumable
planet : @{ 'Venus' ; 'Earth'; 'Mars' } # identifier must specify consumable
The_Tribe = "Cherokee" # Upper-case variable name not allowed
9th = 9 # valid, but is interpreted as sequence `9 th = 9`
More about consumable and non-consumable values, variables and constants is discussed shortly.
Variables and constants
Symbolic identifiers for named values can either be defined as variables or constants.
variable = 0 # assign 0 to a variable
constant : 0 # assign 0 to a constant
Obviously, this looks like the same. variable
becomes 0 and constant
also. Let's try to modify these values afterwards.
variable += 1 # increment variable by 1
constant += 1 # throws compiler error: Cannot assign to constant 'constant'
Now variable
becomes 1, but constant
can't be assigned and Tokay throws a compile error.
What you can do is to redefine the constant with a new value.
variable++ # increment variable by 1
constant : 1 # re-assign constant to 1
The reason is, that variables are evaluated at runtime, whereas constants are evaluated at compile-time, before the program is being executed.
The distinction between variables and constants is a tradeoff between flexibility and predictivity to make different concepts behind Tokay possible. The values of variables aren't known at compile-time, therefore predictive construction of code depending on the values used is not possible. On the other hand, constants can be used before their definition, which is very useful when thinking of functions being called by other functions before their definition.
Callables and consumables
Some values are callable. Generally, all tokens, functions and builtins are callable.
Here are some usages of callables:
int("123") # Builtin int constructor
Int # Builtin Int parser token
Char<A-Z> # builtin char token
'Check' # Touch token 'Check'
''Bold'' # Match token 'Bold'
s = "Hello"
s.upper # calls method 'str_upper', returns "HELLO"
s[0] # Internally call 'str_get_item', returns "H"
f : @x { x * 2 } # function definition
f(42) # function call, producing 84
Additionally, a callable can be attributed to be consumable. This is the case when the callable either makes use of another consumable callable, or it direcly consumes input. Consumables are always identified by either starting with an upper-case letter or an underscore. A function which makes use of a consumable is called parselet.
# invalid attempt; the parselet makes use of consumables,
# but is assigned to a name for a non-consumable constant.
assign : @{
Ident _ expect '=' _ expect Expr
}
# creating a parselet
Assign : @{
Ident _ expect '=' _ expect Expr
}
Scopes
Variables and constants are organized in scopes.
- A scope is any block, and the global scope.
- Constants can be defined in any block. They can be re-defined by other constants in the same or in subsequent blocks. Constants being re-defined in a subsequent block are valid until the block ends, afterwards the previous constant will be valid again.
- Variables are only distinguished between global and local scope of a parselet. Unknown variables used in a parselet block are considered as local variables.
Here's some commented code for clarification:
x = 10 # global scope variable x
y : 2000 # global scope constant y
z = 30 # global scope variable z
# entering new scope of function f
f : @x { # x is overridden as local variable
y : 1000 # local constant y overrides global constant y temporarily in this block
z += y + x # adds local constant y and local value of x to global value of z
}
f(42)
# back in global scope, x is still 10, y is 2000 again, z is 1072 now.
x y z
Values
Generally, everything in Tokay is some kind of value or part of a value. The term "value" refers to both simple atomic values like booleans, numbers, strings but also objects which are partly mutable, recursive or callable.
Atomics
Atomic values stand on their own and are generally not mutable in the sense of an object. These values are the following:
void # values to representing just nothing
null # values representing a defined "set to null"
true false # boolean values
Using the builtin function bool(v)
, a boolean value can be constructed from any other value v
, by testing for truth.
Numbers
Tokay supports int
and float
as numeric types.
42 -23 # signed integer number object of arbitrary size (bigint)
3.1415 -1.337 # signed 64-bit float number object
For numbers, the following methods can be used:
int(v)
- contructs an int value from any other valuev
float(v)
- contructs a float value from any other valuev
float.ceil()
- returns the next integer ceiling of a floatfloat.fract()
- returns only the fractional part of a floatfloat.trunc()
- truncates the fractional part off a float
Strings
A string (str
) is a unicode-character sequence of arbitrary length.
s = "Tokay 🦎"
s = s + " is cool"
s += "!"
str
objects can be concatenated by the operators +
and +=
.
They can also be multiplied by the operators *
and *=
.
Additionally, they provide the following methods:
str(v)
- constructs a string object from any other valuev
str.byteslen()
- return total length of bytes used by the stringstr.endswith(s)
- check if string ends with postfixs
str.join(l)
- create a string delimited by str from a listl
str.len()
- return number of characters in the stringstr.lower()
- turns any upper-case characters of a string into lower-case orderstr.replace(from, to="", n=void)
- replace stringfrom
byto
for at leastn
-timesstr.startswith(s)
- check if string begins with prefixs
str.substr(start=0, length=void)
- returns a substring fromstart
oflength
or to the endstr.upper()
- turns any lower-case characters of a string into lower-case order
Lists
A list is a sequence of arbitrary values in a row. Therefore, a list can also contain further lists, or other complex objects. A list is also mutable, which means items can be extended or removed during runtime.
# list of values
(42, true, "yes")
l = (42 true "yes")
l[1] = false
l.push("🦎")
l.len() # 4
Lists can be concatenated by the +
- and +=
-operators, and provide the following methods:
list(*args)
- constructs a new list from all arguments providedlist.flatten()
- integrates items of lists inside a list into itselflist.len()
- returns number of items in the listlist.push(item, index=void)
- either appends anitem
to the list or inserts it at positionindex
list.pop(index=void)
- either pops the last item off the list or removes and returns item at positionindex
Dicts
Dictionaries ("dicts") are hash tables or maps with key-value-pairs, where a value is referenced by using the key as its storage location.
# dictionary (dict), a map of key-value-pairs
(i => 42, b => true, status => "success", true => false)
d = (i => 42 b => true status => "success" true => false)
d["angle"] = 23.5 # add key "angle"
d["i"] = void # remove key "i"
Dicts provide the following methods:
dict()
- creates a new, empty dictdict_clone()
- create an independ copy of dictdict_items()
- returns a list of lists (key, value)dict_keys()
- returns a list of keysdict.len()
- returns number of items in the dictdict.merge(other)
- merges antoher dict into the dictdict_pop(k=void, d=void)
- remove and return k from dict; returns d when key is not present; when k is not present, the last item will be removed.dict_push(k, v)
- insert v as key k into dictdict_values()
- returns a list of values
Tokens
Tokens are callables consuming input from the stream. They are object values as well. They always return a value parsed from the input stream in case the token matches. Otherwise, tokens usually reject the current block branch or parselet, to try other alternatives.
'touch' # silently touch a string in the input (low severity)
''match'' # verbosely match a string from the input (high severity)
Char<A-Z0-9>+ # matching a sequence of multiple valid characters
Int # built-in token for parsing and returning Integer values
Word(3) # built-in token Word, matching at least words of length 3
In terms of parsing, tokens are the terminal symbols of a context-free grammar.
Functions and parselets
Functions are sub-programs for a specific task or routine which can be used for multiple tasks. A function can accept arguments with default values.
# function that doubles its value
f : @x { x * 2 }
f(9) # 18
# anonymous function example
@x { x * 3 }(5) # 15, returned by anonymous function that is called in-place
Parselets are more-specific functions consuming input and used for parsing. They are conceptionally the same, but also they are very distinguishable in their usage.
# parselet that parses simple assignments to variables
Assign : @{
variable => Ident _ '=' _ value => Number
}
# called on a given input `n = 42`...
Assign
# ... returns dict `(variable => "n", value => 42)`
In terms of parsing, parselets are considered as non-terminal symbols of a context-free grammar.
Tokens
Tokens are the fundamental building blocks used to process input. Tokay implements first-level tokens which direcly consume input, but usages of parselets, which are functions consuming input, are considered as second-level tokens, and are at least tokens as well.
'touch'
and ''match''
To match exact strings of characters from the input, like keywords, the match and touch token-type is used. Touch was yet mostly used in our examples, but match is also useful, depending on use-case.
'Touch' # match string in the input and discard
''Match'' # match string in the input and take
The only difference between the two types is, that a match has a higher severity than a touch, and will be recognized within automatic value construction. Both type of matches can be referred by capture variables, therefore
'Match' $1
is the same result like a direct match.
Check out the following one-liner when executed on the input 1+2-3+4
, it will return (1, "+", (2, (3, "+", 4)))
. The matches on the plus (''+''
) is taken into the result, the touch on minus ('-'
) are discarded.
E : { E ''+'' E ; E '-' E; Integer }; E
Char
To match a character, the Char
-token is both builtin and part of Tokay's syntax.
- Single characters are either specified by a Unicode-character or an escape sequence
- Ranges are delimited by a dash (
-
). If a Max-Min-Range is specified, it is automatically converted into a Min-Max-Range, soChar<z-a>
is equal toChar<a-z>
. - If a dash (
-
) should be part of the character-class, it should be specified first or last. - If a circumflex (
^
) is specified as first character in the character-class, the character-class will be inverted, soChar<^a-z>
matches everything excepta
toz
.
Char # any character
Char<a> # just "a"
Char<az> # either "a" or "z"
Char<a-z> # any character from "a" to "z"
Char<a-zA-Z0-9_> # All ASCII digit or letter and underscore
Char<^0-9> # Any character except ASCII digits
Char<-+*/> # Mathematical base operators (minus-dash first!)
When using the
Char
-token with the multiplicative operators+
(many repetition) or*
(kleene, none or many), they are internally revised to aChars
-version, for better performance.
Builtin tokens
The following tokens are builtin and can be parametrized.
Ident
- parses any C-style idenfifier nameInt(base=10, with_signs=true)
- parses an int-value to the provided base, optionally with+
or-
signsFloat(with_signs=true)
- parses a float-value, optionally with+
or-
signsNumber
- parsesFloat
orInt
Token
- either parsesNumber
,Word
orAsciiPunctuation
Word(min=1, max=void)
- parses any word, number, etc. with the specifiedmin
- andmax
-length
Parselets
Currently this chapter is an unfinished work-in-progress.
Parselets are functions, which consume input.
begin, end
coming soon
accept, reject
coming soon
repeat
coming soon
Functions
A function is introduced by an at-character (@
), where a parameter list might optionally follow. The function's body is obgligatory, but can also exist of just a sequence or an item. Functions are normally assigned to constants, but can also be assigned to variables, with some loose of flexibility, but opening other features.
# f is a function
f : @x = 1 {
print("I am a function, x is " + x)
}
f # calls f, because it has no required parameters!
f() # same as just f
f(5) # calls f with x=5
f(x=10) # calls f with x=10
Tokay functions that consume input are called parselets. It depends on the function's body if its either considered to be a function or a parselet. Generally, when talking about parselets in Tokay, both function and real parselets are meant as shorthand.
# P is a parselet, as it uses a consuming token
P : @x = 1 {
Word print("I am a parselet, x is " + x)
}
P # calls P, because it has no required parameters!
P() # same as just P
P(5) # calls P with x=5
P(x=10) # calls P with x=10
Control structures
In comparison to many other languages, control structures in Tokay are part of expressions. They always return a value, which defaults to void
when no other value is explicitly returned.
if...else
The if...else
-construct implements conditional branching depending on the result of an expression.
The else
part is optional, and can be omitted.
if sense == 42 && axis == 23.5 {
print("Well, this is fine!")
}
else {
print("That's quite bad.")
}
As stated before, all control structures are part of Tokays expression syntax. Above example can easily by turned into
print(
if sense == 42 && axis == 23.5
"Well, this is fine!"
else
"That's quite bad."
)
or directly used inside of an expression.
# if can be part of an expression
Word "Hello " + if $1 == "World" "Earth" else $1
if...else
constructs working on static expressions are optimized away during compile-time.
loop
The loop
-keyword is used to create loops, either with an aborting conditions on top or without any condition.
# Countdown
count = 10
loop count >= 0 print(
if --count == 3
"Ignition"
else if count < 0
"Liftoff"
else
count
)
A loop can be aborted everytime using the break
-statement.
The continue
-statement restarts the loop at the beginning, but a present abort-condition will be re-checked again.
count = 10
loop {
count = count - 1
if count == 3 {
print("Ignition")
continue
}
print(count)
if count == 0 {
print("Liftoff")
break
}
}
A loop without any aborting condition loops forever.
loop print("Forever!")
for
The for
-keyword introduces a special form of loop that syntacically glues the parts initialization, abort condition and iteration together into a separate syntactic element.
for count = 10; count >= 0; count-- {
print(i)
}
This syntax is abandoned. The upcoming version 0.7 of Tokay will only support
for...in
.
Appendix
Appendix A: Keywords
In Tokay, the following keywords are reserved words for control structures, values and special operators.
accept
- accept parselet, optionally with a return valuebegin
- sequence to execute at begining of a parseletbreak
- break from a loop, optonally with a return valuecontinue
- restart iteration in a loopelse
- fallback forif
constructsend
- sequence to execute at end of a parseletexit
- stop program execution, optional with exit codeexpect
- operator for consumable that expects the consumable and throws an error if not presentfalse
- the false valuefor
- head-controlledfor
loopif
- branch based on the result of a conditional expressionin
- part of thefor
-loop syntaxloop
- head-controlled loop with an optional abort conitionnext
- continue with next sequence in a blocknot
- operator for consumable that satisfies when the consumable is not consumednull
- the null valuepeek
- operator for consumable that satisfies when consumable is consumed but the reader rolls back afterwardspush
- accept a sequence by pushing a valuereject
- reject parselet as not being consumedrepeat
- repeat parselet, optionally push a resultreturn
- same likeaccept
, but with a meaning for ordinary functionstrue
- the true valuevoid
- the void value
Appendix B: Operators
Tokay implements the following operators for use in expressions. The operators are ordered by precedence, operators in the same row share the same precedence.
Don't confuse with some rows which look as redundant, this depends on either the operator is delimited by whitespace from its operands or not.
Operator | Description | Associativity |
---|---|---|
= += -= *= /= //= %= |
Assignment, combined operation-assignment | left |
|| | Logical or | left |
&& | Logical and | left |
== != < <= >= > |
Equal, unequal, comparison | left |
+ - |
Add, subtract | left |
* / // % |
Multiply, divide, integer divide, modulo | left |
- ! |
Negate, not | right |
++ -- |
Increment, decrement | right |
(...) | Inline sequence | left |
(...) [...] . |
Call parameters, subscript, attribute | left |
Operators produce different results depending on the data-types of their operands. For example, 3 * 10
multiplies 10 by 3, whereas 3 * "test"
creates a new string repeating "test" 3 times. Try out the results of different operands in a Tokay REPL for clarification.
Appendix C: Modifiers
Tokay allows to use the following modifiers for calls to consumable values. Modifiers are used to describe repetitions or optional occurences of consumables.
Modifier | Description | Examples |
---|---|---|
+ | Positive repetition (one or many) | `'t'+, P(n=3)+` |
? | Optional (one or none) | `'t'?, P(n=3)?` |
* | Kleene star (none or many) | `'t'*, P(n=3)*` |
Redudancy with expressional operators
You might have noticed that the operators +
and *
are used as operators for add and multiply as well. To clarify meaning, all modifiers stick to the token they belong to, and no whitespace is accepted between them. Modifiers are only allowed on tokens and parselet calls, and nowhere else, as it makes no sense.
Here are some examples for clarification:
't' * 3 # match 't' and repeat the result 3 times
't'* * 3 # match 't' one or multiple times and repeat the result 3 times
't' * * 3 # syntax error
Appendix D: Builtins
Functions
Tokens
The following tokens are built into Tokay and can be used immediatelly. Programs can override these constants on-demand.
Token | Token+ | Description |
---|---|---|
Alphabetic | Alphabetics | All Unicode characters having the Alphabetic property |
Alphanumeric | Alphanumerics | The union of Alphabetic and Numeric |
Ascii | Asciis | All characters within the ASCII range. |
AsciiAlphabetic | AsciiAlphabetics | All ASCII alphabetic characters [A-Za-z] |
AsciiAlphanumeric | AsciiAlphanumerics | ASCII alphanumeric characters [0-9A-Za-z] |
AsciiControl | AsciiControls | All ASCII control characters [\x00-\x1F\x7f] . SPACE is not a control character. |
AsciiDigit | AsciiDigits | ASCII decimal digits [0-9] |
AsciiGraphic | AsciiGraphics | ASCII graphic character [!-~] |
AsciiHexdigit | AsciiHexdigits | ASCII hex digits [0-9A-Fa-f] |
AsciiLowercase | AsciiLowercases | All ASCII lowercase characters [a-z] |
AsciiPunctuation | AsciiPunctuations | All ASCII punctuation characters [-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~] |
AsciiUppercase | AsciiUppercases | All ASCII uppercase characters [A-Z] |
AsciiWhitespace | AsciiWhitespaces | All characters defining ASCII whitespace [ \t\n\f\r] |
Char | Chars | Any character, except EOF |
Char<...> | Chars<...> | Any character of specified character-class, except EOF |
Control | Controls | All Unicode characters in the controls category |
Digit | Digits | ASCII decimal digits [0-9] |
EOF | - | Matches End-Of-File. |
Lowercase | Lowercases | All Unicode characters having the Lowercase property |
Numeric | Numerics | All Unicode characters in the numbers category |
Uppercase | Uppercases | All Unicode characters having the Uppercase property |
Whitespace | Whitespaces | All Unicode characters having the White_Space property |
Void | - | The empty token, which consuming nothing. But it consumes! |
The respective properties of the built-in character classes is described in Chapter 4 (Character Properties) of the Unicode Standard and specified in the Unicode Character Database in DerivedCoreProperties.txt.
Appendix E: Escape sequences
Escape sequences can be used inside of strings, match/touch tokens and character-classes to encode any unicode character. They are introduced with a backslash.
Escape-sequences should be used to simplify the source code and its readability, but any unicode character can also be directly expressed.
Sequence | Description | Examples |
---|---|---|
\a \b \f \n \r \t \v | Bell (alert), backspace, formfeed, new line, carriage return, horizontal tab, vertical tab, | "\a\b\f\n\r\t\v" |
\' \" \\ | Quotation marks, backslash | "\'\"\\" # '"\ |
\ooo | ASCII character in octal notation | "\100" # @ |
\xhh | ASCII character in hexadecimal notation | "\xCA" # Ê |
\uhhhh | 16-Bit Unicode character in hexadecimal notation | "\u20ac" # € |
\Uhhhhhhhh | 32-Bit Unicode character in hexadecimal notation | "\U0001F98E" # 🦎 |