Tokens
Tokens are the fundamental building blocks used to process input. Tokay implements first-level tokens which direcly consume input, but usages of parselets, which are functions consuming input, are considered as second-level tokens, and are at least tokens as well.
'touch'
and ''match''
To match exact strings of characters from the input, like keywords, the match and touch token-type is used. Touch was yet mostly used in our examples, but match is also useful, depending on use-case.
'Touch' # match string in the input and discard
''Match'' # match string in the input and take
The only difference between the two types is, that a match has a higher severity than a touch, and will be recognized within automatic value construction. Both type of matches can be referred by capture variables, therefore
'Match' $1
is the same result like a direct match.
Check out the following one-liner when executed on the input 1+2-3+4
, it will return (1, "+", (2, (3, "+", 4)))
. The matches on the plus (''+''
) is taken into the result, the touch on minus ('-'
) are discarded.
E : { E ''+'' E ; E '-' E; Integer }; E
Char
To match a character, the Char
-token is both builtin and part of Tokay's syntax.
- Single characters are either specified by a Unicode-character or an escape sequence
- Ranges are delimited by a dash (
-
). If a Max-Min-Range is specified, it is automatically converted into a Min-Max-Range, soChar<z-a>
is equal toChar<a-z>
. - If a dash (
-
) should be part of the character-class, it should be specified first or last. - If a circumflex (
^
) is specified as first character in the character-class, the character-class will be inverted, soChar<^a-z>
matches everything excepta
toz
.
Char # any character
Char<a> # just "a"
Char<az> # either "a" or "z"
Char<a-z> # any character from "a" to "z"
Char<a-zA-Z0-9_> # All ASCII digit or letter and underscore
Char<^0-9> # Any character except ASCII digits
Char<-+*/> # Mathematical base operators (minus-dash first!)
When using the
Char
-token with the multiplicative operators+
(many repetition) or*
(kleene, none or many), they are internally revised to aChars
-version, for better performance.
Builtin tokens
The following tokens are builtin and can be parametrized.
Ident
- parses any C-style idenfifier nameInt(base=10, with_signs=true)
- parses an int-value to the provided base, optionally with+
or-
signsFloat(with_signs=true)
- parses a float-value, optionally with+
or-
signsNumber
- parsesFloat
orInt
Token
- either parsesNumber
,Word
orAsciiPunctuation
Word(min=1, max=void)
- parses any word, number, etc. with the specifiedmin
- andmax
-length