Tokens
Tokens are the fundamental building blocks used to process input. Tokay implements first-level tokens which direcly consume input, but usages of parselets, which are functions consuming input, are considered as second-level tokens, and are at least tokens as well.
'touch' and ''match''
To match exact strings of characters from the input, like keywords, the match and touch token-type is used. Touch was yet mostly used in our examples, but match is also useful, depending on use-case.
'Touch' # match string in the input and discard
''Match'' # match string in the input and take
The only difference between the two types is, that a match has a higher severity than a touch, and will be recognized within automatic value construction. Both type of matches can be referred by capture variables, therefore
'Match' $1
is the same result like a direct match.
Check out the following one-liner when executed on the input 1+2-3+4, it will return (1, "+", (2, (3, "+", 4))). The matches on the plus (''+'') is taken into the result, the touch on minus ('-') are discarded.
E : { E ''+'' E ; E '-' E; Integer }; E
Char
To match a character, the Char-token is both builtin and part of Tokay's syntax.
- Single characters are either specified by a Unicode-character or an escape sequence
- Ranges are delimited by a dash (
-). If a Max-Min-Range is specified, it is automatically converted into a Min-Max-Range, soChar<z-a>is equal toChar<a-z>. - If a dash (
-) should be part of the character-class, it should be specified first or last. - If a circumflex (
^) is specified as first character in the character-class, the character-class will be inverted, soChar<^a-z>matches everything exceptatoz.
Char # any character
Char<a> # just "a"
Char<az> # either "a" or "z"
Char<a-z> # any character from "a" to "z"
Char<a-zA-Z0-9_> # All ASCII digit or letter and underscore
Char<^0-9> # Any character except ASCII digits
Char<-+*/> # Mathematical base operators (minus-dash first!)
When using the
Char-token with the multiplicative operators+(many repetition) or*(kleene, none or many), they are internally revised to aChars-version, for better performance.
Builtin tokens
The following tokens are builtin and can be parametrized.
Ident- parses any C-style idenfifier nameInt(base=10, with_signs=true)- parses an int-value to the provided base, optionally with+or-signsFloat(with_signs=true)- parses a float-value, optionally with+or-signsNumber- parsesFloatorIntToken- either parsesNumber,WordorAsciiPunctuationWord(min=1, max=void)- parses any word, number, etc. with the specifiedmin- andmax-length