Basic Example
Lezer's grammar notation borrows from extended Backus-Naur notation
and regular expression syntax, using |
to indicate a choice between
several forms, *
and +
for repetition, and ?
for optional
elements.
A grammar should be put in its own file, typically
with a .grammar
extension, and ran through
lezer-generator
to create a JavaScript file.
Each regular (non-token) rule expresses the structure of a given
construct (say, an expression or a statement, or a smaller part of
those). For example, this rule indicates that an expression can be
either an identifier, a string, a number, or a sequence of expressions
between parentheses.
expression {
Identifier |
String |
Boolean |
Application
}
Application { "(" expression* ")" }
The separate things that count as an expression are separated by |
characters. Things that should come after each other are simply
written next to each other.
This tells the parser generated from the grammar that, if it is in a
position where an expression would be allowed and the next token is
the starting token for one of these options, it should start parsing
an expression. And when it reaches the end of either of these options,
it should count that has having parsed an expression.
The parse position at the start of the parse is determined by the rule
marked with @top
.
@top Program { expression* }
This expresses that a document should be parsed as any number of
expression
s, and the top node of the syntax tree should be called
Program
.
Rule names that start with a capital letter will end up in the syntax
tree produced by the parser. Other rules, such as expression
, which
are only there to structure the grammar, will be left out (to keep the
tree small and clean).
Simple tokens that just match a string can be included directly in
rules as quoted strings (for example "("
and ")"
in Application
).
More involved tokens have to be defined in a @tokens
block:
@tokens {
Identifier { $[a-zA-Z_]+ }
String { '"' (!["\\] | "\\" _)* '"' }
Boolean { "#t" | "#f" }
LineComment { ";" ![\n]* }
space { $[ \t\n\r]+ }
"(" ")"
}
These use a syntax similar to the rule definitions, but can only
express a regular language, which roughly mean they can't be
recursive. Quoted literals match exactly the text in the quotes, sets
of characters can be specified with $[]
syntax, and ![]
is used to
match all characters except the ones between the brackets.
By default, tokens implicitly created by using literal strings in the
(non-token) grammar won't be part of the syntax tree. By mentioning
such tokens (like "("
and ")"
) explicitly in the @tokens
block,
we indicate that they should be included.
The LineComment
and space
tokens haven't been used anywhere yet.
That's because they aren't normal parts of the grammar, but are
“skipped” elements, that may appear anywhere between other tokens, and
don't affect the structure of the program. This is declared with a
@skip
rule.
@skip { space | LineComment }
And finally, the parser generator can be asked to automatically infer
matching delimiters with a @detectDelim
directive. This will cause
it to add
metadata
to those node types, which the editor can use for things like bracket
matching and automatic indentation.
@detectDelim
If this grammar lives in example.grammar
, you can run
lezer-generator example.grammar
to create a JavaScript module
holding the parse tables.
lezer-generator example.grammar > example.mjs
Or see the setup example for a more general description
of how to set up a parser project.