JavaScript Example
To show how a medium-sized grammar holds together, this example walks
through the definition of a grammar for a sizable subset of
JavaScript, including some of the more awkward features, such as
automatic semicolon insertion.
For this grammar, we define two @top
rules.
@top Script { statement* }
@top SingleExpression { expression }
The one that occurs first will be the default one, but code using the
parser can use the top
option to select
another one, allowing it to parse only a single expression if it needs
to.
The grammar needs a bunch of precedences to define which rules take
precedence in some cases. We'll discuss their roles when we get to the
rules that use them. Note that our dialect already drops a bunch of
operators in order to not get too repetitive—the
real JavaScript grammar
has 33 named precedences.
@precedence {
member,
newArgs,
call,
times @left,
plus @left,
rel @left,
ternary @right,
assign @right,
forIn,
else,
statement @cut
}
These are the types of statements that we recognize. The simpler ones
are defined using inline rules, where the braces that define the
rule's content follow directly after the rule's name.
statement[@isGroup=Statement] {
FunctionDeclaration |
VariableDeclaration |
ForStatement |
IfStatement |
ReturnStatement |
Block |
LabeledStatement { Identifier ":" statement } |
ExpressionStatement { expression semi } |
EmptyStatement { ";" }
}
The rule itself (statement
) isn't capitalized. When statements
appear in the tree, we want a node for a specific statement type (say,
ExpressionStatement
), without wrapping each of those in a generic
statement node. Lower-case rule names don't appear in the tree output.
But it can be useful to have some way to identify whether a node is a
statement. The [@isGroup=Statement]
pseudo-prop declares that every
rule that is referenced as one of the choices of this rule should be
tagged with a group prop that marks it as a
statement. We could also have added those to the individual rules,
this @isGroup
tends to be more succinct.
Note that ExpressionStatement
uses semi
rather than ";"
. We'll
define that to accept either an actual semicolon, or an automatically
inserted one. Since automatically inserted semicolons are not regular
textual tokens, we will define that through an external tokenizer
later on.
semi { ";" | insertSemi }
Let's fill out some of the statement rules, starting with IfStatement
.
IfStatement {
kw<"if"> ParenthesizedExpression statement (!else kw<"else"> statement)?
}
The must use the !else
precedence to specify that an else
should
always be attached to the if
directly in front of it. Without it,
code like this would allow two parses, one attaching the else
to if (b)
(the correct one), and one attaching it to if (a)
(wrong).
if (a) {}
if (b) {}
else {}
Lezer complains about such ambiguity and requires you to add
precedence markers to resolve it.
The kw
rule is what we'll use for keywords. It takes a string and
specializes the Identifier
token for that string so that it acts
like a separate token.
kw<term> { @specialize[@name={term}]<Identifier, term> }
The @name
prop after @specialize
gives the newly defined token a
name that matches the keyword content, so that for example kw<"if">
shows up as a node called if
in the output tree.
ReturnStatement {
kw<"return"> (noSemi expression)? semi
}
The JavaScript standard defines return
syntax so that the optional
expression after return
only belongs to the return statement if an
automatic semicolon cannot be inserted between them. We will use
another external token to encode this constraint in the
grammar.
@external tokens noSemicolon from "./tokens.js" { noSemi }
noSemi
matches nothing, but is only generated when the token stream
is in a position where no semicolon can be inserted and a noSemi
token can be shifted.
ForStatement {
kw<"for"> (ForSpec | ForInSpec) statement
}
ForSpec {
"("
(VariableDeclaration | expression ";" | ";") expression? ";" expression?
")"
}
ForInSpec {
"("
((kw<"let"> | kw<"var"> | kw<"const">) pattern | Identifier)
!forIn kw<"in"> expression
")"
}
for
statement are mostly straightforward. The official JavaScript
grammar parameterizes most expression-related rules with a flag that
indicates whether they are allowed to match the in
operator, in
order to make the for
/in
syntax work in such a way that the in
is interpreted as part of the for
spec in this case, and as a binary
operator otherwise. Here, we can just use an explicit precedence
(!forIn
) to get that same effect.
Next come the definitions for declarations.
FunctionDeclaration {
!statement kw<"function"> Identifier ParamList Block
}
ParamList {
"(" commaSep<"..."? pattern ("=" expression)?> ")"
}
The !statement
precedence has a @cut
annotation, which means that
when a parse moves past that marker, only the rule with the marker,
and not other rules that may match the current input, is kept in the
parse state. This is used here to implement the way JavaScript's
function
keyword always starts a function definition when at the
start of a statement, despite function expressions using the same
keyword, and expressions being allowed in statement positions (via
ExpressionStatement
).
The parameter list uses the commaSep
rule template, which matches a
comma-separated list of its argument expression, and is defined like
this.
commaSep<content> {
(content ("," content)*)?
}
commaSep1<content> {
content ("," content)*
}
commaSep1
is useful for situations where the comma-separated
expression must occur at least once, such as variable declaration
lists.
VariableDeclaration {
(kw<"let"> | kw<"var"> | kw<"const">)
commaSep1<pattern ("=" expression)?> semi
}
Patterns are things that can be assigned to—variable names,
destructured arrays, and destructured objects.
The rule for Block
(a block of statements wrapped in braces) again
uses the !statement
marker to make it override the object expression
interpretation of the opening brace when the block interpretation is
valid.
Block {
!statement "{" statement* "}"
}
Expressions, like statements, are defined with a generic expression
rule that contains a big choice of the various kinds of expressions
that may occur. It uses @isGroup
to tag all of these choices with a
group prop.
expression[@isGroup=Expression] {
Number |
String |
TemplateString |
Identifier ~arrow |
@specialize[@name=BooleanLiteral]<Identifier, "true" | "false"> |
kw<"this"> |
kw<"null"> |
kw<"super"> |
RegExp |
ArrayExpression {
"[" commaSep1<"..."? expression | ""> ~destructure "]"
} |
ObjectExpression {
"{" commaSep<Property> ~destructure "}"
} |
NewExpression {
kw<"new"> expression (!newArgs ArgList)?
} |
UnaryExpression |
ParenthesizedExpression |
FunctionExpression {
kw<"function"> Identifier? ParamList Block
} |
ArrowFunction {
(ParamList { Identifier ~arrow } | ParamList) "=>" (Block | expression)
} |
MemberExpression |
BinaryExpression |
ConditionalExpression {
expression !ternary LogicOp<"?"> expression LogicOp<":"> expression
} |
AssignmentExpression |
CallExpression {
expression !call ArgList
}
}
There are various ambiguity markers (~arrow
and ~destructure
) in
these rules. Some of the JavaScript syntax introduced in ES2015 cannot
be disambiguated from other syntax by a plain LR parser. For example,
when seeing a parenthesized identifier, that might be just a variable,
or it might be the argument list for an arrow function. When seeing an
array of identifiers, that might be just an array, or the start of a
destructuring assignment.
(x) + 1
(x) => x - 1
[a, b, c].join()
[a, b, c] = something()
Our parser uses Lezer's support for GLR parsing, where it runs
multiple different parses alongside each other until the ambiguity
goes away, to handle these cases. The ambiguity markers indicate the
places where this kind of splitting is allowed. ~destructure
will
also occur in the rules for patterns later on.
The regular precedence markers !call
and !newArgs
are used to give
these expression types a well-defined precedence compared to other
expression types. I.e. !a()
should be defined as a call to a
which
is then negated, not a call to !a
, and arguments to new
should be
parsed with higher precedence than regular calls.
The anonymous rule ParamList { Identifier ~arrow }
simply wraps an
identifier that is used as a parameter to an arrow function in a
ParamList
node. Note that anonymous nodes may share their name with
other nodes, since they cannot be referred to by name anyway.
ParenthesizedExpression { "(" expression ")" }
ArgList { "(" commaSep<"..."? expression> ")" }
This grammar uses two different tokens, defined in exactly the same
way, for identifiers and property names. Because these do not occur in
the same places in the grammar, Lezer's contextual tokenization will
make sure that the appropriate one gets picked.
The reason for this is that when you specialize a token (as the kw
rule does), that specialization will take effect everywhere (the token
is simply replaced when it matches the specializer string). But
JavaScript property names may be keyword names, so we do not want that
specialization for the property name tokens.
propName { PropertyName | "[" expression "]" | Number | String }
Property {
(propKw<"get"> | propKw<"set">)? propName ParamList Block |
propName ~destructure (":" expression)? |
"..." expression
}
The propKw
rule is like kw
in that it defines keywords, but get
and set
in JavaScript are contextual keywords—they can also be
regular property names, but define getters and setters if they are
followed by some kind of property name.
propKw<term> { @extend[@name={term}]<PropertyName, term> }
Thus, we use GLR parsing again, this time through the @extend
feature, which is similar to @specialize
, except that it allows both
the plain token and the specialized token to be used, splitting the
parse into two possibilities when both can be parsed at that point,
somewhat like ~
markers do.
UnaryExpression {
(kw<"void"> | kw<"typeof"> | kw<"delete"> | LogicOp<"!"> | ArithOp<"+" | "-">)
expression
}
BinaryExpression {
expression !times (ArithOp<"/"> | ArithOp<"%"> | ArithOp<"*">) expression |
expression !plus ArithOp<"+" | "-"> expression |
expression !rel CompareOp expression
}
Operator parsing uses precedence markers again to set the precedence
of the precedence of the various binary operators.
The LogicOp
and ArithOp
tokens just wrap the token expression they
are given as an argument in a named token, which makes it easier to
assign a highlighting style to the various different operator tokens.
Assignment operators work similarly, but put some restrictions on the
kind of thing that can appear to the left of them. The first choice
parses update operators like +=
, the second handles regular
assignment with =
.
AssignmentExpression {
(Identifier | MemberExpression) !assign UpdateOp expression |
(MemberExpression | pattern) !assign "=" expression
}
MemberExpression {
expression !member ("." PropertyName | "[" expression "]")
}
Patterns can be identifiers, destructured arrays, or destructured
objects. The ~destructure
markers are there to allow the ambiguity
between expressions and patterns.
pattern {
Identifier ~arrow |
ArrayPattern {
"[" commaSep<("..."? pattern ("=" expression)?)?> ~destructure "]"
} |
ObjectPattern {
"{" commaSep<PatternProperty> ~destructure "}"
}
}
PatternProperty {
("..." pattern | propName ":" pattern | PropertyName) ("=" expression)?
}
This @skip
declaration indicates that whitespace and comments may be
skipped in any of these rules. Newlines are their own token, so that
we can track them for the purpose of automatic semicolon insertion.
@skip { spaces | newline | LineComment | BlockComment }
That tracking is done with a context, which is a value that's kept
alongside the parse, and updated whenever a token is shifted. Contexts
can be used to do not-quite-context-free things like tracking
indentation or the set of open tags in an HTML document. In this case,
it just tracks a single boolean that indicates whether we saw a
newline since the last non-skipped token.
@context trackNewline from "./tokens.js"
This tracker is implemented in JavaScript like this. Note that the
export name matches the name given in the @context
declaration.
import {ContextTracker} from "@lezer/lr"
import {spaces, newline, BlockComment, LineComment} from "./javascript.grammar.terms"
export const trackNewline = new ContextTracker({
start: false,
shift(context, term) {
return term == LineComment || term == BlockComment || term == spaces
? context : term == newline
},
strict: false
})
When you run lezer-generator
on a grammar file, it generates both a
file with the parse tables, and a file with constants for the IDs of
the tokens defined in the grammar. Because this context tracker needs
to know the IDs of some tokens, it imports them from the terms file.
Template strings allow interpolations inside of them, so they must be
parsed piece-by-piece rather than as a single token. But inside of
them, comments and whitespace must not be treated specially. So we
define them in a @skip {}
block that indicates that the global skip
rules are turned off for this rule.
@skip {} {
TemplateString {
"`" (templateEscape | templateContent | Interpolation)* templateEnd
}
}
Interpolation { InterpolationStart expression "}" }
@local tokens {
InterpolationStart[@name="${"] { "${" }
templateEnd { "`" }
templateEscape { "\\" _ }
@else templateContent
}
Defining the kind of token structure where a few specific things are
handled specially and everything else is lumped into a generic 'other'
token is best done with a @local tokens
definition. In this case, we
are interested backslash escapes, interpolation starts (${
), and the
end of the string. The @else
token, interpolationContent
, will
generated for all stretches of input that don't match the other tokens
in the block.
Block comments can be matched as a single token, but since they can
be gigantic, and incremental parsing doesn't work on a single huge
token, it is a good idea to define them like this, where each line of
the comment becomes its own token.
@skip {} {
BlockComment { "/*" (blockCommentContent | blockCommentNewline)* blockCommentEnd }
}
@local tokens {
blockCommentEnd { "*/" }
blockCommentNewline { "\n" }
@else blockCommentContent
}
That brings us to the @tokens
block.
@tokens {
Whitespace and line comments are straightforward.
spaces[@export] { $[\u0009 \u000b\u00a0]+ }
newline[@export] { $[\r\n] }
LineComment { "//" ![\n]* }
@precedence { "/*", LineComment, ArithOp<"/"> }
@precedence { "/*", LineComment, RegExp }
We have to explicitly say that it is okay for comment, regexp, and
division tokens to all start with a slash, and that comment tokens
should take precedence.
identifierChar { @asciiLetter | $[_$\u{a1}-\u{10ffff}] }
Identifier { identifierChar (identifierChar | @digit)* }
PropertyName { Identifier }
As mentioned earlier, we define Identifier
and PropName
as
separate tokens, so that they can be specialized differently.
The number token definition is somewhat messy, due to the various
formats that numbers can have.
hex { @digit | $[a-fA-F] }
Number {
(@digit ("_" | @digit)* ("." ("_" | @digit)*)? | "." @digit ("_" | @digit)*)
(("e" | "E") ("+" | "-")? ("_" | @digit)+)? |
@digit ("_" | @digit)* "n" |
"0x" (hex | "_")+ "n"? |
"0b" $[01_]+ "n"? |
"0o" $[0-7_]+ "n"?
}
@precedence { Number "." }
If you think it is relevant to users of your syntax tree, you can of
course also define different token types for the various number
notations.
Plain strings in JavaScript are not too hard to parse, if you just
assume every character after a backslash cannot end the string.
String {
'"' (![\\\n"] | "\\" _)* '"'? |
"'" (![\\\n'] | "\\" _)* "'"?
}
Note that this rule makes the closing quote optional. This isn't how
that actual language works, but it can be helpful to have the parser
tokenize unfinished strings (since JavaScript strings cannot continue
across lines anyway).
Next come tokens for the operators, some parameterized, some
hard-coded.
ArithOp<expr> { expr }
LogicOp<expr> { expr }
UpdateOp { $[+\-/%*] "=" }
CompareOp { ("<" | ">" | "==" | "!=") "="? }
Regular expressions are a bit involved to tokenize, due to the fact that
unescaped slashes may occur inside them if they are wrapped in
brackets.
RegExp { "/" (![/\\\n[] | "\\" ![\n] | "[" (![\n\\\]] | "\\" ![\n])* "]")+ ("/" $[gimsuy]*)? }
But note how, despite the token-level ambiguity between division
operators and regular expressions in JavaScript, we had to do nothing
about that here. Since there are no parse positions where a regexp and
a division operator are both valid, Lezer automatically made the
tokens contextual and reads the appropriate one depending on the parse
position.
These tokens are mentioned simply as quoted strings in the grammar,
but should appear in the output tree anyway.
"=" "..." "=>"
"(" ")" "[" "]" "{" "}"
"." "," ";" ":"
}
Because other tokens should take precedence over inserted semicolons
(JavaScript only inserts semicolons when it can't otherwise proceed
its parse), the declaration for this external tokenizer has to appear
after the other tokenizers.
@external tokens insertSemicolon from "./tokens.js" { insertSemi }
Our external tokenizers are defined like this:
import {ExternalTokenizer} from "@lezer/lr"
import {insertSemi, noSemi} from "./javascript.grammar.terms"
const space = [9, 10, 11, 12, 13, 32, 133, 160]
const braceR = 125, semicolon = 59, slash = 47, star = 42, plus = 43, minus = 45
export const insertSemicolon = new ExternalTokenizer((input, stack) => {
let {next} = input
if (next == braceR || next == -1 || stack.context)
input.acceptToken(insertSemi)
}, {contextual: true, fallback: true})
export const noSemicolon = new ExternalTokenizer((input, stack) => {
let {next} = input, after
if (space.indexOf(next) > -1) return
if (next == slash && ((after = input.peek(1)) == slash || after == star)) return
if (next != braceR && next != semicolon && next != -1 && !stack.context)
input.acceptToken(noSemi)
}, {contextual: true})
The language rules are that a semicolon can only be inserted before a
}
character, at the end of the file (next == -1
), or if there was
a newline before the current token. Since we are using a context to
track whether a newline was seen after the last token, stack.context
holds the (boolean) value produced by that context.
The tokenizer for noSemi
needs to verify that there is no whitespace
or comment ahead of the current position, since it can only determine
whether there is no newline after such skipped tokens have been
seen.
That concludes our exercise of parsing half of JavaScript in Lezer.
These are the files with the full code:
You can use lezer-generator
(from the @lezer/generator package) to
compile the grammar. Or use Rollup with the Lezer plugin to build a
self-contained script file (that exports the parser as parser
) like
this:
rollup -p @lezer/generator/rollup -e @lezer/lr javascript.grammar
Or see the setup example for a more general description
of how to set up a parser project.
The full JavaScript (+ TypeScript and JSX) grammar can be found on
GitHub.