System Guide

This is the guide to the Lezer parser system. It provides a prose description of the system's functionality. For the item-by-item documentation of its interface, see the reference manual.

Overview

Lezer is a parser system written in JavaScript. Given a formal description of a grammar, it can produce a set of parse tables. Such tables provide a description that the parser system can use to efficiently construct a syntax tree for a given piece of text, describing the structure of the text in terms of the grammar.

The tables are generated by the lezer-generator tool, a build tool that takes a file in the format described later in this guide, and outputs a big, largely unreadable blob of JavaScript that represents the parse tables. This is something that happens offline, as part of the build process for a grammar package.

The lezer package provides the run-time parsing system. Combined with a parser built by lezer-generator, it gives you a parse method that takes a source file and returns a tree.

These trees, represented by data structures from the lezer-tree package, are more limited than the abstract syntax trees you might have seen in other contexts. They are not very abstract. Each node only stores a type, a start and end position, and a flat sequence of children. When writing a grammar, you choose which productions are stored as nodes—the others don't appear in the tree at all.

This means that the tree is very compact in memory and cheap to build. It does make doing refined analysis on it somewhat awkward. The use case guiding the design of this library is an editor system, which keeps a syntax tree of the edited document, and uses it for things like syntax highlighting and smart indentation.

To support this use case the parser has a few other interesting properties. It can be used incrementally, meaning it can efficiently re-parse a document that is slightly changed compared to some previous version given the parse for the old version. And it has error recovery built in, meaning that even if the input doesn't adhere to the grammar, it can still produce some reasonable syntax tree for it.

This system's approach is heavily influenced by tree-sitter, a similar system written in C and Rust, and several papers by Tim Wagner and Susan Graham on incremental parsing (1, 2).

Parser Algorithm

Lezer is based on the LR parser, an algorithm invented by Donald Knuth in 1965, which by pre-analyzing the grammar can derive fully deterministic (and thus efficient) parsers for some grammars.

Roughly, this algorithm abstractly interprets the grammar, recording the various states the parser can be in, and creating a table for each state mapping terminal symbols (tokens) to the action that should be taken when the parser sees that token in this state. If there's more than one action to take for a given token in a given state, the grammar can not be parsed with this algorithm. Such problems are usually called “shift-reduce” or “reduce-reduce” conflicts. More about that in a moment.

When writing grammars for LR-based tools, it can help to have a rough feel for this algorithm. The Wikipedia article linked above is a good introduction. For a more in-depth treatment, I recommend Chapter 9 of this book (PDF).

Ambiguity

A lot of grammars are either impossible or extremely awkward to represent as a plain LR grammar. If an element's syntactic role only becomes clear later in the parse (for example JavaScript's (x = 1) + 1 versus (x = 1) => x, where (x = 1) can either be an expression or a parameter list), plain LR often isn't very practical.

GLR is an extension of the parsing algorithm that allows a parse to ‘split’ at an ambiguous point by applying more than one action for a given token. The alternative parses then continue beside each other. When a parse can't make any more progress, it is dropped. As long as at least one parse continues, we can get our syntax tree.

GLR can be extremely helpful when dealing with local ambiguities, but when applied naively it can easily lead to an explosion of concurrent parses and, when the grammar is actually ambiguous, multiple parses continuing indefinitely so that you're in effect parsing the document multiple times at once. This completely ruins the properties that made LR parsing useful: predictability and efficiency.

Lezer allows GLR parsing but requires you to explicitly annotate the places in your grammar where it is allowed to happen, so that you can use it to solve problems that were otherwise difficult, but it won't accidentally start happening all over your grammar.

The parse state splitting is optimized in Lezer, so that, though it is more expensive than just doing a linear parse, you can have ambiguities in commonly encountered constructs in your language and still have a fast parser.

Error Recovery

Though the parser has a strict mode, by default it'll proceed through any text, no matter how badly it fits the grammar, and come up with a tree at the end.

To do this it uses the GLR mechanism to try various recovery tricks (ignoring the current token or skipping ahead to a place that matches the current token) alongside each other to see which one, a few tokens later, gets the best result.

Ignored tokens are added to the tree, wrapped in an error node. Similarly, the place where the parser skipped ahead is marked with an error node.

Incremental Parsing

In order to avoid re-doing work, the parser allows you to provide a cache of nodes, which is a previous tree mapped to the current document with the applyChanges method. The parser will, when possible, reuse nodes from this cache rather than re-parsing the parts of the document they cover.

Because the syntax tree represents sequences of matches (specified in the grammar notation with the + and * operators) as balanced sub-trees, the cost of re-matching unchanged parts of the document is low, and you can quickly create a new tree even for a huge document.

This isn't bulletproof, though—even a tiny document change, if it changes the meaning of the stuff that comes after it, can require a big part of the document to be re-parsed. An example would be adding or removing a block comment opening marker.

Contextual Tokenization

In classical parsing techniques there is a strict separation between the tokenizer, the thing that splits the input into a sequence of atomic tokens, and the parser.

This separation can be problematic, though. Sometimes, the meaning of a piece of text depends on context, such as with the ambiguity between JavaScript's regular expression notation and its division operator syntax. At other times, a sub-language used in the grammar (say, the content of a string) has a different concept of what a token is than the rest of the grammar.

Lezer supports contextual token reading. It will allow the tokens you declare to overlap (to match the same input) as long as such tokens can't occur in the same place anywhere in the grammar.

You can also define external tokenizers, which cause the parser to call out to your code to read a token. Such tokenizers will, again, only be called when the tokens they produce apply at the current position.

Even white space, the type of tokens implicitly skipped by the parser, is contextual, and you can have different rules skip different things.

Grammar Nesting

In some cases, such as JavaScript embedded in HTML or code snippets inside a literal programming notation, you want to make another parser responsible for parsing pieces of a document. Lezer supports one form of this: If the nested syntax has a clearly identifiable end token, you can specify that anything up to that token should be parsed by another grammar.

This means that parse trees can contain node types from different grammars.

Writing a Grammar

Lezer's parser generator defines its own notation for grammars. You can take a look at the JavaScript grammar to see an example.

A grammar is a collection of rules, which define terms. Terms can be either tokens, in which case they directly match a piece of input text, or nonterminals, which match expressions made up of other terms. Both tokens and nonterminals are defined using similar syntax, but they are explicitly distinguished (tokens must appear in an @tokens block), and tokens are more limited in what they can match—for example, they can't contain arbitrary recursion.

Formally, tokens must match a regular language (roughly the thing that basic regular expressions can match), whereas nonterminals can express a context-free language.

Here is an example of a simple grammar:

@top { expression }

expression { name | number | binaryExpression }

binaryExpression:binary.expression { "(" expression ("+" | "-") expression ")" }

@tokens {
  name:variable.expression { std.asciiLetter+ }
  number:number.expression { std.digit+ }
}

This would match strings like (a+1) or (100-(2+4)). It doesn't allow white space between tokens and the parentheses around binary expressions are required, or it would complain about ambiguity.

@top defines the entry point to the grammar. This is the rule that will be used to match the entire input. It'll usually contain something that repeats, like statement+.

Some of the terms definitions in the example have a colon after the term name. These define a tag for that term. By default, matched terms do not appear in the output tree. They only do when you define a tag for them.

Operators

The grammar notation supports the repetition operators * (any number of repetitions of the thing before), + (one or more repetitions), and ? (optional, zero or one repetitions). These have high precedence and only apply to the thing directly in front of them.

The pipe | character is used to represent choice, matching either of the expressions on its sides. It can, of course, be repeated to express more than two choices (x | y | z). Choice has the lowest precedence of any operator, and, if there are no parentheses present, will apply to the entire expressions to its left and right.

Choice in context-free grammars is commutative, meaning that a | b is exactly equivalent to b | a.

Parentheses can be used for grouping, as in x (y | z)+.

Sequences of things are expressed by simply putting them next to each other. a b means a followed by b.

Tokens

Named tokens are defined in a @tokens block. You can also, outside of the tokens block, use string literals like "+" as tokens. These automatically define a new token. String literals use the same escaping rules as in JavaScript.

String literals inside of token rules work differently. They can be combined with other expressions to form a bigger token. All token rules (and literal tokens) are compiled to a deterministic finite automaton, which can then be used to efficiently match them against a text stream.

So an expression like "a" "b" "c" in a nonterminal rule is a sequence of three tokens. In a token rule, it is exactly equivalent to the string "abc".

You can express character sets using square bracket notation, in a way similar to square bracket notation in regular expression. [.,] means either a period or a comma (there's no special meaning associated with . in this notation, or with escapes like \s). [a-z] matches a, z, and any character that, in the Unicode character code ordering, comes between them. To create an inverted character set, matching only character not mentioned in the set, the first character after the opening bracket must be ^. So [^x] matches any character that is not x.

The parser generator defines a few built-in character sets under names that start with std.:

Token rules cannot refer to nonterminal rules. But they can refer to each other, as long as the references don't form a non-tail recursive loop. I.e. a rule x can not, directly or indirectly, include a reference to x, unless that reference appears at the very end of the rule. References to tagged token rules from inside other tokens are valid, but will ignore the tag—only the tag of the token rule that was referenced from a nonterminal will be used.

Skip Expressions

Our initial example does not allow any whitespace between the tokens. Almost all real languages define some kind of special tokens, usually covering spacing and comments, that may occur between the other tokens and is ignored when matching the grammar.

To support whitespace, you must add a @skip rule to your grammar.

@skip { space | comment }

You could define the space and comment tokens like this:

tokens {
  space { std.whitespace+ }
  comment:comment { "//" [^\n]* }
  // ...
}

A skip rule will be matched zero or more times between other tokens. So the rule above also handles a long sequence of comments and whitespace.

Skipped tokens may be tagged (as comment is), in which case they will appear in the syntax tree. It is allowed for a skip expression to match things more complicated than single tokens.

When your grammar needs more than one kind of whitespace, for example when your strings aren't plain tokens but need their own internal structure, but you don't want to match whitespace and comments between the pieces of string, you can create a skip block like this:

@skip {} {
  string:string.expression {
    stringOpen (stringEscape | stringContent)* stringClose
  }
}

The initial braces contain the skip expression—in this case we want to skip nothing so it is empty—and the second pair of braces contain rules inside of which this skip expression is used.

Precedence

Let's go back to the binary operator rule in the example. If we define it like this, removing the parentheses, we get an error message.

binaryExpression:binary.expression { expression ("+" | "-") expression }

The error says "shift/reduce conflict" at expression "+" expression · "+". The · indicates the parse position, and the parser is telling us that, after reading what looks like a binary expression and seeing another operator, it doesn't know whether to first reduce the initial tokens to a binaryExpression or to first shift the second operator and leave the first for later. Basically, it doesn't know whether to parse x + y + z as (x + y) + z or x + (y + z).

This kind of issue can be solved without resorting to GLR parsing, fortunately. Doing so involves what is basically a crude hack on top of the LR parser generator algorithm: specify precedence to resolve these conflicts, so that when the parser generator runs into the ambiguity, we tell it to take one parse and discard the other.

This can be used not only for associativity (the error above), but also for operator precedence (giving groupings involving * precedence over groupings involving +, for example) and various other issues.

The way to specify precedence in Lezer is with, first, an @precedence block that enumerates the precedences you are going to use in the grammar, in order of precedence (highest first), and then inserting precedence markers at ambiguous positions.

@precedence { times @left, plus @left }

@top { expression }

expression { number | binaryExpression }

binaryExpression:binary.expression {
  expression !times "*" expression |
  expression !plus "+" expression
}

tokens {
  number:number.expression { std.digit+ }
}

This tells the parser generator that the !times precedence marker has a higher precedence than !plus, and that both are left associative (preferring (a + b) + c to a + (b + c)). You can also specify @right after the precedence name to make it right associative, or leave off the associativity not specify any (in which case the precedence won't resolve shift/reduce conflicts with itself).

The ! markers have to be inserted at the point where the conflict occurs. In this case, the conflict happens directly in front of the operators, so that's where the markers are placed.

This grammar can now correctly parse things like 1+2*3+4 into nodes that group the operators by precedence, as in (1+(2*3))+4.

It is also possible, instead of specifying an associativity for a given precedence, to make it a cut operator by using the keyword @cut. A cut operator will override other interpretations even though no conflict was detected yet. An example of where this is appropriate is the way JavaScript's function keyword has a different meaning in statement position. A statement can start with an expression, and an expression can be a function expression, but when we see the function token at the start of a statement, we only want to enter the function declaration statement rule, not the function expression rule.

@precedence { decl @cut }

@top { statement+ }

statement { functionDeclaration | functionExpression }

functionExpression { "function" "..." }

functionDeclaration { !decl "function" "..." }

This will parse function... as a functionDeclaration, though it would also match functionExpression (which could be used elsewhere in the grammar, if it was a real grammar).

Allowing Ambiguity

Still, some things can not be resolved with precedence declarations. To explicitly allow Lezer to try multiple actions at a given point, you can use ambiguity markers.

@top { (goodStatement | badStatement)+ }

goodStatement:good.statement { "(" goodValue ")" ";" }

goodValue:good.value { "val" ~statement }

badStatement:bad.statement { "(" badValue ")" "!" }

badValue:bad.value { "val" ~statement }

This grammar is entirely nonsensical, but it shows the problem: in order to know whether to match goodValue or badValue, the parser has to decide whether it is parsing a goodStatement or a badStatement. But at the time where it has to do that, the next token is a closing parenthesis, which doesn't yet tell it what it needs to know.

The parser generator will complain about that (it is a “reduce/reduce“ conflict) unless we add the ~statement ambiguity markers at the point where the reductions happen. The word (statement) doesn't have any special meaning, beyond that it has to be the same for all the conflicting positions, so that the parser generator knows that we are explicitly annotating these two positions as ambiguous.

With the markers, the grammar compiles, and whenever the parser gets to the ambiguous position, it'll split its parse state, try both approaches, and then drop off one path when it sees that it fails to match on the next token.

Template Rules

It is possible to define parameterized rules, which can help avoid repetitive grammar rules. Such rules act as templates—they are copied for each set of parameters they are given.

Lezer uses angle brackets around parameters and arguments. You can define a parameterized rule like this:

commaSep<content> { "" | content ("," content)* }

(The empty string literal is treated as the empty sequence, or ε, matching nothing.)

After having defined the above rule, you can do commaSep<expression> to match a comma-separated list of expressions.

When you pass arguments to a rule defined in the @tokens block, those arguments will be interpreted as if they are part of a token rule, even when you call it from a nonterminal rule.

Token Precedence and Specialization

By default, tokens are only allowed to overlap (match some prefix of each other) when they do not occur in the same place of the grammar or they are both simple tokens, defined without use of repeat operators. I.e. you can have tokens "+=" and "+", but not tokens "+" "="+ and "+".

To explicitly specify that, in such a case, one of the tokens takes precedence, you can add @precedence declarations to your @tokens block.

@tokens {
  divide { "/" }
  comment { "//" [^\n]* }
  @precedence { comment, divide }
}

By default, since divide is a prefix of comment, these would be considered overlapping. The precedence declaration states that comment overrides divide when both match. You can have multiple @precedence declarations in your @tokens block, and each declaration can contain any number of tokens, breaking ties between those tokens.

You could try to handle things like keywords, which overlap with identifiers, by specifying that they take precedence. But this is awkward, because if you define a keyword as matching "new" and having a higher precedence than identifiers, the input newest would match as first the keyword "new" and then the identifier "est".

In addition, having dozens of keywords in your tokenizer automaton greatly increases its size and complexity. What many hand-written tokenizers do is first match an identifier, and then check whether its content corresponds to any keyword.

This is supported in Lezer with the @specialize operator. This operator declares a new token given a base token and a string literal. When specializations have been defined for a given token type, its specialize table will be consulted every time it is read, and if its content matches a specialization, it is replaced by the specialized token.

newExpression:new.expression { @specialize<identifier, "new"> expression }

The newExpression rule with match things like new Array. @specialize takes an optional third argument, which must be a :-prefixed tag name, to assign a tag to the specialized token. It is often useful to define a helper rule like this to easily define keywords.

kw<word> { @specialize<identifier, word, :$word.keyword> }

Inside a tag, $ can be used to insert a parameter's content into the tag, so that the tag for kw<"new"> would be new.keyword. This only works if that parameter is a rule name, literal, or other tag.

There is another operator similar to @specialize, called @extend. Whereas specialized tokens replace the original token, extended tokens allow both meanings to take effect, implicitly enabling GLR when both apply. This can be useful for contextual keywords where it isn't clear whether they should be treated as an identifier or a keyword until a few tokens later.

Tags

In some cases, like functionExpression:function.expression, the way tags are specified may feel a bit redundant. Why not just use the term name?

Tags are an attempt to make syntax trees usable even to code that knows little about the grammar. Real grammars will often use detailed tags like template.string.literal.expression that are too cumbersome to use to refer to the rules.

Tags are structured as a sequence of parts, separated by dot characters. These parts are roughly ordered from most to least specific, so variable.declaration.statement would count as a statement, a declaration statement, and a variable declaration statement. (I say roughly because some of the information you might want to store in tags doesn't have an obvious hierarchical relation to the rest, and it is okay to just tack it onto the end in that case.)

Being meticulous about assigning tags can be a great help to users of your grammar's output. See the tag guide (FIXME) for some guidelines.

Parts that aren't plain ASCII words can be escaped with double quotes using JSON escaping rules. It is possible for a part to have a value, written lang=javascript.

It is possible to assign a tag to an expression inline, without explicitly wrapping it into a rule. This can be useful for tagged things that are used only once.

statement {
  (kw<"if"> expression statement):if.statement |
  (kw<"return"> expression):return.statement |
  expression:expression.statement
}

Such an inline tag is written with a colon and a tag after the target expression.

The grammar syntax also supports tag literals, which are written simply :my.tag. These are only useful as arguments to operators like @specialize, or to user-defined templates that internally use them to assign a tag to something.

Tag declarations

Literal tokens will, by default, be untagged. But it is possible to explicitly add tags to them with a @tags block.

@tags {
  "(":paren.open.punctuation
  ")":paren.close.punctuation
  string:string.literal
}

Such a block can also assign tags to named terms, as with string above, if you prefer to specify all your tags in one place.

If you want to add some parts to all tags defined in your grammar, you can include an @all<:my.tag> declaration inside a @tags block. This is often used for a lang= tag, so that nodes from your grammar can be recognized as belonging to that grammar.

Since it is often useful to add .punctuation tags to a bunch of single-character literals, and those tags are usually the same across grammars, lezer-generator has a shortcut where it accepts an @punctuation declaration inside @tags to quickly define these standard punctuation names.

@tags {
  @all<:lang=xml>
  @punctuation<"()[]{},">
}

It can be useful for syntax tree consumers, especially when they are a text editor, to know when a node type is delimited by a fixed pair of tokens, for example "{" and "}" for block nodes. For this, the convention is to use a delim="{ }" tag part (where the value of delim holds the delimiters with a space between them). If you add an @infer-delim declaration in your @tags block, lezer-generator will, for tagged rules that look like their definition contains matching delimiters (their first and last token contain mirrored bracket-type characters) automatically add a delim part.

External Tokens

The regular way to define tokens is usually enough, and the most convenient. But regular languages are notoriously limited in what they can match, so some things, like the multiple-character look-ahead needed to recognize the end of a comment, are terribly awkward to express inside the grammar. And something like JavaScript's automatic semicolon insertion, which requires checking the content of preceding whitespace and looking at the character ahead, is just impossible to express like that.

Thus, Lezer allows you to declare some tokens as being external, read by pieces of JavaScript code that you provide. An external tokenizer might look like this:

@external-tokens insertSemicolon from "./tokens" { insertedSemicolon }

This tells lezer-generator that it should import the insertSemicolon symbol from "./tokens", and use that as a tokenizer which might yield the insertedSemicolon token.

Such a tokenizer should be an ExternalTokenizer instance. It will only be called when the current state allows one of the tokens it defines to be matched.

The order in which external tokens declarations and the tokens block appear in the grammar file is significant. Those that appear earlier will take precedence over those that appear later, meaning that if they return a token, the others will not be tried.

You can assign tags to external tokens by writing a colon after their name in the @external-tokens block.

Nested Grammars

Similar to external tokenizers, it is possible to pull in an external grammar to use it to parse bits of the input.

@external-grammar parser as javascript from "lezer-javascript"

scriptBlock { "{{" nest.javascript "}}" }

This defines scriptBlock to, after the "{{" token has been matched, scan ahead for an "}}" token and then continue parsing the text in between using the parser export from "lezer-javascript".

You can leave off the as part when the local name for the grammar, the thing you write after nest, is the same as the exported name. It is also possible to leave off the from part, in which case the content of the block will simply not be parsed, unless a parser for javascript is later registered with the withNested method.

The value of a nested grammar can be either null (don't parse), a Lezer Parser instance, or a function (see the docs for its signature) that delays the decision on how to parse the content to run-time.

You can pass up to three arguments to nest.something to further specify the way the region should be parsed. The first argument may be a tag to assign to the node that wraps the nested region. The second may be an explicit end token. By default, the token occurring directly after the nest expression is taken as the token that ends the region, but sometimes that isn't what you want. Finally, a third argument may provide a default expression to parse, instead of entering the nested grammar, in case the nested grammar is a dynamic function and it returns {stay: true} (declines to enter the nesting).

Building a Grammar

The lezer-generator package comes with a command-line tool lezer-generator that converts a grammar file into a JavaScript module that can be loaded to get access to the parser interface.

(This functionality is also available through a programming interface.)

For example, we could run the tool as

lezer-generator lang.grammar -o lang.js

If there are no problems with the grammar specified in lang.grammar, this will write two files, lang.js and lang.terms.js.

The first is the main grammar file. It depends on "lezer" and on any files specified in external token or nested grammar declarations. It exports a binding parser that holds a Parser instance.

The terms file contains, for every named term in the grammar, an export with the same name as the term that holds the term's ID. When you define an external tokenizer, you'll usually need to import the token IDs for the tokens you want to recognize from this file. (Since there'll be a lot of definitions in the file for regular-sized grammars, you might want to use a tree-shaking bundler to remove the ones you do not need.)

These are the command-line options supported by lezer-generator:

Running the Parser

The simplest way to parse a file is to call Parser.parse on the parser you import from the generated file. This will always succeed and return a Tree instance.

Sometimes you want to take advantage of the fact that LR parsing happens in discrete steps, and have control over when the steps are performed. You can create a ParseContext with the startParse method, which is an object that represents an in-progress parse, on which you can repeatedly call advance to perform the next action, until you decide you have parsed enough or the method returns a tree to indicate that is has finished parsing.

Such a parse context holds a collection of Stack instances, each of which represents an actual single parse state. When the parse is unambiguous, there'll be one stack. When it is trying out multiple options in parallel, there'll be more than one. You don't usually need to concern yourself with these, but an external tokenizer or nested grammar resolution function gets access to a stack and can query it to, for example, ask if the current state accepts a given token or where the rule that is currently being matched starts.

Parsing consumes an InputStream, which abstracts access to a string-like data structure. You may simply pass a string to parse, in which case the library will do the wrapping itself, but if you have your data in some kind of data structure more involved than a string, you'll need to write your own class the implements InputStream, so that Lezer can read directly from your data structure, without copying its content into a flat string.

External tokenizers get access to this stream, as well as to a Token object that should be filled with information about the token. When an external tokenizer is called, only the start property of the token has been filled in, and should be used by the tokenizer to decide where to start reading. When it recognizes a token, it should set the value (token term ID) and end (end offset) properties (possibly by using the accept method.

Working with Trees

FIXME To be added later