Indentation Example

In this example, we build a parser for a small indentation-sensitive language. Note that the approach shown here is no one-size-fits-all technique—the details of how indentation-based languages handle things like commend and empty lines differ, and you'll have to adjust your parsing approach solution to fit the language.

The language we implement here looks somewhat like Sass's indented syntax. It just supports words that can be nested with indentation, and line comments.

Word

Section
  Subsection
    # Comment
    Content
    More # Comment 2

  Etc

When a line is indented more than the current block, it starts a new block. When indentation becomes smaller again, the current block is ended. Commented or blank lines do not influence block structure.

The general approach to this kind of syntax in Lezer is to define a context that tracks the current indentation, and have an external tokenizer emit tokens at the points where indentation is added or removed. These tokens can then be used as the start and end of blocks.

Such languages tend to treat newlines as significant tokens as well, following a less free-form line structure than languages that just ignore all whitespace between tokens.

The grammar itself is really simple.

@top Tree { element* }

element {
  Atom { Identifier lineEnd } |
  Section { Identifier lineEnd Block }
}

A document is a sequence of elements, and an element can either be an atom (an identifier without an indented block after it) or a section.

Newlines are explicitly mentioned, and not skipped.

lineEnd { newline | eof }

This rule matches either a line break or the end of the file, so that input that doesn't end in a blank line also parses properly.

Block { indent element+ (dedent | eof) }

Blocks start with an increase in indentation, and end when a line that is dedented beyond their indentation level is found, or at the end of the document.

@skip {
  spaces |
  Comment |
  blankLineStart (spaces | Comment)* lineEnd
}

Beyond spaces and comments, the @skip declaration also includes empty lines. We'll use an external tokenizer to detect, at the start of a line, whether the line is empty, so that we can emit the special blankLineStart token.

As mentioned, we need a context to track indentation levels.

@context trackIndent from "./tokens.js"

The context value is an object that forms a linked list of indentation levels. It does some bit mixing to create a hash from its indentation level and the hash of the parent level.

import {ContextTracker} from "@lezer/lr"
import {indent, dedent} from "./indent.grammar.terms"

class IndentLevel {
  constructor(parent, depth) {
    this.parent = parent
    this.depth = depth
    this.hash = (parent ? parent.hash + parent.hash << 8 : 0) + depth + (depth << 4)
  }
}

export const trackIndent = new ContextTracker({
  start: new IndentLevel(null, 0),
  shift(context, term, stack, input) {
    if (term == indent) return new IndentLevel(context, stack.pos - input.pos)
    if (term == dedent) return context.parent
    return context
  },
  hash: context => context.hash
})

The context tracking relies on the external tokenizer to notice indentation and dedentation and emit the proper tokens. indent tokens cover the indentation text (so that the context tracker can easily derive the depth from the token size). dedent tokens are zero-length.

@external tokens indentation from "./tokens.js" {
  indent
  dedent
  blankLineStart
}

Since both the indentation tokens and the blankLineStart token need to act at the start of lines and need to scan through indentation, they are put in the same external tokenizer function.

import {ExternalTokenizer} from "@lezer/lr"
import {blankLineStart} from "./indent.grammar.terms"

const newline = 10, space = 32, tab = 9, hash = 35

export const indentation = new ExternalTokenizer((input, stack) => {
  let prev = input.peek(-1)
  if (prev != -1 && prev != newline) return
  let spaces = 0
  while (input.next == space || input.next == tab) { input.advance(); spaces++ }
  if ((input.next == newline || input.next == hash) && stack.canShift(blankLineStart)) {
    input.acceptToken(blankLineStart, -spaces)
  } else if (spaces > stack.context.depth) {
    input.acceptToken(indent)
  } else if (spaces < stack.context.depth) {
    input.acceptToken(dedent, -spaces)
  }
})

If the character after the indentation is a hash (comment) or a line break, the line is empty. blankLineStart is again a zero-length token. Such tokens must be used with care—it is easy to get into an infinite loop if your grammar continues to consume them and your tokenizer continues to generate them.

In this case, we make sure to only emit blankLineStart if the stack can currently shift it. That means that it has not already entered the @skip expression for blank lines. That skip expression always matches something (a line end), so it can't land us in an infinite loop.

Similarly, dedent tokens are only emitted as long as the indent context indicates there is still indentation, which provides a limit on how many of those can be emitted (since each one will remove an indentation level).

Finally, these are the non-external tokens in the grammar.

@tokens {
  spaces { $[ \t]+ }
  newline { "\n" }
  eof { @eof }
  Comment { "#" ![\n]+ }
  Identifier { $[a-zA-Z0-9_]+ }
}

These are the full files for the code in this example:

Again, you can build them into a script that exports the parser with Rollup:

rollup -p @lezer/generator/rollup -e @lezer/lr indent.grammar