Tomas Vik

Example of whitespace-sensitive TreeSitter grammar

In this short post, I’ll explain how to parse nested single-line markdown blocks into an AST using TreeSitter. The whole grammar is 18 lines of JavaScript and 127 lines of C.

The final grammar will be able to parse this text:

- one
	- two
		- three
	- four
- five

into this AST:

(document
(block (block_start) (block_content)
  (block (block_start) (block_content)
    (block (block_start) (block_content)))
  (block (block_start) (block_content)))
(block (block_start) (block_content)))

This post expects that you know how to set up a TreeSitter project and you are familiar with basic grammar syntax. I don’t try to explain the basic concepts in too much detail here. For that, I can recommend Jonas Hietala’s Let’s create a Tree-sitter grammar.

Sidenote: I’m not an expert on programming language grammar or C. I wrote this summary mainly on the off chance that future me ever decides to pick up the grammar and extend it. Sorry if it’s too terse. My original goal was to write a grammar for Logseq files, but it was too much effort for the value. I spent eight hours writing this basic grammar and that’s only 10% of the effort needed for a full grammar.

Why are whitespace-sensitive grammars special?

The key difference between free-form grammar like C and whitespace-sensitive grammar like Markdown or Python is that:

  • free-form grammar - characters can be directly mapped to tokens (parentheses, semicolon,..), and whitespace is generally ignored apart from being used to separate tokens.
  • Whitespace-sensitive grammar — some tokens are the result of the absence of characters. In our example, when we don’t see a tab, we close a block. This complicates the lexer (scanner), which needs to keep a state with information about indentation.

Simple example

Without nested blocks, our text structure is very simple:

- one
- two
- three

These blocks can be parsed by this grammar:

module.exports = grammar({
  name: "simple",
  // ignores carriage return (\r\n turns to \n)
  extras: (_) => ["\r"],
  rules: {
  document: ($) => repeat($.block),
  block: ($) => seq($.block_start, $.block_content, $._block_end),
  block_start: ($) => /- /,
  block_content: ($) => /[^\n]+/,
  _block_end: ($) => choice("\n", $._eof),
  _eof: ($) => "\0",
  },
});

The document is made up of blocks (lines). Each block starts with - , and then all characters on the line count as content. The block ends with a new line or the end of the file. Underscored rules won’t be included in the final AST.

Nesting

Now let’s introduce nesting. Remember the example from the start of the post:

- one
	- two
		- three
	- four
- five

For parsing this, we need to delegate the block_start and block_end tokens to a separate lexer logic. TreeSitter calls this lexer logic an external scanner.c.

This scanner will keep track of the number of open blocks and then emit block_end tokens based on the indentation of the next block.

The above example will be treated like this:

  • line 1: we parsed block one
  • line 2: has a leading tab, that means we nest block two under one
  • line 3: has two leading tabs, we’ll nest three under one
  • line 4: has one leading tab, we will close blocks three and two and create block four (sibling of two, nested under one)
  • line 5: doesn’t have any leading tabs, we’ll close four and one and create five
  • end of file, we’ll close five

Now that you see how the scanner detects the end of a block, let’s make the above process a bit more exact by capturing it in pseudocode.

var level = 0
var blocksToClose = 0
scanToken() {
  if (isValidSymbol(BLOCK_END)){
     if (blocksToClose > 0) {
       blocksToClose--
       return BLOCK_END
     }
     if(nextChar == \n){
       tabs = countTabsAfterNewLine()
       if (tabs < level) {
          blocksToClose = level - tabs;
          blocksToClose--
          level--
          return BLOCK_END
        }
      } else if (eof) {
        blocksToClose = level
        blocksToClose--
        level--
        return BLOCK_END
      }
  }

  if (isValidSymbol(BLOCK_START)){
    if(nextTwoChars == "- "){
      level++
      return BLOCK_START
    }
  }
  return nil
}

The key points to follow in the pseudocode:

  • The level global state represents the number of open blocks we have at the moment we scan this token.
  • The blocksToClose is a workaround because scanToken() can only return one token at a time, but we might have to close multiple blocks. We close a block if the number of leading tabs on the new line is less than the number of open blocks. Look at the example, see that all - increase level, and apply this close rule to understand how tabs are used by this code.
  • We close all open blocks if we hit the end of the file.
  • The isValidSymbol check prevents us from running our code at places where we can’t emit BLOCK_START or BLOCK_END (e.g. when we are parsing the block content).

Limitation: Representing the state by two integers is a simplification. You probably want to implement a stack of open blocks for real-world grammar. This stack will allow you to close different type of blocks at once. For example, you might want to close a quote block, then a paragraph, and then the list item.

The grammar.js is even simpler because it “outsources” two rules. The only added complexity is that we allowed nested blocks (block rule references itself):

module.exports = grammar({
    name: "simple",
    extras: (_) => ["\r"],
    externals: ($) => [$.block_start, $._block_end],
    rules: {
    document: ($) => seq(/\s*/, repeat($.block)),
    block: ($) =>
      seq($.block_start, $.block_content, repeat($.block), $._block_end),
    block_content: ($) => /[^\n]+/,
    },
});

The scanner code is a bit more complex. It follows the pseudocode closely but it’s too long for me to paste it here verbatim. You can follow the link to my GitLab. The TreeSitter external scanner documentation gives a good overview of the file structure and interface.

Limitation: I don’t understand error handling and highlighting in TreeSitter well and I don’t know if the grammar from this post has good error handling and highlighting support.

Summary

I showed you how to create a simple grammar representing Markdown lists. First, we looked at non-nested lists, which can be handled with the default scanner. Then, we introduced nested list items, which needed an external scanner.

From here, I recommend reading Jonas Hietala’s Let’s create a Tree-sitter grammar, which introduces a much more complex grammar.

Good luck!