search term:

tree-sitter-scala 0.20.0

Hi everyone. On behalf of the tree-sitter-scala project, I am happy to announce tree-sitter-scala 0.20.0. The first two segment of the version number comes from the tree-sitter-cli that was used to generate the parser, and the last segment is our actual version number.

About tree-sitter-scala

tree-sitter-scala is a Scala parser in C language, generated using Tree-sitter CLI, and conforming to the Tree-sitter API. Tree-sitter parsers are generally fast, incremental, and robust (ok with partial errors).

Since its initial release in 2017, Tree-sitter parsers are adopted by editors like Atom, NeoVim, Emacs, Helix to provide language features like syntax highlight and folding and more (supposedly part of GitHub.com).

Highlights

Full release note is at https://github.com/tree-sitter/tree-sitter-scala/releases/tag/v0.20.0.

Contribution infrastructure

Smoke test

As we were making bigger changes, we sometimes noticed afterwards that overall parsing accuracy could regress, but was hard to pinpoint. To tackle this, I added smoke test in #81. As part of GitHub Actions, this checks out scala/scala and lampepfl/dotty code base to parse the *.scala source files using tree-sitter-scala.

Using the special log format ::notice or ::error, we can surface the parse success % compared to the expected values:

  if (( $(echo "$actual > $expected" |bc -l) )); then
    # See https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#example-creating-an-annotation-for-an-error
    echo -e "::notice file=grammar.js,line=1::ok, ${source_dir}: ${actual}%"
  else
    echo -e "::error file=grammar.js,line=1::${source_dir}: expected ${expected}, but got ${actual} instead"
    failed=$((failed + 1))
  fi

Initially, the baseline parsing % for Scala 2 library, Scala 2 compiler, and Scala 3 compiler were 87%, 46%, and 35% respectively. As of version 0.20.0, the parse success % are: 89%, 68%, and 66%.

Correction: In the first version of this post I wrote that tree-sitter-scala 0.20.0 parses 100% of Scala 2 library sources, but we found out that there was a bug in smoke test and it was actually 89%.

Note: Due to the robust nature of tree-sitter, even the parse results that contain errors is often usable for the purpose of syntax highlighing.

C code generation

During the initial period while I was on an extended vacation, and I worked on it around the clock because I was having a lot of fun. Because tree-sitter family of parsers are normally distributed as *.c source code checked into the GitHub repo, Anton Sviridov and I would be working on different features, but our PRs collided.

To make the matter worse, C code generation was taking longer and longer as we added more features, ranging from 10 to 40 minutes. I dusted out my old System76 Linux machine, which seemed to do better job.

To work around the git conflicts, we agreed on not including the generated code into the PR, and periodically someone would send a separate PR to bring them up to date. Chris Kipp created a GitHub Action to automate the codegen PR sending process in #147.

It turned out that the reason why the compilation was taking so long or crashing GitHub Actions jobs was due to the memory usage. Andrew Hlynskyi (@ahlinc) from tree-sitter project pointed out to us that Linux was killing tree-sitter CLI as it generated tree-sitter-scala because it was consuming 34 GB of RAM. Andrew also gave us hints on --report-states-for-rule flag, which printed out the following:

$  node_modules/.bin/tree-sitter generate --report-states-for-rule compilation_unit
class_definition                3728
function_definition             2214
ascription_expression           1442
infix_expression                1412
assignment_expression           1412
....

Using this as the hint, in #102 I was able to refactor the AST to make class parsing left associative (with a right-associative subtree), which brought down the memory usage to 11GB. Later on I applied similar refactoring to given_definition etc to bring down the memory usage down to 1 GB. On my old Linux box codegen now took less than a minute.

Scala 3 syntax improvements

The primary motivation for three of us (Anton, Chris, and me) was better Scala 3 support on editors like Neovim and Helix, so it got a lot of the attention. Though as indicated by the 66% smoke test result, it’s still a work in progress.

Participation

tree-sitter-scala 0.20.0 was brought to you by 8 contributors and a good bot:

$ git shortlog -sn --no-merges v0.19.1...v0.20.0
    47  Eugene Yokota
    24  Anton Sviridov
    16  Chris Kipp
     8  ghostbuster91
     8  GitHub
     5  susliko
     2  Kasper Kondzielski
     1  Logan Wemyss
     1  Guillaume Martres

Thanks to everyone who’s helped improve tree-sitter-scala by using them, reporting bugs, improving our documentation, and submitting and reviewing pull requests.

Scala Center is a non-profit center at EPFL to support education and open source.