incremental parsing in go
This post is an attempt to explain the incremental parsing algorithm aretext uses for syntax highlighting.
Like the rest of aretext, parsers are implemented in Go for portability and performance. Most people do not consider Go a functional programming language; nonetheless, aretext’s parsers rely on functional programming patterns. In this post, we’ll see how to implement these patterns in pure Go to build parsers that are fast and expressive.
Problem
Syntax highlighting is a special case of parsing. The input is a text file containing source code, and the output is a sequence of tokens. Some tokens have special meaning such as “number”, “string”, or “comment” – these get displayed in a different color. Each programming language needs its own parser, and parsers can be a pain to write.1
Parsers also needs to be fast. When the user edits a JSON document with 100K lines, syntax highlighting should update instantly. This requires parsing incrementally, reusing cached results as the user edits the document.
Example
Before explaining the incremental parsing algorithm, let’s look at a simple example. In git commit messages, any line that starts with a “#” is a comment. The screencast below shows how git commit comments are highlighted in aretext:
This is the parser code for git commit messages:
// GitCommitParseFunc parses a git commit.
func GitCommitParseFunc() parser.Func {
parseCommentLine := consumeString("#").
ThenMaybe(consumeToNextLineFeed).
Map(recognizeToken(parser.TokenRoleComment))
return parseCommentLine.Or(consumeToNextLineFeed)
}
Each invocation of this function consumes one line. If the line starts with a “#”, the parser recognizes the line as a comment token. Otherwise, it consumes the rest of the line without recognizing any tokens.
With this single function, aretext can incrementally parse comments in git commit messages. If the user edits the message, aretext will re-run this function on part of the message, reusing cached results from the previous parse. The rest of this post explains the magic that makes this work!
Parser func and result
A parser.Func
is responsible for parsing part of a document. It is defined as:
type Func func(TrackingRuneIter, State) Result
It accepts a TrackingRuneIter
used to read the document and returns a parser.Result
. (The State
argument is used by some parsers2 to maintain arbitrary state across executions.) The return type parser.Result
is defined as:
// Result represents the result of a single execution of a parse function.
type Result struct {
NumConsumed uint64
ComputedTokens []ComputedToken
NextState State
}
In the rest of this post, I’ll refer to parser.Func
as a “parse func” and parser.Result
as a “parse result”.
A failed parse consumes zero runes; a successful parse consumes at least one rune. A successful parse may also produce tokens.
Here’s an example of a parse func that consumes a single “x” and produces a “keyword” token.
func parseSingleXRune(iter parser.TrackingRuneIter, state parser.State) parser.Result {
r, err := iter.NextRune()
if err != nil || r != 'x' {
return parser.FailedResult
} else {
return parser.Result{
NumConsumed: 1,
ComputedTokens: []parser.ComputedToken{
{Offset: 0, Length: 1, Role: parser.TokenRoleKeyword},
},
NextState: state,
}
}
}
The parse func reads the next rune from iter
. If the rune is ‘x’, it returns a successful result with a keyword token; otherwise, it returns parser.FailedResult
(which has NumConsumed == 0
).
Since parse funcs are implemented in Go, they can parse arbitrarily complex languages, including context-sensitive languages like YAML and Markdown.
To parse a document, aretext invokes a parse func repeatedly. Each invocation uses an iterator starting immediately after the last rune consumed by the previous invocation. If the parse fails, aretext skips ahead one rune and tries again. Every parse func must produce the same output for a given input so results can safely cached and reused.
You might be thinking that parseSingleXRune
looks imperative and verbose. Do we really need to write all that code just to parse an ‘x’? The answer is: no. We’ll see later that new parsers are usually composed from simpler parse funcs.
Tracking rune iterator
parser.TrackingRuneIter
allows the parse func to read input one rune at a time. Internally, it records how many runes were read. This is important for incremental parsing: we can safely reuse a parse result only if we prove that the parse func did NOT read input edited by the user.
Copying a TrackingRuneIter
produces a new iterator with its own position in the document. This allows parsers to “look ahead” using a copy of an iterator. For example, a parser might read a quote rune ("), then look ahead for the closing quote of a string token. If it fails to find a closing quote, it can backtrack by using the original iterator. All copies of a TrackingRuneIter
update a shared maxRead
variable, so we can later determine the runes read by the original iterator and all of its copies.
Since TrackingRuneIter
is a Go struct with few fields (a pointer in the document and a count of runes read), the Go runtime can allocate copies on the stack, avoiding the overhead of heap allocation and garbage collection. This is critical for performance.
Reusing parse results
Executing a parse func repeatedly on a document produces a sequence of parse results. Suppose the user then inserts or deletes text at a position in the document. Which parse results could be reused and which would need to be recomputed?
Remember that parse funcs are deterministic: given the same input, they return the same output. Given a TrackingRuneIter
’s initial position and maxRead
count, aretext can determine exactly which runes a parser read to produce a given result. Let’s call these runes the “read region” of a result.
For a given edit and parse result, there are three possible cases:
- The edit occurred within the read region: The parse func may receive different input, so all bets are off. We need to rerun the parse func to produce a new, possibly different result.
- The edit occurred after the read region: The parse func will receive the same input and produce the same output, so we can reuse the result.
- The edit occurred before the read region: The parse func will receive the same input and produce the same output. Each result contains only rune counts; absolute positions are calculated by summing the counts of preceding results. The edit changes the rune counts of preceding results, effectively “shifting” the result by the number of inserted/deleted runes.
The above observations lead to a relatively straightforward incremental parsing algorithm:
while there is input left to consume:
check if we can reuse a prior result at the current position
if yes => reuse it
else => recompute it by executing the parse func
advance the position by number of runes consumed
Interested readers can compare the above pseudocode to the full Go implementation in syntax/parser/parser.go.
In practice, an efficient implementation of this algorithm requires organizing results into a tree (called a computation
in the aretext code). The leaves of the tree represent results, and the inner nodes represent groups of results. This allows us to quickly find and reuse entire sub-trees for parts of the document unaffected by the edit. We construct a new tree from these sub-trees, rotating nodes as necessary to keep the new tree balanced.3
Parser combinators
We now turn to ergonomics. Writing imperative code for every parse func would be slow and error-prone – better to write a few primitive parse func implementations, then combine them to create new parsers.
In functional programming, a “parser combinator” is a function that transforms a parser function into another parser function. We can implement parser combinators in Go as methods on a function receiver. Examples of combinators include Then
, Map
, and Or
.
The Or
combinator transforms two parse funcs and returns a new parse func. The combined parse func tries the first parse func, and if that fails, it tries the second parse func. This allows us to decompose a complicated parser into simple parts, then combine them with Or
. The parser for Go exemplifies this pattern:
func GolangParseFunc() parser.Func {
return golangLineCommentParseFunc().
Or(golangGeneralCommentParseFunc()).
Or(golangIdentifierOrKeywordParseFunc()).
Or(golangOperatorParseFunc()).
Or(golangRuneLiteralParseFunc()).
Or(golangRawStringLiteralParseFunc()).
Or(golangInterpretedStringLiteralParseFunc()).
Or(golangFloatLiteralParseFunc()).
Or(golangIntegerLiteralParseFunc())
}
The Or
combinator is implemented in just 10 lines of Go code:
// Or produces a parse func that returns the result of `f` if it succeeds,
// or the result of `nextFn` if `f` fails.
func (f Func) Or(nextFn Func) Func {
return func(iter TrackingRuneIter, state State) Result {
result := f(iter, state)
if result.IsSuccess() {
return result
}
return nextFn(iter, state)
}
}
Notice that this relies on the ability to copy TrackingRuneIter
efficiently. Passing iter
to f()
creates a copy of the iterator; if f
fails, we backtrack by passing the original, unmodified, iter
to nextFn()
.
Parser combinators allow us to write succinct parser implementations. Even complicated parsers like YAML and Markdown can be implemented in relatively few lines of Go code. The table below shows the number of lines of code (excluding helpers) in aretext for several languages:
Language | Lines of Code |
---|---|
JSON | 77 |
Python | 142 |
Go | 169 |
C | 190 |
Rust | 246 |
YAML | 394 |
Markdown | 847 |
Tradeoffs
Like any solution, aretext’s syntax highlighting implementation makes tradeoffs.
Cost of developing new parsers
Supporting a new language requires writing Go code. I can usually implement simple languages in a day, but more complicated ones like YAML can take weeks. I’ve now implemented parsers for every language I use regularly, so I don’t view this as a serious problem.
I considered integrating tree-sitter, an incremental parsing library with parsers for many existing languages. However, running JavaScript to generate parsers and linking to a C library would have greatly complicated the build process. Today, aretext can be built on almost any platform using a single go install
command. I’ve had users install aretext on ARM laptops, FreeBSD servers, Chromebooks, and Android phones. To maintain portability, I wanted a pure Go implementation.
In the future, I may explore writing a code-generator to automatically translate a TextMate grammar (used by VSCode and other editors) to an aretext parse func. That could make it easier to create, or at least bootstrap, parsers for new languages.
Tokens vs syntax tree
The parsers produce a sequence to tokens, not a full syntax tree. Writing a tokenizer is much easier than parsing full syntax trees. However tokenizers can sometimes produce misleading highlights. For example, consider this Go function:
func F(len string) int {
return len
}
Here, “len” is the name of a variable with type string
. However, “len” is also the name of a built-in function, so aretext highlights it. Without constructing the full syntax tree, it is difficult to differentiate an identifier representing a variable from reference to a built-in function. We’re also unable to perform other syntactical analysis like identifying blocks for code-folding.
In practice, this is rarely a problem. Most other editors don’t construct the full syntax tree, and most developers don’t declare variables that shadow built-in functions.
Extensibility vs performance
Other editors like vim and VSCode allow users to define new syntax languages at runtime. Typically this involves writing grammar rules in a file, which the editor loads to construct a parser. This is not possible in aretext, since the parsers are implemented in Go. On the other hand, aretext’s parsers are likely faster since they are compiled to machine code.
Conclusion
Syntax highlighting in aretext is ergonomic and efficient. Parser combinators allow complex parsers to be composed from simpler ones, and incremental parsing ensures excellent performance even for documents with hundreds of thousands of lines. Today, aretext provides syntax highlighting for thirteen languages, all using parsers implemented in pure Go!
There are lots of clever algorithms to automatically generate a parser from a grammar. This sounds easier than writing a parser by hand, but I’ve found that it’s often harder. Parser generator algorithms are picky about how grammars must be specified and almost always require languages to be context-free. Additionally, each parser generator implementation I’ve seen has its own, non-portable, way to express grammar rules. ↩︎
One example is matching opening/closing braces/brackets in YAML flow style. ↩︎
The algorithm for constructing a balanced tree by combining other trees is adapted from Blelloch, G. E., Ferizovic, D., & Sun, Y. (2016). Just join for parallel ordered sets. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures. ↩︎