fuzzy find algorithm
No one types perfectly. To compensate, many programs use “fuzzy find” algorithms to retrieve records close to what a user typed. Accidentally typed “quck” or “quack” when you meant “quick”? No worries! Fuzzy find will retrieve what you meant anyway.
This post explains the fuzzy find algorithm used in aretext, the terminal-based text editor I’ve been working on.
In a typical editing session, the user will search for commands to execute or files to open. The menu options update as the user types each character:
When implementing fuzzy find in aretext, I had a few design goals:
- Fast: it should be fast enough to use interactively. Once a user types a character, the algorithm should return results within tens of milliseconds – otherwise, the editor will feel laggy.
find ~ | wc -lsays there are ~500K files on my largest machine, so aretext should be able to search that many file paths at interactive speed.
- Synchronous: the algorithm should run synchronously on every keystroke. It’s possible to hide latency by updating results asynchronously, but I wanted to avoid the complexity of multiple threads mutating editor state.
- Useful: as often as possible, the top result should be the one the user wanted. If multiple records match, the user should be able to type more characters to disambiguate and eventually find the desired record.
I found several blog posts1 and open source projects2 implementing fuzzy find, but none achieved these goals. In the prototypes I built, the greedy algorithms and n-gram similarity sometimes ranked records differently than I expected, and the Smith-Waterman algorithm proved too slow to run synchronously on half a million items.3
The fuzzy find algorithm in aretext has two distinct phases:
- Retrieval: find records that contain keywords similar to keywords in the user’s query.
- Ranking: score all retrieved records by relevance to the user’s query, then return the top results.
For retrieval, aretext uses a keyword trie, and for ranking, aretext uses an approximate string search algorithm. As we will see, the ranking algorithm’s time complexity depends on both the number of records to score as well as the length of the records and query. Filtering records in the retrieval phase reduces the amount of work in the ranking phase. This is especially important for longer queries, which are slower to score during ranking, but tend to match fewer records during retrieval.
Retrieval: Keyword Trie
A keyword trie is a data structure that can efficiently find records containing a given keyword. For example, suppose we have three records: (1) “foo bar” (2) “bar” and (3) “baz”. If we split records into keywords based on whitespace, then the keyword trie would look like this:
Each edge represents a character in a keyword; the nodes represent keywords and their prefixes. Nodes representing keywords are associated with one or more record IDs that contain the keyword. In the example above, “bar” appears in record IDs 1 and 2.
How can we use a trie to find records for a search query? If we were looking for exact prefix matches, the algorithm would be straightforward. First, split the query into keywords, separating at spaces and punctuation. Then, for each query keyword, start at the trie root and follow edges matching characters in the query (if there is no matching edge, terminate with an empty set of results). Once all characters have been matched, the algorithm will stop at some node in the trie. Retrieve all record IDs from that node and its descendants. Finally, calculate the set intersection of record IDs retrieved for each query keyword: these represent the records that match all the query keywords.
For fuzzy find, we want the algorithm to include keywords even if the user mistyped part of the keyword’s prefix. The paper “Efficient Interactive Fuzzy Keyword Search”4 explains how to do this. Allow each query keyword to match multiple nodes in a trie – the paper calls these “active nodes.” Each active node represents a keyword prefix within some threshold
maxEditDist of the query keyword. Once we know the active nodes, finding matching keywords is simple: visit the active nodes and their descendants, collecting and returning all associated records.
How can we calculate the active nodes for a given query keyword? Using recursion!
- Base case: If the query keyword is the empty string, the active nodes are all prefixes with length less than or equal to
- Recursive case: Otherwise, for a query of length
n >= 1, start with the active nodes for the query’s prefix of length
n-1. For each active node and its children, determine the new edit distance based on the active node’s previous edit distance and whether the edge to the child matches the next character in the query keyword.5
The advantage of calculating the active nodes recursively is that we can cache and reuse the results as the user types. For example, suppose the user has typed “abc” and then types “d”. We can reuse the active nodes from “abc” to calculate the active nodes for “abcd”. This makes the algorithm fast enough to use interactively!
Ranking: Approximate String Search
One might expect that we could use edit distances calculated from the keyword trie to rank records as well. Intuitively, records containing keywords with small edit distances should rank higher than records containing keywords with large edit distances. In practice, however, I found that this approach produced poor results. The problem is that the trie treats each record as an unordered set of keywords, so it cannot rank based on the sequence that keywords appear in the query. It might be possible to fix this (perhaps storing keyword positions in the trie would help), but I decided another approach was simpler.
For ranking, aretext uses a variation of a dynamic programming algorithm for approximate string matching.6 The algorithm searches each record for characters in the query string, with each matching character increasing the score. To find approximate matches, the algorithm allows “edits” (insertions, deletions, and substitutions) to the query, but each edit decreases the score. The “best” score for each possible query prefix and record prefix are memoized in a table to avoid duplicate work. This dynamic programming technique allows the algorithm to search many possible matches in
O(nm) time, where
m are the lengths of the record and query strings, respectively.
As mentioned earlier, the dynamic programming algorithm is expensive for long query strings. Fortunately, long queries tend to match fewer records in the retrieval phase, which reduces the time required for the ranking phase. To further speed up ranking, aretext parallelizes scoring records across multiple goroutines.
Once all records have been scored, aretext uses a heap to find the top 100 highest-scoring records, with ties broken by lexicographic order of record strings. Limiting the maximum number of results avoids the
O(nlog(n)) cost of sorting all records, most of which are likely irrelevant to the user’s query.
How well does the fuzzy find algorithm work? From my perspective – it’s great! The editor responds instantly when I type, and the item I’m searching for almost always appears as the first result even when I mistype some characters.
If you’re interested learning more, the code is open-source under the GPLv3 license and available on GitHub. The fuzzy find algorithm described in this post will be released in aretext v0.4, scheduled for late January 2022!
I’m sure it’s possible to optimize my naive Smith-Waterman implementation further, but the algorithm’s time/space complexity (much worse for long strings!) motivated me to look for a different solution. ↩︎