Articles

One trie, three jobs, zero benchmarks won

I just open-sourced wordtree, a compact trie for word lists. Before I tell you what it does, here is what it is not: it is not the fastest at anything. I benchmarked it against a specialist crate for each job it does, and on every head-to-head axis a specialist beat it. Exact lookup is slower than a HashMap. The file is three times larger than an FST. Spelling correction is an order of magnitude slower than symspell.

I shipped it anyway, and I'd reach for it again. This post is about why a data structure that loses every micro-benchmark can still be the right dependency — and about the two genuinely fun pieces of engineering inside it: an edit-distance computation that rides down the trie in a fixed three-byte window, and an 8-byte node that pushes everything else off to the side.

The whole comparative study is reproducible — every number below comes from comparisons/REPORT.md, regenerable with four cargo commands against the word lists bundled in the repo.

Three jobs

wordtree came out of a translation app that needed three things from one big word list, all at once, on devices where startup time and memory both mattered:

  1. A browsable index. Group the words into folders (~100 per folder) so a UI can page through them. path_of("apricot") returns the folder path.
  2. Exact lookup. Resolve a word to the index of its expression in O(word length). index_of("apple")Some(1).
  3. Typo-tolerant autocomplete. Frequency-ranked, as-you-type suggestions that both extend a prefix ("ap"apple, apply) and fix a single typo — substitution, transposition, insertion, or deletion at Damerau-Levenshtein distance ≤ 1 ("aple"apple). suggestions("aple", …).

Each of those jobs has a specialist crate that does it better. What almost nothing does is all three from one structure — and, crucially, from one file that loads with zero parsing. That last constraint is the whole story, so let me start there.

The structure: 8 bytes a node, and not a byte more

The tree is a width-first array of fixed-size nodes: a node is immediately followed by all its siblings, so "next sibling" is just the next slot and "first child" is one 24-bit offset.

Logical trie versus its flat node array logical trie flat node array · width-first a p e p l e y ape apple apply apepley 0123456 first_child_pos contiguous siblings siblings

The same nodes, linearised. A parent reaches its children with one forward first_child_pos jump; the children then sit in adjacent slots, so walking siblings is just stepping forward until the is_last_sibling flag ends the run.

Each node is exactly 8 bytes:

fieldbitsrole
first_child_pos24relative position of the first child
node_char24UTF-32 codepoint (low 3 bytes)
is_folder1drives the browsable index
is_last_sibling1terminates a sibling run
max_child_percentile10best frequency in the subtree — drives top-k pruning
(spare)4
The 8-byte node bit layout one node record first_child_pos 24 bits node_char 24 bits max_child percentile 10 bits spare 4 b is_last_sibling · 1 is_folder · 1 8 bytes · 64 bits

Everything a node needs in 8 bytes. max_child_percentile (highlighted) earns its 10 inline bits because the suggestion walk reads it on every node to prune subtrees; the per-word data only a quarter of nodes need lives off-node instead.

Why fixed 8-byte records and not a tidy struct? Because the on-disk format is the in-memory format. The tree serialises with rkyv, and an ArchivedTree is queried directly out of an mmap — no parse, no rebuild, no pointer fix-up. Loading a 21 MiB English dictionary is an mmap call. For English, live heap == serialized == 21.11 MiB: the bytes you store are the bytes you query.

That max_child_percentile field earns its 10 inline bits because it is read on every node during a suggestion walk. It records the highest word frequency anywhere in the subtree below a node, which is exactly the lower bound a pruning-radix-trie (Wolf Garbe's design, which wordtree's pruning is modelled on) needs: if a subtree's best possible frequency can't beat the current top-k, skip the whole subtree. Top-k autocomplete then touches a tiny fraction of the tree.

Pushing the sparse data off-node

Here is the first optimisation I'm happy with. A word needs two more values: its frequency (percentile, 0–1000) and the 24-bit index of its expression. But only ~28% of nodes actually end a word — the rest are interior characters. Storing those 5 bytes inline would waste them on roughly three out of four nodes.

So they live in side tables instead, all part of the same zero-copy image:

tablesizerole
word_bits1 bit / nodeis this node the end of a word?
rank_index1 × u32 / 64 nodescumulative word count → rank(node) in O(1)
values5 bytes / wordthe (percentile, expr_index) pair

The trick is the classic succinct-structure move: a word node at position i finds its value at values[rank(i)], where rank(i) is the number of word-nodes before it. The word_bits bitvector plus the cumulative rank_index answer that rank query in O(1) — popcount the partial 64-bit word, add the precomputed prefix sum. The bit probe sits on the hot descent path; the rank query only fires when a value is actually consumed (an exact lookup, or a suggestion you decided to keep).

Moving those 5 bytes off-node took the node from 12 bytes to 8, which on English trimmed the structure from ~26.5 MiB to ~21.1 MiB — about 20% — with no loss of function. It also made exact lookup ~10–20% faster, because more siblings now fit in a cache line and index_of scans siblings linearly. Smaller and faster from the same change is rare enough to enjoy.

The fun part: edit distance that rides down the trie

The third job is the interesting one. How do you find every word within Damerau-Levenshtein distance 1 of a typo, frequency-ranked, without scanning the dictionary? (A brute-force DL≤1 scan over English takes ~90 ms — far too slow for as-you-type.)

The answer is to compute the edit distance incrementally as you walk the trie. Each node carries one dynamic-programming row recording the edit distance between the query and the word spelled by the path from the root to that node. The row at a node is computed from its parent's row (and, for transposition, its grandparent's). Conceptually, for a query of length n, the row holds

row[j] = edit_distance(query[0..j], word spelled to this node)

and two quantities drive everything:

  • row[n] is the distance from the whole query to this node's word. If the node ends a word and row[n] ≤ K, it's a correction.
  • min(row) is the distance to the closest prefix — a lower bound on every word in the subtree below. Once min(row) > K, the entire subtree is pruned.

So the walk descends at most K levels past the query length and, in practice, visits ~2–3% of the tree. One traversal does autocomplete and correction at the same time.

The bug that made me rewrite it

I didn't start here. The first version used a hand-rolled 4-window state machine that tracked a few edit positions as it descended. It looked fine. It passed my hand-written tests. It was wrong.

What exposed it was building the comparison harness — specifically, a recall table broken down by edit kind. Substitution and transposition: fine. Deletion: ~6–10% recall. Insertion: 0%. The state machine, run over a branching trie rather than a single string, mis-scored mid-word insertions and deletions and pruned the correct word away before it was ever reached. My autocomplete had been silently dropping every insertion typo and I had no idea until a table told me. That is the entire argument for building the benchmark harness, in one anecdote.

The DP-row-over-trie version replaced it and corrects all four single-edit kinds at 100% (more on that below).

The band: why each node costs three bytes, not n

A full row of n + 1 values per node would make every node cost O(query length). But you don't need the full row. row[j] is at least |j − depth| — you need that many indels just to reconcile the length difference between a depth-depth prefix and a j-character query prefix — so any cell with |j − depth| > K is already > K and can never be a kept correction nor lower a surviving minimum.

Only the 2K + 1 cells in a diagonal band around j = depth can ever matter. At the default K = 1 that's three cells, stored as a fixed [u8; 3].

The diagonal band: three cells per row query index j → 0123456 trie depth d ↓ 0123456 012 [u8; 3] one band row out of band |j − d| > K only the 2K + 1 = 3 band cells per row are computed

For a query of length n, a full row would cost O(n) per node. But any cell with |j − depth| > K is already > K, so only the 2K + 1 band cells can change a keep-or-prune decision — three at K = 1, carried as a [u8; 3] that rides the diagonal.

Shifting to band-local coordinates turns every DP neighbour into a constant offset, so the recurrence carries no per-cell column arithmetic at all:

cur[o] = min(prev[o+1] + 1,     // deletion       (word longer than query)
             prev[o]   + cost,   // match / substitution
             cur[o-1]  + 1,      // insertion      (word shorter than query)
             pp[o]     + 1)      // transposition  (grandparent row)

By Ukkonen's banding argument, the optimal alignment to any cell whose true distance is ≤ K stays inside the band. So every value the search actually acts on is computed exactly, and the kept corrections, their order, and the exact set of visited nodes are bit-identical to a full-row walk. Out-of-band cells may be over-estimated, but they stay > K, so no keep/prune decision changes. Banding changes only the per-node cost — O(K) instead of O(n) — which is why longer queries gain the most: a 14-character fuzzy query runs ~80% faster than the full-row walk; short typos roughly halve.

(One Rust wrinkle worth a footnote: stable Rust can't size [u8; 2*K + 1] from a K parameter, so the band width W is the const generic and K = (W − 1) / 2 is derived. Want distance-2 suggestions? Instantiate the search with W = 5.)

Losing every axis, on purpose

Now the benchmarks. I picked the best specialist crate for each job and ran them on the same word lists — fst (BurntSushi's FSA), symspell, Wolf Garbe's pruning_radix_trie, boomphf (minimal perfect hash), and plain HashMap/Vec baselines. A correctness gate asserts every engine resolves a word to the same expression index before any timing is trusted.

Exact lookup (nanoseconds). wordtree is the slowest of the bunch — it linearly scans each node's siblings.

case (en)wordtreefstboomphfhashmap
short on72.015.313.57.5
long alphanumerical108.994.425.78.6

HashMap wins outright at ~8 ns, flat. wordtree is ~8–13× slower. All are tens of nanoseconds in absolute terms — fine — but exact lookup is not a reason to pick wordtree.

Size. The FST is the clear winner: it minimises shared prefixes and suffixes (DAWG-like), doing exact lookup and spelling correction in ~3× less space than wordtree does anything.

engine (en)live heapserialized
fst10.0 MiB6.7 MiB
wordtree21.1 MiB21.1 MiB
sorted-vec25.1 MiB
boomphf33.9 MiB
hashmap38.7 MiB
symspell300.4 MiB

So the honest framing: wordtree is the smallest of the naive key-storing structures, but still ~3× larger than an FSA. Its "size-optimised" claim holds against a naive trie, not against fst. (And one more honesty note: wordtree's build peaks at ~224 MiB to produce 21 MiB — ~11× — which matters if you generate trees on a constrained device.)

Spelling correction. symspell is in another league on latency.

case (en)wordtreesymspellfst-levbrute force
sub abxut44.0 µs1.4 µs122.6 µs95.1 ms
del abut48.7 µs8.0 µs124.3 µs86.4 ms

symspell does a handful of hash lookups against a precomputed delete-dictionary; wordtree walks the trie. It's ~25–31× slower than symspell. (It is ~2.5–3.2× faster than fst's Levenshtein automaton, and corrects transpositions that fst misses entirely — but symspell is the one to beat, and it wins.)

Autocomplete. Closest race. The combined suggestions() call runs the edit-distance walk every time, so it's the wrong thing to race against a pure completer (~43 µs). The autocomplete-only completions() call skips the walk:

case (en)wordtree completions()pruning-trie
co2.5 µs1.2 µs

Within ~2× on English, ~1.2× on Swedish. The pruning trie also tracks the frequency oracle a bit better (recall@5 93% vs 85% on Swedish). Close, but still a loss.

So: across the four jobs, on each job's home axis, a specialist wins. The title isn't false modesty.

The one place it doesn't lose: doing all of it from one file

Here's the part the per-axis tables hide. Every alternative above does one job (boomphf, symspell, pruning-trie) or two (fst: lookup + correction). Picking specialists means assembling three or four structures, three or four files, three or four load paths — and HashMap/symspell can't be memory-mapped at all, so they rebuild at startup.

wordtree folds all three jobs into one structure that loads by mmap with no parse or build step, and returns a deliberately short, frequency-ranked, single-edit-tolerant list. It even matches symspell's quality where it counts:

correction recall by edit kind (en)substitutetransposedeleteinsert
wordtree100%100%100%100%
symspell100%100%100%100%
fst-lev100%0%100%100%

(fst's Levenshtein is plain Levenshtein — a transposition costs 2, so it misses every transposed typo at distance 1.) wordtree returns a small frequency-capped top-k rather than the exhaustive DL≤1 set symspell gives you, which is the right trade for an as-you-type box and the wrong one for a batch spell-checker.

When to use it (and when not)

I'll be blunt, because the benchmarks are:

  • Need just one of these jobs, or the lowest latency, or the smallest file? Use the specialist. fst for lookup + fuzzy in minimal space; symspell for exhaustive correction; pruning_radix_trie for pure autocomplete; a HashMap for raw lookup speed.
  • Need a browsable index + frequency + typo-tolerant autocomplete from one mmap-able file, with a short ranked suggestion list and no startup cost? Then one 21 MiB file you mmap and query three ways is a reasonable single dependency — which is exactly the spot wordtree was built for.

The repo is an unmaintained showcase — a snapshot extracted to accompany this post, not a crate I'm asking you to depend on. But the comparison harness is real and reproducible, the edit-distance walk is genuinely nice, and the broader lesson is the one I keep relearning: benchmark against the specialists, expect to lose, and find out whether the thing you're actually optimising for — here, three jobs in one zero-copy file — is even on the axis you're measuring. Usually it isn't.


Reproduce everything:

cargo run -p comparisons --bin quality --release   # quality tables
cargo run -p comparisons --bin size    --release   # size + RAM
cargo bench -p comparisons                         # latency
cargo test  -p comparisons                         # correctness gate

Numbers are from one Apple M-series machine; treat them as ratios, not absolutes. Word lists are derived from PanLex and Wiktionary (en 638,545 words, sv 113,220).


2576 Words

2026-06-06