Home/Benchmarks

Benchmarks

Measured on real code, reproducibly.

Not a synthetic corpus. These numbers come from the actual release binary run against real codebases and a hand-curated NL→code gold set, by the harness in bench/run_bench.py.

Results

What the runs show

Every bar below is computed straight from bench/results.json by bench/make_charts.py — no hand-drawn numbers. The headline: swapping in the local semantic embedder roughly doubles MRR and recall@1 on paraphrased queries.

Retrieval quality — semantic vs lexical

recall@k and MRR · higher is better · 18-query NL→code gold set

Per-query rank — lexical vs semantic

rank of the gold file, 1 = best · grey = lexical, blue = semantic · ✕ = not in top-10

Indexing throughput

chunks / second on real repos · hash embedder

Search latency

p50 / p99 · in-process · lower is better

Embed-cache hit rate

on a 1-line edit

Live

Watch every experiment run

This is not a mock-up. We recorded the actual session — building the binary, then running the whole suite end to end (A retrieval quality · B indexing · C engine latency · D filesystem) — and replay it below with its real timing (long idle pauses are compressed). Here is the exact machine it ran on.

Ubuntu 24.04.4 LTS · Linux 6.17 · x86-64

CPU

AMD Ryzen AI MAX+ PRO 395 · 32 cores

Memory

94 GB

Toolchain

rustc 1.96.0 · cargo 1.96.0 · Python 3.12.3

Embedder

CodeRankEmbed (137M, 768-dim) · pure-Rust candle · CPU inference

synafs@bench: ~/SynaFS

# press play to replay the recorded benchmark run

Recorded with bench/record_cast.py (every line timestamped); replayed by a tiny vanilla-JS player — no asciinema, no external libraries. The same run wrote bench/results.json that powers the charts above.

↓ The run wrote results.json

The exact file the run just produced — shown verbatim, then charted straight from it.

Retrieval quality, charted from results.json

recall@k and MRR · higher is better · 18-query NL→code gold set

Methodology

How we experimented

SynaFS ships a built-in syna bench, but its corpus is synthetic — recall is 1.0 by construction, which proves the pipeline runs but says nothing about quality. The harness here is different: it indexes real source trees and grades retrieval against a gold set whose answers are fixed by inspection, swapping only the embedder so the comparison is clean.

Build the release binary with the local semantic embedder: cargo build --release --features coderank.
Assemble corpora — copy code files only (excluding node_modules, target, build output) from real repositories into a clean tree.
Write 18 natural-language queries, each a paraphrase that avoids the codebase's own identifiers; fix the gold answer to the file that primarily implements that concept.
Index the SynaFS tree twice — once with the offline lexical hash embedder, once with CodeRankEmbed — and run the identical hybrid pipeline (RRF over vector + BM25 + trigram).
Grade at file level: a hit counts if its path ends with the gold path. recall@k = share of queries with a gold hit in the top k; MRR = mean of 1/rank.
Measure engine latency in-process via syna bench; measure indexing throughput by wall-clock over the real trees.

Gold-set examples (the queries contain none of the target's identifiers)

Natural-language query	Gold file
“compress response header fields for an http2 stream”	`syna-grpc/src/hpack.rs`
“get notified of every write across a whole mounted filesystem”	`syna-engine/src/fanotify.rs`
“demand and check the caller's certificate during the secure handshake”	`syna-web/src/tls.rs`
“run a neural code representation model locally without python”	`syna-embed/src/coderank.rs`

Reference machine: x86-64 Linux, CPU-only inference, the pure-Rust engine (brute-force vector search over an in-memory snapshot). Your numbers will vary by corpus, embedder, and hardware. Raw output lives in bench/results.json.

Retrieval quality — semantic vs lexical

bench/gold.json is 18 natural-language queries over the SynaFS source. Each is a paraphrase that deliberately avoids the codebase's own identifiers; the gold answer is the file that primarily implements that concept (e.g. "compress response header fields for an http2 stream" → syna-grpc/src/hpack.rs). Relevance is graded at file level. We index the same tree twice and run the identical hybrid pipeline, changing only the embedder.

Embedder	recall@1	recall@3	recall@5	recall@10	MRR
Lexical (hash baseline)	0.111	0.278	0.389	0.500	0.216
Semantic (CodeRankEmbed)	0.333	0.556	0.556	0.611	0.434

Semantic embeddings roughly double MRR (0.216 → 0.434) and recall@1 (0.111 → 0.333). The biggest per-query wins are exactly where lexical overlap is weakest: "get notified of every write across a whole mounted filesystem" → fanotify.rs climbs from rank 10 → 1, and "demand and check the caller's certificate during the secure handshake" → tls.rs from 6 → 1. Several hard queries (hpack.rs, wal.rs, ws.rs) are missed by both in the top-10 — the set is small and untuned, not rigged toward a win.

Scope. 18 queries, file-level gold, author-written. This measures the value of the semantic signal inside SynaFS's own pipeline — not a leaderboard ranking against external systems. A standard benchmark (CoIR / CodeSearchNet) is future work.

Indexing throughput — real repositories

End-to-end syna index (tree-walk → tree-sitter chunk → embed → persist) over real source trees, code files only. The offline hash embedder isolates engine/chunking throughput; semantic indexing is bound by CPU model inference and is far slower (it is the quality path, not the throughput path).

Corpus	Files	Chunks	files/s	MB/s	chunks/s	Index
SynaFS (self)	60	964	466	5.7	7,493	7 MB
rogers	467	4,304	324	4.4	2,988	34 MB
nidavellir	559	21,604	839	64.2	32,438	198 MB

Index size is the current M0 JSON snapshot (~7–9 KB/chunk, full uncompressed vectors); the planned RocksDB + usearch + PQ layout compresses this substantially.

Engine latency — in-process

From syna bench at 1,000 files / 5,000 chunks. These are in-process (no per-call snapshot reload), so they reflect engine latency, not CLI start-up. Corpus is synthetic; the latency is real.

Metric	p50	p99
Search latency	0.21 ms	0.25 ms
1-line reindex latency	19.5 ms	21.3 ms

Embed-cache hit rate on a 1-line edit: 80% — editing 1 of 5 functions re-embeds only that function's chunks; the other 4/5 are served from cache. This is the incremental-on-write invariant, measured directly.
The ~20 ms reindex cost is dominated by the M0 executor rebuilding the index O(n) per commit; usearch incremental add/remove will cut it. It grows with corpus size — it is not yet a fixed per-edit cost.

Filesystem performance — the POSIX floor

SynaFS is a real FUSE filesystem, so we measured ordinary file ops through the mount against the raw backing disk, plus the cost of a semantic search done by listing a magic directory. The honest picture: metadata is essentially free, every latency stays under ~100 µs, and a semantic query runs in a fraction of a millisecond — but raw byte streaming carries real overhead, because the current pure-Rust FUSE copies bytes through userspace (kernel FUSE_PASSTHROUGH would close that gap).

Byte throughput vs raw disk

sequential read & write · mount as a share of raw

Metadata latency · raw vs mount

stat / readdir / open+read · p50 · lower is better

Search by ls

semantic query through the magic path

Operation	raw	mount
Sequential read	32.2 GB/s	6.4 GB/s
Sequential write (index-on-write)	2.5 GB/s	0.7 GB/s
stat / getattr	3.8 µs	4.1 µs
readdir	2.5 µs	19.0 µs
open + read (small file)	5.6 µs	23.0 µs
Semantic query · ls magic path	—	0.37 ms

Scope. Warm page cache, single host, one FUSE worker. stat is identical through the mount because attributes are cached; throughput is lower because reads/writes round-trip through userspace and writes also enqueue the reindex. Reproduce: python3 bench/fs_bench.py.

Scaling — overcoming brute-force with ANN

The biggest M0 limitation was exact brute-force vector search — O(N) per query. So we implemented a pure-Rust HNSW approximate-nearest-neighbour index (no external crates) behind the same VectorIndex trait, selectable with SYNA_ANN=hnsw, and measured it against exact search over clustered 768-dim vectors. Search becomes sublinear: 26× faster at 100k while keeping 98.7% recall@10.

Search speedup vs scale

HNSW vs exact brute-force · by corpus size

Recall held as it scales

recall@10 vs exact brute-force · 0–1

Latency at 100k

p50 search · brute-force vs HNSW

Vectors	search p50 · brute → HNSW	recall@10	speedup	HNSW build
1,000	0.31 → 0.30 ms	1.000	1.0×	0.6 s
10,000	3.1 → 0.48 ms	1.000	6.4×	8.9 s
100,000	31.9 → 1.23 ms	0.987	26.0×	122 s

Scope. Clustered synthetic vectors — uniform random high-dimensional data has no nearest-neighbour structure, so every serious ANN benchmark uses structured data. HNSW trades build time for query speed: graph construction is O(N·logN) and slow in pure Rust (122 s at 100k), but it is paid incrementally — once per write — not per query. Recall vs latency is tunable via ef_search / ef_construction. Reproduce: syna ann-bench --sizes 1000,10000,100000.

Memory — 32× smaller with product quantization

M0 kept every vector as full f32 — ~3 KB each, so a few million chunks no longer fit in RAM. SynaFS now ships a pure-Rust PQ-compressed index (SYNA_ANN=pq): each vector becomes a 96-byte code, the raw f32s move to an on-disk tier, and search re-scores the top candidates exactly from disk — so RAM drops 32× while recall@10 stays at brute-force. Reproduce with syna pq-bench.

RAM footprint at 100k vectors

full f32 vs PQ codes · lower is better

Per-vector RAM

full f32 → PQ code

Recall held

recall@10 at 100k · PQ + exact rerank

ADC scan per query

plain PQ (O(N)) vs IVF-PQ · fewer codes is better

Vectors	RAM (raw → PQ)	compression	recall@10	search p50	IVF scan
1,000	3.07 → 0.10 MB	32×	1.000	0.24 ms	12.9%
10,000	30.7 → 0.96 MB	32×	1.000	0.68 ms	8.0%
100,000	307 → 9.6 MB	32×	1.000	6.45 ms	8.2%

Scope. Clustered 768-d vectors (the structure real embeddings have). PQ trades a little search time for a 32× smaller resident index; the ADC scan is still O(N), so past ~1M vectors you layer the §E HNSW graph on top of the compressed tier. The on-disk raw tier is read only for the top-k rerank. Reproduce: syna pq-bench --sizes 1000,10000,100000.

Standard benchmark — CodeSearchNet-style

The 18-query gold set above is hand-written; this is its held-out complement. Following the CodeSearchNet method, a harness auto-harvests 186 docstring→function pairs from the real source and strips the doc comments from the indexed code, so the query text is never in the index (no leakage). Semantic embeddings win decisively on this larger, bias-free set.

Retrieval quality — 186 held-out NL→code queries

recall@k & MRR · semantic vs lexical · higher is better

Scope. 186 auto-harvested queries, file-level gold, doc comments stripped before indexing — a CodeSearchNet-style eval over SynaFS's own source. Reproduce: python3 bench/csn_bench.py.

External repos — semantic wins out of domain too

third-party arcflo (TS/JS) · recall@k & MRR · higher is better

Scope. The same harness now runs across 7 external third-party repos (TS/JS + Python, 320+ held-out pairs). Retrieval is harder than on the tuned Rust source, but semantic still beats lexical — arcflo shown. The absolute out-of-domain gap, and a public CoIR / CodeSearchNet leaderboard, remain ongoing work. Reproduce: python3 bench/csn_multi.py.

Honesty notes

The vector store is no longer RAM-bound: vectors are PQ-compressed to 96 bytes (32× smaller than full f32), with the raw vectors on an on-disk tier and exact rerank, so recall@10 holds at 1.0 even at 100k while RAM drops 307 MB → 9.6 MB (SYNA_ANN=pq; reproduce with syna pq-bench). And the scan is no longer O(N): an IVF coarse quantizer (≈√N cells, probe a few) touches only ~8% of the codes at 100k — recall@10 stays ~0.99 and that fraction keeps shrinking as the corpus grows. §E adds pure-Rust HNSW for 26× faster exact search on top.
Beyond the 18-query hand set, a CodeSearchNet-style harness auto-harvests 186 held-out docstring→function pairs with the doc comments stripped from the indexed code (no leakage): semantic CodeRankEmbed scores R@1 0.75 / R@10 0.96 / MRR 0.82 vs the lexical baseline's 0.42 / 0.78 / 0.53 (bench/csn_bench.py). And it's no longer only our own source: the same harness now runs across 7 external third-party repos (TS/JS + Python, 320+ held-out pairs, bench/csn_multi.py). Out-of-domain retrieval is harder, but semantic still wins — e.g. on the third-party arcflo project, recall@10 0.45 vs 0.20 and MRR 0.29 vs 0.17 over lexical. Honest remainder: absolute external numbers trail the Rust target SynaFS was tuned on, and a public CoIR leaderboard is still future work.
Everything here is reproducible from a clean checkout with the two commands below.

# build with the local semantic embedder, then run the harness
cargo build --release --features coderank
python3 bench/run_bench.py            # → bench/results.json
python3 bench/run_bench.py --skip-coderank   # lexical only, no model download

Full write-up: docs/benchmarks.md.

← PreviousAPI Next →Agents

Docs

Project

Measured on real code, reproducibly.

What the runs show

Retrieval quality — semantic vs lexical

Per-query rank — lexical vs semantic

Indexing throughput

Search latency

Embed-cache hit rate

Watch every experiment run

↓ The run wrote results.json

Retrieval quality, charted from results.json

How we experimented

Gold-set examples (the queries contain none of the target's identifiers)

Retrieval quality — semantic vs lexical

Indexing throughput — real repositories

Engine latency — in-process

Filesystem performance — the POSIX floor

Byte throughput vs raw disk

Metadata latency · raw vs mount

Search by ls

Scaling — overcoming brute-force with ANN

Search speedup vs scale

Recall held as it scales

Latency at 100k

Memory — 32× smaller with product quantization

RAM footprint at 100k vectors

Per-vector RAM

Recall held

ADC scan per query

Standard benchmark — CodeSearchNet-style

Retrieval quality — 186 held-out NL→code queries

External repos — semantic wins out of domain too

Honesty notes