Home/Benchmarks
Not a synthetic corpus. These numbers come from the actual release binary run against real codebases and a hand-curated NL→code gold set, by the harness in bench/run_bench.py.
Every bar below is computed straight from bench/results.json by bench/make_charts.py — no hand-drawn numbers. The headline: swapping in the local semantic embedder roughly doubles MRR and recall@1 on paraphrased queries.
recall@k and MRR · higher is better · 18-query NL→code gold set
rank of the gold file, 1 = best · grey = lexical, blue = semantic · ✕ = not in top-10
chunks / second on real repos · hash embedder
p50 / p99 · in-process · lower is better
on a 1-line edit
This is not a mock-up. We recorded the actual session — building the binary, then running the whole suite end to end (A retrieval quality · B indexing · C engine latency · D filesystem) — and replay it below with its real timing (long idle pauses are compressed). Here is the exact machine it ran on.
Recorded with bench/record_cast.py (every line timestamped); replayed by a tiny vanilla-JS player — no asciinema, no external libraries. The same run wrote bench/results.json that powers the charts above.
The exact file the run just produced — shown verbatim, then charted straight from it.
recall@k and MRR · higher is better · 18-query NL→code gold set
SynaFS ships a built-in syna bench, but its corpus is synthetic — recall is 1.0 by construction, which proves the pipeline runs but says nothing about quality. The harness here is different: it indexes real source trees and grades retrieval against a gold set whose answers are fixed by inspection, swapping only the embedder so the comparison is clean.
cargo build --release --features coderank.node_modules, target, build output) from real repositories into a clean tree.recall@k = share of queries with a gold hit in the top k; MRR = mean of 1/rank.syna bench; measure indexing throughput by wall-clock over the real trees.| Natural-language query | Gold file |
|---|---|
| “compress response header fields for an http2 stream” | syna-grpc/src/hpack.rs |
| “get notified of every write across a whole mounted filesystem” | syna-engine/src/fanotify.rs |
| “demand and check the caller's certificate during the secure handshake” | syna-web/src/tls.rs |
| “run a neural code representation model locally without python” | syna-embed/src/coderank.rs |
Reference machine: x86-64 Linux, CPU-only inference, the pure-Rust engine (brute-force vector search over an in-memory snapshot). Your numbers will vary by corpus, embedder, and hardware. Raw output lives in bench/results.json.
bench/gold.json is 18 natural-language queries over the SynaFS source. Each is a paraphrase that deliberately avoids the codebase's own identifiers; the gold answer is the file that primarily implements that concept (e.g. "compress response header fields for an http2 stream" → syna-grpc/src/hpack.rs). Relevance is graded at file level. We index the same tree twice and run the identical hybrid pipeline, changing only the embedder.
| Embedder | recall@1 | recall@3 | recall@5 | recall@10 | MRR |
|---|---|---|---|---|---|
| Lexical (hash baseline) | 0.111 | 0.278 | 0.389 | 0.500 | 0.216 |
| Semantic (CodeRankEmbed) | 0.333 | 0.556 | 0.556 | 0.611 | 0.434 |
Semantic embeddings roughly double MRR (0.216 → 0.434) and recall@1 (0.111 → 0.333). The biggest per-query wins are exactly where lexical overlap is weakest: "get notified of every write across a whole mounted filesystem" → fanotify.rs climbs from rank 10 → 1, and "demand and check the caller's certificate during the secure handshake" → tls.rs from 6 → 1. Several hard queries (hpack.rs, wal.rs, ws.rs) are missed by both in the top-10 — the set is small and untuned, not rigged toward a win.
End-to-end syna index (tree-walk → tree-sitter chunk → embed → persist) over real source trees, code files only. The offline hash embedder isolates engine/chunking throughput; semantic indexing is bound by CPU model inference and is far slower (it is the quality path, not the throughput path).
| Corpus | Files | Chunks | files/s | MB/s | chunks/s | Index |
|---|---|---|---|---|---|---|
| SynaFS (self) | 60 | 964 | 466 | 5.7 | 7,493 | 7 MB |
| rogers | 467 | 4,304 | 324 | 4.4 | 2,988 | 34 MB |
| nidavellir | 559 | 21,604 | 839 | 64.2 | 32,438 | 198 MB |
Index size is the current M0 JSON snapshot (~7–9 KB/chunk, full uncompressed vectors); the planned RocksDB + usearch + PQ layout compresses this substantially.
From syna bench at 1,000 files / 5,000 chunks. These are in-process (no per-call snapshot reload), so they reflect engine latency, not CLI start-up. Corpus is synthetic; the latency is real.
| Metric | p50 | p99 |
|---|---|---|
| Search latency | 0.21 ms | 0.25 ms |
| 1-line reindex latency | 19.5 ms | 21.3 ms |
SynaFS is a real FUSE filesystem, so we measured ordinary file ops through the mount against the raw backing disk, plus the cost of a semantic search done by listing a magic directory. The honest picture: metadata is essentially free, every latency stays under ~100 µs, and a semantic query runs in a fraction of a millisecond — but raw byte streaming carries real overhead, because the current pure-Rust FUSE copies bytes through userspace (kernel FUSE_PASSTHROUGH would close that gap).
sequential read & write · mount as a share of raw
stat / readdir / open+read · p50 · lower is better
semantic query through the magic path
| Operation | raw | mount |
|---|---|---|
| Sequential read | 32.2 GB/s | 6.4 GB/s |
| Sequential write (index-on-write) | 2.5 GB/s | 0.7 GB/s |
| stat / getattr | 3.8 µs | 4.1 µs |
| readdir | 2.5 µs | 19.0 µs |
| open + read (small file) | 5.6 µs | 23.0 µs |
| Semantic query · ls magic path | — | 0.37 ms |
stat is identical through the mount because attributes are cached; throughput is lower because reads/writes round-trip through userspace and writes also enqueue the reindex. Reproduce: python3 bench/fs_bench.py.The biggest M0 limitation was exact brute-force vector search — O(N) per query. So we implemented a pure-Rust HNSW approximate-nearest-neighbour index (no external crates) behind the same VectorIndex trait, selectable with SYNA_ANN=hnsw, and measured it against exact search over clustered 768-dim vectors. Search becomes sublinear: 26× faster at 100k while keeping 98.7% recall@10.
HNSW vs exact brute-force · by corpus size
recall@10 vs exact brute-force · 0–1
p50 search · brute-force vs HNSW
| Vectors | search p50 · brute → HNSW | recall@10 | speedup | HNSW build |
|---|---|---|---|---|
| 1,000 | 0.31 → 0.30 ms | 1.000 | 1.0× | 0.6 s |
| 10,000 | 3.1 → 0.48 ms | 1.000 | 6.4× | 8.9 s |
| 100,000 | 31.9 → 1.23 ms | 0.987 | 26.0× | 122 s |
ef_search / ef_construction. Reproduce: syna ann-bench --sizes 1000,10000,100000.M0 kept every vector as full f32 — ~3 KB each, so a few million chunks no longer fit in RAM. SynaFS now ships a pure-Rust PQ-compressed index (SYNA_ANN=pq): each vector becomes a 96-byte code, the raw f32s move to an on-disk tier, and search re-scores the top candidates exactly from disk — so RAM drops 32× while recall@10 stays at brute-force. Reproduce with syna pq-bench.
full f32 vs PQ codes · lower is better
full f32 → PQ code
recall@10 at 100k · PQ + exact rerank
plain PQ (O(N)) vs IVF-PQ · fewer codes is better
| Vectors | RAM (raw → PQ) | compression | recall@10 | search p50 | IVF scan |
|---|---|---|---|---|---|
| 1,000 | 3.07 → 0.10 MB | 32× | 1.000 | 0.24 ms | 12.9% |
| 10,000 | 30.7 → 0.96 MB | 32× | 1.000 | 0.68 ms | 8.0% |
| 100,000 | 307 → 9.6 MB | 32× | 1.000 | 6.45 ms | 8.2% |
syna pq-bench --sizes 1000,10000,100000.The 18-query gold set above is hand-written; this is its held-out complement. Following the CodeSearchNet method, a harness auto-harvests 186 docstring→function pairs from the real source and strips the doc comments from the indexed code, so the query text is never in the index (no leakage). Semantic embeddings win decisively on this larger, bias-free set.
recall@k & MRR · semantic vs lexical · higher is better
python3 bench/csn_bench.py.third-party arcflo (TS/JS) · recall@k & MRR · higher is better
python3 bench/csn_multi.py.SYNA_ANN=pq; reproduce with syna pq-bench). And the scan is no longer O(N): an IVF coarse quantizer (≈√N cells, probe a few) touches only ~8% of the codes at 100k — recall@10 stays ~0.99 and that fraction keeps shrinking as the corpus grows. §E adds pure-Rust HNSW for 26× faster exact search on top.bench/csn_bench.py). And it's no longer only our own source: the same harness now runs across 7 external third-party repos (TS/JS + Python, 320+ held-out pairs, bench/csn_multi.py). Out-of-domain retrieval is harder, but semantic still wins — e.g. on the third-party arcflo project, recall@10 0.45 vs 0.20 and MRR 0.29 vs 0.17 over lexical. Honest remainder: absolute external numbers trail the Rust target SynaFS was tuned on, and a public CoIR leaderboard is still future work.# build with the local semantic embedder, then run the harness cargo build --release --features coderank python3 bench/run_bench.py # → bench/results.json python3 bench/run_bench.py --skip-coderank # lexical only, no model download
Full write-up: docs/benchmarks.md.