Home/Benchmarks

Benchmarks

Measured on real code, reproducibly.

Not a synthetic corpus. These numbers come from the actual release binary run against real codebases and a hand-curated NL→code gold set, by the harness in bench/run_bench.py.

Results

What the runs show

Every bar below is computed straight from bench/results.json by bench/make_charts.py — no hand-drawn numbers. The headline: swapping in the local semantic embedder roughly doubles MRR and recall@1 on paraphrased queries.

Retrieval quality — semantic vs lexical

recall@k and MRR · higher is better · 18-query NL→code gold set

0.0 0.2 0.4 0.6 0.33 0.11 R@1 0.56 0.28 R@3 0.56 0.39 R@5 0.61 0.44 R@10 0.43 0.21 MRR semantic (CodeRankEmbed) lexical (hash)

Per-query rank — lexical vs semantic

rank of the gold file, 1 = best · grey = lexical, blue = semantic · ✕ = not in top-10

1 3 5 10 miss sys/uring.rs index/lib.rs sys/device.rs web/tls.rs engine/fanotify.rs embed/coderank.rs web/lib.rs fuse/resolver.rs chunk/lib.rs store/lib.rs query/lib.rs grpc/hpack.rs engine/versions.rs web/ws.rs engine/wal.rs index/graph.rs engine/versions.rs engine/state.rs

Indexing throughput

chunks / second on real repos · hash embedder

nidavellir 32,438/s SynaFS 7,493/s rogers 2,988/s chunks/s · hash embedder · higher is better

Search latency

p50 / p99 · in-process · lower is better

p50 0.21 ms p99 0.25 ms in-process · 5,000-chunk engine

Embed-cache hit rate

on a 1-line edit

80% cached · 80% re-embedded · 20% touched chunks only
Live

Watch every experiment run

This is not a mock-up. We recorded the actual session — building the binary, then running the whole suite end to end (A retrieval quality · B indexing · C engine latency · D filesystem) — and replay it below with its real timing (long idle pauses are compressed). Here is the exact machine it ran on.

OS
Ubuntu 24.04.4 LTS · Linux 6.17 · x86-64
CPU
AMD Ryzen AI MAX+ PRO 395 · 32 cores
Memory
94 GB
Toolchain
rustc 1.96.0 · cargo 1.96.0 · Python 3.12.3
Embedder
CodeRankEmbed (137M, 768-dim) · pure-Rust candle · CPU inference
synafs@bench: ~/SynaFS
# press play to replay the recorded benchmark run

Recorded with bench/record_cast.py (every line timestamped); replayed by a tiny vanilla-JS player — no asciinema, no external libraries. The same run wrote bench/results.json that powers the charts above.

↓ The run wrote results.json

The exact file the run just produced — shown verbatim, then charted straight from it.

Retrieval quality, charted from results.json

recall@k and MRR · higher is better · 18-query NL→code gold set

Methodology

How we experimented

SynaFS ships a built-in syna bench, but its corpus is synthetic — recall is 1.0 by construction, which proves the pipeline runs but says nothing about quality. The harness here is different: it indexes real source trees and grades retrieval against a gold set whose answers are fixed by inspection, swapping only the embedder so the comparison is clean.

  1. Build the release binary with the local semantic embedder: cargo build --release --features coderank.
  2. Assemble corpora — copy code files only (excluding node_modules, target, build output) from real repositories into a clean tree.
  3. Write 18 natural-language queries, each a paraphrase that avoids the codebase's own identifiers; fix the gold answer to the file that primarily implements that concept.
  4. Index the SynaFS tree twice — once with the offline lexical hash embedder, once with CodeRankEmbed — and run the identical hybrid pipeline (RRF over vector + BM25 + trigram).
  5. Grade at file level: a hit counts if its path ends with the gold path. recall@k = share of queries with a gold hit in the top k; MRR = mean of 1/rank.
  6. Measure engine latency in-process via syna bench; measure indexing throughput by wall-clock over the real trees.

Gold-set examples (the queries contain none of the target's identifiers)

Natural-language queryGold file
“compress response header fields for an http2 stream”syna-grpc/src/hpack.rs
“get notified of every write across a whole mounted filesystem”syna-engine/src/fanotify.rs
“demand and check the caller's certificate during the secure handshake”syna-web/src/tls.rs
“run a neural code representation model locally without python”syna-embed/src/coderank.rs

Reference machine: x86-64 Linux, CPU-only inference, the pure-Rust engine (brute-force vector search over an in-memory snapshot). Your numbers will vary by corpus, embedder, and hardware. Raw output lives in bench/results.json.

A

Retrieval quality — semantic vs lexical

bench/gold.json is 18 natural-language queries over the SynaFS source. Each is a paraphrase that deliberately avoids the codebase's own identifiers; the gold answer is the file that primarily implements that concept (e.g. "compress response header fields for an http2 stream"syna-grpc/src/hpack.rs). Relevance is graded at file level. We index the same tree twice and run the identical hybrid pipeline, changing only the embedder.

Embedderrecall@1recall@3recall@5recall@10MRR
Lexical (hash baseline)0.1110.2780.3890.5000.216
Semantic (CodeRankEmbed)0.3330.5560.5560.6110.434

Semantic embeddings roughly double MRR (0.216 → 0.434) and recall@1 (0.111 → 0.333). The biggest per-query wins are exactly where lexical overlap is weakest: "get notified of every write across a whole mounted filesystem"fanotify.rs climbs from rank 10 → 1, and "demand and check the caller's certificate during the secure handshake"tls.rs from 6 → 1. Several hard queries (hpack.rs, wal.rs, ws.rs) are missed by both in the top-10 — the set is small and untuned, not rigged toward a win.

Scope. 18 queries, file-level gold, author-written. This measures the value of the semantic signal inside SynaFS's own pipeline — not a leaderboard ranking against external systems. A standard benchmark (CoIR / CodeSearchNet) is future work.
B

Indexing throughput — real repositories

End-to-end syna index (tree-walk → tree-sitter chunk → embed → persist) over real source trees, code files only. The offline hash embedder isolates engine/chunking throughput; semantic indexing is bound by CPU model inference and is far slower (it is the quality path, not the throughput path).

CorpusFilesChunksfiles/sMB/schunks/sIndex
SynaFS (self)609644665.77,4937 MB
rogers4674,3043244.42,98834 MB
nidavellir55921,60483964.232,438198 MB

Index size is the current M0 JSON snapshot (~7–9 KB/chunk, full uncompressed vectors); the planned RocksDB + usearch + PQ layout compresses this substantially.

C

Engine latency — in-process

From syna bench at 1,000 files / 5,000 chunks. These are in-process (no per-call snapshot reload), so they reflect engine latency, not CLI start-up. Corpus is synthetic; the latency is real.

Metricp50p99
Search latency0.21 ms0.25 ms
1-line reindex latency19.5 ms21.3 ms
D

Filesystem performance — the POSIX floor

SynaFS is a real FUSE filesystem, so we measured ordinary file ops through the mount against the raw backing disk, plus the cost of a semantic search done by listing a magic directory. The honest picture: metadata is essentially free, every latency stays under ~100 µs, and a semantic query runs in a fraction of a millisecond — but raw byte streaming carries real overhead, because the current pure-Rust FUSE copies bytes through userspace (kernel FUSE_PASSTHROUGH would close that gap).

Byte throughput vs raw disk

sequential read & write · mount as a share of raw

read · raw 32.2 GB/s mount 6.4 GB/s · 20% of raw write · raw 2.5 GB/s mount 0.7 GB/s · 28% of raw warm cache · pure-Rust userspace FUSE (no kernel passthrough yet)

Metadata latency · raw vs mount

stat / readdir / open+read · p50 · lower is better

stat 3.8 4.1 µs readdir 2.5 19.0 µs open+read 5.6 23.0 µs raw mount · µs (p50)

Search by ls

semantic query through the magic path

0.37 ms · p50 ls /.syna/query/<text>/ — a search by readdir mean 2.7 ms (p99 includes cold first query)
Operationrawmount
Sequential read32.2 GB/s6.4 GB/s
Sequential write (index-on-write)2.5 GB/s0.7 GB/s
stat / getattr3.8 µs4.1 µs
readdir2.5 µs19.0 µs
open + read (small file)5.6 µs23.0 µs
Semantic query · ls magic path0.37 ms
Scope. Warm page cache, single host, one FUSE worker. stat is identical through the mount because attributes are cached; throughput is lower because reads/writes round-trip through userspace and writes also enqueue the reindex. Reproduce: python3 bench/fs_bench.py.
E

Scaling — overcoming brute-force with ANN

The biggest M0 limitation was exact brute-force vector search — O(N) per query. So we implemented a pure-Rust HNSW approximate-nearest-neighbour index (no external crates) behind the same VectorIndex trait, selectable with SYNA_ANN=hnsw, and measured it against exact search over clustered 768-dim vectors. Search becomes sublinear: 26× faster at 100k while keeping 98.7% recall@10.

Search speedup vs scale

HNSW vs exact brute-force · by corpus size

1k 1.0× 10k 6.4× 100k 26.0× HNSW search vs brute-force · higher is better

Recall held as it scales

recall@10 vs exact brute-force · 0–1

1k 1.000 10k 1.000 100k 0.987 recall@10 vs exact · 1.0 = identical to brute-force

Latency at 100k

p50 search · brute-force vs HNSW

brute-force 31.93 ms HNSW 1.23 ms p50 search at 100k vectors · 26× apart
Vectorssearch p50 · brute → HNSWrecall@10speedupHNSW build
1,0000.31 → 0.30 ms1.0001.0×0.6 s
10,0003.1 → 0.48 ms1.0006.4×8.9 s
100,00031.9 → 1.23 ms0.98726.0×122 s
Scope. Clustered synthetic vectors — uniform random high-dimensional data has no nearest-neighbour structure, so every serious ANN benchmark uses structured data. HNSW trades build time for query speed: graph construction is O(N·logN) and slow in pure Rust (122 s at 100k), but it is paid incrementally — once per write — not per query. Recall vs latency is tunable via ef_search / ef_construction. Reproduce: syna ann-bench --sizes 1000,10000,100000.
F

Memory — 32× smaller with product quantization

M0 kept every vector as full f32 — ~3 KB each, so a few million chunks no longer fit in RAM. SynaFS now ships a pure-Rust PQ-compressed index (SYNA_ANN=pq): each vector becomes a 96-byte code, the raw f32s move to an on-disk tier, and search re-scores the top candidates exactly from disk — so RAM drops 32× while recall@10 stays at brute-force. Reproduce with syna pq-bench.

RAM footprint at 100k vectors

full f32 vs PQ codes · lower is better

full f32 307.2 MB PQ codes 9.6 MB in-RAM vector store · 100k × 768-d · raw f32 vs PQ code (raw tier on disk)

Per-vector RAM

full f32 → PQ code

32× 3072 B → 96 B full f32 → PQ code

Recall held

recall@10 at 100k · PQ + exact rerank

1.000 recall@10 @ 100k recall@10 at 100k · PQ + exact rerank

ADC scan per query

plain PQ (O(N)) vs IVF-PQ · fewer codes is better

plain PQ 100% · 100k codes IVF-PQ 8.2% fraction of the 100k PQ codes the scan touches · IVF probes ≈√N cells, only a few
VectorsRAM (raw → PQ)compressionrecall@10search p50IVF scan
1,0003.07 → 0.10 MB32×1.0000.24 ms12.9%
10,00030.7 → 0.96 MB32×1.0000.68 ms8.0%
100,000307 → 9.6 MB32×1.0006.45 ms8.2%
Scope. Clustered 768-d vectors (the structure real embeddings have). PQ trades a little search time for a 32× smaller resident index; the ADC scan is still O(N), so past ~1M vectors you layer the §E HNSW graph on top of the compressed tier. The on-disk raw tier is read only for the top-k rerank. Reproduce: syna pq-bench --sizes 1000,10000,100000.
G

Standard benchmark — CodeSearchNet-style

The 18-query gold set above is hand-written; this is its held-out complement. Following the CodeSearchNet method, a harness auto-harvests 186 docstring→function pairs from the real source and strips the doc comments from the indexed code, so the query text is never in the index (no leakage). Semantic embeddings win decisively on this larger, bias-free set.

Retrieval quality — 186 held-out NL→code queries

recall@k & MRR · semantic vs lexical · higher is better

0.0 0.2 0.4 0.6 0.8 1.0 0.75 0.42 R@1 0.86 0.58 R@3 0.91 0.69 R@5 0.96 0.78 R@10 0.82 0.53 MRR semantic (CodeRankEmbed) lexical (hash)
Scope. 186 auto-harvested queries, file-level gold, doc comments stripped before indexing — a CodeSearchNet-style eval over SynaFS's own source. Reproduce: python3 bench/csn_bench.py.

External repos — semantic wins out of domain too

third-party arcflo (TS/JS) · recall@k & MRR · higher is better

0.0 0.1 0.2 0.3 0.4 0.5 0.23 0.15 R@1 0.38 0.18 R@5 0.45 0.20 R@10 0.29 0.17 MRR semantic (CodeRankEmbed) lexical (hash)
Scope. The same harness now runs across 7 external third-party repos (TS/JS + Python, 320+ held-out pairs). Retrieval is harder than on the tuned Rust source, but semantic still beats lexical — arcflo shown. The absolute out-of-domain gap, and a public CoIR / CodeSearchNet leaderboard, remain ongoing work. Reproduce: python3 bench/csn_multi.py.

Honesty notes

# build with the local semantic embedder, then run the harness
cargo build --release --features coderank
python3 bench/run_bench.py            # → bench/results.json
python3 bench/run_bench.py --skip-coderank   # lexical only, no model download

Full write-up: docs/benchmarks.md.