Home/Coding agents

For coding agents

Does SynaFS actually make coding agents faster?

The retrieval numbers on the benchmarks page prove the index is good. This page asks the harder question: when a real agent — Claude Code or Codex — is handed SynaFS as a tool, does it spend fewer tokens and tool-calls to get the job done? We ran live A/B trials (same task, same model, only the toolbox differs) on a real 367-file app.

Headline

What the live A/B shows

At the retrieval layer SynaFS beats grep for any agent. End-to-end, whether that becomes a win depends on the task shape and the model — and on whether the agent actually routes its navigation through the tool. Three charts, then the full breakdown.

Where the answer ranks — grep vs SynaFS

recall@1 / recall@5 / MRR · higher is better · 23-query NL→code gold set

0.0 0.3 0.6 0.9 0.57 0.04 recall@1 0.83 0.52 recall@5 0.67 0.28 MRR SynaFS (semantic) grep (lexical tool-loop)

Context read to reach the answer

whole-file bytes ingested before the gold file surfaces · lower is better

grep 187 KB SynaFS 28 KB · 6.6× less median whole-file bytes before the gold file surfaces · 23-query gold set

Strong-agent token cost grows with the repo — SynaFS stays flat

Claude Sonnet · multi-file task · median billed input tokens

0 20K 40K 60K 40.3K 35.4K 259-file repo 55.7K 37.6K 367-file repo baseline (grep only) + SynaFS

Codex barely adopts the tool

median calls per task with +SynaFS — it greps anyway

shell grep ≈37 shell cmds SynaFS 2.5 search calls median calls per task · Codex +SynaFS · it greps anyway

The win peaks at the mid tier — an inverted-U

multi-file input-token change vs baseline · 259-file repo · ← SynaFS saves · grep cheaper →

Haiku +30% Sonnet −12% Opus 4.8 +6% Codex −13% SynaFS saves tokens SynaFS costs more ← SynaFS saves tokens · grep cheaper →
The 2×2

Who wins depends on task shape × model strength

The end-to-end benefit is not a single number. SynaFS pays off wherever the agent would otherwise over-acquire context — and a weak model over-acquires on easy lookups, while a strong model only over-acquires on hard recall tasks.

Single-file nav — “where is X?”Multi-file synthesis — “map the whole subsystem”
Cheap model (Haiku)SynaFS wins · −21% input tokensSynaFS hurts · thrashes search (30–46 calls)
Strong model (Sonnet)≈ no change · grep already cheap on a tidy treeSynaFS wins · F1 0.84→0.94 · −12% tokens
A

A — Retrieval efficiency (no agent variance)

23 natural-language queries over a real full-stack app, each paraphrased to avoid the codebase's own identifiers (so it can't be solved by keyword match). We compare the lexical tool-loop an agent runs today (deterministic keyword → ripgrep) against a single syna query on the CodeRankEmbed index. SynaFS puts the right file at rank 1 on 57% of queries vs grep's 4%, and the agent reaches the answer after opening 1 file instead of 5 — 6.6× less code ingested.

Strategyrecall@1recall@5MRRfiles→goldcontext→gold
grep (lexical tool-loop)0.040.520.285187 KB
SynaFS semantic0.570.830.67128 KB
Scope. 23 queries, file-level single-answer gold, author-written. syna-lex (the same pipeline on a lexical embedder) is actually worse than grep — the value is the neural embedder, not the hybrid plumbing.
B

B — Claude Code, live (single-file vs multi-file)

Same prompt, same model, headless claude -p; the only difference is whether the SynaFS MCP (search / read_span / symbol_lookup) is wired in. On single-file “where is X?” tasks the cheap model (Haiku) gains and the strong model (Sonnet) is flat. On multi-file “list every file in subsystem Y” tasks it inverts: the strong model wins decisively while the weak model thrashes the search tool. Multi-file is graded by set precision / recall / F1 against a hand-verified gold set.

Single-file — “where is X? report the path” (12 navigation tasks)

ModelConditionsuccesstoolsfiles openedinput tokenswall
Haikubaseline12/124.01.919,38119.3 s
Haiku+SynaFS12/124.01.315,281 (−21%)20.3 s
Sonnetbaseline12/122.00.28,92118.4 s
Sonnet+SynaFS12/123.50.89,42334.0 s

Here the cheap model wins: Haiku banks −21% input tokens (−29% on hard tasks) at equal 12/12 success — SynaFS partly closes the gap that makes a cheap model expensive. Sonnet is flat-to-negative: it already greps the answer in a median of ~2 calls, so the extra MCP round-trip is pure overhead and wall-clock roughly doubles. This neutral result is a small-tidy-corpus artifact — on the 367-file tree even Sonnet's single-file lookups flip to SynaFS (§C).

Multi-file — “list every file in subsystem Y” (4 concepts, set-graded)

ModelConditionrecallprecisionF1files openedinput tokens
Haikubaseline1.000.650.7812.061,521
Haiku+SynaFS1.000.530.658.079,760
Sonnetbaseline1.000.730.846.540,286
Sonnet+SynaFS1.000.900.944.535,369
Opus 4.8baseline0.940.890.904.534,254
Opus 4.8+SynaFS0.880.950.893.536,458

Multi-file synthesis (“list every file in subsystem Y”), 4 hand-verified concepts, 259-file repo. Note the trend down the table: Haiku over-includes noise, Sonnet is the clean win, and Opus 4.8's baseline is already precise (0.89) — so it has little left to gain (the inverted-U in §E). On single-file “where is X?” tasks the pattern flips again — Haiku −21% input tokens, Sonnet flat (the 2×2 above).

C

C — The win scales with codebase size

The same Sonnet tasks, re-run on the full app (367 indexed files, +admin/mobile) instead of the trimmed 259-file tree. Baseline grep cost grows with the repo; SynaFS's single-query cost is flat — so the token win roughly triples (−12% → −32% median, −48% peak), and even single-file navigation, which grep won on the small tree, now favours SynaFS (−26% tokens, −53% wall-clock).

Corpusbaseline input+SynaFS inputΔ tokensprecision
259-file (trimmed)40,28635,369−12%0.73 → 0.90
367-file (full app)55,66537,580−32%0.75 → 0.85

Sonnet, multi-file. Single-file hard tasks on the 367-file repo also flip to SynaFS: 7 → 2.5 tool calls, input −26%, wall-clock 30.9 s → 14.5 s (−53%), at 4/4 success.

D

D — Codex (GPT-5.5, xhigh): the same index, the opposite outcome

We ran the identical multi-file A/B with Codex (codex exec, GPT-5.5 at the highest xhigh reasoning effort). The retrieval edge is agent-agnostic — the index doesn't know who's querying — but Codex calls the MCP a median of just 2 times while keeping its 30–37-command shell-grep loop. So SynaFS rides alongside its scan instead of replacing it: recall stays perfect and precision is unchanged, while the token delta is small and noisy. Even maximum reasoning effort didn't make Codex route through the tool — it greps either way. The win depends less on SynaFS than on how readily the agent substitutes semantic search for grep.

CorpusConditionrecallprecisionfresh inputtool callsMCP calls
259-filebaseline1.000.74170,920300
259-file+SynaFS1.000.74149,46433.52
367-filebaseline1.000.72172,612370
367-file+SynaFS1.000.72165,71033.52
Scope. Codex reports cumulative, cache-inclusive token usage, so we bill input − cached and compare only within Codex (baseline ↔ +SynaFS), never against Claude's absolute counts. A Codex setup that steers navigation to the MCP is the obvious follow-up; not tested here.
E

E — Haiku → Sonnet → Opus 4.8, and Codex GPT-5.5 xhigh

Does a stronger model gain more? No — the benefit is an inverted-U in model capability. SynaFS only recovers cost the agent was already wasting on navigation, and a frontier model wastes the least: Opus 4.8's baseline grep is already precise (0.89–0.92 vs Sonnet's 0.73), so it reaches the answer in ~9 calls without the tool and even ignores the MCP on the easiest concept. The weak model (Haiku) is strong enough to call search but too weak to stop, so it thrashes. The mid tier (Sonnet) is the sweet spot — capable enough to wield semantic search, still wasteful enough with grep to have headroom worth recovering.

Model259-file Δ tokens259-file precision367-file Δ tokens367-file precision
Haiku (weak)+30%0.65 → 0.53
Sonnet (mid)−12%0.73 → 0.90−32%0.75 → 0.85
Opus 4.8 (strong)+6%0.89 → 0.95−16%0.92 → 0.85
Codex GPT-5.5 xhigh−13%0.74 → 0.74−4%0.72 → 0.72

Multi-file synthesis, baseline → +SynaFS. Negative Δ = SynaFS saves input tokens. The rightmost-but-one column is the tell: Opus's baseline precision is already 0.89–0.92, so it has almost nothing to recover — the gain is largest for the capable-but-wasteful middle (Sonnet).

Related work

Related research & further reading

SynaFS sits where neural code retrieval, coding agents, and semantic filesystems meet. Below are the verified primary sources behind its design and claims, grouped by theme. Citations are kept in their original language.

Neural code embeddings & retrieval models

Repository-level & agentic retrieval (RAG for code)

Coding agents & benchmarks where navigation matters

Code search systems & evaluation

Model Context Protocol (MCP) & agent interfaces

Semantic filesystems & index-at-storage foundations

Incremental / always-fresh code indexing

The field has converged on neural code retrieval for agents — from CodeBERT/GraphCodeBERT to contrastively-trained, benchmark-leading embedders (CodeSage, CodeXEmbed/SFR, CodeRankEmbed), validated on code-IR benchmarks (CodeSearchNet, CoIR) and shown to lift SWE-bench-style agents where localization dominates (SWE-agent, Agentless, SweRank). Meanwhile MCP has standardized how agents consume tools, and HNSW is the default vector substrate. Yet across all of it, index freshness stays bolt-on: retrieval lives in external RAG pipelines or indexers (SCIP, LSFS) that re-index after the fact and drift from the working tree. SynaFS's bet is to push the fused vector / lexical / AST / version index down to the filesystem write boundary, so it updates atomically on every write — giving agents always-fresh, read-your-writes semantic retrieval through ordinary read()/FUSE, a /dev syscall, and MCP, instead of a separate, lagging index.

Honesty notes

# retrieval benchmark (any agent) + live agent A/B (Claude / Codex)
python3 experiments/harness/exp1_retrieval.py
python3 experiments/harness/exp3_multifile.py --agent claude --cond synafs --task m4_rewards
python3 experiments/harness/exp3_multifile.py --agent codex  --cond baseline --task m4_rewards

Full design, every run, and the raw result JSON live in experiments/ (DESIGN.md · REPORT.md · harness/).