Home/Coding agents
The retrieval numbers on the benchmarks page prove the index is good. This page asks the harder question: when a real agent — Claude Code or Codex — is handed SynaFS as a tool, does it spend fewer tokens and tool-calls to get the job done? We ran live A/B trials (same task, same model, only the toolbox differs) on a real 367-file app.
At the retrieval layer SynaFS beats grep for any agent. End-to-end, whether that becomes a win depends on the task shape and the model — and on whether the agent actually routes its navigation through the tool. Three charts, then the full breakdown.
recall@1 / recall@5 / MRR · higher is better · 23-query NL→code gold set
whole-file bytes ingested before the gold file surfaces · lower is better
Claude Sonnet · multi-file task · median billed input tokens
median calls per task with +SynaFS — it greps anyway
multi-file input-token change vs baseline · 259-file repo · ← SynaFS saves · grep cheaper →
The end-to-end benefit is not a single number. SynaFS pays off wherever the agent would otherwise over-acquire context — and a weak model over-acquires on easy lookups, while a strong model only over-acquires on hard recall tasks.
| Single-file nav — “where is X?” | Multi-file synthesis — “map the whole subsystem” | |
|---|---|---|
| Cheap model (Haiku) | SynaFS wins · −21% input tokens | SynaFS hurts · thrashes search (30–46 calls) |
| Strong model (Sonnet) | ≈ no change · grep already cheap on a tidy tree | SynaFS wins · F1 0.84→0.94 · −12% tokens |
23 natural-language queries over a real full-stack app, each paraphrased to avoid the codebase's own identifiers (so it can't be solved by keyword match). We compare the lexical tool-loop an agent runs today (deterministic keyword → ripgrep) against a single syna query on the CodeRankEmbed index. SynaFS puts the right file at rank 1 on 57% of queries vs grep's 4%, and the agent reaches the answer after opening 1 file instead of 5 — 6.6× less code ingested.
| Strategy | recall@1 | recall@5 | MRR | files→gold | context→gold |
|---|---|---|---|---|---|
| grep (lexical tool-loop) | 0.04 | 0.52 | 0.28 | 5 | 187 KB |
| SynaFS semantic | 0.57 | 0.83 | 0.67 | 1 | 28 KB |
syna-lex (the same pipeline on a lexical embedder) is actually worse than grep — the value is the neural embedder, not the hybrid plumbing.Same prompt, same model, headless claude -p; the only difference is whether the SynaFS MCP (search / read_span / symbol_lookup) is wired in. On single-file “where is X?” tasks the cheap model (Haiku) gains and the strong model (Sonnet) is flat. On multi-file “list every file in subsystem Y” tasks it inverts: the strong model wins decisively while the weak model thrashes the search tool. Multi-file is graded by set precision / recall / F1 against a hand-verified gold set.
| Model | Condition | success | tools | files opened | input tokens | wall |
|---|---|---|---|---|---|---|
| Haiku | baseline | 12/12 | 4.0 | 1.9 | 19,381 | 19.3 s |
| Haiku | +SynaFS | 12/12 | 4.0 | 1.3 | 15,281 (−21%) | 20.3 s |
| Sonnet | baseline | 12/12 | 2.0 | 0.2 | 8,921 | 18.4 s |
| Sonnet | +SynaFS | 12/12 | 3.5 | 0.8 | 9,423 | 34.0 s |
Here the cheap model wins: Haiku banks −21% input tokens (−29% on hard tasks) at equal 12/12 success — SynaFS partly closes the gap that makes a cheap model expensive. Sonnet is flat-to-negative: it already greps the answer in a median of ~2 calls, so the extra MCP round-trip is pure overhead and wall-clock roughly doubles. This neutral result is a small-tidy-corpus artifact — on the 367-file tree even Sonnet's single-file lookups flip to SynaFS (§C).
| Model | Condition | recall | precision | F1 | files opened | input tokens |
|---|---|---|---|---|---|---|
| Haiku | baseline | 1.00 | 0.65 | 0.78 | 12.0 | 61,521 |
| Haiku | +SynaFS | 1.00 | 0.53 | 0.65 | 8.0 | 79,760 |
| Sonnet | baseline | 1.00 | 0.73 | 0.84 | 6.5 | 40,286 |
| Sonnet | +SynaFS | 1.00 | 0.90 | 0.94 | 4.5 | 35,369 |
| Opus 4.8 | baseline | 0.94 | 0.89 | 0.90 | 4.5 | 34,254 |
| Opus 4.8 | +SynaFS | 0.88 | 0.95 | 0.89 | 3.5 | 36,458 |
Multi-file synthesis (“list every file in subsystem Y”), 4 hand-verified concepts, 259-file repo. Note the trend down the table: Haiku over-includes noise, Sonnet is the clean win, and Opus 4.8's baseline is already precise (0.89) — so it has little left to gain (the inverted-U in §E). On single-file “where is X?” tasks the pattern flips again — Haiku −21% input tokens, Sonnet flat (the 2×2 above).
The same Sonnet tasks, re-run on the full app (367 indexed files, +admin/mobile) instead of the trimmed 259-file tree. Baseline grep cost grows with the repo; SynaFS's single-query cost is flat — so the token win roughly triples (−12% → −32% median, −48% peak), and even single-file navigation, which grep won on the small tree, now favours SynaFS (−26% tokens, −53% wall-clock).
| Corpus | baseline input | +SynaFS input | Δ tokens | precision |
|---|---|---|---|---|
| 259-file (trimmed) | 40,286 | 35,369 | −12% | 0.73 → 0.90 |
| 367-file (full app) | 55,665 | 37,580 | −32% | 0.75 → 0.85 |
Sonnet, multi-file. Single-file hard tasks on the 367-file repo also flip to SynaFS: 7 → 2.5 tool calls, input −26%, wall-clock 30.9 s → 14.5 s (−53%), at 4/4 success.
We ran the identical multi-file A/B with Codex (codex exec, GPT-5.5 at the highest xhigh reasoning effort). The retrieval edge is agent-agnostic — the index doesn't know who's querying — but Codex calls the MCP a median of just 2 times while keeping its 30–37-command shell-grep loop. So SynaFS rides alongside its scan instead of replacing it: recall stays perfect and precision is unchanged, while the token delta is small and noisy. Even maximum reasoning effort didn't make Codex route through the tool — it greps either way. The win depends less on SynaFS than on how readily the agent substitutes semantic search for grep.
| Corpus | Condition | recall | precision | fresh input | tool calls | MCP calls |
|---|---|---|---|---|---|---|
| 259-file | baseline | 1.00 | 0.74 | 170,920 | 30 | 0 |
| 259-file | +SynaFS | 1.00 | 0.74 | 149,464 | 33.5 | 2 |
| 367-file | baseline | 1.00 | 0.72 | 172,612 | 37 | 0 |
| 367-file | +SynaFS | 1.00 | 0.72 | 165,710 | 33.5 | 2 |
Does a stronger model gain more? No — the benefit is an inverted-U in model capability. SynaFS only recovers cost the agent was already wasting on navigation, and a frontier model wastes the least: Opus 4.8's baseline grep is already precise (0.89–0.92 vs Sonnet's 0.73), so it reaches the answer in ~9 calls without the tool and even ignores the MCP on the easiest concept. The weak model (Haiku) is strong enough to call search but too weak to stop, so it thrashes. The mid tier (Sonnet) is the sweet spot — capable enough to wield semantic search, still wasteful enough with grep to have headroom worth recovering.
| Model | 259-file Δ tokens | 259-file precision | 367-file Δ tokens | 367-file precision |
|---|---|---|---|---|
| Haiku (weak) | +30% | 0.65 → 0.53 | — | — |
| Sonnet (mid) | −12% | 0.73 → 0.90 | −32% | 0.75 → 0.85 |
| Opus 4.8 (strong) | +6% | 0.89 → 0.95 | −16% | 0.92 → 0.85 |
| Codex GPT-5.5 xhigh | −13% | 0.74 → 0.74 | −4% | 0.72 → 0.72 |
Multi-file synthesis, baseline → +SynaFS. Negative Δ = SynaFS saves input tokens. The rightmost-but-one column is the tell: Opus's baseline precision is already 0.89–0.92, so it has almost nothing to recover — the gain is largest for the capable-but-wasteful middle (Sonnet).
SynaFS sits where neural code retrieval, coding agents, and semantic filesystems meet. Below are the verified primary sources behind its design and claims, grouped by theme. Citations are kept in their original language.
# retrieval benchmark (any agent) + live agent A/B (Claude / Codex)
python3 experiments/harness/exp1_retrieval.py
python3 experiments/harness/exp3_multifile.py --agent claude --cond synafs --task m4_rewards
python3 experiments/harness/exp3_multifile.py --agent codex --cond baseline --task m4_rewardsFull design, every run, and the raw result JSON live in experiments/ (DESIGN.md · REPORT.md · harness/).