For coding agents

Does SynaFS actually make coding agents faster?

The retrieval numbers on the benchmarks page prove the index is good. This page asks the harder question: when a real agent — Claude Code or Codex — is handed SynaFS as a tool, does it spend fewer tokens and tool-calls to get the job done? We ran live A/B trials (same task, same model, only the toolbox differs) on a real 367-file app.

Headline

What the live A/B shows

At the retrieval layer SynaFS beats grep for any agent. End-to-end, whether that becomes a win depends on the task shape and the model — and on whether the agent actually routes its navigation through the tool. Three charts, then the full breakdown.

Where the answer ranks — grep vs SynaFS

recall@1 / recall@5 / MRR · higher is better · 23-query NL→code gold set

Context read to reach the answer

whole-file bytes ingested before the gold file surfaces · lower is better

Strong-agent token cost grows with the repo — SynaFS stays flat

Claude Sonnet · multi-file task · median billed input tokens

Codex barely adopts the tool

median calls per task with +SynaFS — it greps anyway

The win peaks at the mid tier — an inverted-U

multi-file input-token change vs baseline · 259-file repo · ← SynaFS saves · grep cheaper →

The 2×2

Who wins depends on task shape × model strength

The end-to-end benefit is not a single number. SynaFS pays off wherever the agent would otherwise over-acquire context — and a weak model over-acquires on easy lookups, while a strong model only over-acquires on hard recall tasks.

	Single-file nav — “where is X?”	Multi-file synthesis — “map the whole subsystem”
Cheap model (Haiku)	SynaFS wins · −21% input tokens	SynaFS hurts · thrashes search (30–46 calls)
Strong model (Sonnet)	≈ no change · grep already cheap on a tidy tree	SynaFS wins · F1 0.84→0.94 · −12% tokens

A — Retrieval efficiency (no agent variance)

23 natural-language queries over a real full-stack app, each paraphrased to avoid the codebase's own identifiers (so it can't be solved by keyword match). We compare the lexical tool-loop an agent runs today (deterministic keyword → ripgrep) against a single syna query on the CodeRankEmbed index. SynaFS puts the right file at rank 1 on 57% of queries vs grep's 4%, and the agent reaches the answer after opening 1 file instead of 5 — 6.6× less code ingested.

Strategy	recall@1	recall@5	MRR	files→gold	context→gold
grep (lexical tool-loop)	0.04	0.52	0.28	5	187 KB
SynaFS semantic	0.57	0.83	0.67	1	28 KB

Scope. 23 queries, file-level single-answer gold, author-written. syna-lex (the same pipeline on a lexical embedder) is actually worse than grep — the value is the neural embedder, not the hybrid plumbing.

B — Claude Code, live (single-file vs multi-file)

Same prompt, same model, headless claude -p; the only difference is whether the SynaFS MCP (search / read_span / symbol_lookup) is wired in. On single-file “where is X?” tasks the cheap model (Haiku) gains and the strong model (Sonnet) is flat. On multi-file “list every file in subsystem Y” tasks it inverts: the strong model wins decisively while the weak model thrashes the search tool. Multi-file is graded by set precision / recall / F1 against a hand-verified gold set.

Single-file — “where is X? report the path” (12 navigation tasks)

Model	Condition	success	tools	files opened	input tokens	wall
Haiku	baseline	12/12	4.0	1.9	19,381	19.3 s
Haiku	+SynaFS	12/12	4.0	1.3	15,281 (−21%)	20.3 s
Sonnet	baseline	12/12	2.0	0.2	8,921	18.4 s
Sonnet	+SynaFS	12/12	3.5	0.8	9,423	34.0 s

Here the cheap model wins: Haiku banks −21% input tokens (−29% on hard tasks) at equal 12/12 success — SynaFS partly closes the gap that makes a cheap model expensive. Sonnet is flat-to-negative: it already greps the answer in a median of ~2 calls, so the extra MCP round-trip is pure overhead and wall-clock roughly doubles. This neutral result is a small-tidy-corpus artifact — on the 367-file tree even Sonnet's single-file lookups flip to SynaFS (§C).

Multi-file — “list every file in subsystem Y” (4 concepts, set-graded)

Model	Condition	recall	precision	F1	files opened	input tokens
Haiku	baseline	1.00	0.65	0.78	12.0	61,521
Haiku	+SynaFS	1.00	0.53	0.65	8.0	79,760
Sonnet	baseline	1.00	0.73	0.84	6.5	40,286
Sonnet	+SynaFS	1.00	0.90	0.94	4.5	35,369
Opus 4.8	baseline	0.94	0.89	0.90	4.5	34,254
Opus 4.8	+SynaFS	0.88	0.95	0.89	3.5	36,458

Multi-file synthesis (“list every file in subsystem Y”), 4 hand-verified concepts, 259-file repo. Note the trend down the table: Haiku over-includes noise, Sonnet is the clean win, and Opus 4.8's baseline is already precise (0.89) — so it has little left to gain (the inverted-U in §E). On single-file “where is X?” tasks the pattern flips again — Haiku −21% input tokens, Sonnet flat (the 2×2 above).

C — The win scales with codebase size

The same Sonnet tasks, re-run on the full app (367 indexed files, +admin/mobile) instead of the trimmed 259-file tree. Baseline grep cost grows with the repo; SynaFS's single-query cost is flat — so the token win roughly triples (−12% → −32% median, −48% peak), and even single-file navigation, which grep won on the small tree, now favours SynaFS (−26% tokens, −53% wall-clock).

Corpus	baseline input	+SynaFS input	Δ tokens	precision
259-file (trimmed)	40,286	35,369	−12%	0.73 → 0.90
367-file (full app)	55,665	37,580	−32%	0.75 → 0.85

Sonnet, multi-file. Single-file hard tasks on the 367-file repo also flip to SynaFS: 7 → 2.5 tool calls, input −26%, wall-clock 30.9 s → 14.5 s (−53%), at 4/4 success.

D — Codex (GPT-5.5, xhigh): the same index, the opposite outcome

We ran the identical multi-file A/B with Codex (codex exec, GPT-5.5 at the highest xhigh reasoning effort). The retrieval edge is agent-agnostic — the index doesn't know who's querying — but Codex calls the MCP a median of just 2 times while keeping its 30–37-command shell-grep loop. So SynaFS rides alongside its scan instead of replacing it: recall stays perfect and precision is unchanged, while the token delta is small and noisy. Even maximum reasoning effort didn't make Codex route through the tool — it greps either way. The win depends less on SynaFS than on how readily the agent substitutes semantic search for grep.

Corpus	Condition	recall	precision	fresh input	tool calls	MCP calls
259-file	baseline	1.00	0.74	170,920	30	0
259-file	+SynaFS	1.00	0.74	149,464	33.5	2
367-file	baseline	1.00	0.72	172,612	37	0
367-file	+SynaFS	1.00	0.72	165,710	33.5	2

Scope. Codex reports cumulative, cache-inclusive token usage, so we bill input − cached and compare only within Codex (baseline ↔ +SynaFS), never against Claude's absolute counts. A Codex setup that steers navigation to the MCP is the obvious follow-up; not tested here.

E — Haiku → Sonnet → Opus 4.8, and Codex GPT-5.5 xhigh

Does a stronger model gain more? No — the benefit is an inverted-U in model capability. SynaFS only recovers cost the agent was already wasting on navigation, and a frontier model wastes the least: Opus 4.8's baseline grep is already precise (0.89–0.92 vs Sonnet's 0.73), so it reaches the answer in ~9 calls without the tool and even ignores the MCP on the easiest concept. The weak model (Haiku) is strong enough to call search but too weak to stop, so it thrashes. The mid tier (Sonnet) is the sweet spot — capable enough to wield semantic search, still wasteful enough with grep to have headroom worth recovering.

Model	259-file Δ tokens	259-file precision	367-file Δ tokens	367-file precision
Haiku (weak)	+30%	0.65 → 0.53	—	—
Sonnet (mid)	−12%	0.73 → 0.90	−32%	0.75 → 0.85
Opus 4.8 (strong)	+6%	0.89 → 0.95	−16%	0.92 → 0.85
Codex GPT-5.5 xhigh	−13%	0.74 → 0.74	−4%	0.72 → 0.72

Multi-file synthesis, baseline → +SynaFS. Negative Δ = SynaFS saves input tokens. The rightmost-but-one column is the tell: Opus's baseline precision is already 0.89–0.92, so it has almost nothing to recover — the gain is largest for the capable-but-wasteful middle (Sonnet).

Related work

Related research & further reading

SynaFS sits where neural code retrieval, coding agents, and semantic filesystems meet. Below are the verified primary sources behind its design and claims, grouped by theme. Citations are kept in their original language.

Neural code embeddings & retrieval models

CodeBERT: A Pre-Trained Model for Programming and Natural Languages — Feng, Guo, Tang, Duan et al. (Microsoft) · arXiv:2002.08155 · 2020Foundational bimodal NL/PL encoder — the lineage ancestor of the neural code-search embeddings SynaFS relies on.
GraphCodeBERT: Pre-training Code Representations with Data Flow — Guo, Ren et al. (Microsoft) · arXiv:2009.08366 · ICLR 2021Adds data-flow structure to code embeddings, paralleling SynaFS's fusion of embeddings with an AST/symbol graph.
CodeSage: Code Representation Learning at Scale — Zhang, Ahmad et al. (AWS AI Labs / UPenn) · arXiv:2402.01935 · ICLR 2024Scaled contrastive code encoders for retrieval — a candidate embedding backbone for SynaFS's vector index.
CodeXEmbed / SFR-Embedding-Code: A Generalist Embedding Family for Code Retrieval — Liu et al. (Salesforce AI Research) · arXiv:2411.12644 · COLM 2025#1 on CoIR across 12 languages — the kind of neural retriever that beats grep, motivating SynaFS's vector layer.
CoRNStack / CodeRankEmbed: Better Code Retrieval and Reranking — Suresh, Reddy, Xu, Nussbaum, Mulyar, Duderstadt, Ji · arXiv:2412.01007 · ICLR 2025Introduces the CodeRankEmbed retriever named in SynaFS's own value claim; sharpens issue-function localization.

Repository-level & agentic retrieval (RAG for code)

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation — Zhang, Chen et al. (Microsoft) · arXiv:2303.12570 · EMNLP 2023Canonical case that repo-wide iterative retrieval beats in-file context — a premise SynaFS operationalizes at the FS level.
SweRank: Software Issue Localization with Code Ranking — Reddy, Suresh et al. (Salesforce AI Research / UIUC) · arXiv:2505.07849 · 2025Retrieve-and-rerank localization that beats costly agent loops — retrieval, not just reasoning, drives navigation.

Coding agents & benchmarks where navigation matters

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan (Princeton) · arXiv:2310.06770 · ICLR 2024The standard benchmark requiring agents to locate and edit code across real repos (Verified = 500 human-curated).
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — Yang, Jimenez, Wettig, Lieret, Yao, Narasimhan, Press (Princeton) · arXiv:2405.15793 · NeurIPS 2024A well-designed repo navigation/edit interface sharply boosts agents — SynaFS proposes the filesystem as that interface.
Agentless: Demystifying LLM-based Software Engineering Agents — Xia, Deng, Dunn, Zhang (UIUC) · arXiv:2407.01489 · 2024A simple localize→repair→validate pipeline beats complex agents — accurate localization (SynaFS's core) dominates.

Code search systems & evaluation

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search — Husain, Wu, Gazit, Allamanis, Brockschmidt (GitHub / Microsoft) · arXiv:1909.09436 · 2019The seminal semantic-code-search corpus/benchmark framing the dense-retrieval task SynaFS embeds in the FS.
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models — Li, Dong et al. · arXiv:2407.02883 · ACL 202510-dataset code-retrieval benchmark comparing dense vs lexical — the yardstick for SynaFS's vector-layer choices.

Model Context Protocol (MCP) & agent interfaces

Introducing the Model Context Protocol — Anthropic · official announcement · Nov 2024The open standard for connecting LLMs to tools/data; SynaFS exposes its index through an MCP API.
Model Context Protocol — Specification (2025-11-25) — MCP project (Anthropic / Linux Foundation AAIF) · official specAuthoritative JSON-RPC spec (Resources / Tools / Prompts) — the contract SynaFS's MCP surface implements.

Semantic filesystems & index-at-storage foundations

Semantic File Systems — Gifford, Jouvelot, Sheldon, O'Toole (MIT) · SOSP '91 · 1991The direct historical ancestor: virtual directories as queries — SynaFS's stated lineage anchor.
From Commands to Prompts: LLM-based Semantic File System for AIOS (LSFS) — Shi et al. (AGI Research / Rutgers) · arXiv:2410.11843 · ICLR 2025Closest contemporary peer — a semantic FS for agents, though it sits above (not at) the write boundary.
Efficient and Robust Approximate Nearest Neighbor Search Using HNSW Graphs — Malkov, Yashunin · arXiv:1603.09320 · IEEE TPAMI 2018The foundational graph-based ANN index underpinning SynaFS's in-filesystem vector index.

Incremental / always-fresh code indexing

SCIP — a better code indexing format than LSIF — Sourcegraph · blog + spec (scip-code.org) · 2022Incremental reindexing of only changed files — the bolt-on freshness approach SynaFS instead solves atomically at write time.

The field has converged on neural code retrieval for agents — from CodeBERT/GraphCodeBERT to contrastively-trained, benchmark-leading embedders (CodeSage, CodeXEmbed/SFR, CodeRankEmbed), validated on code-IR benchmarks (CodeSearchNet, CoIR) and shown to lift SWE-bench-style agents where localization dominates (SWE-agent, Agentless, SweRank). Meanwhile MCP has standardized how agents consume tools, and HNSW is the default vector substrate. Yet across all of it, index freshness stays bolt-on: retrieval lives in external RAG pipelines or indexers (SCIP, LSFS) that re-index after the fact and drift from the working tree. SynaFS's bet is to push the fused vector / lexical / AST / version index down to the filesystem write boundary, so it updates atomically on every write — giving agents always-fresh, read-your-writes semantic retrieval through ordinary read()/FUSE, a /dev syscall, and MCP, instead of a separate, lagging index.

Honesty notes

Small N (23 queries; 4 multi-file concepts × 2 conditions × 2 models per corpus). Results are directional, reported with per-task breakdowns rather than one headline number.
The symbol-graph caller/refs relations are unresolved on these indexes, so “find every caller” impact-analysis tasks are not a SynaFS differentiator here — the win is the semantic embedder. Stated plainly rather than hidden.
Multi-file gold is a hand-verified core file set; “alts” (legitimate-but-optional files) neither help recall nor hurt precision. Every run is reproducible from the harness below.

# retrieval benchmark (any agent) + live agent A/B (Claude / Codex)
python3 experiments/harness/exp1_retrieval.py
python3 experiments/harness/exp3_multifile.py --agent claude --cond synafs --task m4_rewards
python3 experiments/harness/exp3_multifile.py --agent codex  --cond baseline --task m4_rewards

Full design, every run, and the raw result JSON live in experiments/ (DESIGN.md · REPORT.md · harness/).

← PreviousPerformance Next →Install

Docs

Project

Does SynaFS actually make coding agents faster?

What the live A/B shows

Where the answer ranks — grep vs SynaFS

Context read to reach the answer

Strong-agent token cost grows with the repo — SynaFS stays flat

Codex barely adopts the tool

The win peaks at the mid tier — an inverted-U

Who wins depends on task shape × model strength

A — Retrieval efficiency (no agent variance)

B — Claude Code, live (single-file vs multi-file)

Single-file — “where is X? report the path” (12 navigation tasks)

Multi-file — “list every file in subsystem Y” (4 concepts, set-graded)

C — The win scales with codebase size

D — Codex (GPT-5.5, xhigh): the same index, the opposite outcome

E — Haiku → Sonnet → Opus 4.8, and Codex GPT-5.5 xhigh

Related research & further reading

Neural code embeddings & retrieval models

Repository-level & agentic retrieval (RAG for code)

Coding agents & benchmarks where navigation matters

Code search systems & evaluation

Model Context Protocol (MCP) & agent interfaces

Semantic filesystems & index-at-storage foundations

Incremental / always-fresh code indexing

Honesty notes