Home/Features

What's inside

A hybrid index, fused at the write path.

One engine, four signals — vector, symbol graph, lexical, and temporal indexes that share a single document id, so hybrid ranking is a join rather than a glue layer.

Four signals, one DocID

SynaFS keeps four indexes over the same code, all keyed off one content-addressed chunk identity. Every posting list, vector label, and graph edge resolves back to the same DocID (an interned u64 mapped to (PathID, VersionID, ChunkID) in cf_docmap). Because the four signals agree on identity, fusing them is a set join — not a cross-system reconciliation.

Vector
HNSW for the hot tier, IVF-PQ for the cold tier (designed; the M0 engine is currently brute-force over an in-memory snapshot). DocID is used directly as the HNSW label, and the index is rebuildable from cf_embed_cache.
Symbol graph
defs / refs / callers / callees / importers as RocksDB adjacency lists (cf_graph_*: SymbolID → roaring<SymbolID>). Version-independent SymbolID tracks a symbol across revisions.
Lexical
BM25 over a term postings list (cf_lex_post: term → postings{DocID,tf}) plus a trigram index (cf_lex_trigram) for regex / substring filtering.
Temporal
A native version DAG in cf_versions (VersionID → Manifest, ref/<name> → VersionID), so the same chunk identity carries history and powers --as-of time-travel.

Fusion & reranking

Candidate generation runs in parallel — vector top-K (default 200), BM25 top-K (default 200), and a trigram filter — then the lists are merged with Reciprocal Rank Fusion:

score(d) = Σ_s  w_s / (k + rank_s(d))     # k = 60
w_vector = 1.0,  w_bm25 = 0.6             # defaults (config)

RRF is chosen over a weighted sum because cosine scores (0–1) and BM25 scores (unbounded, corpus-dependent) live on different scales — a weighted sum lets one signal dominate and needs re-normalising per corpus. RRF uses ranks only, so it is scale-invariant and tunes to just (w_s, k).

An optional graph rerank then boosts candidates along the call graph 1–2 hops out from the top matched symbols, with decay (1-hop = 1.0, 2-hop = 0.5) and a default weight α = 0.02:

score'(d) = score(d) + α · graph_signal(d)

Because RRF discards raw score magnitude, every hit carries provenance — a why object with the per-signal contributions (vector, bm25, rrf, graph_boost) plus the version — so an agent can judge confidence itself.

Status. RRF and BM25 are implemented in syna-index at M0. Graph reranking (the α term) lands at M3; until then α = 0.

Chunking & embedding

Chunks are semantic units, not fixed windows. syna-chunk parses with tree-sitter and emits function- / method-level chunks (with the leading doc-comment attached) for Rust, Python, JS, and Go. Token target is ~512 with a hard cap of 1024; over-cap functions split at statement boundaries with ~64-token overlap.

A context header (path + qualified symbol) is prepended to the embed input only — the stored blob stays clean:

// path: src/auth/jwt.rs
// symbol: auth::jwt::validate_token
<chunk source>

Chunk identity is content-addressed for deterministic invalidation: ChunkID = BLAKE3(blob_id ‖ byte_range ‖ chunker_version). Folding the chunker_version into the id means a chunker upgrade invalidates exactly the affected chunks, deterministically.

Embedding is pluggable (the Embedder trait), with three default tiers. The cache is keyed (embedder_id, ChunkID), so swapping models partitions the cache cleanly and only cache misses hit the model:

TierModelSpecNote
Local (default)CodeRankEmbed137M · 768-dim · MITPure-Rust (candle), runs local on CPU.
Local (opt-in)CodeSage-large-v21.3B · Apache-2.0 · MatryoshkaTruncatable dims for hot/cold tiers.
API (opt-in)voyage-code-31024-dim · APIExplicit opt-in; code leaves the box.

Transactional writes & consistency

A write returns immediately, then reindexes asynchronously. The write path is a small state machine: a FUSE write or MCP apply_edit puts the new bytes into the content-addressed blob store, updates cf_meta, bumps T = ++enqueued_generation, journals a ReindexTask to the WAL, and returns {ok, token: T}. A background worker then chunk-diffs old vs new, embeds only cache-miss chunks, and atomically lifts applied_generation = T.

Two monotonic generation counters drive three consistency levels:

eventual
Default. Searches against the current applied_generation; the response carries that generation.
strong
Targets the enqueued_generation at query time and blocks until applied ≥ target (with timeout).
read-your-writes
The client sends the max token it has seen; the query waits until that token is applied.

Crash recovery replays the dirty queue plus the WAL; if the HNSW index is corrupted it is rebuilt in full from cf_embed_cache. And the incremental invariant holds: a 1-line edit re-embeds only the touched function's chunks (≈ the neighbour count of the change), never the whole file.

Versions & time travel

A native snapshot DAG (separate from git) lets you semantically search "the code as it was then." Each commit seals the working tree into a new VersionID = BLAKE3(manifest) and advances HEAD. Chunks are retained per file-version unit (path, blob), so reindexing adds new units without deleting old history.

--as-of accepts working (default), HEAD, HEAD~N (first-parent ancestor), or a VersionID prefix; search filters candidates to that tree's visible DocIDs. Because SymbolID is version-independent, syna diff --symbol <name> produces a line-level diff of one function across revisions:

$ syna diff --symbol validate
modified HEAD → working: validate (src/auth.rs)
 fn validate(t: &str) -> bool {
-    !t.is_empty()
+    !t.is_empty() && t.len() > 8
 }

Mark-sweep gc drops versions unreachable from any ref and reclaims only the chunks they held exclusively. Because chunks are content-addressed, anything shared with a reachable version or the working tree is preserved — a reverted edit shares its DocID and vector with the original.

Degraded mode & determinism

If the embedder or HNSW is unavailable, SynaFS does not go dark. The vector list is emptied and RRF runs with w_vector = 0 — effectively BM25 + trigram ranking. The response is flagged degraded: true and provenance reports vector = 0, so an agent knows to lower its confidence.

Everything stays deterministic by construction: identical (content, chunker_version, embedder_id) yields an identical ChunkID and, on a cache hit, an identical vector. The same input always produces the same chunk identity and the same embedding.

Content-addressed identity

Every identifier is a BLAKE3 content address — determinism, dedup, and cache invalidation come for free:

BlobID
BLAKE3(bytes) — immutable content address; identical bytes are stored once.
ChunkID
BLAKE3(blob ‖ range ‖ chunker_version) — dedups even moved code; deterministic invalidation on chunker upgrade.
SymbolID
BLAKE3(lang ‖ container ‖ kind ‖ qualified_name) — version-independent identity, tracked across revisions.
VersionID
BLAKE3(snapshot manifest) — one snapshot, git-commit-like.