Home/Features
One engine, four signals — vector, symbol graph, lexical, and temporal indexes that share a single document id, so hybrid ranking is a join rather than a glue layer.
SynaFS keeps four indexes over the same code, all keyed off one content-addressed chunk identity. Every posting list, vector label, and graph edge resolves back to the same DocID (an interned u64 mapped to (PathID, VersionID, ChunkID) in cf_docmap). Because the four signals agree on identity, fusing them is a set join — not a cross-system reconciliation.
cf_embed_cache.cf_graph_*: SymbolID → roaring<SymbolID>). Version-independent SymbolID tracks a symbol across revisions.cf_lex_post: term → postings{DocID,tf}) plus a trigram index (cf_lex_trigram) for regex / substring filtering.cf_versions (VersionID → Manifest, ref/<name> → VersionID), so the same chunk identity carries history and powers --as-of time-travel.Candidate generation runs in parallel — vector top-K (default 200), BM25 top-K (default 200), and a trigram filter — then the lists are merged with Reciprocal Rank Fusion:
score(d) = Σ_s w_s / (k + rank_s(d)) # k = 60 w_vector = 1.0, w_bm25 = 0.6 # defaults (config)
RRF is chosen over a weighted sum because cosine scores (0–1) and BM25 scores (unbounded, corpus-dependent) live on different scales — a weighted sum lets one signal dominate and needs re-normalising per corpus. RRF uses ranks only, so it is scale-invariant and tunes to just (w_s, k).
An optional graph rerank then boosts candidates along the call graph 1–2 hops out from the top matched symbols, with decay (1-hop = 1.0, 2-hop = 0.5) and a default weight α = 0.02:
score'(d) = score(d) + α · graph_signal(d)
Because RRF discards raw score magnitude, every hit carries provenance — a why object with the per-signal contributions (vector, bm25, rrf, graph_boost) plus the version — so an agent can judge confidence itself.
syna-index at M0. Graph reranking (the α term) lands at M3; until then α = 0.Chunks are semantic units, not fixed windows. syna-chunk parses with tree-sitter and emits function- / method-level chunks (with the leading doc-comment attached) for Rust, Python, JS, and Go. Token target is ~512 with a hard cap of 1024; over-cap functions split at statement boundaries with ~64-token overlap.
A context header (path + qualified symbol) is prepended to the embed input only — the stored blob stays clean:
// path: src/auth/jwt.rs // symbol: auth::jwt::validate_token <chunk source>
Chunk identity is content-addressed for deterministic invalidation: ChunkID = BLAKE3(blob_id ‖ byte_range ‖ chunker_version). Folding the chunker_version into the id means a chunker upgrade invalidates exactly the affected chunks, deterministically.
Embedding is pluggable (the Embedder trait), with three default tiers. The cache is keyed (embedder_id, ChunkID), so swapping models partitions the cache cleanly and only cache misses hit the model:
| Tier | Model | Spec | Note |
|---|---|---|---|
| Local (default) | CodeRankEmbed | 137M · 768-dim · MIT | Pure-Rust (candle), runs local on CPU. |
| Local (opt-in) | CodeSage-large-v2 | 1.3B · Apache-2.0 · Matryoshka | Truncatable dims for hot/cold tiers. |
| API (opt-in) | voyage-code-3 | 1024-dim · API | Explicit opt-in; code leaves the box. |
A write returns immediately, then reindexes asynchronously. The write path is a small state machine: a FUSE write or MCP apply_edit puts the new bytes into the content-addressed blob store, updates cf_meta, bumps T = ++enqueued_generation, journals a ReindexTask to the WAL, and returns {ok, token: T}. A background worker then chunk-diffs old vs new, embeds only cache-miss chunks, and atomically lifts applied_generation = T.
Two monotonic generation counters drive three consistency levels:
applied_generation; the response carries that generation.enqueued_generation at query time and blocks until applied ≥ target (with timeout).token it has seen; the query waits until that token is applied.Crash recovery replays the dirty queue plus the WAL; if the HNSW index is corrupted it is rebuilt in full from cf_embed_cache. And the incremental invariant holds: a 1-line edit re-embeds only the touched function's chunks (≈ the neighbour count of the change), never the whole file.
A native snapshot DAG (separate from git) lets you semantically search "the code as it was then." Each commit seals the working tree into a new VersionID = BLAKE3(manifest) and advances HEAD. Chunks are retained per file-version unit (path, blob), so reindexing adds new units without deleting old history.
--as-of accepts working (default), HEAD, HEAD~N (first-parent ancestor), or a VersionID prefix; search filters candidates to that tree's visible DocIDs. Because SymbolID is version-independent, syna diff --symbol <name> produces a line-level diff of one function across revisions:
$ syna diff --symbol validate
modified HEAD → working: validate (src/auth.rs)
fn validate(t: &str) -> bool {
- !t.is_empty()
+ !t.is_empty() && t.len() > 8
}Mark-sweep gc drops versions unreachable from any ref and reclaims only the chunks they held exclusively. Because chunks are content-addressed, anything shared with a reachable version or the working tree is preserved — a reverted edit shares its DocID and vector with the original.
If the embedder or HNSW is unavailable, SynaFS does not go dark. The vector list is emptied and RRF runs with w_vector = 0 — effectively BM25 + trigram ranking. The response is flagged degraded: true and provenance reports vector = 0, so an agent knows to lower its confidence.
Everything stays deterministic by construction: identical (content, chunker_version, embedder_id) yields an identical ChunkID and, on a cache hit, an identical vector. The same input always produces the same chunk identity and the same embedding.
Every identifier is a BLAKE3 content address — determinism, dedup, and cache invalidation come for free:
BLAKE3(bytes) — immutable content address; identical bytes are stored once.BLAKE3(blob ‖ range ‖ chunker_version) — dedups even moved code; deterministic invalidation on chunker upgrade.BLAKE3(lang ‖ container ‖ kind ‖ qualified_name) — version-independent identity, tracked across revisions.BLAKE3(snapshot manifest) — one snapshot, git-commit-like.