Deep dive
Data Structuring
How a folder of PDFs becomes a queryable knowledge surface — three honest ingest tiers, profile-aware mining, hybrid retrieval, and quote-grounded extraction the model cannot fake.
Drop a folder on Ouroboros. Get back something you can actually query — by keyword, by meaning, and by extracted fact. No “processing…” spinner that lies about progress, no opaque “indexing” that might mean anything. Three tiers, each one earning a distinct capability, all visible in the dashboard chip.
The daily-driver instance has 4928 documents through this pipeline today. It’s beta. The mechanics below are the ones in production right now.
Three honest tiers
Every document moves through scanned → searchable → indexed. The header
chip on the Data tab reads exactly that:
276s · 276f · 0i ↻
276 scanned, 276 searchable (full-text + dense vectors live), 0 indexed (no facts mined yet). No fake bar inching toward 100%. No vague “in progress.” If you want facts mined, click the arrow — mining runs.
What each tier earns you
- Scanned. The daemon saw the file, hashed its bytes, knows its mime/kind. You can browse and open it. Nothing more.
- Searchable. BM25 full-text plus dense vectors are indexed. The default
embedding model in local mode is
qwen3-embedding:0.6b(1024-dim) via Ollama; if you’ve configured a BYOK embedding provider, that’s used instead.sophia.search_documentsworks. - Indexed. Facts have been extracted into the typed knowledge graph.
sophia.query_knowledgereturns rows for entities and predicates pulled from the document.
The tiers are independent of each other. A document can sit at searchable
forever — many do. You only mine the ones you actually want facts from.
The pipeline
flowchart LR
F[Folder you registered] --> W[fs-watcher]
W --> H[hash + mime/kind]
H --> S[scanned]
S --> FT[FTS5 index]
S --> EM[dense embedding]
FT --> SR[searchable]
EM --> SR
SR -->|click ↻ or agent enqueues| MN[mine: profile + features]
MN --> EX[extractor produces<br/>claims with quotes]
EX --> CK{quote literally<br/>in source?}
CK -->|yes| KG[knowledge graph]
CK -->|no| DR[dropped on the floor]
KG --> IX[indexed] The fork at the bottom is the load-bearing part. Every claim the extractor produces carries a verbatim quote from the source. Before the claim lands, a substring check runs against the original text. Claims whose quotes don’t literally appear are dropped.
Profiles + features axis
Different documents need different extraction. A legal contract isn’t a research paper isn’t a markdown spec isn’t a generic note. Ouroboros uses four profiles to set the baseline:
general— default, conservative extractionlegal— parties, dates, obligations, defined termscode-doc— symbols, examples, directive blockspaper— citations, methods, claims with confidence
On top of the profile, a features axis layers in extra extraction the document actually needs:
features: Set<DocumentFeature>
= tables | wikilinks | code_blocks | frontmatter
| external_refs | citations | procedures
A markdown spec with code blocks and wikilinks gets the code-doc profile
plus the wikilinks and code_blocks features. A legal PDF with a
schedule of payments gets legal plus tables. Composable, not exploded into
twelve sub-profiles.
Per-root overrides let you set defaults at the folder level — “everything
under ~/Notes/legal/ is legal profile, tables feature on” — so you don’t
re-pick on every document.
Quote-grounded extraction in practice
When an agent (or the SPA) asks the daemon for a document to mine, it gets back the enriched contract — not just the raw text:
const job = await sophia.get_document_for_mining({ doc_id });
// → {
// doc_id: 'doc_8f2…',
// text: '…full document text…',
// profile: 'legal',
// features: ['tables', 'external_refs'],
// directives: [
// 'Extract parties as entities of type organization or person',
// 'Quote ≤ 300 chars, verbatim, must appear in text',
// 'Tables: emit one claim per row with row-keyed predicates',
// ],
// prior_claims: [ /* what's already mined for this doc */ ],
// } The directives field is the per-profile guidance the extractor follows. The
prior_claims field is what’s already been mined — so re-mining is additive
and idempotent rather than duplicating work. And every claim the extractor
returns gets the substring check before it lands.
Hybrid retrieval, on by default
Search isn’t just BM25 and isn’t just vectors. Every search_documents call
runs the full hybrid pipeline:
- BM25 lexical match via SQLite FTS5
- Dense vectors semantic match against the embedding index
- RRF fusion combines the two ranked lists
- Cross-encoder rerank with
bge-reranker-v2-m3via onnxruntime-node — the top-K from the fused list re-scored against the query
One call, sub-second on the daily-driver, no external API for the rerank step (the cross-encoder ships with the daemon and runs on CPU).
const hits = await sophia.search_documents({
query: 'lease termination notice period',
k: 10,
});
// → {
// results: [
// { doc_id, title, excerpt, score, source: 'fused+reranked' },
// …
// ],
// timing_ms: { bm25: 18, dense: 41, fuse: 1, rerank: 287 },
// } If you don’t want the rerank step (say you’re paginating a long list),
rerank: false skips it. The default is on.
Discovered review queue
When extraction finds an entity the user hasn’t declared — a person, an organization, a system the document references — it doesn’t get auto-promoted into the entity table. It lands in a Discovered queue. The Data tab shows it. You triage:
- Promote — it’s a real entity, give it a row
- Merge — it’s the same as one you already have, fold it in
- Dismiss — it’s noise, never surface again
Dismissal is durable. Re-mining the same document — or any document that mentions the same string — cannot resurrect a dismissed entity. The dismissal is keyed and persisted, not just a UI hide.
This is what keeps the entity table from filling with model-invented artifacts of casual mentions.
The Data tab
The SPA’s Data tab is where this is all visible — replaces the all-or-nothing folder-add of the prior version. Per-class groups, each with its own status:
- Code — modules indexed, mined-at, content-hash drift
- Documents — scanned/searchable/indexed counts, anomalies
- Knowledge — facts in the graph, contradictions, orphans
- Wiki — pages, tags, broken wikilinks
- Discovered — the review queue above
- Autolinks — proposed entity↔document links awaiting review
- Ingest log — recent activity with cost, time, outcome
Multi-select within any group lets you bulk-mine, re-mine, or dismiss. The header summary fails loud — if there’s an anomaly (mining errors, embedding backlog, stuck shards), the relevant group auto-expands so you don’t miss it.
Where this is headed
- Gmail and Drive ingest — the same three-tier pipeline applied to email and Drive folders, with per-thread / per-folder profiles. Mail with attachments composes naturally into the existing extractor.
- OCR for scanned PDFs — the current pipeline assumes selectable text.
Scanned-PDF support adds an OCR step between
scannedandsearchableso image-only documents land in the same tiers. - Multi-source-folder repos with per-folder profiles — a single registered root with sub-trees that each carry their own profile + feature set, rather than picking one default for the whole root.
- More language profiles — current extraction is tuned for English documents. Profiles for non-English docs (matching the embedding model’s multilingual coverage) are next.