Deep dive

Data Structuring

How a folder of PDFs becomes a queryable knowledge surface — three honest ingest tiers, profile-aware mining, hybrid retrieval, and quote-grounded extraction the model cannot fake.

Updated May 2, 2026

Drop a folder on Ouroboros. Get back something you can actually query — by keyword, by meaning, and by extracted fact. No “processing…” spinner that lies about progress, no opaque “indexing” that might mean anything. Three tiers, each one earning a distinct capability, all visible in the dashboard chip.

The daily-driver instance has 4928 documents through this pipeline today. It’s beta. The mechanics below are the ones in production right now.

Three honest tiers

Every document moves through scanned → searchable → indexed. The header chip on the Data tab reads exactly that:

276s · 276f · 0i ↻

276 scanned, 276 searchable (full-text + dense vectors live), 0 indexed (no facts mined yet). No fake bar inching toward 100%. No vague “in progress.” If you want facts mined, click the arrow — mining runs.

What each tier earns you

Scanned. The daemon saw the file, hashed its bytes, knows its mime/kind. You can browse and open it. Nothing more.
Searchable. BM25 full-text plus dense vectors are indexed. The default embedding model in local mode is qwen3-embedding:0.6b (1024-dim) via Ollama; if you’ve configured a BYOK embedding provider, that’s used instead. sophia.search_documents works.
Indexed. Facts have been extracted into the typed knowledge graph. sophia.query_knowledge returns rows for entities and predicates pulled from the document.

The tiers are independent of each other. A document can sit at searchable forever — many do. You only mine the ones you actually want facts from.

The pipeline

folder → hash → scanned → searchable → indexed → graph (with substring-check fork)

flowchart LR
  F[Folder you registered] --> W[fs-watcher]
  W --> H[hash + mime/kind]
  H --> S[scanned]
  S --> FT[FTS5 index]
  S --> EM[dense embedding]
  FT --> SR[searchable]
  EM --> SR
  SR -->|click ↻ or agent enqueues| MN[mine: profile + features]
  MN --> EX[extractor produces<br/>claims with quotes]
  EX --> CK{quote literally<br/>in source?}
  CK -->|yes| KG[knowledge graph]
  CK -->|no| DR[dropped on the floor]
  KG --> IX[indexed]

The fork at the bottom is the load-bearing part. Every claim the extractor produces carries a verbatim quote from the source. Before the claim lands, a substring check runs against the original text. Claims whose quotes don’t literally appear are dropped.

Profiles + features axis

Different documents need different extraction. A legal contract isn’t a research paper isn’t a markdown spec isn’t a generic note. Ouroboros uses four profiles to set the baseline:

general — default, conservative extraction
legal — parties, dates, obligations, defined terms
code-doc — symbols, examples, directive blocks
paper — citations, methods, claims with confidence

On top of the profile, a features axis layers in extra extraction the document actually needs:

features: Set<DocumentFeature>
  = tables | wikilinks | code_blocks | frontmatter
  | external_refs | citations | procedures

A markdown spec with code blocks and wikilinks gets the code-doc profile plus the wikilinks and code_blocks features. A legal PDF with a schedule of payments gets legal plus tables. Composable, not exploded into twelve sub-profiles.

Per-root overrides let you set defaults at the folder level — “everything under ~/Notes/legal/ is legal profile, tables feature on” — so you don’t re-pick on every document.

Quote-grounded extraction in practice

When an agent (or the SPA) asks the daemon for a document to mine, it gets back the enriched contract — not just the raw text:

const job = await sophia.get_document_for_mining({ doc_id });
// → {
//   doc_id: 'doc_8f2…',
//   text: '…full document text…',
//   profile: 'legal',
//   features: ['tables', 'external_refs'],
//   directives: [
//     'Extract parties as entities of type organization or person',
//     'Quote ≤ 300 chars, verbatim, must appear in text',
//     'Tables: emit one claim per row with row-keyed predicates',
//   ],
//   prior_claims: [ /* what's already mined for this doc */ ],
// }

The directives field is the per-profile guidance the extractor follows. The prior_claims field is what’s already been mined — so re-mining is additive and idempotent rather than duplicating work. And every claim the extractor returns gets the substring check before it lands.

Hybrid retrieval, on by default

Search isn’t just BM25 and isn’t just vectors. Every search_documents call runs the full hybrid pipeline:

BM25 lexical match via SQLite FTS5
Dense vectors semantic match against the embedding index
RRF fusion combines the two ranked lists
Cross-encoder rerank with bge-reranker-v2-m3 via onnxruntime-node — the top-K from the fused list re-scored against the query

One call, sub-second on the daily-driver, no external API for the rerank step (the cross-encoder ships with the daemon and runs on CPU).

const hits = await sophia.search_documents({
query: 'lease termination notice period',
k: 10,
});
// → {
//   results: [
//     { doc_id, title, excerpt, score, source: 'fused+reranked' },
//     …
//   ],
//   timing_ms: { bm25: 18, dense: 41, fuse: 1, rerank: 287 },
// }

If you don’t want the rerank step (say you’re paginating a long list), rerank: false skips it. The default is on.

Discovered review queue

When extraction finds an entity the user hasn’t declared — a person, an organization, a system the document references — it doesn’t get auto-promoted into the entity table. It lands in a Discovered queue. The Data tab shows it. You triage:

Promote — it’s a real entity, give it a row
Merge — it’s the same as one you already have, fold it in
Dismiss — it’s noise, never surface again

Dismissal is durable. Re-mining the same document — or any document that mentions the same string — cannot resurrect a dismissed entity. The dismissal is keyed and persisted, not just a UI hide.

This is what keeps the entity table from filling with model-invented artifacts of casual mentions.

The Data tab

The SPA’s Data tab is where this is all visible — replaces the all-or-nothing folder-add of the prior version. Per-class groups, each with its own status:

Code — modules indexed, mined-at, content-hash drift
Documents — scanned/searchable/indexed counts, anomalies
Knowledge — facts in the graph, contradictions, orphans
Wiki — pages, tags, broken wikilinks
Discovered — the review queue above
Autolinks — proposed entity↔document links awaiting review
Ingest log — recent activity with cost, time, outcome

Multi-select within any group lets you bulk-mine, re-mine, or dismiss. The header summary fails loud — if there’s an anomaly (mining errors, embedding backlog, stuck shards), the relevant group auto-expands so you don’t miss it.

Where this is headed

Gmail and Drive ingest — the same three-tier pipeline applied to email and Drive folders, with per-thread / per-folder profiles. Mail with attachments composes naturally into the existing extractor.
OCR for scanned PDFs — the current pipeline assumes selectable text. Scanned-PDF support adds an OCR step between scanned and searchable so image-only documents land in the same tiers.
Multi-source-folder repos with per-folder profiles — a single registered root with sub-trees that each carry their own profile + feature set, rather than picking one default for the whole root.
More language profiles — current extraction is tuned for English documents. Profiles for non-English docs (matching the embedding model’s multilingual coverage) are next.

← Back to overview