Deep dive
Codebase Graph
Point Ouroboros at a repo. Tree-sitter parses every module locally — modules, symbols, edges — into a typed graph your coding agent queries in sub-second time. The orientation toll your agent pays on every fresh session, paid once.
A coding agent on a fresh repo burns context before it writes a single line. It globs the file tree. It reads eight to twelve anchor files to figure out the shape. It follows imports. It hunts callers. By the time it understands enough to make a real edit, you’ve paid for thousands of tokens that produced no output. Start a new session — pay it again. Switch from Claude Code to Codex — pay it again. The repo didn’t change. The cost did.
The Codebase Graph is the one-time toll. Point Ouroboros at a repo, let tree-sitter parse it locally, and your agent gets a typed graph it can query in sub-second time: modules, symbols, callers, callees, imports. Every fresh session starts oriented.
What sophia.query_codebase earns you
One call. Sub-second. Returns typed rows: file path, line range, callers, callees, imports, exports. The graph is built from real ASTs — not LLM-derived guesses, not regex over a flat-file index. Every edge has a source location you can jump to.
// What does this entity's code surface look like?
const overview = await sophia.queryCodebase({
kind: 'overview',
entity_id: 'ouroboros-app',
});
// →
// {
// modules: 1742,
// symbols: 9821,
// edges: 5634,
// languages: { typescript: 1488, tsx: 201, python: 53 },
// kinds: { lib: 1612, test: 88, script: 42 },
// } That’s the answer to “what am I looking at?” — surfaced in one round trip instead of twenty file reads.
Six languages, parsed locally
TypeScript, TSX, Python, Rust, Go, Java, C# — all parsed locally with tree-sitter. No language-server dependency. No IDE plugin. No LLM call. The C# parser has been stress-tested on a 2,000-file Unity project; it parses cleanly.
The choice of tree-sitter over heavier toolchains is deliberate. Tree-sitter parsers are fast, embeddable, and incremental — they re-parse a single file in milliseconds when it changes. The resulting AST is well-typed enough to extract the shapes the graph cares about: declarations, references, imports, exports.
What gets stored
Three tables per repo, one row per real thing in your code:
code_modules— one row per file with language, kind (lib / test / script), parsed timestamp, byte size.code_symbols— one row per declared name: classes, functions, interfaces, methods, types — each with a line range pointing back into the source file.code_edges— typed relationships between symbols:imports,calls,references,extends,implements. Each edge carries the line where the relationship is expressed.
On the daily-driver machine running this site, two real repos are indexed — Ouroboros itself in TypeScript and a separate C# project. Together that’s 2,998 modules, 13,554 symbols, 8,407 edges. All on local disk, all queryable by every connected agent.
The query shapes
sophia.query_codebase takes a kind parameter that selects the walk:
overview— language breakdown plus module / symbol / edge counts for an entity. The “where am I?” call.modules— list modules under an entity, filterable by language or kind.symbols— find a named symbol (function, class, interface). Returns every match with file path and line range.callers— who calls this symbol. Walkscode_edgeswhererelationship = 'calls'and the target matches.callees— what this symbol calls. The opposite walk.imports— what this module imports.
Agents chain these naturally. Find the symbol, walk to its callers, read the
two highest-confidence ones to understand how it’s used, then make the edit.
Three queries, one round trip when wrapped in sophia.execute_code.
// Who calls validateBearer? Read them before refactoring it.
const symbol = await sophia.queryCodebase({
kind: 'symbols',
name: 'validateBearer',
});
// → [{ symbol_id: 'sym_a1b2', module: 'src/auth/bearer.ts', start_line: 47, end_line: 82, kind: 'function' }]
const callers = await sophia.queryCodebase({
kind: 'callers',
symbol_id: 'sym_a1b2',
});
// →
// [
// { module: 'src/routes/mcp.ts', line: 134, caller_symbol: 'handleToolCall' },
// { module: 'src/routes/api.ts', line: 88, caller_symbol: 'authMiddleware' },
// { module: 'src/mcp/auth.test.ts', line: 21, caller_symbol: 'rejects expired bearer' },
// ] Architecture
flowchart LR R[Repo on disk] -->|chokidar watch| W[File-change debouncer] W -->|modified files| P[tree-sitter parse] P -->|extract| M[(code_modules)] P -->|extract| S[(code_symbols)] P -->|extract| E[(code_edges)] M --> Q[sophia.query_codebase] S --> Q E --> Q Q -->|sub-second typed rows| A[Your coding agent] Q -->|same data| D[/Data tab in tray app/]
Auto re-sync
A chokidar watcher keeps the graph honest. Two triggers fire a re-parse:
- Five-minute timer — sweeps the repo for any change the watcher might have missed (large rebases, file restorations).
- On-change debounce — files modified in your editor re-parse within a couple seconds.
Re-parse touches only the files that actually changed. The rest of the graph stays stable. Symbols that disappear on re-parse are removed; symbols that appear are added; edges referencing removed symbols are pruned. Nothing runs an LLM unless you ask it to — code ingest is pure CPU and stays $0 forever.
The /data tab — same graph, two surfaces
Open the tray app and click the Data tab. The view there is the same graph your agents query, just rendered for human eyes. Click a module — drill to its symbols. Click a symbol — walk to its callers. Click a caller — read the documents that mention it. Walk further — read the facts those documents produced.
The dashboard isn’t a separate analytics layer with its own copy of the data.
It’s sophia.query_codebase with a UI. When the graph updates from a re-parse,
the dashboard updates too. When your agent and you disagree about what’s in
the repo, the disagreement is impossible — you’re both reading the same rows.
Where this is headed
- More language parsers — Swift, Kotlin, Ruby on the near roadmap. The ingest pipeline is parser-agnostic; adding a language is a tree-sitter grammar plus a small extractor that maps AST nodes to the three tables.
- Call-graph diff between commits — so a coding agent can ask “what changed in the public surface of this package since last week?” and get a list of added, removed, and signature-changed symbols. Closes a real gap in the orient-on-fresh-session story when the repo did change.
- Source-location-aware semantic linking — facts produced from documents
that mention
validateBearerget linked to the actual symbol in the graph, not just the string. So a question like “what do my notes say about the function I’m about to refactor?” returns the right notes, scoped to the right symbol, with line ranges.