Benchmarks
Rename a function used in 20 places.
Grep finds 1. MCI finds 13.
Cursor, Copilot, and Claude Code — without a code graph — fall back to text search when finding callers. These benchmarks put the same model (Claude Opus 4.7) in three conditions: blind baseline, grep/glob only, and full MCI. Same tasks. Same codebase. Different results.
20-task benchmark, April 2026, one production codebase. Three-condition design (blind / grep / MCI) isolates the information gain of each tool directly. External validation on microsoft/TypeScript (379k lines) also completed April 2026. Full methodology ↓
Benchmark 1 · Results
MCI Caller Accuracy
20 cross-file refactor tasks on the Momental TypeScript monorepo. Control: Claude Opus 4.7 with grep/glob only — the same baseline used by Cursor, Copilot, and Claude Code without a code graph. Treatment: same model, full MCI. Ground truth computed at runtime.
Sample tasks
- Change
claimFiles()signature — add requiredttlSecondsparam - Rename
getBlastRadius()return type fieldriskLevelfrom string to literal union - Add required
includeExternalparam tocode_blastMCP tool - Modify
searchSymbols()to return paginated results - Change
saveRecord()to accept newenvironmentparameter
Live Example · Real Codebase
See it on one function
Task: "Add a required source: string param to recordAsync — find every call site."
Run live on the Momental monorepo (38k symbols, 1,130 TypeScript source files). Same real codebase as Benchmark 1.
grep -rn "recordAsync(" packages/ - 1 line is the function definition — requires manual triage
- No grouping by calling function —
gemini.service.tshas 9 separate callers buried in the list - Worker's separate
AIUsageServicemixed in — no signal which implementation is which - No risk level, no test suggestions
momental_code_find(“recordAsync”)
momental_code_blast([id]) - Each result is a named function — zero manual triage
- Risk level: P0 — billing-critical path, 441 downstream symbols
- Exact tests to run:
embedding.test.ts,gemini.service.test.ts,agent12-adapters.test.ts - Worker symbol flagged as separate implementation — won't silently regress
Measured live · April 2026 · Momental monorepo · 38k symbols · 1,130 TypeScript source files.
Grep: bash time grep -rn "recordAsync(" . --include="*.ts" (full repo, no path scoping). MCI: direct HTTP timing to mcp.momentalos.com — 185ms code_find + 237ms code_symbol. Token estimates = output bytes ÷ 4.
External Validation · Public Codebase · 379k lines
microsoft/TypeScript — quality improves even on public codebases
Same benchmark harness, different codebase. 20 tasks against
microsoft/TypeScript
(601 source files, 379k lines). Momental had not previously indexed this repo.
Cold filesystem cache flushed before each grep run (sudo purge).
20 of 20 tasks measured · April 2026 · 13m 16s total run time ·
microsoft/TypeScript shallow clone · 601 src files · 379k lines.
Grep: rg --no-heading -n "functionName(" /repo/src
(full repo, no path scoping, cold filesystem cache flushed via sudo purge before each run).
MCI: HTTP to mcp.momentalos.com — ~275ms code_find + ~227ms code_symbol = ~502ms total.
Raw output and per-task scores:
benchmarks/results/typescript-2026-04-18.txt.
Benchmark 2 · Results
Organizational Context Lift
5 of 20 planned architectural decision tasks where the correct answer is already recorded as a DECISION atom in Momental. Control agents have no memory of past choices. Treatment agents recall the relevant decision before answering. Run: April 2026 · Model: Claude Opus 4.7 · LLM judge 1–10 quality score.
The +53% lift on the Gemini billing question is the clearest signal: without Momental,
Claude knows the SDK API but doesn’t know the team’s specific architectural decision
(use geminiService.completeWithModel() — it auto-records usage). With Momental,
that decision is surfaced in context and Claude applies it correctly.
Methodology
How we score every task
Every task runs twice — control and treatment. Five dimensions, every run. No cherry-picking. The same codebase, the same questions.
Correct
Did the agent find all cross-file callers? Ground truth computed at run time — not a static fixture that goes stale.
Solution Built
Does the proposed change compile? Scored via LLM judge proxy for tsc --noEmit — running real compilation per benchmark loop is too slow.
Speed
Wall-clock milliseconds from task start to working output. MCI adds a small upfront cost but eliminates the back-and-forth of missed callers.
Tokens
Input + output tokens consumed. MCI sends a targeted caller graph instead of requiring the agent to read 20+ files blind.
Quality
LLM-as-judge score from 1–10 using Claude Opus. Rubric: correctness, idiomaticity, addressed all callers, no introduced regressions.
Fairness rules
- Same model (Claude Opus 4.7) for all control and treatment runs
- Same task descriptions — only the context injected differs
- Ground truth computed at run time, not hardcoded
- All raw scores published with results
- Codebase: the Momental monorepo (TypeScript, ~300k lines, real production code) + microsoft/TypeScript (379k lines, public OSS)
All numbers from a 20-task benchmark run completed April 2026 on one production
codebase (Momental’s own monorepo, 38k+ indexed symbols).
External validation on microsoft/TypeScript (379k lines, public OSS) completed April 2026 —
see the TypeScript section above. Runner source code lives at
packages/worker/src/__tests__/e2e/benchmarks/ in the Momental monorepo.
Under the hood
What MCI gives the agent
Five MCP tool calls at session start. The agent sees the entire codebase as a queryable knowledge graph — not a directory tree.
code_find Exact symbol lookup by name. Returns file path, line, symbol ID for all matches across the repo.
code_symbol 360° view of a symbol: callers, callees, cluster membership, linked tasks and decisions.
code_blast Recursive blast radius from a symbol. Returns P0–P5 risk level, all affected files, suggested tests to run.
code_search Semantic + BM25 hybrid search across all symbols and comments. Finds conceptually related code, not just string matches.
code_claim Declare which files you’re editing. Surfaces live conflicts with other agents before you write a single line.
What we measure
Beyond code — the full picture
MCI combines code intelligence with organizational memory. These benchmarks measure all of it.
Cross-file caller graph
BM25 + pgvector hybrid search. Every function, every caller, across the whole repo — not just the open file.
Blast radius scoring
P0–P5 risk level before you touch a line. Know exactly which tests to run and which callers to notify.
Organizational decisions
DECISION atoms surface the why behind every architectural choice — not just what the code does, but why it was built that way.
Multi-agent coordination
Live file claiming and conflict detection. Multiple agents working the same repo without stepping on each other.
Give your agents the context they need.
MCI is live. Connect your repo and see the caller graph in minutes.