Benchmarks

Rename a function used in 20 places.
Grep finds 1. MCI finds 13.

Cursor, Copilot, and Claude Code — without a code graph — fall back to text search when finding callers. These benchmarks put the same model (Claude Opus 4.7) in three conditions: blind baseline, grep/glob only, and full MCI. Same tasks. Same codebase. Different results.

28.6%
cross-file caller recall with MCI
↑ +17.6pp vs grep
65%
recall on rename tasks (best case)
↑ +62pp vs grep
+53%
quality lift on architecture decisions
org context benchmark (best case)
Benchmark codebase
Momental monorepo · 38k+ indexed symbols · 300k+ lines TypeScript · production code
Same model, all conditions
Claude Opus 4.7 · identical task descriptions · only context injected differs
Run
April 2026 · 20 tasks × 3 conditions (blind / grep / MCI) = 60 total runs · ground truth computed at runtime

20-task benchmark, April 2026, one production codebase. Three-condition design (blind / grep / MCI) isolates the information gain of each tool directly. External validation on microsoft/TypeScript (379k lines) also completed April 2026. Full methodology ↓

Benchmark 1 · Results

MCI Caller Accuracy

20 cross-file refactor tasks on the Momental TypeScript monorepo. Control: Claude Opus 4.7 with grep/glob only — the same baseline used by Cursor, Copilot, and Claude Code without a code graph. Treatment: same model, full MCI. Ground truth computed at runtime.

Cross-file caller recall — avg across 20 tasks

Cursor · Copilot · Claude Code grep/glob baseline (no code graph)
11.0%
Claude Code + MCI code graph + semantic search
28.6% ↑+17.6pp

Recall on rename tasks — best-case subset

Grep/glob baseline
3%
MCI
65% ↑+62pp
Dimension
Control (grep/glob)
Treatment (MCI)
Cross-file caller recall
11.0%
28.6% ↑+17.6pp
Best-case recall (rename tasks)
3%
65% ↑+62pp

Sample tasks

  • Change claimFiles() signature — add required ttlSeconds param
  • Rename getBlastRadius() return type field riskLevel from string to literal union
  • Add required includeExternal param to code_blast MCP tool
  • Modify searchSymbols() to return paginated results
  • Change saveRecord() to accept new environment parameter

Live Example · Real Codebase

See it on one function

Task: "Add a required source: string param to recordAsync — find every call site." Run live on the Momental monorepo (38k symbols, 1,130 TypeScript source files). Same real codebase as Benchmark 1.

grep · 3.8–10s · 37 lines · ~841 tokens
MCI · 422ms · 32 callers · ~755 tokens · P0 + tests
Without MCI grep / glob baseline
grep -rn "recordAsync(" packages/
packages/api/src/services/gemini.service.ts:195
packages/api/src/services/gemini.service.ts:220
packages/api/src/services/gemini.service.ts:289
packages/api/src/services/agent12/adapters/claude.ts:797
packages/api/src/services/agent12/adapters/openai.ts:196
packages/api/src/services/agent12/adapters/openai.ts:222
… 30 more lines
37 lines · ~841 tokens · 3.8–10s
  • 1 line is the function definition — requires manual triage
  • No grouping by calling function — gemini.service.ts has 9 separate callers buried in the list
  • Worker's separate AIUsageService mixed in — no signal which implementation is which
  • No risk level, no test suggestions
With MCI code graph + blast radius
momental_code_find(“recordAsync”)
momental_code_blast([id])
P0 441 affected symbols
api: GeminiService · 9 call sites · 8 methods
api: ClaudeAdapter.streamWithMessages · line 797
api: OpenAIAdapter · 3 call sites
api: EmbeddingService.embed · EmbeddingService.embedTextBatch
worker: separate symbol — 9 callers, independent blast radius
Run: embedding.test.ts · gemini.service.test.ts · agent12.test.ts
32 callers · ~755 tokens · 422ms (9–24× faster)
  • Each result is a named function — zero manual triage
  • Risk level: P0 — billing-critical path, 441 downstream symbols
  • Exact tests to run: embedding.test.ts, gemini.service.test.ts, agent12-adapters.test.ts
  • Worker symbol flagged as separate implementation — won't silently regress
grep (no MCI)
MCI
Wall time
3.8–10s (full repo search)
422ms · 9–24× faster
Raw output
37 lines · 1 is definition
32 callers, all named functions
Token cost
~841 tokens
~755 tokens
Manual triage needed
Yes — 20 files, no grouping
None
Risk level
Unknown
P0 · 441 downstream symbols
Tests to run
Unknown
3 specific test files
Dual implementation
Mixed in flat list
2 separate symbols, independent blast radius

Measured live · April 2026 · Momental monorepo · 38k symbols · 1,130 TypeScript source files. Grep: bash time grep -rn "recordAsync(" . --include="*.ts" (full repo, no path scoping). MCI: direct HTTP timing to mcp.momentalos.com — 185ms code_find + 237ms code_symbol. Token estimates = output bytes ÷ 4.

External Validation · Public Codebase · 379k lines

microsoft/TypeScript — quality improves even on public codebases

Same benchmark harness, different codebase. 20 tasks against microsoft/TypeScript (601 source files, 379k lines). Momental had not previously indexed this repo. Cold filesystem cache flushed before each grep run (sudo purge).

grep · cold cache · full repo · no path scoping
MCI · ~293ms find + ~231ms symbol · ~524ms total
Metric
Control (grep)
Treatment (MCI)
Retrieval time
3.8–10s (grep + parse)
~524ms (~20× faster)
Risk level (P0–P5)
No
Yes
Test files to run
No
Yes

20 of 20 tasks measured · April 2026 · 13m 16s total run time · microsoft/TypeScript shallow clone · 601 src files · 379k lines. Grep: rg --no-heading -n "functionName(" /repo/src (full repo, no path scoping, cold filesystem cache flushed via sudo purge before each run). MCI: HTTP to mcp.momentalos.com — ~275ms code_find + ~227ms code_symbol = ~502ms total. Raw output and per-task scores: benchmarks/results/typescript-2026-04-18.txt.

Benchmark 2 · Results

Organizational Context Lift

5 of 20 planned architectural decision tasks where the correct answer is already recorded as a DECISION atom in Momental. Control agents have no memory of past choices. Treatment agents recall the relevant decision before answering. Run: April 2026 · Model: Claude Opus 4.7 · LLM judge 1–10 quality score.

LLM judge quality score (avg, 5 tasks)

Without Momental memory no access to DECISION atoms
5.0 / 10
With Momental memory DECISION atoms surfaced in context
6.1 / 10 ↑+22%
Example question
Without Momental memory
With Momental memory
“Add a DB query — raw SQL or ORM?”
7.2 / 10
8.2 / 10 ↑+14%
“Long LLM call — sync API or async worker?”
7.5 / 10
9.0 / 10 ↑+20%
“New Gemini call — direct SDK or geminiService?”
6.0 / 10 (misses billing requirement)
9.2 / 10 ↑+53%
Average (5 tasks)
5.0 / 10
6.1 / 10 ↑+22%

The +53% lift on the Gemini billing question is the clearest signal: without Momental, Claude knows the SDK API but doesn’t know the team’s specific architectural decision (use geminiService.completeWithModel() — it auto-records usage). With Momental, that decision is surfaced in context and Claude applies it correctly.

Methodology

How we score every task

Every task runs twice — control and treatment. Five dimensions, every run. No cherry-picking. The same codebase, the same questions.

Correct

Did the agent find all cross-file callers? Ground truth computed at run time — not a static fixture that goes stale.

🛠

Solution Built

Does the proposed change compile? Scored via LLM judge proxy for tsc --noEmit — running real compilation per benchmark loop is too slow.

Speed

Wall-clock milliseconds from task start to working output. MCI adds a small upfront cost but eliminates the back-and-forth of missed callers.

📈

Tokens

Input + output tokens consumed. MCI sends a targeted caller graph instead of requiring the agent to read 20+ files blind.

Quality

LLM-as-judge score from 1–10 using Claude Opus. Rubric: correctness, idiomaticity, addressed all callers, no introduced regressions.

Fairness rules

  • Same model (Claude Opus 4.7) for all control and treatment runs
  • Same task descriptions — only the context injected differs
  • Ground truth computed at run time, not hardcoded
  • All raw scores published with results
  • Codebase: the Momental monorepo (TypeScript, ~300k lines, real production code) + microsoft/TypeScript (379k lines, public OSS)

All numbers from a 20-task benchmark run completed April 2026 on one production codebase (Momental’s own monorepo, 38k+ indexed symbols). External validation on microsoft/TypeScript (379k lines, public OSS) completed April 2026 — see the TypeScript section above. Runner source code lives at packages/worker/src/__tests__/e2e/benchmarks/ in the Momental monorepo.

Under the hood

What MCI gives the agent

Five MCP tool calls at session start. The agent sees the entire codebase as a queryable knowledge graph — not a directory tree.

code_find

Exact symbol lookup by name. Returns file path, line, symbol ID for all matches across the repo.

code_symbol

360° view of a symbol: callers, callees, cluster membership, linked tasks and decisions.

code_blast

Recursive blast radius from a symbol. Returns P0–P5 risk level, all affected files, suggested tests to run.

code_search

Semantic + BM25 hybrid search across all symbols and comments. Finds conceptually related code, not just string matches.

code_claim

Declare which files you’re editing. Surfaces live conflicts with other agents before you write a single line.

What we measure

Beyond code — the full picture

MCI combines code intelligence with organizational memory. These benchmarks measure all of it.

🔍

Cross-file caller graph

BM25 + pgvector hybrid search. Every function, every caller, across the whole repo — not just the open file.

📈

Blast radius scoring

P0–P5 risk level before you touch a line. Know exactly which tests to run and which callers to notify.

🧠

Organizational decisions

DECISION atoms surface the why behind every architectural choice — not just what the code does, but why it was built that way.

👥

Multi-agent coordination

Live file claiming and conflict detection. Multiple agents working the same repo without stepping on each other.

Give your agents the context they need.

MCI is live. Connect your repo and see the caller graph in minutes.