Benchmarks

Rename a function used in 20 places.
Grep finds 1. MCI finds 13.

Cursor, Copilot, and Claude Code — without a code graph — fall back to text search when finding callers. These benchmarks put the same model (Claude Opus 4.7) in three conditions: blind baseline, grep/glob only, and full MCI. Same tasks. Same codebase. Different results.

See results ↓ How we score

28.6%

cross-file caller recall with MCI

↑ +17.6pp vs grep

65%

recall on rename tasks (best case)

↑ +62pp vs grep

+53%

quality lift on architecture decisions

org context benchmark (best case)

Benchmark codebase

Momental monorepo · 38k+ indexed symbols · 300k+ lines TypeScript · production code

Same model, all conditions

Claude Opus 4.7 · identical task descriptions · only context injected differs

Run

April 2026 · 20 tasks × 3 conditions (blind / grep / MCI) = 60 total runs · ground truth computed at runtime

20-task benchmark, April 2026, one production codebase. Three-condition design (blind / grep / MCI) isolates the information gain of each tool directly. External validation on microsoft/TypeScript (379k lines) also completed April 2026. Full methodology ↓

Benchmark 1 · Results

MCI Caller Accuracy

20 cross-file refactor tasks on the Momental TypeScript monorepo. Control: Claude Opus 4.7 with grep/glob only — the same baseline used by Cursor, Copilot, and Claude Code without a code graph. Treatment: same model, full MCI. Ground truth computed at runtime.

Cross-file caller recall — avg across 20 tasks

Cursor · Copilot · Claude Code grep/glob baseline (no code graph)

11.0%

Claude Code + MCI code graph + semantic search

28.6% ↑+17.6pp

Recall on rename tasks — best-case subset

Grep/glob baseline

MCI

65% ↑+62pp

Dimension

Control (grep/glob)

Treatment (MCI)

Cross-file caller recall

11.0%

28.6% ↑+17.6pp

Best-case recall (rename tasks)

65% ↑+62pp

Sample tasks

Change claimFiles() signature — add required ttlSeconds param
Rename getBlastRadius() return type field riskLevel from string to literal union
Add required includeExternal param to code_blast MCP tool
Modify searchSymbols() to return paginated results
Change saveRecord() to accept new environment parameter

Live Example · Real Codebase

See it on one function

Task: "Add a required source: string param to recordAsync — find every call site." Run live on the Momental monorepo (38k symbols, 1,130 TypeScript source files). Same real codebase as Benchmark 1.

grep · 3.8–10s · 37 lines · ~841 tokens

→

MCI · 422ms · 32 callers · ~755 tokens · P0 + tests

Without MCI grep / glob baseline

grep -rn "recordAsync(" packages/

packages/api/src/services/gemini.service.ts:195

packages/api/src/services/gemini.service.ts:220

packages/api/src/services/gemini.service.ts:289

packages/api/src/services/agent12/adapters/claude.ts:797

packages/api/src/services/agent12/adapters/openai.ts:196

packages/api/src/services/agent12/adapters/openai.ts:222

… 30 more lines

37 lines · ~841 tokens · 3.8–10s

1 line is the function definition — requires manual triage
No grouping by calling function — gemini.service.ts has 9 separate callers buried in the list
Worker's separate AIUsageService mixed in — no signal which implementation is which
No risk level, no test suggestions

With MCI code graph + blast radius

momental_code_find(“recordAsync”)
momental_code_blast([id])

P0 441 affected symbols

api: GeminiService · 9 call sites · 8 methods

api: ClaudeAdapter.streamWithMessages · line 797

api: OpenAIAdapter · 3 call sites

api: EmbeddingService.embed · EmbeddingService.embedTextBatch

worker: separate symbol — 9 callers, independent blast radius

Run: embedding.test.ts · gemini.service.test.ts · agent12.test.ts

32 callers · ~755 tokens · 422ms (9–24× faster)

Each result is a named function — zero manual triage
Risk level: P0 — billing-critical path, 441 downstream symbols
Exact tests to run: embedding.test.ts, gemini.service.test.ts, agent12-adapters.test.ts
Worker symbol flagged as separate implementation — won't silently regress

grep (no MCI)

MCI

Wall time

3.8–10s (full repo search)

422ms · 9–24× faster

Raw output

37 lines · 1 is definition

32 callers, all named functions

Token cost

~841 tokens

~755 tokens

Manual triage needed

Yes — 20 files, no grouping

None

Risk level

Unknown

P0 · 441 downstream symbols

Tests to run

Unknown

3 specific test files

Dual implementation

Mixed in flat list

2 separate symbols, independent blast radius

Measured live · April 2026 · Momental monorepo · 38k symbols · 1,130 TypeScript source files. Grep: bash time grep -rn "recordAsync(" . --include="*.ts" (full repo, no path scoping). MCI: direct HTTP timing to mcp.momentalos.com — 185ms code_find + 237ms code_symbol. Token estimates = output bytes ÷ 4.

External Validation · Public Codebase · 379k lines

microsoft/TypeScript — quality improves even on public codebases

Same benchmark harness, different codebase. 20 tasks against microsoft/TypeScript (601 source files, 379k lines). Momental had not previously indexed this repo. Cold filesystem cache flushed before each grep run (sudo purge).

grep · cold cache · full repo · no path scoping

→

MCI · ~293ms find + ~231ms symbol · ~524ms total

Metric

Control (grep)

Treatment (MCI)

Retrieval time

3.8–10s (grep + parse)

~524ms (~20× faster)

Risk level (P0–P5)

Yes

Test files to run

Yes

20 of 20 tasks measured · April 2026 · 13m 16s total run time · microsoft/TypeScript shallow clone · 601 src files · 379k lines. Grep: rg --no-heading -n "functionName(" /repo/src (full repo, no path scoping, cold filesystem cache flushed via sudo purge before each run). MCI: HTTP to mcp.momentalos.com — ~275ms code_find + ~227ms code_symbol = ~502ms total. Raw output and per-task scores: benchmarks/results/typescript-2026-04-18.txt.

Benchmark 2 · Results

Organizational Context Lift

5 of 20 planned architectural decision tasks where the correct answer is already recorded as a DECISION atom in Momental. Control agents have no memory of past choices. Treatment agents recall the relevant decision before answering. Run: April 2026 · Model: Claude Opus 4.7 · LLM judge 1–10 quality score.

LLM judge quality score (avg, 5 tasks)

Without Momental memory no access to DECISION atoms

5.0 / 10

With Momental memory DECISION atoms surfaced in context

6.1 / 10 ↑+22%

Example question

Without Momental memory

With Momental memory

“Add a DB query — raw SQL or ORM?”

7.2 / 10

8.2 / 10 ↑+14%

“Long LLM call — sync API or async worker?”

7.5 / 10

9.0 / 10 ↑+20%

“New Gemini call — direct SDK or geminiService?”

6.0 / 10 (misses billing requirement)

9.2 / 10 ↑+53%

Average (5 tasks)

5.0 / 10

6.1 / 10 ↑+22%

The +53% lift on the Gemini billing question is the clearest signal: without Momental, Claude knows the SDK API but doesn’t know the team’s specific architectural decision (use geminiService.completeWithModel() — it auto-records usage). With Momental, that decision is surfaced in context and Claude applies it correctly.

Methodology

How we score every task

Every task runs twice — control and treatment. Five dimensions, every run. No cherry-picking. The same codebase, the same questions.

✓

Correct

Did the agent find all cross-file callers? Ground truth computed at run time — not a static fixture that goes stale.

🛠

Solution Built

Does the proposed change compile? Scored via LLM judge proxy for tsc --noEmit — running real compilation per benchmark loop is too slow.

⚡

Speed

Wall-clock milliseconds from task start to working output. MCI adds a small upfront cost but eliminates the back-and-forth of missed callers.

📈

Tokens

Input + output tokens consumed. MCI sends a targeted caller graph instead of requiring the agent to read 20+ files blind.

⭐

Quality

LLM-as-judge score from 1–10 using Claude Opus. Rubric: correctness, idiomaticity, addressed all callers, no introduced regressions.

Fairness rules

Same model (Claude Opus 4.7) for all control and treatment runs
Same task descriptions — only the context injected differs
Ground truth computed at run time, not hardcoded
All raw scores published with results
Codebase: the Momental monorepo (TypeScript, ~300k lines, real production code) + microsoft/TypeScript (379k lines, public OSS)

All numbers from a 20-task benchmark run completed April 2026 on one production codebase (Momental’s own monorepo, 38k+ indexed symbols). External validation on microsoft/TypeScript (379k lines, public OSS) completed April 2026 — see the TypeScript section above. Runner source code lives at packages/worker/src/__tests__/e2e/benchmarks/ in the Momental monorepo.

Under the hood

What MCI gives the agent

Five MCP tool calls at session start. The agent sees the entire codebase as a queryable knowledge graph — not a directory tree.

code_find

Exact symbol lookup by name. Returns file path, line, symbol ID for all matches across the repo.

code_symbol

360° view of a symbol: callers, callees, cluster membership, linked tasks and decisions.

code_blast

Recursive blast radius from a symbol. Returns P0–P5 risk level, all affected files, suggested tests to run.

code_search

Semantic + BM25 hybrid search across all symbols and comments. Finds conceptually related code, not just string matches.

code_claim

Declare which files you’re editing. Surfaces live conflicts with other agents before you write a single line.

What we measure

Beyond code — the full picture

MCI combines code intelligence with organizational memory. These benchmarks measure all of it.

🔍

Cross-file caller graph

BM25 + pgvector hybrid search. Every function, every caller, across the whole repo — not just the open file.

📈

Blast radius scoring

P0–P5 risk level before you touch a line. Know exactly which tests to run and which callers to notify.

🧠

Organizational decisions

DECISION atoms surface the why behind every architectural choice — not just what the code does, but why it was built that way.

👥

Multi-agent coordination

Live file claiming and conflict detection. Multiple agents working the same repo without stepping on each other.

Give your agents the context they need.

MCI is live. Connect your repo and see the caller graph in minutes.

Connect your repo → See all features

Rename a function used in 20 places.Grep finds 1. MCI finds 13.