Everyone agrees agents need memory.

Nobody agrees on what that means.

Ask ten teams how they handle agent memory and you’ll get ten different answers: a vector store, a Notion doc, a system prompt, a JSON file, a RAG pipeline, an Obsidian vault. They call it all “memory.” Most of it isn’t.

Memory is the key factor that transforms potential into intelligence. An isolated model - no matter how capable - is like Einstein without accumulated knowledge: brilliant in the moment, but unable to build on anything.

True intelligence requires the ability to operate over long cycles of thinking that span days, weeks, months. A context window was never enough - that’s a different architectural requirement entirely.

So what does long-term memory actually require? Here’s what we’ve learned building it.

1. Persistence Across Sessions

The most basic requirement. Memory that disappears when the session ends isn’t memory.

Most agent setups fail here immediately. The context is loaded at the start, used during the run, and discarded at the end. Next session: start from zero. The agent is smart. The system isn’t learning.

Long-term memory must survive session boundaries by default - not as an optional export, not as a log file someone has to remember to check. It has to be structural, automatic and unconditional.

2. Structure, Not Storage

Storage is cheap. Structure is hard.

A flat file of notes is not memory. It’s an archive. Archives don’t compose - you can’t ask them what they know about a specific constraint, or whether a new decision conflicts with an old one, or which learnings apply to the task you’re about to run.

Structured memory means typed units: a decision has different properties than a goal, which is different from a raw data point. Each type has its own lifecycle - facts age differently than principles, customer signals age differently than architectural constraints.

Structure is also what enables automatic connections. Relationships between data points, entities, and concepts can be detected automatically - not because someone drew a diagram, but because the types and content make the links inferrable. A decision about pricing connects to a customer signal about willingness to pay. A principle about engineering connects to a task that touches the relevant codebase.

Structure is what makes memory queryable. And queryability is what makes memory useful to agents at the moment they need it - including entity-aware semantic search: ask about a person, a project, or a decision, and get precise answers with sources, not a list of documents to wade through.

3. Generalizability, Not Just Facts

There are two kinds of things worth storing.

Facts: “The customer meeting was on February 12th.” “The API response time is 240ms.” These are useful. They’re also cheap and specific.

Learnings: “Agent performance degrades when type definitions are omitted from context, especially for multi-file refactors.” “Customers in the mid-market segment care more about integrations than price.” These are harder to write. They require synthesis. And they’re worth ten times as much.

The difference between a memory layer and a continual learning system is exactly this: whether you’re storing raw facts or generalizable learnings that transfer across contexts. Facts answer questions. Learnings change behavior.

And this is where self-improving loops become the point. Agents shouldn’t just read memory - they should write back. As agents complete tasks, they update and refine what the team knows. The graph improves through use, not just through manual curation. Over time, agents get smarter together — not because any single model updated its weights, but because the shared context they draw from keeps improving.

4. Shared (Both Read and Write)

Memory that only humans can write is a bottleneck. Memory that only agents can read is a dead end.

True shared memory means every participant in the system - humans and agents alike - can contribute to it and draw from it. A human typing in a planning session and an agent adding learnings at the end of a task need to write to the same place, in the same structure, with the same guarantees about retrieval.

This is the architectural requirement that most teams underestimate. They build memory for humans and pipe a slice of it to agents. Or they let agents write to a scratch space that humans never look at. Neither compounds.

Shared memory - genuinely shared - is what makes organizational learning possible. The graph grows from both directions. What a human decides and what an agent discovers live in the same place, linked to the same context, available to whoever needs it next.

5. Human-in-the-Loop, Not Human-as-Bottleneck

Shared write access doesn’t mean unreviewed write access.

Before agent learnings become part of the team’s context, humans need to be able to review them - approve, edit, or reject. That’s the difference between a context graph that gets smarter and one that silently accumulates errors.

This also means agents shouldn’t be able to promote context that hasn’t earned trust. The review layer isn’t optional - it’s what makes the self-improving loop safe enough to actually run.

The goal is a human who stays informed without becoming a bottleneck. Agents write back continuously. Humans review on their own cadence. Nothing unreviewed gets treated as authoritative.

6. Conflict Detection

Memory that can contradict itself isn’t trustworthy.

One agent says the product is priced at $49/month. Another has learned it’s free forever. Without conflict detection, both atoms sit in the graph, and the next agent that reads them picks one arbitrarily - or worse, synthesizes a confident answer from contradictory premises.

Long-term memory needs to detect when new information conflicts with existing information - and surface that conflict for a human to resolve, not silently suppress it. This is the equivalent of a merge conflict in git. You want to discover the contradiction in the review queue, not in production.

The right design doesn’t hide conflicts. It makes them visible, explains what’s in tension, and puts the resolution in human hands.

7. Temporal Awareness

Not all memory ages the same way.

A principle - “we prioritize user privacy over conversion” - might be valid for years. A data point - “our conversion rate was 3.2% last Tuesday” - has a short useful life. A decision - “we’re going with server-side rendering for the dashboard” - might get revisited in six months.

Long-term memory requires a model of freshness. Not just timestamps — decay logic. Confidence should degrade over time for volatile types. Outdated information should get flagged automatically, before it gets injected into an agent’s context as if it were current.

Memory without decay eventually becomes noise. And noise is worse than no memory at all, because it creates false confidence.

8. Full Traceability

Memory has to be auditable.

Who wrote this atom? When? Which version is current? What did it say before it was edited? If a decision gets updated, can you trace why?

Every edit, every version, every author - a full audit trail. Not because compliance requires it, but because trustworthy memory is transparent memory. When an agent makes a decision based on a principle in the graph, you need to be able to follow the chain back: where did that principle come from, who established it, and does it still reflect what the team actually believes?

Traceability is also what makes ownership meaningful. Memory without ownership is a shared doc where everyone writes in the same font. Identity - knowing which atom belongs to which team, project, or decision maker - is what determines authority when two atoms conflict.

9. Dynamic Delivery, Not Bulk Injection

Storing memory is one problem. Delivering it usefully is another.

The naive approach: dump everything into the context window and let the model figure it out. This fails at scale. Context windows are zero-sum. Injecting the full knowledge graph into every agent turn wastes tokens, increases cost, and degrades output quality - because the signal gets buried in noise.

The right model is dynamic context delivery: agents pull exactly the context they need, per turn, based on what they’re doing right now. The memory system becomes a precision tool, not a firehose. Less tokens wasted. Better outputs. Lower cost.

This is only possible if the memory is structured. Unstructured archives can’t be queried selectively. Typed, linked atoms can.

The Full Stack

Most memory implementations get one or two of these right.

They persist across sessions but store flat files. They have structure but no conflict detection. They’re readable by agents but writable only by humans. They store facts but never synthesize learnings. They capture everything but can’t deliver it selectively.

External memory that is persistent, structured, generalizable, genuinely shared, human-reviewed, conflict-aware, fresh, traceable, selectively delivered, and actionable - that’s what makes agents trustworthy enough to operate over time.

Language was the first shared memory system humanity built. It let humans build on each other’s thinking across generations.

We’re building the next version of that - for organizations running on agents.