Token Probability: The Case for Agent Memory

When an LLM generates a response, it samples from a probability distribution shaped by context. External memory doesn't help the model 'remember' — it shifts the probability distribution so that informed tokens become the most likely next output. This is context engineering at the token level.

token-probabilitycontext-engineeringautoregressiveagent-infrastructure

Most discussions about AI agent memory frame it as "helping the model remember." This framing is anthropomorphic and imprecise. LLMs don't remember or forget — they compute probability distributions. Understanding what that means changes how you think about memory infrastructure.

The output is a probability distribution

At every generation step, a transformer produces a probability distribution over its vocabulary. If the vocabulary has 100,000 tokens, the model outputs 100,000 probabilities that sum to 1.0:

P(token | context) for each token in vocabulary

The model then samples from this distribution (with temperature, top-p, or other strategies) to select the next token. The generated text is a sequence of samples from a sequence of distributions.

Here's the key: the distribution is entirely determined by the context. Same context → same distribution. Different context → different distribution. There is no hidden state, no mood, no memory. Just a function from tokens to probabilities.

Context shapes the distribution

Consider an agent asked: "What's the current BTC market outlook?"

Without memory context

The model sees: [system prompt] [user: What's the current BTC market outlook?]

Its training data contains thousands of crypto analyses. The probability distribution is broad — "bullish" and "bearish" have roughly similar probability. "Bitcoin is currently..." leads to a generic completion. The model generates a safe, hedged, non-specific answer because no specific information pushes the distribution toward a particular conclusion.

P("bullish")     ≈ 0.15
P("bearish")     ≈ 0.12
P("volatile")    ≈ 0.18
P("uncertain")   ≈ 0.14
...distribution is flat, uncertain

With memory context

Now the model sees:

[system prompt]
[API response: BTC price $67,234, +2.34% 24h]
[memory: "BTC dominance rising to 58%. SOL underperforming vs ETH."
 — Claude Code, 33 minutes ago]
[user: What's the current BTC market outlook?]

The probability distribution shifts dramatically. "dominance rising" in the context makes "bullish" and "strengthening" high-probability tokens. The specific number "58%" makes the model likely to cite it. "SOL underperforming" makes relative comparisons likely.

P("bullish")        ≈ 0.31  (+0.16)
P("dominance")      ≈ 0.22  (was ~0.02)
P("strengthening")  ≈ 0.18  (was ~0.04)
P("uncertain")      ≈ 0.03  (-0.11)
...distribution is peaked, specific

The model didn't "remember" the previous analysis. It saw relevant tokens in its context window, and those tokens reshaped the probability distribution. The output went from generic to specific — not because the model got smarter, but because the input changed.

Memory infrastructure is distribution engineering

This reframing is important. When you build memory for agents, you're not building a "memory system" in any cognitive sense. You're building a context injection pipeline that shifts the model's output probability distribution toward more informed, more specific, more useful tokens.

The engineering question becomes precise: which tokens, injected into the context, produce the largest positive shift in output distribution?

This is measurable. Given a task, you can compare:

  • Output distribution without memory context (baseline)
  • Output distribution with memory context (treatment)
  • KL divergence between the two distributions (magnitude of shift)
  • Human evaluation of whether the shift improved output quality

What makes a good memory injection?

From the probability perspective, effective memory context has specific properties:

1. High information content. "BTC dominance rising to 58%" is better than "BTC is doing well." The specific number creates a sharper peak in the output distribution — the model is likely to cite "58%" rather than generate a vague statement. Information theory: low-entropy tokens (specific facts) shift distributions more than high-entropy tokens (vague summaries).

2. Task-relevant framing. The same fact framed differently shifts different parts of the distribution. "BTC dominance: 58%" (data point) vs. "BTC dominance is rising to 58%, suggesting capital rotation into BTC" (analysis). The second version shifts not just what the model says, but how it reasons. It pushes the distribution toward analytical tokens.

3. Recency signal. "33 minutes ago" calibrates the model's confidence. Without a timestamp, the model can't assess reliability. With it, the output distribution appropriately weighs the information — recent data gets cited with confidence, old data gets hedged.

4. Attribution. "— Claude Code" tells the model this came from another agent's analysis, not from the user or from verified data. This subtly shifts the distribution toward tokens like "previously noted" or "prior analysis suggests" rather than "the data shows." Attribution makes the model's epistemic framing more accurate.

Cross-agent memory as distribution alignment

In a multi-agent setup, each agent produces outputs from its own probability distribution. Without shared memory, these distributions are independent — each agent's output is conditioned only on its own context, even when they're working on related tasks.

Shared memory creates distribution alignment across agents. When Agent A's analysis enters Agent B's context, Agent B's output distribution shifts toward tokens consistent with Agent A's findings. The agents converge toward coherent outputs without explicit coordination.

Agent A context: [data] [analysis: "dominance rising to 58%"]
Agent A distribution: peaked around "bullish", "rotation", "dominance"

Agent B context: [data] [memory from A: "dominance rising to 58%"]
Agent B distribution: also peaked around "bullish", "rotation", "dominance"

Without shared memory:

Agent B context: [data]
Agent B distribution: flat, generic — may contradict Agent A

This is the probabilistic argument for cross-agent memory. It's not about agents "sharing knowledge." It's about aligning their output distributions so they produce coherent, non-contradictory responses to related queries. The mechanism is simple: same relevant tokens in context → similar probability distributions → consistent outputs.

The context window is a probability lever

Putting this all together: the context window is the only input to the probability distribution. Everything in the context window shapes the output. Therefore, controlling what enters the context window is the highest-leverage intervention you can make on model output quality.

This frames three infrastructure problems precisely:

Memory injection (auto-recall): Which past tokens should re-enter the context? The ones that shift the output distribution toward more informed, specific completions. Harbor's auto-recall injects relevant past analysis into meta.recalls[] — the model sees previous insights and its distribution shifts accordingly.

Schema learning (curation): Which data tokens should enter the context? Only the ones the agent actually uses. 47 fields with 3 useful = 44 fields competing for attention and diluting the distribution. After learning which fields matter, the context contains only high-signal tokens. The distribution peaks more sharply around useful outputs.

Access control (govern): Which tokens is each agent allowed to see? Different visibility levels produce different distributions. Agent A with full data produces detailed analysis. Agent B with a summary produces high-level overviews. Same model, different context, different probability distribution, different output — by design.

From intuition to engineering

The shift from "help agents remember" to "shape output probability distributions via context injection" is the difference between building features and building infrastructure.

Features ask: "how do we add memory to this agent?" Infrastructure asks: "how do we control the probability distribution of every agent in the system?"

The transformer's autoregressive architecture makes this possible. Every token in the context window contributes to the next-token probability. By engineering what enters the context — what memories are injected, what fields are included, what density layer is used — you engineer the output. Not approximately. Mathematically.

The model is a function. Context is the input. The output distribution is the output. Memory infrastructure is input engineering.