Attention Is a Zero-Sum Game: Precise vs More Context

There's a common intuition in AI engineering: more context is better. Give the model everything. Let it figure out what's relevant. Models are smart — they'll ignore the noise.

This intuition is wrong, and the math of self-attention explains exactly why.

How attention actually works

In a transformer, self-attention computes how much each token should "attend to" every other token. For a query token q and a set of key tokens K, the attention weights are:

Attention(q, K, V) = softmax(qK^T / sqrt(d_k)) · V

The critical operation is softmax. It takes a vector of raw scores and converts them into a probability distribution:

softmax(z_i) = exp(z_i) / sum(exp(z_j)) for all j

This distribution sums to 1. Always. This is the zero-sum property: if one token gets more attention, other tokens must get less. The total attention budget is fixed at 1.0.

The dilution problem

Suppose your agent queries a crypto API and receives a response with 47 fields: price, market_cap, total_volume, high_24h, low_24h, price_change_24h, circulating_supply, total_supply, max_supply, ath, ath_change_percentage, ath_date, atl, atl_change_percentage, atl_date, roi, last_updated, market_cap_rank, fully_diluted_valuation, price_change_percentage_24h, market_cap_change_24h, market_cap_change_percentage_24h, sparkline_7d, developer_score, community_score, liquidity_score, public_interest_score... and 20 more.

The agent needs 3: price, change_24h, market_cap.

When the model processes this response, its attention mechanism computes weights across all tokens — including the 44 irrelevant fields. Each field name, each value, each JSON delimiter consumes attention budget. The softmax denominator grows with every token:

attention_weight(relevant_token) = exp(score) / (exp(score) + sum_of_all_other_exp_scores)

More tokens in the denominator → smaller weight for each relevant token. This isn't theoretical. It's arithmetic.

A concrete example

Consider a simplified case. Your model is generating text about "current BTC trend." It needs to attend to the price and change_24h fields. Let's say:

Score for price: 4.0
Score for change_24h: 3.8
Score for each of 44 irrelevant fields: 1.5 (average)

With just the 3 relevant fields:

softmax(4.0) = exp(4.0) / (exp(4.0) + exp(3.8) + exp(2.0))
             = 54.6 / (54.6 + 44.7 + 7.4)
             = 54.6 / 106.7
             = 0.512

With all 47 fields:

softmax(4.0) = exp(4.0) / (exp(4.0) + exp(3.8) + exp(2.0) + 44 × exp(1.5))
             = 54.6 / (54.6 + 44.7 + 7.4 + 44 × 4.48)
             = 54.6 / (106.7 + 197.1)
             = 54.6 / 303.8
             = 0.180

The attention weight on the most relevant token dropped from 0.512 to 0.180 — a 65% reduction. The model is now almost three times less focused on what matters.

This is the dilution effect. It's not about the model being "confused." It's the softmax denominator doing exactly what the math says it should do.

Multi-head attention: doesn't this help?

Transformers use multi-head attention — multiple independent attention computations in parallel. Each head can learn different patterns: one might focus on syntax, another on semantics, another on positional relationships.

But each individual head still uses softmax. Each head still has a fixed attention budget of 1.0. Multiple heads diversify what gets attended to, but they don't escape the dilution problem within each head.

In practice, research has shown that heads specialize. Some heads attend to local syntax. Others attend to semantic relationships. The heads that matter for your task — the ones attending to factual content — are still subject to the same zero-sum constraint.

"Lost in the Middle" is a dilution symptom

The "Lost in the Middle" finding (Liu et al., 2023) showed that LLMs recall information at the beginning and end of long contexts much better than information in the middle. The recall accuracy for facts placed in the middle of a 20-document context can drop by 20-30% compared to facts at the beginning or end.

This is attention dilution in action. Position encoding gives beginning and end tokens structural advantages. When the context is long, middle tokens compete for attention against a larger pool, and their positional encoding provides less signal. The result: the model "forgets" middle content even within a single forward pass.

Implication: if you stuff a 47-field API response into the middle of a long agent conversation, the model will attend to it poorly — not because it's not smart enough, but because the math is working against it.

The information density argument

There's a better frame than "fewer tokens": information density. What matters isn't how many tokens enter the context — it's the ratio of useful signal to total tokens.

information_density = useful_tokens / total_tokens

A 47-field response where 3 fields matter has a density of ~6%. A curated 3-field response has a density of ~100%. The model's task difficulty is inversely proportional to information density — with higher density, the relevant tokens command more attention weight, and the model's output distribution shifts toward accurate responses.

This is why schema learning changes output quality, not just output cost. When Harbor applies a learned schema — keeping price, change_24h, and market_cap while dropping 44 other fields — it's not a compression optimization. It's an attention optimization. The transformer's limited attention budget is now concentrated on tokens that matter.

From architecture to engineering

The attention mechanism tells us something precise: context quality is not about what you add, it's about what you remove.

Every token you inject into the context window competes for attention. Every irrelevant field, every stale cache entry, every verbose error message dilutes the model's focus on what actually matters for the current task.

This inverts the common "give the model more context" intuition. The engineering principle is:

Minimize the denominator of softmax. Maximize the information density of the context window. Every token should earn its place.

This is the principle behind Harbor's schema learning. After a few API calls, Harbor observes which fields the agent actually uses and learns a schema: raw → normalized → compact → summary. Each density layer increases the signal-to-noise ratio. The agent doesn't need to understand attention mechanics — it just gets better answers because the math is working for it instead of against it.

Before: 47 fields × 12 tokens/field = 564 tokens, density ~6%
After:   3 fields × 12 tokens/field =  36 tokens, density ~100%

Same information. 94% fewer tokens competing for attention. Mathematically guaranteed better focus.

The compound effect

This matters more than you'd think when you consider that agents make multiple API calls per session, context accumulates over a conversation, and memory from previous sessions is injected alongside fresh data.

An agent running a 10-call analysis with 47-field responses accumulates ~5,640 tokens of API data, of which ~360 are useful. That's 5,280 wasted tokens actively degrading the model's ability to synthesize across the calls it actually cares about.

With schema learning applied: 360 tokens. Every one relevant. The model's attention is undiluted, and its cross-call synthesis — connecting patterns across multiple API responses — improves because the relevant tokens aren't lost in noise.

The architecture rewards precision. Build for it.