0

Supersede

Fresh

Bounded-memory supersession environment: train/eval agents to use the current fact, not the stale one, across long multi-session interactions.

Type
RL Env
Publisher
Vedant
Runtime
multi-turn
License
unknown
Size
v0.1.0
Published
Jun 2026

Cite

Notes

Only stored in your browser.

supersede

Train and evaluate agents to use the current fact, not the stale one.

A bounded-memory environment over multi-session interactions: the agent sees one session at a time and maintains a capped notes memory (it never re-sees raw sessions), then must answer a question using the current value of a fact that was updated along the way.

The failure it targets

On LongMemEval's knowledge-update questions, giving an agent bounded memory instead of full context drops supersession accuracy sharply — and the gap survives on the frontier model:

ModelFull-contextBounded memory
gpt-4.1-mini82%63%
gpt-4.191%64%
gpt-5.492%77%

Even gpt-5.4 loses 15 points (paired McNemar p=0.0033) and fails ~23% of supersession questions under bounded memory, while full-context saturates near 92%. The bottleneck is memory maintenance, not comprehension. (Details: docs/findings/ in the repo.)

Usage

prime env install supersede
# bounded memory (the failure regime)
prime eval run supersede -m openai/gpt-4.1-mini -a '{"max_examples": 78}'
# full-context upper bound (for the gap)
prime eval run supersede -m openai/gpt-4.1-mini -a '{"full_context": true}'

The environment auto-downloads the LongMemEval knowledge-update data (MIT license) on first run. Arguments to load_environment:

argdefaultmeaning
question_typeknowledge-updateLongMemEval subset
max_examplesNonecap on tasks
budget300character cap on the agent's notes memory (bounded mode)
full_contextFalseupper-bound mode: all sessions in context, single turn

Reward

  • answered_current (+1): the final answer conveys the current/gold value (programmatic, ungameable matcher; no API needed).
  • stale_penalty (-1): the answer asserts a known superseded value — active only when the task ships stale_values (synthetic timelines; LongMemEval is gold-only).

Status

Validated end-to-end under verifiers 0.1.14 against OpenAI: all 78 knowledge-update rollouts terminate cleanly and the environment reports 57.7% accuracy for gpt-4.1-mini (programmatic matcher), consistent with the offline harness's 63% (LLM judge). The remaining step is the Hub push (prime env push, which authenticates under your Prime Intellect account).