0

Revocable Learned State via Process Sidecars

Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.30788CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not the same problem as subtracting the memory update: the later safety optimizer has transported the memory direction. We introduce process sidecars, a two-coefficient edit family \hatθ(λ,γ)=θ_{AMS}-λΔ_{M}-γ\hat{R}{S\leftarrowM}, with \hat{R}{S\leftarrowM}=\hat{J}{S,\varepsilon}(Δ{M})-Δ_{M}, where \hat{J}{S,\varepsilon} is a centered secant through the realized future AdamW safety-training process. The implementation uses \varepsilon=1 at the natural memory-edit scale; it reuses θ{AMS} as the positive endpoint and computes one additional safety trace at θ_{A}-Δ_{M}. We prove two things. First, the exact sidecar, using the true transported direction R_{S\leftarrowM} rather than the secant estimate, at (λ,γ)=(1,1) recovers the counterfactual safety-only oracle θ_{AS} up to second order; the proof treats AdamW as an augmented-state map over parameters, first moments, and second moments. Second, this process information is necessary: whenever future safety training bends the memory direction, every scalar task-arithmetic edit leaves first-order counterfactual error, while the process-sidecar edit is second-order accurate. Across three models, the validation-selected 2D edit improves held-out refusal closure over naive task arithmetic in all trials, and over the γ=λ process-JVP subfamily, the diagonal slice of the cached 2D grid, in all paired trials.