0

Probe Choice Changes Canary-Memorization Verdicts: Three Post-Hoc Disagreement Case Studies in a Text-Dominant LoRA-Tuned Autoregressive Testbed

We audit a fixed prefix-window mean-NLL memorization probe (K=20) on a Qwen2.5-VL-7B canary testbed and report three post-hoc cases where it disagrees with full-span secret NLL or greedy exact-recall.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.31168CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We audit a fixed prefix-window mean-NLL memorization probe (K=20) on a Qwen2.5-VL-7B canary testbed and report three post-hoc cases where it disagrees with full-span secret NLL or greedy exact-recall. C3 (false negative, window truncation): damage lands on hex tokens outside K=20; the probe stays flat while hit@1 drops. C4 (false positive, non-secret drift): the probe moves, but approximately 99% sits on non-secret preamble; the secret span and hit@1 are unchanged. C5 (ambiguous in-window drop): the probe falls on an undertrained baseline while full-span hex is positive and hit@1=0. Recommendation: report (i) full-span secret NLL, (ii) a span-localised decomposition, (iii) behavioural exact-recall at k>=4, and (iv) decoy probes before asserting secret-specificity. Evidence is on controlled canaries in one backbone; magnitudes are testbed-specific.