Before and After Temperature: A Distributional View of Creative LLM Generation

Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature \emph{reshapes} the model's token distribution before the next token is drawn. On Llama-3.1-8B-Instruct generations of 500 open-ended creative prompts at $T \in {0.3, 0.8, 1.5}$, a single per-token feature derived from this reshaping predicts the within-prompt creativity rank at Spearman $ρ{=}0.918$ against an averaged gpt-4o,/,gemini-2.5-pro judge ($n{=}500$) and $ρ{=}0.870$ against a three-rater human-majority ranking ($n{=}150$). Each of four standard reference-free baselines (self-perplexity, mean predictive entropy, top-1 margin, gzip compression ratio) tops out at $|ρ|!\approx!0.76$ on both ground truths: a gap of $+0.165$ on averaged-LLM and $+0.110$ on human-majority, both far larger than the spread among the baselines themselves. The two ground-truth panels agree with each other at $ρ{=}0.83$, above the inter-human ceiling of $ρ{=}0.77$, so the comparison is not bottlenecked by judge noise. Mechanistically, the win comes from a sharp distributional signature of the incoherence regime: at $T{=}1.5$ the cumulative-mass width $n_{95}(q)$ inflates from $\sim!1$ to ${\sim}!131$ tokens and post-temperature mass leaks off the pre-temperature top-$90%$ plausible set by about $13$ percentage points. The per-token aggregates do not separate $T{=}0.8$ from $T{=}0.3$; discriminating the two coherent regimes is left to sequence-level features.