Reward-Conditioned Attention: How Reward Design Shapes What Autonomous Driving Agents See

We investigate how reward design shapes the internal attention patterns of reinforcement learning agents trained for autonomous driving. Using three Perceiver-based agents that share identical architectures and training data but differ only in their reward configurations\unicode{x2014}ranging from basic violation penalties to continuous proximity penalties\unicode{x2014}we analyze cross-attention allocation across 50 real-world scenarios from the Waymo Open Motion Dataset. A central methodological finding is that naïve pooling of timesteps across episodes substantially underestimates the attention\unicode{x2013}risk relationship; within-episode correlation with Fisher z-transform aggregation is the appropriate statistic and reveals a robustly positive link between collision risk and agent-directed attention. Building on this validated methodology, we demonstrate two reward-conditioned effects: agents trained with navigation rewards allocate up to 2.0\times more attention to GPS-path tokens than those trained with additional proximity penalties\unicode{x2014}and 4.7\times more than agents with no navigation incentive\unicode{x2014}revealing that reward content directly determines which scene elements the encoder prioritizes, and continuous time-to-collision penalties create a learned vigilance prior\unicode{x2014}elevated resting agent surveillance maintained throughout collision-free phases. In several scenarios, the complete-reward and minimal-reward models exhibit opposite attention\unicode{x2013}risk correlation directions, demonstrating that reward design can qualitatively reverse attentional strategy rather than merely modulating its magnitude. These results suggest that attention analysis is a practical diagnostic for verifying that a reward function produces the intended representational behaviour in safety-critical RL systems.