Current text-to-video models can make individual frames look convincing while still getting simple interactions wrong: objects move before contact, an intended action is skipped, a placed object keeps drifting, or a support relation breaks. Our starting point is that standard frame-first denoising updates every latent region at every step, even when the prompt implies that only a local interaction should be active. We introduce Event-Driven Video Generation (EVD), a small DiT-compatible intervention that gives the sampler an explicit event signal. A lightweight head predicts token-level event activity; training losses tie that activity to latent state change; and event-gated sampling, with hysteresis and an early-step schedule, applies the update field mainly where an interaction is forming. On EVD-Bench, EVD improves human preference and VBench dynamics for state persistence, spatial accuracy, support relations, and contact stability, while keeping appearance quality comparable to the base model. The results suggest that a modest amount of event structure can correct several interaction failures that otherwise remain hidden behind good frame-level appearance.
Event-Driven Video Generation
Current text-to-video models can make individual frames look convincing while still getting simple interactions wrong: objects move before contact, an intended action is skipped, a placed object keeps drifting, or a support relation breaks.
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2603.13402CC-BY-4.0
- TL;DR
- Semantic Scholar