0

Stateful Token Reduction for Long-Video Hybrid VLMs

Token reduction accelerates long-video vision--language models (VLMs), but existing methods target Transformers, where reduction is treated as token pruning. We study token reduction in hybrid Mamba--Transformer VLMs and find that it is \emph{stateful}: Mamba layers maintain a…

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2603.00198ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Token reduction accelerates long-video vision--language models (VLMs), but existing methods target Transformers, where reduction is treated as token pruning. We study token reduction in hybrid Mamba--Transformer VLMs and find that it is stateful: Mamba layers maintain a recurrent state that accumulates information from earlier tokens, allowing discarded tokens to persist, so reduction behaves more like compression than dropping.We support this view with a representation-based probing method measuring how much information from discarded tokens is retained, and analyze layer-wise sparsity and cross-layer importance stability. Our findings show importance is sparse within layers but unstable across layers, making aggressive early pruning unreliable while hybrids remain robust to later reduction.Motivated by this, we propose a hybrid-aware token reduction framework with a low-to-high progressive schedule and a unified query-conditioned importance score for attention and Mamba layers. For Mamba, excluding the position-dependent decay from the recurrence produces a stronger selection signal. Across long-video benchmarks, our method achieves 3.8{\times}--4.2{\times} prefilling speedups at a 25% token budget while maintaining near-baseline accuracy and improving with light finetuning. Hybrid models benefit from aggressive reduction, improving both efficiency and accuracy, whereas Transformers exhibit the standard trade-off. Our method also outperforms prior baselines on the same hybrid backbone and combines effectively with visual redundancy reduction methods.