We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that efficiently reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with O(\log T) calls to an offline regression oracle over T rounds, and makes O(\log\log T) calls when T is known. The design of OE2D algorithm generalizes Falcon \citep{simchi2022bypassing} and its linear reward version \citep[][Section 4]{xu2020upper} in that it finds an action distribution that we term ``exploitative F-design'' that simultaneously guarantees low regret and good coverage, striking a balance between exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is small in many settings, including bounded Eluder dimension per-context and the smoothed regret setting. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC) \citep{foster2021statistical}, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.
Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits
We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that efficiently reduces contextual bandit learning with general reward function approximation to offline regression.
- Preview

- Year
- 2026
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2602.09456ARXIV-DEFAULT
- TL;DR
- Semantic Scholar