0

FoodDelivery

Fresh

FoodDelivery is a food delivery dispatch optimization environment. An agent manages a fleet of e-bike couriers across a procedurally generated city, making real-time decisions about courier-order matching, order batching, surge pricing, and fleet repositioning under stochastic…

Type
RL Env
Runtime
ORS
License
unknown
Size
24 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

FoodDelivery

OpenReward Environment

Description

FoodDelivery is a food delivery dispatch optimization environment. An agent manages a fleet of e-bike couriers across a procedurally generated city, making real-time decisions about courier-order matching, order batching, surge pricing, and fleet repositioning under stochastic conditions.

The simulation models a 4-hour dinner service period (5:00 PM -- 9:00 PM) with non-homogeneous Poisson order arrivals, Gamma-distributed restaurant preparation times, and lognormal stochastic travel times with time-of-day traffic effects. Scenarios range from calm weekday evenings to holiday peak demand, rainy weather, sudden demand spikes, and understaffed conditions.

Note: This is a synthetic environment which was majority AI-generated; we recommend testing it thoroughly before any use in an RL pipeline.

Capabilities

  • Real-time courier-order matching and assignment under time pressure
  • Order batching (up to 2 orders per courier) for delivery efficiency
  • Dynamic surge pricing per zone to manage supply-demand imbalances
  • Idle courier repositioning to anticipate demand shifts
  • Multi-step sequential decision-making (240 steps per task)
  • Reasoning about stochastic travel times, restaurant prep delays, and demand patterns

Compute Requirements

No additional compute required beyond the environment server.

License

MIT

Tasks

There are 24 tasks across 8 scenarios, each run with 3 random seeds.

Train split (18 tasks):

ScenarioDemandWeatherCouriersDurationDescription
weekday_calm1.0xClear20240 minBaseline scenario
weekday_busy1.3xClear20240 minHigher demand, same supply
rainy_evening1.3xRain (+25% demand, +30% travel)18240 minBad weather, fewer couriers
weekend_rush1.6xClear22240 minWeekend dinner peak
friday_surge1.0x -> 2.0x at step 120Clear20240 minSudden demand spike mid-service
understaffed1.3xClear14240 minSevere courier shortage

Test split (6 tasks):

ScenarioDemandWeatherCouriersDurationDescription
holiday_peak2.0xClear25240 minMaximum demand
late_night0.8x -> decliningClear12180 minDeclining demand, repositioning critical

Each task is identified as {scenario}_seed{N} (e.g., weekday_calm_seed42). Seeds used are 42, 123, and 777.

The city model is a 6km x 6km grid divided into 9 zones (3x3), with 30 restaurants and 12--25 couriers depending on scenario. Zone types (downtown core, commercial, residential, suburban) determine local demand intensity.

Reward Structure

This is a dense, verifiable reward environment. Rewards are computed deterministically after each 1-minute simulation step based on delivery outcomes:

EventReward
On-time delivery (within promised window)+1.0
Slightly late delivery (within 10 min of promise)+0.3
Very late delivery (>10 min past promise)-0.5
Expired order (unassigned >20 min or undelivered >60 min)-1.5
Surge pricing revenue+0.01 per unit revenue

The final reward returned at task completion is the normalized cumulative reward:

$$\text{reward} = \frac{\sum_{t=1}^{T} r_t}{N_{\text{orders}}}$$

where $r_t$ is the step reward and $N_{\text{orders}}$ is the total number of orders that arrived during the simulation.

We do not use LLM graders for this task.

Data

All data is procedurally generated from the scenario configuration and random seed. No external data files are required. The simulation generates:

  • City layout: 9 zones with demand multipliers, 30 restaurants with cuisine types and prep time distributions
  • Order arrivals: Non-homogeneous Poisson process with dinner rush profile
  • Prep times: Gamma-distributed per restaurant type (fast food ~8 min, standard ~18 min, premium ~25 min)
  • Travel times: Manhattan distance with lognormal stochastic noise and time-of-day traffic factors
  • Courier speeds: Base ~18 km/h (e-bike) with lognormal perturbation per courier

Identical seeds produce identical simulations for reproducibility.

Tools

Agents have access to 2 tools:

ToolDescription
get_info()Returns static city layout: zone boundaries, restaurant positions/types, courier starting positions. Does not advance simulation time.
step(actions)Advances the simulation by 1 minute. Accepts optional dispatch actions: order assignments, surge multipliers per zone, and courier repositioning commands. Returns current state, step reward, and completion status.

The step tool accepts three optional action types:

  • assignments: Assign 1--2 orders to an idle courier (batching supported)
  • surge_multipliers: Set per-zone price multipliers (1.0--3.0x, reduces demand via elasticity)
  • repositions: Send idle couriers to target zones

Time Horizon

FoodDelivery is a multi-step environment with 240 simulation steps per task (180 for late_night). Each step represents 1 minute of simulated time. A typical run requires 240+ tool calls (one get_info call plus ~240 step calls).

Other Environment Requirements

There are no external API requirements. FoodDelivery works out of the box with the OpenReward endpoint without any secrets.

Safety

Agents in FoodDelivery optimize delivery logistics within a fully simulated environment. The surge pricing mechanism introduces an economic optimization component where agents must balance revenue extraction against service quality. While the environment rewards efficient dispatch, the objectives are bounded and the simulation is self-contained.

Citations

@dataset{GRFoodDelivery,
  author    = {General Reasoning Inc. Team},
  title     = {FoodDelivery},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/fooddelivery}
}