0

DeepSynth

Fresh

DeepSynth is an environment for automatically synthesizing programs from examples. It combines machine learning predictions with efficient enumeration techniques in a very generic way.

Type
RL Env
Capabilities
Code Generation
Runtime
ORS
License
unknown
Size
2145 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

DeepSynth

OpenReward Environment

Description

DeepSynth is an environment for evaluating program synthesis from input-output examples. Agents are given examples of integer and integer-list transformations and must write a Python function that implements the transformation, generalizing beyond the shown examples.

Capabilities

  • Pattern recognition from input-output examples
  • Program synthesis / function induction
  • Integer and list manipulation
  • Generalization from examples to unseen inputs

Compute Requirements

No special compute requirements. Lightweight non-sandbox environment.

Tasks

  • Train split: 1,353 tasks (T=1: 21, T=2: 464, T=3: 455, T=4: 413)
  • Test split: 792 tasks (T=1: 24, T=2: 99, T=3: 487, T=4: 96, T=5: 86)
  • Difficulty levels T=1 (single operation) through T=5 (five chained operations)
  • Each task has 5 visible I/O examples and ~20 hidden test cases

Tasks are drawn from the DeepCoder dataset, which covers integer list transformations using ~40 primitives (sorting, filtering, mapping, zipping, scanning, etc.).

Reward Structure

Binary reward: 1.0 if the submitted function passes ALL hidden test cases, 0.0 otherwise. Fully verifiable — no LLM grader.

Data

  • Source: DeepCoder dataset (Balog et al., ICLR 2017)
  • Enriched with hidden test cases generated from ground-truth DSL programs

Tools

  • test(code) — Test Python code against the visible I/O examples. Returns pass/fail per example. Non-terminal.
  • submit(code) — Submit final Python code for grading against hidden test cases. Terminal action, one attempt only.

Time Horizon

Multi-turn. Agents can iterate using the test tool before submitting. Typical: 1-5 tool calls.

Environment Difficulty

  • T=1: Easy (single operation, e.g., sort, map, filter)
  • T=2: Medium (two chained operations)
  • T=3: Hard (three chained operations, often with multiple inputs)
  • T=4-5: Very hard (four-five chained operations, complex compositions)

Safety

No safety concerns — tasks involve only integer and list manipulation.

Citations

@inproceedings{Balog2017,
  author    = {Balog, Matej and Gaunt, Alexander L. and Brockschmidt, Marc and Nowozin, Sebastian and Tarlow, Daniel},
  title     = {DeepCoder: Learning to Write Programs},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2017},
  url       = {https://arxiv.org/abs/1611.01989}
}

@inproceedings{Fijalkow2022,
  author    = {Fijalkow, Nathana{\"{e}}l and Lagarde, Guillaume and Matricon, Th{\'{e}}o and Ellis, Kevin and Ohlmann, Pierre and Potta, Akarsh},
  title     = {Scaling Neural Program Synthesis with Distribution-based Search},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume    = {36},
  number    = {6},
  pages     = {6623--6630},
  year      = {2022},
  doi       = {10.1609/aaai.v36i6.20616}
}