0

EsoLang

Fresh

EsoLang-Bench is a benchmark for evaluating genuine reasoning in large language models via esoteric programming languages. It is designed to be resistant to data contamination and benchmark gaming, measuring transferable computational reasoning rather than memorization.

Type
RL Env
Capabilities
Code Generation
Runtime
ORS
License
unknown
Size
400 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

EsoLang-Bench

OpenReward Environment

Description

EsoLang-Bench is an environment for evaluating LLM code generation in esoteric programming languages. Models score ~90% on Python coding tasks but only ~3.8% when the same problems must be solved in esoteric languages — testing genuine reasoning vs pattern matching. Agents are given programming problems and must write solutions in one of five esoteric languages.

Capabilities

  • Code generation in Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare
  • Iterative testing and debugging via interpreter execution
  • Algorithmic problem solving under severe language constraints

Compute Requirements

No special compute requirements. All interpreters are pure-Python and execute server-side with a 5-second timeout per execution.

License

CC BY 4.0 (dataset), MIT (interpreters)

Tasks

400 tasks in a single test split (80 problems x 5 languages). Each task combines a programming problem with a target esoteric language. Problems range from simple string output to complex algorithmic challenges across four difficulty levels (easy, medium, hard, extra_hard).

Reward Structure

Partial credit based on hidden test cases:

  • reward = num_passing_tests / total_tests (6 test cases per problem)
  • Output matching uses the paper's outputs_match_lang function (language-aware, trailing whitespace tolerant, with numeric normalization)
  • Reward range: [0.0, 1.0]

Data

  • Source: Lossfunk/Esolang-Bench on HuggingFace
  • Size: 80 problems, 6 test cases each
  • Format: Parquet file with problem descriptions and test cases

Tools

  • run_code(code, stdin): Execute code in the target esoteric language with optional stdin. Returns stdout, stderr, and error status. 5-second timeout.
  • submit(code): Submit final solution for grading against all hidden test cases. Terminal action — one submission allowed. Returns per-case results and partial credit reward.

Time Horizon

Multi-turn. Agents iteratively write, test, and debug code before submitting. Expected 5-30 tool calls depending on problem difficulty and language complexity.

Environment Difficulty

  • Easy (20 problems): Basic I/O, string manipulation
  • Medium (20 problems): Loops, conditionals, arithmetic
  • Hard (20 problems): Complex algorithms, data structures
  • Extra Hard (20 problems): Advanced algorithmic challenges

Difficulty is compounded by the esoteric language constraint — even easy problems become challenging in Brainfuck or Whitespace.

Other Environment Requirements

No other requirements. No sandbox, API keys, or external services needed.

Safety

All code execution happens server-side in pure-Python interpreters with strict timeouts. Esoteric languages operate on abstract machines (tapes, stacks, grids) with no filesystem or network access. No safety concerns.

Citations

@article{sharma2026esolangbench,
      title={EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages},
      author={Aman Sharma and Paras Chopra},
      year={2026},
      eprint={2603.09678},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
}