EsoLang-Bench

Description

EsoLang-Bench is an environment for evaluating LLM code generation in esoteric programming languages. Models score ~90% on Python coding tasks but only ~3.8% when the same problems must be solved in esoteric languages — testing genuine reasoning vs pattern matching. Agents are given programming problems and must write solutions in one of five esoteric languages.

Capabilities

Code generation in Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare
Iterative testing and debugging via interpreter execution
Algorithmic problem solving under severe language constraints

Compute Requirements

No special compute requirements. All interpreters are pure-Python and execute server-side with a 5-second timeout per execution.

License

CC BY 4.0 (dataset), MIT (interpreters)

Tasks

400 tasks in a single test split (80 problems x 5 languages). Each task combines a programming problem with a target esoteric language. Problems range from simple string output to complex algorithmic challenges across four difficulty levels (easy, medium, hard, extra_hard).

Reward Structure

Partial credit based on hidden test cases:

reward = num_passing_tests / total_tests (6 test cases per problem)
Output matching uses the paper's outputs_match_lang function (language-aware, trailing whitespace tolerant, with numeric normalization)
Reward range: [0.0, 1.0]

Data

Source: Lossfunk/Esolang-Bench on HuggingFace
Size: 80 problems, 6 test cases each
Format: Parquet file with problem descriptions and test cases

Tools

run_code(code, stdin): Execute code in the target esoteric language with optional stdin. Returns stdout, stderr, and error status. 5-second timeout.
submit(code): Submit final solution for grading against all hidden test cases. Terminal action — one submission allowed. Returns per-case results and partial credit reward.

Time Horizon

Multi-turn. Agents iteratively write, test, and debug code before submitting. Expected 5-30 tool calls depending on problem difficulty and language complexity.

Environment Difficulty

Easy (20 problems): Basic I/O, string manipulation
Medium (20 problems): Loops, conditionals, arithmetic
Hard (20 problems): Complex algorithms, data structures
Extra Hard (20 problems): Advanced algorithmic challenges

Difficulty is compounded by the esoteric language constraint — even easy problems become challenging in Brainfuck or Whitespace.

Other Environment Requirements

No other requirements. No sandbox, API keys, or external services needed.

Safety

All code execution happens server-side in pure-Python interpreters with strict timeouts. Esoteric languages operate on abstract machines (tapes, stacks, grids) with no filesystem or network access. No safety concerns.

Citations

@article{sharma2026esolangbench,
      title={EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages},
      author={Aman Sharma and Paras Chopra},
      year={2026},
      eprint={2603.09678},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
}