EsoLang-Bench
Description
EsoLang-Bench is an environment for evaluating LLM code generation in esoteric programming languages. Models score ~90% on Python coding tasks but only ~3.8% when the same problems must be solved in esoteric languages — testing genuine reasoning vs pattern matching. Agents are given programming problems and must write solutions in one of five esoteric languages.
Capabilities
- Code generation in Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare
- Iterative testing and debugging via interpreter execution
- Algorithmic problem solving under severe language constraints
Compute Requirements
No special compute requirements. All interpreters are pure-Python and execute server-side with a 5-second timeout per execution.
License
CC BY 4.0 (dataset), MIT (interpreters)
Tasks
400 tasks in a single test split (80 problems x 5 languages). Each task combines a programming problem with a target esoteric language. Problems range from simple string output to complex algorithmic challenges across four difficulty levels (easy, medium, hard, extra_hard).
Reward Structure
Partial credit based on hidden test cases:
reward = num_passing_tests / total_tests(6 test cases per problem)- Output matching uses the paper's
outputs_match_langfunction (language-aware, trailing whitespace tolerant, with numeric normalization) - Reward range: [0.0, 1.0]
Data
- Source: Lossfunk/Esolang-Bench on HuggingFace
- Size: 80 problems, 6 test cases each
- Format: Parquet file with problem descriptions and test cases
Tools
- run_code(code, stdin): Execute code in the target esoteric language with optional stdin. Returns stdout, stderr, and error status. 5-second timeout.
- submit(code): Submit final solution for grading against all hidden test cases. Terminal action — one submission allowed. Returns per-case results and partial credit reward.
Time Horizon
Multi-turn. Agents iteratively write, test, and debug code before submitting. Expected 5-30 tool calls depending on problem difficulty and language complexity.
Environment Difficulty
- Easy (20 problems): Basic I/O, string manipulation
- Medium (20 problems): Loops, conditionals, arithmetic
- Hard (20 problems): Complex algorithms, data structures
- Extra Hard (20 problems): Advanced algorithmic challenges
Difficulty is compounded by the esoteric language constraint — even easy problems become challenging in Brainfuck or Whitespace.
Other Environment Requirements
No other requirements. No sandbox, API keys, or external services needed.
Safety
All code execution happens server-side in pure-Python interpreters with strict timeouts. Esoteric languages operate on abstract machines (tapes, stacks, grids) with no filesystem or network access. No safety concerns.
Citations
@article{sharma2026esolangbench,
title={EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages},
author={Aman Sharma and Paras Chopra},
year={2026},
eprint={2603.09678},
archivePrefix={arXiv},
primaryClass={cs.AI},
}