0

Balrog Prime

BALROG (Benchmarking Agentic LLM and VLM Reasoning On Games) environments (NLE, MiniHack, BabyAI, TextWorld, Babaisai, Crafter).

Domain
rl-env
License
unknown
Published
Aug 2025

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
prime-hub
Attribution policy →

Top score 0.0% by GPT-4o - 1 model reporting (1 frontier)

Top models

1
Balrog PrimeBar chart with 1 bar. Highest value: GPT-4o at 0.
1 model

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is Balrog Prime?
BALROG (Benchmarking Agentic LLM and VLM Reasoning On Games) environments (NLE, MiniHack, BabyAI, TextWorld, Babaisai, Crafter).
What is the current top score on Balrog Prime?
The top reported score is 0.0% by GPT-4o, across 1 model reporting (1 from frontier labs).
How can a model improve its Balrog Prime score?
Tools linked to Balrog Prime on Sophon include Balrog Prime RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is Balrog Prime under?
Balrog Prime is available under unknown.