pi_apex_agents
This is a minimal Verifiers v1 taskset for
mercor/apex-agents.
The environment uses the generic Verifiers v1 harness interface, so any
compatible harness can be supplied through [eval.harness].
Sandbox setup downloads the task's world_files_zipped/<world_id>.zip snapshot
from Hugging Face, extracts it, and then overlays task_files/<task_id>/ in the
same order as the Archipelago example runner. Agent-visible files are therefore
available at:
/workspace/filesystem/workspace/.apps_data/workspace/input_manifest.txt
Task prompts are passed through from the dataset without adding a taskset system prompt, so each harness keeps its own prompt format.
The reward reads /workspace/final_answer.txt when a harness writes it, falling
back to the harness completion otherwise. It grades with the same reference
rubric shape used by
Mercor-Intelligence/archipelago:
each rubric criterion is judged independently with Verifiers' JudgeRubric,
the judge JSON is parsed with the v1 judge utilities, and each criterion passes
only when its score is at least 0.99.
The weight-1 task_reward is binary: 1.0 only when every criterion passed.
The partial result passed_count / total_count is emitted as partial_reward
with weight 0.0, and passed_count and total_count are emitted as metrics
for auditing. For file-output tasks, the reward also extracts changed workspace
documents,
spreadsheets, slide decks, PDFs, and text files and appends their readable
contents to the solution shown to the judge.
Tooling Assumption
Archipelago exposes calendar, chat, code, document, filesystem, mail, PDF,
presentation, and spreadsheet MCP servers. I checked the public tool packages
before implementing this version. The code/document/PDF/spreadsheet/presentation
tools are mostly wrappers around standard Python libraries such as pandas,
openpyxl, python-docx, python-pptx, pypdf, pdfplumber, pymupdf, and
reportlab; this matches Epoch's note that many benchmark tools wrap common Python
packages.
This taskset therefore makes those Python packages available directly in the
sandbox instead of recreating each MCP wrapper. The default sandbox starts from
python:3.11-slim and installs the library set in harness.program.setup
after the selected command harness has installed itself and before task setup
downloads the Hugging Face world files. Installing this way makes the packages
available to the image's normal python3, which is what coding harnesses tend
to expose to the agent inside shell commands.
Calendar, chat, and mail are not pure library wrappers; they expose structured
app state. This implementation exposes the raw .apps_data files and assumes a
coding harness can inspect and modify them directly when needed.
Optional Archipelago code extras for medicine/scientific-computing worlds
(pydicom, biopython, openmm, pyhmmer, particle) are not installed by
default because the released domains are law, investment banking, and
management consulting. Add them under [eval.harness.program.sandbox].packages
if a future shard needs them.
Running
Install the environment from the repository root:
uv pip install -e ./environments/pi_apex_agents
Run with a local TOML config, for example:
eval
env_id = "pi_apex_agents"
[eval.harness]
id = "harnesses.mini_swe_agent"
max_turns = 100
The dataset is gated. The host must expose either HF_TOKEN,
HUGGINGFACE_HUB_TOKEN, or ~/.cache/huggingface/token; the environment passes
that token only to the sandbox setup download step and removes the temporary
token file before the agent starts.
The grader uses Pinference at https://dev-inference.pinference.ai/api/v1 with
google/gemini-2.5-flash and reads its API key from PRIME_API_KEY.
Archipelago's example config names the same judge as
gemini/gemini-2.5-flash; Pinference currently exposes it as
google/gemini-2.5-flash.
Useful Knobs
TOML examples:
[eval.taskset]
max_tasks = 3
domains = ["Law"]
task_ids = ["task_0b9134a634c14f24a6c256d034a6c130"]
[eval.harness]
max_turns = 100
To use a different harness in TOML, add an id under [eval.harness] and pass
that harness's config fields there:
[eval.harness]
id = "harnesses.terminus_2"
max_turns = 1
Changelog
0.1.0: Initial taskset with hardcodedmercor/apex-agentstrain split, generic Verifiers v1 harness wiring, preinstalled Python sandbox libraries, and artifact-aware rubric grading.