pi_apex_agents

This is a minimal Verifiers v1 taskset for mercor/apex-agents. The environment uses the generic Verifiers v1 harness interface, so any compatible harness can be supplied through [eval.harness].

Sandbox setup downloads the task's world_files_zipped/<world_id>.zip snapshot from Hugging Face, extracts it, and then overlays task_files/<task_id>/ in the same order as the Archipelago example runner. Agent-visible files are therefore available at:

/workspace/filesystem
/workspace/.apps_data
/workspace/input_manifest.txt

Task prompts are passed through from the dataset without adding a taskset system prompt, so each harness keeps its own prompt format.

The reward reads /workspace/final_answer.txt when a harness writes it, falling back to the harness completion otherwise. It grades with the same reference rubric shape used by Mercor-Intelligence/archipelago: each rubric criterion is judged independently with Verifiers' JudgeRubric, the judge JSON is parsed with the v1 judge utilities, and each criterion passes only when its score is at least 0.99.

The weight-1 task_reward is binary: 1.0 only when every criterion passed. The partial result passed_count / total_count is emitted as partial_reward with weight 0.0, and passed_count and total_count are emitted as metrics for auditing. For file-output tasks, the reward also extracts changed workspace documents, spreadsheets, slide decks, PDFs, and text files and appends their readable contents to the solution shown to the judge.

Tooling Assumption

Archipelago exposes calendar, chat, code, document, filesystem, mail, PDF, presentation, and spreadsheet MCP servers. I checked the public tool packages before implementing this version. The code/document/PDF/spreadsheet/presentation tools are mostly wrappers around standard Python libraries such as pandas, openpyxl, python-docx, python-pptx, pypdf, pdfplumber, pymupdf, and reportlab; this matches Epoch's note that many benchmark tools wrap common Python packages.

This taskset therefore makes those Python packages available directly in the sandbox instead of recreating each MCP wrapper. The default sandbox starts from python:3.11-slim and installs the library set in harness.program.setup after the selected command harness has installed itself and before task setup downloads the Hugging Face world files. Installing this way makes the packages available to the image's normal python3, which is what coding harnesses tend to expose to the agent inside shell commands. Calendar, chat, and mail are not pure library wrappers; they expose structured app state. This implementation exposes the raw .apps_data files and assumes a coding harness can inspect and modify them directly when needed.

Optional Archipelago code extras for medicine/scientific-computing worlds (pydicom, biopython, openmm, pyhmmer, particle) are not installed by default because the released domains are law, investment banking, and management consulting. Add them under [eval.harness.program.sandbox].packages if a future shard needs them.

Running

Install the environment from the repository root:

uv pip install -e ./environments/pi_apex_agents

Run with a local TOML config, for example:

eval
env_id = "pi_apex_agents"

[eval.harness]
id = "harnesses.mini_swe_agent"
max_turns = 100

The dataset is gated. The host must expose either HF_TOKEN, HUGGINGFACE_HUB_TOKEN, or ~/.cache/huggingface/token; the environment passes that token only to the sandbox setup download step and removes the temporary token file before the agent starts.

The grader uses Pinference at https://dev-inference.pinference.ai/api/v1 with google/gemini-2.5-flash and reads its API key from PRIME_API_KEY.

Archipelago's example config names the same judge as gemini/gemini-2.5-flash; Pinference currently exposes it as google/gemini-2.5-flash.

Useful Knobs

TOML examples:

[eval.taskset]
max_tasks = 3
domains = ["Law"]
task_ids = ["task_0b9134a634c14f24a6c256d034a6c130"]

[eval.harness]
max_turns = 100

To use a different harness in TOML, add an id under [eval.harness] and pass that harness's config fields there:

[eval.harness]
id = "harnesses.terminus_2"
max_turns = 1

Changelog

0.1.0: Initial taskset with hardcoded mercor/apex-agents train split, generic Verifiers v1 harness wiring, preinstalled Python sandbox libraries, and artifact-aware rubric grading.