BixBench

Description

BixBench is an environment for evaluating AI agents on real-world bioinformatics computational analysis tasks. Built from Code Ocean capsules containing published bioinformatics analyses, agents are given access to biological datasets and must answer hypothesis-driven research questions through multi-step analytical trajectories.

Capabilities

Exploring and analyzing biological datasets using CLI tools
Writing and executing bioinformatics analysis code
Interpreting results from genomic, transcriptomic, and proteomic analyses
Multi-step computational biology reasoning

Compute Requirements

Agents in BixBench are given a sandbox with access to bioinformatics tools (samtools, bcftools, bedtools, tabix) and the full Code Ocean capsule data (~5.91 GB total).

License

Apache 2.0.

Tasks

There is one split in this environment:

Test: 205 bioinformatics analysis tasks

Each task is derived from a Code Ocean capsule and presents a hypothesis-driven question about biological data. Tasks span diverse bioinformatics domains including genomics, transcriptomics, and proteomics.

Reward Structure

This is a multi-turn environment with binary reward at submission:

1.0 — Correct answer
0.0 — Incorrect answer

Evaluation uses two modes depending on the task:

String verifier: Case-insensitive string matching with LLM semantic fallback (gpt-5-mini)
Range verifier: Numeric proximity check with distractor-based tolerance

Exact matches are checked first to avoid unnecessary LLM calls.

Data

Task data consists of a Parquet metadata file and Code Ocean capsules containing biological datasets. Capsules are mounted at /orwd_data/bixbench/capsules/ in production.

Source: futurehouse/BixBench

Tools

Tool	Description
`submit_answer`	Submit your answer for binary evaluation.
`bash`	Execute shell commands.
`glob`	Find files by pattern.
`grep`	Search file contents.
`ls`	List directory contents.
`read`	Read file contents.
`write`	Write to files.
`edit`	Edit existing files.
`multi_edit`	Apply multiple edits to a file.
`todo_write`	Track task progress.

Time Horizon

BixBench is a multi-turn environment. Agents iteratively explore data, write analysis code, and execute computations before submitting a final answer.

Environment Difficulty

Model performance on BixBench from the original paper (open-answer setting):

Model	Accuracy
Claude 3.5 Sonnet	17%
GPT-4o	9%

Even frontier models achieve no better than random in the multiple-choice setting, indicating that fully autonomous bioinformatics research remained challenging at the time of the benchmark's release.

Other Environment Requirements

OpenAI API key: Required for LLM-based fallback grading in string verification. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in BixBench interact with published biological datasets in a sandboxed environment. The environment does not involve human subjects or clinical data requiring special protections.

Citations

@article{mitchener2025bixbench,
  author    = {Mitchener, Ludovico and Laurent, Jon M and Tenmann, Benjamin and Narayanan, Siddharth and Wellawatte, Geemi P and White, Andrew and Sani, Lorenzo and Rodriques, Samuel G},
  title     = {BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology},
  journal   = {arXiv preprint arXiv:2503.00096},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.00096}
}