BixBench
Description
BixBench is an environment for evaluating AI agents on real-world bioinformatics computational analysis tasks. Built from Code Ocean capsules containing published bioinformatics analyses, agents are given access to biological datasets and must answer hypothesis-driven research questions through multi-step analytical trajectories.
Capabilities
- Exploring and analyzing biological datasets using CLI tools
- Writing and executing bioinformatics analysis code
- Interpreting results from genomic, transcriptomic, and proteomic analyses
- Multi-step computational biology reasoning
Compute Requirements
Agents in BixBench are given a sandbox with access to bioinformatics tools (samtools, bcftools, bedtools, tabix) and the full Code Ocean capsule data (~5.91 GB total).
License
Tasks
There is one split in this environment:
- Test: 205 bioinformatics analysis tasks
Each task is derived from a Code Ocean capsule and presents a hypothesis-driven question about biological data. Tasks span diverse bioinformatics domains including genomics, transcriptomics, and proteomics.
Reward Structure
This is a multi-turn environment with binary reward at submission:
- 1.0 — Correct answer
- 0.0 — Incorrect answer
Evaluation uses two modes depending on the task:
- String verifier: Case-insensitive string matching with LLM semantic fallback (gpt-5-mini)
- Range verifier: Numeric proximity check with distractor-based tolerance
Exact matches are checked first to avoid unnecessary LLM calls.
Data
Task data consists of a Parquet metadata file and Code Ocean capsules containing biological datasets. Capsules are mounted at /orwd_data/bixbench/capsules/ in production.
Source: futurehouse/BixBench
Tools
| Tool | Description |
|---|---|
submit_answer | Submit your answer for binary evaluation. |
bash | Execute shell commands. |
glob | Find files by pattern. |
grep | Search file contents. |
ls | List directory contents. |
read | Read file contents. |
write | Write to files. |
edit | Edit existing files. |
multi_edit | Apply multiple edits to a file. |
todo_write | Track task progress. |
Time Horizon
BixBench is a multi-turn environment. Agents iteratively explore data, write analysis code, and execute computations before submitting a final answer.
Environment Difficulty
Model performance on BixBench from the original paper (open-answer setting):
| Model | Accuracy |
|---|---|
| Claude 3.5 Sonnet | 17% |
| GPT-4o | 9% |
Even frontier models achieve no better than random in the multiple-choice setting, indicating that fully autonomous bioinformatics research remained challenging at the time of the benchmark's release.
Other Environment Requirements
- OpenAI API key: Required for LLM-based fallback grading in string verification. Pass via
secrets={"openai_api_key": "..."}.
Safety
Agents in BixBench interact with published biological datasets in a sandboxed environment. The environment does not involve human subjects or clinical data requiring special protections.
Citations
@article{mitchener2025bixbench,
author = {Mitchener, Ludovico and Laurent, Jon M and Tenmann, Benjamin and Narayanan, Siddharth and Wellawatte, Geemi P and White, Andrew and Sani, Lorenzo and Rodriques, Samuel G},
title = {BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology},
journal = {arXiv preprint arXiv:2503.00096},
year = {2025},
url = {https://arxiv.org/abs/2503.00096}
}