research-plan-gen

Overview

Environment ID: research-plan-gen
Short description: Benchmark for evaluating AI research planning capabilities using rubric-based LLM judging
Tags: research, planning, llm-judge, eval

Datasets

Primary dataset(s): facebook/research-plan-gen - Research Plan Generation dataset with evaluation rubrics
Source links: HuggingFace, Paper
Subsets:
- ml: 6,872 train / 685 test samples (Machine Learning papers)
- arxiv: 6,573 train / 1,496 test samples (arXiv papers)
- pubmed: 6,423 train / 464 test samples (PubMed papers)

Task

Type: single-turn
Parser: Default (plain text response)
Rubric overview: LLM judge evaluates generated research plans against paper-specific rubric criteria

Each sample contains:

Goal: A research task/objective to be accomplished
Rubric: List of evaluation criteria for assessing the generated plan
Reference solution: A reference solution from the original paper

Quickstart

Run an evaluation with default settings:

uv run vf-eval research-plan-gen

Configure model and sampling:

uv run vf-eval research-plan-gen \
  -m gpt-4.1-mini \
  -n 20 -r 3 -t 2048 -T 0.7 \
  -a '{"subset": "ml"}'

Run directly:

uv run python research_plan_gen.py --model gpt-4.1 --subset ml --num-examples 10

Environment Arguments

Arg	Type	Default	Description
`subset`	str	`"ml"`	Dataset subset: "ml", "arxiv", or "pubmed"
`split`	str	`"test"`	Dataset split: "train" or "test"
`num_samples`	int	`None`	Limit on dataset size (None for all)
`judge_model`	str	`"gpt-4.1-mini"`	Model to use for LLM judge

Metrics

Metric	Meaning
`reward`	LLM judge score (0.0-1.0) based on rubric satisfaction
`mean_reward`	Average reward across all samples
`std_reward`	Standard deviation of rewards

Citation

@article{goel2025training,
  title={Training AI Co-Scientists Using Rubric Rewards},
  author={Goel, Shashwat and Hazra, Rishi and Jayalath, Dulhan and Willi, Timon and Jain, Parag and Shen, William F and Leontiadis, Ilias and Barbieri, Francesco and Bachrach, Yoram and Geiping, Jonas and Whitehouse, Chenxi},
  journal={arXiv preprint arXiv:2512.23707},
  year={2025}
}