research-plan-gen
Overview
- Environment ID:
research-plan-gen - Short description: Benchmark for evaluating AI research planning capabilities using rubric-based LLM judging
- Tags: research, planning, llm-judge, eval
Datasets
- Primary dataset(s): facebook/research-plan-gen - Research Plan Generation dataset with evaluation rubrics
- Source links: HuggingFace, Paper
- Subsets:
ml: 6,872 train / 685 test samples (Machine Learning papers)arxiv: 6,573 train / 1,496 test samples (arXiv papers)pubmed: 6,423 train / 464 test samples (PubMed papers)
Task
- Type: single-turn
- Parser: Default (plain text response)
- Rubric overview: LLM judge evaluates generated research plans against paper-specific rubric criteria
Each sample contains:
- Goal: A research task/objective to be accomplished
- Rubric: List of evaluation criteria for assessing the generated plan
- Reference solution: A reference solution from the original paper
Quickstart
Run an evaluation with default settings:
uv run vf-eval research-plan-gen
Configure model and sampling:
uv run vf-eval research-plan-gen \
-m gpt-4.1-mini \
-n 20 -r 3 -t 2048 -T 0.7 \
-a '{"subset": "ml"}'
Run directly:
uv run python research_plan_gen.py --model gpt-4.1 --subset ml --num-examples 10
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
subset | str | "ml" | Dataset subset: "ml", "arxiv", or "pubmed" |
split | str | "test" | Dataset split: "train" or "test" |
num_samples | int | None | Limit on dataset size (None for all) |
judge_model | str | "gpt-4.1-mini" | Model to use for LLM judge |
Metrics
| Metric | Meaning |
|---|---|
reward | LLM judge score (0.0-1.0) based on rubric satisfaction |
mean_reward | Average reward across all samples |
std_reward | Standard deviation of rewards |
Citation
@article{goel2025training,
title={Training AI Co-Scientists Using Rubric Rewards},
author={Goel, Shashwat and Hazra, Rishi and Jayalath, Dulhan and Willi, Timon and Jain, Parag and Shen, William F and Leontiadis, Ilias and Barbieri, Francesco and Bachrach, Yoram and Geiping, Jonas and Whitehouse, Chenxi},
journal={arXiv preprint arXiv:2512.23707},
year={2025}
}