0

PLAN GEN RL Env (Dev Team)

Fresh

Research Plan Generation environment using the facebook/research-plan-gen dataset

Type
RL Env
Publisher
Dev Team
Capabilities
Planning
Runtime
single-turn
License
unknown
Size
v0.1.1
Published
Jan 2026

Cite

Notes

Only stored in your browser.

research-plan-gen

Overview

  • Environment ID: research-plan-gen
  • Short description: Benchmark for evaluating AI research planning capabilities using rubric-based LLM judging
  • Tags: research, planning, llm-judge, eval

Datasets

  • Primary dataset(s): facebook/research-plan-gen - Research Plan Generation dataset with evaluation rubrics
  • Source links: HuggingFace, Paper
  • Subsets:
    • ml: 6,872 train / 685 test samples (Machine Learning papers)
    • arxiv: 6,573 train / 1,496 test samples (arXiv papers)
    • pubmed: 6,423 train / 464 test samples (PubMed papers)

Task

  • Type: single-turn
  • Parser: Default (plain text response)
  • Rubric overview: LLM judge evaluates generated research plans against paper-specific rubric criteria

Each sample contains:

  • Goal: A research task/objective to be accomplished
  • Rubric: List of evaluation criteria for assessing the generated plan
  • Reference solution: A reference solution from the original paper

Quickstart

Run an evaluation with default settings:

uv run vf-eval research-plan-gen

Configure model and sampling:

uv run vf-eval research-plan-gen \
  -m gpt-4.1-mini \
  -n 20 -r 3 -t 2048 -T 0.7 \
  -a '{"subset": "ml"}'

Run directly:

uv run python research_plan_gen.py --model gpt-4.1 --subset ml --num-examples 10

Environment Arguments

ArgTypeDefaultDescription
subsetstr"ml"Dataset subset: "ml", "arxiv", or "pubmed"
splitstr"test"Dataset split: "train" or "test"
num_samplesintNoneLimit on dataset size (None for all)
judge_modelstr"gpt-4.1-mini"Model to use for LLM judge

Metrics

MetricMeaning
rewardLLM judge score (0.0-1.0) based on rubric satisfaction
mean_rewardAverage reward across all samples
std_rewardStandard deviation of rewards

Citation

@article{goel2025training,
  title={Training AI Co-Scientists Using Rubric Rewards},
  author={Goel, Shashwat and Hazra, Rishi and Jayalath, Dulhan and Willi, Timon and Jain, Parag and Shen, William F and Leontiadis, Ilias and Barbieri, Francesco and Bachrach, Yoram and Geiping, Jonas and Whitehouse, Chenxi},
  journal={arXiv preprint arXiv:2512.23707},
  year={2025}
}