0

SELF Reward RL Env (Prime Intellect)

Fresh

Environment models self-rewarding their own responses

Type
RL Env
Runtime
single-turn
License
unknown
Size
v0.1.1
Published
Dec 2025

Cite

Notes

Only stored in your browser.

self-reward

Source Code

Overview

  • Environment ID: self-reward
  • Short description: Single-turn evaluation where a judge model scores responses based on a simple scoring prompt.
  • Tags: judge, single-turn, self-reward, openai-compatible

Datasets

  • Primary dataset(s): Any HF dataset with question/answer columns (specified by dataset_name)
  • Source links: Hugging Face Datasets
  • Split sizes: Uses the dataset’s train file by default

Task

  • Type: single-turn
  • Rubric overview: JudgeRubric uses a judge client/model/prompt to produce a 0–1 score

Quickstart

Run an evaluation with default settings (example):

prime eval run self-reward -a '{"dataset_name": "your/dataset", "model_name": "Qwen/Qwen3-0.6B"}'

Configure model and sampling:

prime eval run self-reward \
  -m gpt-4.1-mini \
  -n 20 -r 3 -t 1024 -T 0.7 \
  -a '{"dataset_name": "your/dataset", "judge_model": "Qwen/Qwen3-0.6B", "base_url": "http://0.0.0.0:8000/v1", "api_key_var": "JUDGE_API_KEY"}'

Notes:

  • Use -a / --env-args to pass environment-specific configuration as a JSON object.
  • Reports are written under ./environments/self_reward/reports/ and auto-embedded below.

Environment Arguments

ArgTypeDefaultDescription
dataset_namestr"PrimeIntellect/Hendrycks-Math"HF dataset name or path containing question/answer
judge_modelstr"gpt-4.1-mini"Judge model name (OpenAI-compatible)
base_urlstr"http://0.0.0.0:8000/v1"Judge API base URL
api_key_varstr"JUDGE_API_KEY"Env var for judge API key (optional, defaults to "EMPTY" if not set)

Metrics

MetricMeaning
rewardJudge-produced score, normalized to 0–1