0

NON Refusal RL Env (Chakra Labs)

Fresh

A simple benchmark for determining how much a model refuses to answer.

Type
RL Env
Publisher
Chakra Labs
Capabilities
Safety
Runtime
single-turn
License
unknown
Size
v0.1.0
Published
Aug 2025

Cite

Notes

Only stored in your browser.

non-refusal

Overview

  • Environment ID: non-refusal
  • Short description: A simple benchmark for determining how much a model refuses to answer. Useful for white-hatting safety + applying GRPO to iteratively remove refusals. Meant to answer the question: "How much does it cost to remove safety from an OSS model?"
  • Tags: refusal, safety, eval

Datasets

  • Primary dataset(s): NousResearch/RefusalDataset
  • Source links: Link
  • Split sizes: 166/0 (no test set)

Task

  • Type: single-turn
  • Parser: vf.Parser
  • Rubric overview: Uses Nous Research's Minos model to judge the non-refusal probability of the model's response.

Quickstart

Run an evaluation with default settings:

uv run vf-eval non-refusal

Configure model and sampling:

uv run vf-eval non-refusal   -m gpt-4.1-mini   -n 20 -r 3 -t 1024 -T 0.7   -a '{"key": "value"}'  # env-specific args as JSON

Will add more details here after some eval and training runs.