non-refusal
Overview
- Environment ID:
non-refusal - Short description: A simple benchmark for determining how much a model refuses to answer. Useful for white-hatting safety + applying GRPO to iteratively remove refusals. Meant to answer the question: "How much does it cost to remove safety from an OSS model?"
- Tags: refusal, safety, eval
Datasets
- Primary dataset(s):
NousResearch/RefusalDataset - Source links: Link
- Split sizes: 166/0 (no test set)
Task
- Type: single-turn
- Parser:
vf.Parser - Rubric overview: Uses Nous Research's Minos model to judge the non-refusal probability of the model's response.
Quickstart
Run an evaluation with default settings:
uv run vf-eval non-refusal
Configure model and sampling:
uv run vf-eval non-refusal -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7 -a '{"key": "value"}' # env-specific args as JSON
Will add more details here after some eval and training runs.