0

Discovery30s

Fresh

Discovery30s is a benchmark that tests the potential of vintage language models to reproduce scientific discoveries after the training data cutoff period. We construct the benchmark by taking a known discovery, e.g. Hückel's rule, and then breaking it down into a "question lad…

Type
RL Env
Runtime
ORS
License
unknown
Size
196 tasks
Published
Feb 2026

Cite

Notes

Only stored in your browser.

Public scores on this env

1

1 vf-eval report across 1 model

Open the scoring view →