seedeval

Environment ID: seedeval
Short description:
Tags: &lt;tool-use, synthetic-data&gt;
Source: <a href="https://github.com/charvibannur/seed-eval-benchmark/tree/main" target="_blank" rel="noopener noreferrer" node="[object Object]">charvibannur/seed-eval-benchmark

Building a benchmark to evaluate the diversity of seeds generated using different tool calls.

seed-eval-benchmark/
└── environments/
    └── seedeval/
        ├── README.md
        ├── pyproject.toml
        └── seedeval.py

Run an evaluation with default settings:

prime eval run seedeval

Configure model and sampling:

prime eval run seedeval -m gpt-4.1-mini  -n 20 -r 3

Summarize key metrics your rubric emits and how they’re interpreted.

Metric	Meaning
`Diversity Score`	Measures how semantically different the seeds are from each other
`Education Score`	Uses an LLM judge to rate overall seed quality from 0–5 (similar to fine-web edu)
`Not Proper Noun Ratio`	Ratio of words that are not names, dates or places to the total number of generated seeds