seedeval
Building a benchmark to evaluate the diversity of seeds generated using different tool calls.
Overview
- Environment ID:
seedeval - Short description:
- Tags: <tool-use, synthetic-data>
- Source: charvibannur/seed-eval-benchmark
Datasets
- Primary dataset(s): <Fine Web (Works with any text dataset)>
Architecture
seed-eval-benchmark/
└── environments/
└── seedeval/
├── README.md
├── pyproject.toml
└── seedeval.py
Quickstart
Run an evaluation with default settings:
prime eval run seedeval
Configure model and sampling:
prime eval run seedeval -m gpt-4.1-mini -n 20 -r 3
Metrics
Summarize key metrics your rubric emits and how they’re interpreted.
| Metric | Meaning |
|---|---|
Diversity Score | Measures how semantically different the seeds are from each other |
Education Score | Uses an LLM judge to rate overall seed quality from 0–5 (similar to fine-web edu) |
Not Proper Noun Ratio | Ratio of words that are not names, dates or places to the total number of generated seeds |