loong-seed-graph-discrete-math
Replace the placeholders below, then remove this callout. Keep the Evaluation Reports section at the bottom intact so reports can auto-render.
Overview
- Environment ID:
loong-seed-graph-discrete-math - Short description: An environment contains 178 seed questions (53 training, 125 test) in graph discrete math. The seed data are real, human-vetted data collected from networkX library's documentation.
- Tags: reasoning, graph theory
- Contributors: CAMEL-AI.org
Datasets
- Primary dataset(s): loong-seed-graph-discrete-math
- Source links: https://huggingface.co/datasets/camel-ai/loong
- Split sizes: 53 training, 125 test
Task
- Type: single-turn
- Parser: The extractor returns the last occurrence of text following "Final Answer:" (case-insensitive) from the input string, or an empty string if none is found.
- Rubric overview: The rubric awards a score of 1.0 if the model’s response and the reference answer are equivalent—recursively comparing numbers (with float tolerance), lists, tuples, and dictionaries—and 0.0 otherwise.
| Criterion | Description | Reward |
|---|---|---|
| Float comparison | If both response and answer are floats, they are considered correct if they are equal within a small tolerance (1e-6) using math.isclose. | 1.0 if within tolerance, else 0.0 |
| List/Tuple comparison | Both must be lists or tuples of the same length, with elements recursively compared. | 1.0 if all elements match, else 0.0 |
| Dictionary comparison | Both must be dicts with the same keys; values are recursively compared. | 1.0 if all values match, else 0.0 |
| String/Other comparison | All other types are compared directly with ==. | 1.0 if equal, else 0.0 |
| Parsing | Both response and answer are parsed via ast.literal_eval if possible. If parsing fails, raw string comparison is used. | Enables structured comparisons for numbers, lists, tuples, and dicts. |
| Recursive checking | Nested structures are compared recursively, so lists/dicts containing floats or other lists/dicts are handled. | Ensures correctness at all levels of the data structure. |
| Final reward | The function outputs 1.0 if deep_compare returns True, else 0.0. | Numerical reward for model evaluation. |
Quickstart
Run an evaluation with default settings:
uv run vf-eval loong-seed-graph-discrete-math
Configure model and sampling:
uv run vf-eval loong-seed-graph-discrete-math -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7 -a '{"key": "value"}' # env-specific args as JSON
Notes:
- Use
-a/--env-argsto pass environment-specific configuration as a JSON object.
Metrics
Summarize key metrics your rubric emits and how they’re interpreted.
| Metric | Meaning |
|---|---|
reward | Main scalar reward (weighted sum of criteria) |