0

CARD Generator RL Env (Community)

CARD Generator RL Env (Community) is an RL environment.

Type
RL Env
License
mit
Published
May 2025

Cite

Notes

Only stored in your browser.

Metric Card Generator Environment

Design and Motivation

This environment generates structured JSON configurations for Metric Card UI components for AI model evaluation dashboards. It demonstrates a closed-loop generation, evaluation, and visualization pipeline using Atropos.

The environment challenges language models to produce well-structured, valid JSON metric card configurations that can be directly used in front-end applications. This tests the model's ability to:

  • Follow specific schema requirements
  • Generate complex nested structures
  • Maintain consistent JSON formatting
  • Create semantically meaningful metric descriptions

Quickstart

# Install dependencies
pip install -r requirements.txt

# Run the environment with process command to generate rollouts
python metric_card_generator.py process --env.data_path_to_save_groups artifacts/metric_rollouts.jsonl

# View the generated HTML visualization
# Open artifacts/metric_rollouts.html in a browser

Environment Components

  • metric_card_generator.py: Main environment implementation with prompting and evaluation logic
  • extract_metric_training.py: Utility to extract high-quality examples for training
  • trainingDataScript.py: Creates training datasets from collected examples
  • show_score_distribution.py: Visualization tool for analyzing model performance

Artifacts

The artifacts folder includes:

  • metric_rollouts.jsonl: Raw model outputs with scores
  • metric_rollouts.html: Visualization of model outputs and scores
  • metric_training.jsonl: Processed examples suitable for fine-tuning
  • metric_training_high_quality.jsonl: Filtered high-quality examples

Evaluation Metrics

The environment evaluates model outputs on several dimensions:

  • JSON Validity: Whether the output is valid, parseable JSON
  • Schema Compliance: Whether the output follows the required structure
  • Semantic Quality: Whether the metrics described make sense for the given context
  • Formatting: Proper nesting, field types, and attribute consistency

WandB Integration

Performance metrics are logged to Weights & Biases, including:

  • Percent of valid JSON responses
  • Average scores across evaluation criteria
  • Token usage efficiency
  • Examples of best and worst performing generations

Use with Training

This environment can be integrated into the Atropos training loop to improve a model's ability to generate structured JSON output:

# Example training command
python example_trainer/trainer.py --environment metric_card_generator --model your_model --iterations 1000