0

Scene EVAL RL Env (Genime)

Fresh

Environment for evaluating React component scene generation

Type
RL Env
Publisher
Genime
License
unknown
Size
v0.1.0
Published
Aug 2025

Cite

Notes

Only stored in your browser.

vf-react-scene-eval

Environment for evaluating language models on React component generation for video scenes. This environment generates React components for Remotion video scenes, renders them to images, and uses multimodal LLM evaluation to assess design quality.

Overview

This environment tests models' ability to:

  • Generate syntactically correct React/TSX components for Remotion
  • Create visually appealing scene layouts with proper spacing and alignment
  • Follow Remotion animation patterns and component structure
  • Include specified text content accurately
  • Export components with proper structure (export default ComponentName)

The evaluation pipeline consists of:

  1. Component Generation: Model creates React component code in XML tags
  2. Code Parsing: Tree-sitter extracts and validates component structure
  3. Remotion Rendering: Component is rendered to PNG image at final frame
  4. Design Evaluation: GPT-4o-mini evaluates rendered image quality

Architecture

Core Components

  • ReactSceneEnvironment: Main environment class inheriting from verifiers.Environment
  • SceneComponentParser: Tree-sitter based parser for React/TSX code extraction
  • DesignJudgeRubric: Multimodal LLM judge for design quality assessment
  • RemotionRenderer: Handles component rendering using Remotion CLI

Evaluation Pipeline

User Query → Model Generation → XML Extraction → Component Parsing → Remotion Rendering → Design Evaluation

Scoring System

The environment uses a binary reward system:

Parser Validation (Weight: 1.0)

  • Score: 0.0 for valid syntax, -1.0 for invalid
  • Checks: Imports, exports, syntax errors via tree-sitter
  • Requirement: Code must parse successfully and have proper structure

Design Quality Evaluation (Weight: 1.0)

Three binary criteria evaluated by GPT-4o-mini on rendered images:

  1. Space Utilization (Binary: 0 or 1)

    • Effective use of whitespace and layout areas
    • Avoids overcrowding and excessive empty space
    • Maintains comfortable margins and visual weight distribution
  2. Alignment & Spatial Relationships (Binary: 0 or 1)

    • Consistent alignment (horizontal and vertical)
    • Coherent grid/layout structure
    • No overlapping or partially visible elements
  3. Text Correctness (Binary: 0 or 1)

    • Required text is present and correctly spelled
    • Text is not truncated, clipped, or overlapped
    • Sufficient legibility with proper size and contrast
    • Scores 1 if no specific text was requested

Final Scoring

  • Success: reward = 1.0 (all criteria met)
  • Failure: reward = -1.0 (any failure: parsing, rendering, or design)

System Prompts

The environment uses YAML-configured prompts:

Generation Prompt

  • Instructs models to create Remotion components
  • Requires <code></code> XML wrapper for extraction
  • Specifies totalFrames variable and export structure
  • Emphasizes final frame completeness for evaluation

Judge Prompt

  • Evaluates rendered image against design criteria
  • Structured JSON output with reasoning and binary scores
  • Uses multimodal GPT-4o-mini with image input

Usage

Basic Setup

from single_turn_env import ReactSceneEnvironment
from datasets import Dataset

# Create test dataset
queries = [
    "Create a simple title scene with big bold text 'Hello World' centered.",
    "Design a lower-third banner with the name 'Alex Doe' and a subtitle.",
]
dataset = Dataset.from_dict({"question": queries})

# Initialize environment
env = ReactSceneEnvironment(dataset=dataset)

Evaluation

import asyncio
from openai import AsyncOpenAI

async def evaluate():
    client = AsyncOpenAI()
    
    # Format inputs
    formatted = env.get_dataset()
    inputs = {
        "prompt": list(formatted["prompt"]),
        "info": [{} for _ in range(len(formatted))],
    }
    
    # Run evaluation
    results = await env.a_generate(
        inputs=inputs,
        client=client,
        model="gpt-4o-mini",
        sampling_args={"temperature": 0.2},
        score_rollouts=True,
    )
    
    return results

results = asyncio.run(evaluate())

Results Structure

# Results contain:
results.completion  # Generated component code
results.reward      # Binary rewards: 1.0 (success) or -1.0 (failure)
results.metrics     # Detailed metrics:
#   - parser_reward_func: Parser validation score
#   - space_utilization: Space usage score  
#   - alignment: Alignment quality score
#   - text_correctness: Text accuracy score

Dependencies

Required

  • verifiers>=0.1.2 - Base environment framework
  • openai>=1.0.0 - LLM evaluation
  • pydantic>=2.0.0 - Data validation
  • datasets>=2.0.0 - Dataset handling
  • tree-sitter-typescript - Code parsing
  • python-dotenv>=1.0.0 - Environment variables
  • pyyaml>=6.0.0 - Configuration loading

External Tools

  • Node.js & npm - Required for Remotion CLI
  • Remotion CLI - Component rendering (npx remotion)
  • Chrome/Chromium - Headless browser for rendering

Installation

# Install package
pip install -e .

# Install Remotion globally
npm install -g remotion

# Set up environment variables
echo "OPENAI_API_KEY=your_key_here" > .env

File Structure

vf-react-scene-eval/
├── single_turn_env.py          # Main environment class
├── parser.py                   # React component parser  
├── rubrics.py                  # Design evaluation rubric
├── remotion_utils.py           # Remotion rendering utilities
├── yaml_prompt_loader.py       # YAML prompt loading
├── prompts.yaml                # System prompts configuration
├── templates/                  # Remotion configuration templates
│   ├── Image.template.tsx      # Composition wrapper
│   ├── index.template.ts       # Entry point
│   ├── remotion.config.template.ts
│   └── tsconfig.template.json
├── tests/                      # Comprehensive test suite
│   ├── test_env.py            # End-to-end integration tests
│   ├── test_parser.py         # Parser unit tests
│   └── test_rubric.py         # Rubric unit tests
└── test_data/                 # Sample components for testing

Testing

# Run all tests
python -m pytest tests/

# Run integration tests
python -m pytest tests/test_env.py -v

# Run quick integration test
python tests/test_env.py

Configuration

Environment Variables

  • OPENAI_API_KEY - Required for design evaluation
  • OPENAI_MODEL - Model for generation (default: gpt-4o-mini)

Customization

  • System prompts: Edit prompts.yaml
  • Rendering settings: Modify remotion_utils.py
  • Evaluation criteria: Update rubrics.py
  • Parser behavior: Adjust parser.py

Common Issues

Rendering Failures

  • Chrome not found: Install Chrome/Chromium browser
  • Remotion not installed: Run npm install -g remotion
  • Component syntax errors: Check import/export structure
  • Missing totalFrames: Ensure variable is defined at top level

Parser Issues

  • XML extraction fails: Verify <code></code> wrapper is present
  • Tree-sitter errors: Check TypeScript/TSX syntax validity
  • Component name missing: Ensure proper export default statement

Contributing

Adding New Evaluation Criteria

  1. Update ScoreReason model in rubrics.py
  2. Modify judge prompt in prompts.yaml
  3. Add corresponding test cases
  4. Update scoring logic in DesignJudgeRubric

Improving Parser

  1. Extend tree-sitter queries in parser.py
  2. Add new validation patterns
  3. Update reward function logic
  4. Add comprehensive test coverage

License

MIT License - see LICENSE file for details.