vf-react-scene-eval
Environment for evaluating language models on React component generation for video scenes. This environment generates React components for Remotion video scenes, renders them to images, and uses multimodal LLM evaluation to assess design quality.
Overview
This environment tests models' ability to:
- Generate syntactically correct React/TSX components for Remotion
- Create visually appealing scene layouts with proper spacing and alignment
- Follow Remotion animation patterns and component structure
- Include specified text content accurately
- Export components with proper structure (
export default ComponentName)
The evaluation pipeline consists of:
- Component Generation: Model creates React component code in XML tags
- Code Parsing: Tree-sitter extracts and validates component structure
- Remotion Rendering: Component is rendered to PNG image at final frame
- Design Evaluation: GPT-4o-mini evaluates rendered image quality
Architecture
Core Components
ReactSceneEnvironment: Main environment class inheriting fromverifiers.EnvironmentSceneComponentParser: Tree-sitter based parser for React/TSX code extractionDesignJudgeRubric: Multimodal LLM judge for design quality assessmentRemotionRenderer: Handles component rendering using Remotion CLI
Evaluation Pipeline
User Query → Model Generation → XML Extraction → Component Parsing → Remotion Rendering → Design Evaluation
Scoring System
The environment uses a binary reward system:
Parser Validation (Weight: 1.0)
- Score:
0.0for valid syntax,-1.0for invalid - Checks: Imports, exports, syntax errors via tree-sitter
- Requirement: Code must parse successfully and have proper structure
Design Quality Evaluation (Weight: 1.0)
Three binary criteria evaluated by GPT-4o-mini on rendered images:
-
Space Utilization (Binary: 0 or 1)
- Effective use of whitespace and layout areas
- Avoids overcrowding and excessive empty space
- Maintains comfortable margins and visual weight distribution
-
Alignment & Spatial Relationships (Binary: 0 or 1)
- Consistent alignment (horizontal and vertical)
- Coherent grid/layout structure
- No overlapping or partially visible elements
-
Text Correctness (Binary: 0 or 1)
- Required text is present and correctly spelled
- Text is not truncated, clipped, or overlapped
- Sufficient legibility with proper size and contrast
- Scores 1 if no specific text was requested
Final Scoring
- Success:
reward = 1.0(all criteria met) - Failure:
reward = -1.0(any failure: parsing, rendering, or design)
System Prompts
The environment uses YAML-configured prompts:
Generation Prompt
- Instructs models to create Remotion components
- Requires
<code></code>XML wrapper for extraction - Specifies
totalFramesvariable and export structure - Emphasizes final frame completeness for evaluation
Judge Prompt
- Evaluates rendered image against design criteria
- Structured JSON output with reasoning and binary scores
- Uses multimodal GPT-4o-mini with image input
Usage
Basic Setup
from single_turn_env import ReactSceneEnvironment
from datasets import Dataset
# Create test dataset
queries = [
"Create a simple title scene with big bold text 'Hello World' centered.",
"Design a lower-third banner with the name 'Alex Doe' and a subtitle.",
]
dataset = Dataset.from_dict({"question": queries})
# Initialize environment
env = ReactSceneEnvironment(dataset=dataset)
Evaluation
import asyncio
from openai import AsyncOpenAI
async def evaluate():
client = AsyncOpenAI()
# Format inputs
formatted = env.get_dataset()
inputs = {
"prompt": list(formatted["prompt"]),
"info": [{} for _ in range(len(formatted))],
}
# Run evaluation
results = await env.a_generate(
inputs=inputs,
client=client,
model="gpt-4o-mini",
sampling_args={"temperature": 0.2},
score_rollouts=True,
)
return results
results = asyncio.run(evaluate())
Results Structure
# Results contain:
results.completion # Generated component code
results.reward # Binary rewards: 1.0 (success) or -1.0 (failure)
results.metrics # Detailed metrics:
# - parser_reward_func: Parser validation score
# - space_utilization: Space usage score
# - alignment: Alignment quality score
# - text_correctness: Text accuracy score
Dependencies
Required
verifiers>=0.1.2- Base environment frameworkopenai>=1.0.0- LLM evaluationpydantic>=2.0.0- Data validationdatasets>=2.0.0- Dataset handlingtree-sitter-typescript- Code parsingpython-dotenv>=1.0.0- Environment variablespyyaml>=6.0.0- Configuration loading
External Tools
- Node.js & npm - Required for Remotion CLI
- Remotion CLI - Component rendering (
npx remotion) - Chrome/Chromium - Headless browser for rendering
Installation
# Install package
pip install -e .
# Install Remotion globally
npm install -g remotion
# Set up environment variables
echo "OPENAI_API_KEY=your_key_here" > .env
File Structure
vf-react-scene-eval/
├── single_turn_env.py # Main environment class
├── parser.py # React component parser
├── rubrics.py # Design evaluation rubric
├── remotion_utils.py # Remotion rendering utilities
├── yaml_prompt_loader.py # YAML prompt loading
├── prompts.yaml # System prompts configuration
├── templates/ # Remotion configuration templates
│ ├── Image.template.tsx # Composition wrapper
│ ├── index.template.ts # Entry point
│ ├── remotion.config.template.ts
│ └── tsconfig.template.json
├── tests/ # Comprehensive test suite
│ ├── test_env.py # End-to-end integration tests
│ ├── test_parser.py # Parser unit tests
│ └── test_rubric.py # Rubric unit tests
└── test_data/ # Sample components for testing
Testing
# Run all tests
python -m pytest tests/
# Run integration tests
python -m pytest tests/test_env.py -v
# Run quick integration test
python tests/test_env.py
Configuration
Environment Variables
OPENAI_API_KEY- Required for design evaluationOPENAI_MODEL- Model for generation (default: gpt-4o-mini)
Customization
- System prompts: Edit
prompts.yaml - Rendering settings: Modify
remotion_utils.py - Evaluation criteria: Update
rubrics.py - Parser behavior: Adjust
parser.py
Common Issues
Rendering Failures
- Chrome not found: Install Chrome/Chromium browser
- Remotion not installed: Run
npm install -g remotion - Component syntax errors: Check import/export structure
- Missing
totalFrames: Ensure variable is defined at top level
Parser Issues
- XML extraction fails: Verify
<code></code>wrapper is present - Tree-sitter errors: Check TypeScript/TSX syntax validity
- Component name missing: Ensure proper
export defaultstatement
Contributing
Adding New Evaluation Criteria
- Update
ScoreReasonmodel inrubrics.py - Modify judge prompt in
prompts.yaml - Add corresponding test cases
- Update scoring logic in
DesignJudgeRubric
Improving Parser
- Extend tree-sitter queries in
parser.py - Add new validation patterns
- Update reward function logic
- Add comprehensive test coverage
License
MIT License - see LICENSE file for details.