Network Log Anomaly Detection (E1)
A security-focused RL environment for training and evaluating models on network intrusion detection. Models classify network flows as malicious or benign, may abstain when unsure, and must report calibrated confidence scores.
Overview
This environment implements calibrated classification with abstention support and asymmetric costs, enabling realistic evaluation of network intrusion detection agents.
Environment Type: SingleTurnEnv - One prompt, one response per example
Task: Ternary classification of network logs (Malicious / Benign / Abstain)
Reward Structure: Accuracy, JSON format compliance, calibration, and cost-sensitive penalties
Dataset: IoT-23 network traffic with labeled malicious/benign connections
Dataset Access
Public Metadata: Browse sampling information and dataset composition at:
Full Dataset: Private to prevent training contamination. Request access via:
- GitHub Issues with title "Dataset Access Request: E1"
- Include: name, affiliation, research purpose, HuggingFace username
The public metadata repo includes detailed model cards explaining the privacy rationale and dataset composition.
Dataset Loading Strategies
This environment supports multi-tiered dataset loading for flexibility across different deployment scenarios:
- Local datasets (built with
make data-e1) - HuggingFace Hub (with
HF_TOKENauthentication) - Synthetic fixtures (for testing without data dependencies)
Loading Modes
import verifiers as vf
# Auto mode (default): Try local → hub → synthetic
env = vf.load_environment("sv-env-network-logs")
# Local only: Require local dataset
env = vf.load_environment("sv-env-network-logs", dataset_source="local")
# Hub only: Load from HuggingFace
env = vf.load_environment("sv-env-network-logs", dataset_source="hub")
# Synthetic only: Use test fixtures (no data needed)
env = vf.load_environment("sv-env-network-logs", dataset_source="synthetic")
Using Your Own HuggingFace Repository
If you've built and pushed datasets to your own HuggingFace repository:
import os
# Configure custom repository
os.environ["HF_TOKEN"] = "hf_your_token_here"
os.environ["E1_HF_REPO"] = "your-org/security-verifiers-e1-private"
# Load from your repository
env = vf.load_environment(
"sv-env-network-logs",
dataset_source="hub",
max_examples=100
)
See docs/user-dataset-guide.md for instructions on building and pushing datasets to your own HuggingFace repository.
Installation
Install the environment using the Prime CLI:
prime env install intertwine/sv-env-network-logs
Or using pip directly:
pip install sv-env-network-logs
Setup
API Keys Configuration
Before using this environment, you need to configure API keys for model inference and dataset access:
-
Set your API keys as environment variables:
# OpenAI API Key (required for OpenAI models) export OPENAI_API_KEY="your-openai-api-key" # HuggingFace Token (optional, for IoT-23 dataset access) export HF_TOKEN="your-huggingface-token"Get your HuggingFace token from huggingface.co/settings/tokens
Note: Without the HF_TOKEN, the environment will fall back to using a synthetic dataset with limited examples.
-
For persistent configuration, add to your shell profile:
echo 'export OPENAI_API_KEY="your-key"' >> ~/.bashrc echo 'export HF_TOKEN="your-token"' >> ~/.bashrc source ~/.bashrc
Usage
With Prime CLI (Recommended)
The easiest way to evaluate models on this environment is using the Prime CLI:
# Install the environment
prime env install intertwine/sv-env-network-logs
# Run evaluation with default dataset (1000 examples from HuggingFace)
prime env eval sv-env-network-logs \
-a '{"dataset_name":"intertwine-ai/security-verifiers-e1","max_examples":100}'
# Run with specific dataset split
prime env eval sv-env-network-logs \
-a '{"dataset_name":"intertwine-ai/security-verifiers-e1","max_examples":100}' \
--num-examples 10
# Use synthetic dataset for quick testing (no HF_TOKEN required)
prime env eval sv-env-network-logs \
-a '{"dataset_source":"synthetic"}' \
--num-examples 5
Note: By default, Prime uses meta-llama/llama-3.1-70b-instruct. Specify a different model with --model:
prime env eval sv-env-network-logs \
-a '{"dataset_name":"intertwine-ai/security-verifiers-e1"}' \
--model gpt-4o-mini \
--num-examples 10
With Verifiers Library
import os
import verifiers as vf
# Load environment variables from .env file (if running in Python script)
# Alternatively, set them manually:
# os.environ['OPENAI_API_KEY'] = 'your-openai-api-key'
# os.environ['HF_TOKEN'] = 'your-huggingface-token' # optional
# Load the environment
env = vf.load_environment("sv-env-network-logs")
# Evaluate a model
results = env.evaluate(
client=vf.OpenAIClient(),
model="gpt-5-mini",
num_examples=100
)
print(f"Average reward: {results.stats['mean_reward']:.2%}")
Quick Evaluation
Use the verifiers CLI for quick testing:
# First, load environment variables from .env file
set -a && source .env && set +a
# For OpenAI models (requires OPENAI_API_KEY environment variable)
vf-eval sv-env-network-logs \
--model gpt-5-mini \
--num-examples 10
# With custom API endpoint
vf-eval sv-env-network-logs \
--model your-model-name \
--api-host-base https://your-api-endpoint.com/v1 \
--api-key-var YOUR_API_KEY_ENV_VAR \
--num-examples 10
Command Options
-m, --model: Model name to use for evaluation-b, --api-host-base: Base URL for the API endpoint (e.g.,https://api.openai.com/v1)-k, --api-key-var: Name of the environment variable containing your API key--num-examples: Number of examples to evaluate (default: 100)--verbose: Enable verbose output for debugging
Training with Prime RL
In your Prime RL orchestrator configuration:
[environment]
id = "sv-env-network-logs"
Then launch training:
# First, load environment variables from .env file
set -a && source .env && set +a
# Then run training
uv run rl \
--trainer.model.name "Qwen/Qwen-7B" \
--orchestrator.environment.id "sv-env-network-logs" \
--trainer.steps 1000
Task Details
Input Format
Network log entries with connection metadata:
"Log Entry: id.orig_h=None, id.orig_p=None, id.resp_h=None, id.resp_p=8081, proto=tcp, service=None, detailed-label=None"
Expected Output
Strict JSON object:
{
"label": "Benign|Malicious|Abstain",
"confidence": 0.0,
"rationale": "string (optional)"
}
Scoring
The environment uses a weighted multi-criteria rubric:
- Classification Accuracy (1.0)
- Format Compliance (0.1)
- Calibration Bonus (0.2)
- Asymmetric Cost (0.5, heavy penalty for false negatives)
Total reward is the weighted sum of these components.
Performance Benchmarks
| Model | Accuracy | Format | Calibration | Overall |
|---|---|---|---|---|
| GPT-4o-mini | 60.3% | 100% | 85% | 82% |
Benchmarks on 100 examples from the IoT-23 dataset (illustrative).
Dataset
The environment uses locally-built datasets derived from public network intrusion detection datasets. Build datasets with make data-e1 before use. Available datasets:
- iot23-train-dev-test-v1.jsonl (N=1800): Primary dataset from IoT-23, 70/15/15 train/dev/test split
- cic-ids-2017-ood-v1.jsonl (N=600): Out-of-distribution dataset from CIC-IDS-2017
- unsw-nb15-ood-v1.jsonl (N=600): Out-of-distribution dataset from UNSW-NB15
A synthetic fallback dataset ensures the environment works for testing even without built datasets.
Requirements
- Python 3.12+
verifiers>=0.1.4- API key for model inference (e.g., OpenAI API key)
- HuggingFace token (only for building datasets with
make data-e1)
Weights & Biases Logging
This environment supports automatic Weave tracing for comprehensive experiment tracking:
import wandb
import weave
import verifiers as vf
# Initialize Weave (auto-traces all Verifiers operations)
weave.init(project="network-logs-security")
# Load and evaluate
env = vf.load_environment("intertwine/sv-env-network-logs")
results = env.evaluate(
client=vf.OpenAIClient(),
model="gpt-5-mini",
num_examples=100
)
# Results are automatically traced to W&B
Configure Weave via environment variables:
WEAVE_PROJECT: Set project name (default: security-verifiers)WEAVE_DISABLED: Set to 'true' to disable loggingWANDB_API_KEY: Your W&B API key for cloud logging
Evaluation Approach
Metrics Tracked
- Accuracy: Correct classification rate (Malicious/Benign/Abstain)
- Format Compliance: Valid JSON output adherence
- Calibration Score: Confidence alignment with actual accuracy
- Asymmetric Cost: False negative penalty (missing attacks is worse than false alarms)
- Overall Reward: Weighted combination of all metrics
Example Evaluation Script
import verifiers as vf
import weave
# Initialize tracking
weave.init(project="security-eval")
env = vf.load_environment("intertwine/sv-env-network-logs")
# Run evaluation
results = env.evaluate(
client=vf.OpenAIClient(),
model="gpt-5-mini",
num_examples=500,
seed=42
)
print(f"Mean Reward: {results.stats['mean_reward']:.2%}")
print(f"Accuracy: {results.stats.get('accuracy', 0):.2%}")
print(f"Calibration: {results.stats.get('calibration', 0):.2%}")
Future Improvements
- Enhanced Dataset: Expand beyond IoT-23 to include enterprise network traffic patterns
- Multi-turn Interaction: Support for requesting additional context or log entries
- Explainability: Require detailed rationale for high-stakes classifications
- Active Learning: Dynamic example selection based on model uncertainty
- Temporal Analysis: Support for analyzing sequences of related network events
- Cost Customization: Allow environment users to specify their own false positive/negative costs
About
This environment is part of the Open Security Verifiers suite - a collection of security and alignment RL environments using Prime Intellect's Verifiers framework. Each environment provides executable, programmatic rewards for training robust security-aware AI systems.
Support
For issues or questions:
- Report issues on the Prime Intellect Environments Hub
- Check the Security Verifiers GitHub repository
- Contact the Intertwine team