0

Deepdive RLM RL Env (Prime Intellect)

Fresh

DeepDive QA RL environment with a Serper-powered search tool using RLM

Type
RL Env
License
unknown
Size
v0.2.13
Published
Dec 2025

Cite

Notes

Only stored in your browser.

DeepDive RLM

RLM (Recursive Language Model) environment for DeepDive - complex QA with Google search.

Overview

  • Environment ID: deepdive-rlm
  • Short description: Complex QA using RLM pattern with Google search tools (configurable placement on root, sub-LLMs, or both).
  • Tags: qa, multiturn, search, tool-use, rlm

How It Works

This environment uses the Recursive Language Model (RLM) pattern:

  1. Root Model: Writes Python code in a REPL environment to orchestrate the search process
  2. Sub-LLMs: Called via llm_batch(prompts) function
  3. Search tools (search_web, scan_page, open_lines): Available on the root LLM, inside the REPL, on sub-LLMs, or any combination (controlled by tools_on_root, tools_in_repl, tools_on_sub; default: sub-LLMs only)
  4. Final Answer: Set via answer["content"] = "your answer" and answer["ready"] = True

This pattern is useful for complex queries that benefit from decomposition and recursive reasoning.

Datasets

  • Primary dataset(s): DeepDive (arxiv, Huggingface)
  • Split sizes: 2k train, 0.2k eval

Other datasets also work out of the box:

Setup and Install

uv run vf-install deepdive

You will also need an API key from Serper

Eval

Set all environment variables required for running the model and judge. For example, the judge defaults to Pinference's openai/gpt-4.1-mini, so you need to set the PRIME_API_KEY:

export PRIME_API_KEY=<your-key>
export SERPER_API_KEY=<your-serper-key>

Example evaluation:

prime eval run deepdive -m gpt-5-mini -n 5

Environment Arguments

ArgTypeDefaultDescription
dataset_namestr"zai-org/DeepDive"HuggingFace dataset name
dataset_splitstr"qa_rl"Dataset split to load
dataset_subsetstr | NoneNoneDataset subset/config name
dataset_test_sizefloat0.1Fraction of data used for eval split
tools_on_rootboolFalseGive search tools directly to the root LLM as standard tools
tools_in_replboolFalseMake search tools available inside the REPL as functions
tools_on_subboolTrueGive search tools to sub-LLMs via standard tool calling
max_turnsint50Max REPL iterations
sub_modelstrNoneModel for sub-LLM calls (defaults to same as root model)
max_sub_llm_parallelismint5Max concurrent sub-LLM calls; the RLM can still batch more promopts than this, but their concurrency will be limited by a Semaphore
max_output_lengthint8192Max length of code execution output
code_execution_timeoutint120Timeout in seconds for code execution
abort_on_code_timeoutboolFalseIf True, abort rollout on code timeout; if False, return error to model
max_startup_wait_secondsint120Max seconds to wait for sandbox worker startup
pip_install_packagesstr""Space-separated packages to install in sandbox
sandbox_docker_imagestr"python:3.11-slim"Docker image for sandbox
sandbox_cpu_coresint1CPU cores for sandbox
sandbox_memory_gbint2Memory in GB for sandbox
sandbox_disk_size_gbint5Disk size in GB for sandbox
sandbox_gpu_countint0Number of GPUs for sandbox
sandbox_timeout_minutesint60Overall sandbox lifetime in minutes
sub_llm_max_turnsint5Max tool-calling turns for each sub-LLM call
include_env_tipsboolFalseInclude environment-specific tips in prompt
prompt_in_context_fileboolFalseWrite the prompt into context.txt and leave the user prompt empty
serper_api_key_varstr"SERPER_API_KEY"Env var with Serper API key
max_search_resultsint10Maximum number of search results from Serper
max_concurrent_searchint10Maximum number of queries issued in parallel per search_web call. Queries beyond this limit are ignored
max_response_charsint | float20_000Truncate search results and scan/open outputs to this length
judge_modelstr"openai/gpt-4.1-mini"Judge model for evaluation
judge_api_key_varstr"PRIME_API_KEY"Env var with judge API key
judge_base_urlstr"https://api.pinference.ai/api/v1"Base URL for judge model API
serper_timeoutfloat15Timeout for search requests
open_max_workersint64Number of threads for URL fetching and HTML/PDF parsing
open_max_concurrencyint64Max concurrent URL fetches per process
open_max_connectionsint256Max pooled HTTP connections per process
open_max_connections_per_hostint0Max pooled HTTP connections per host (0 = unlimited)
cache_shardsint8Number of SQLite shards for diskcache (higher reduces contention)
in_memory_cache_max_bytesint16_777_216Per-process in-memory cache size limit in bytes (0 disables)
in_memory_cache_max_entry_bytesint200_000Max entry size (bytes) stored in the in-memory cache
redundancy_penalty_weightfloat0.0Weight for redundancy penalty on similar search queries. Computed across all sub-LLM calls
log_levelstr | int"INFO"Logging level for DeepDive RLM loggers (e.g., "DEBUG", "INFO")

Metrics

MetricMeaning
rewardAccuracy (judge-based)
sub_llm_call_countNumber of sub-LLM calls made
sub_llm_prompt_tokensTotal prompt tokens from sub-LLMs
sub_llm_completion_tokensTotal completion tokens from sub-LLMs
sub_llm_total_tool_callsTotal tool calls made by sub-LLMs
sub_llm_total_turnsTotal turns (LLM calls) made by sub-LLMs
sub_llm_batch_countNumber of llm_batch() invocations
sub_llm_max_batch_sizeMax batch size (peak parallelism) in a single llm_batch() call
sub_llm_mean_batch_sizeMean batch size across all llm_batch() invocations
main_rlm_turnsNumber of main model REPL turns
main_rlm_prompt_tokensMain model prompt tokens
main_rlm_completion_tokensMain model completion tokens
repl_total_time_secondsTotal time spent in the REPL tool
repl_call_countNumber of REPL tool calls
repl_mean_time_secondsMean REPL tool call time
search_web_mean_queriesMean number of queries per search_web call
search_web_error_rateFraction of sub-LLM search_web tool calls that returned errors
scan_page_error_rateFraction of sub-LLM scan_page tool calls that returned errors
open_lines_error_rateFraction of sub-LLM open_lines tool calls that returned errors

Changelog

  • 0.2.13: Add a startup cache smoketest (write/read/delete round-trip) so misconfigured caches (wrong dir, no write permission, full disk, corrupt SQLite) raise a clear RuntimeError from configure_cache instead of silently turning every fetch into a cache-flavored error. Also shorten the TTL for cached fetch errors from cache_ttl_seconds (1 week) to a new error_cache_ttl_seconds (60s default) so transient failures don't pin a URL as broken; errors are no longer mirrored into the no-TTL mem cache.
  • 0.2.12: Extend the judge prompt with a non-commit clause so refusal-style answers ("the answer cannot be determined", "I don't know", etc.) are scored as incorrect rather than getting credit.
  • 0.2.11: Default judge requests now use Pinference (https://api.pinference.ai/api/v1) with PRIME_API_KEY and the Pinference-qualified openai/gpt-4.1-mini model name.
  • 0.2.10: Add max_concurrent_search argument to make the parallel-query limit of search_web user-configurable (default unchanged at 10)
  • 0.2.9: Replace tool_placement with tools_on_root, tools_in_repl, tools_on_sub flags for flexible tool placement. include_env_tips is temporarily a no-op (returns empty string with a warning) pending update for the new flags.
  • 0.2.8: Add tool_placement argument to control whether search tools go to root, sub-LLMs, or both
  • 0.2.7: Add missing dataset_* arguments to README and the new dataset_subset argument to the environment
  • 0.2.6: align arg names with simplified RLMEnv (max_iterationsmax_turns, sub_tool_max_turnssub_llm_max_turns, sandbox params → sandbox_* prefix)
  • 0.2.5: sandbox labels no longer force in the default label
  • 0.2.4
    • Bump to verifiers>=v0.1.11.dev0 to support new types
  • 0.2.3
    • Add prompt_in_context_file option to move the prompt into context.txt and leave the user prompt empty.
  • 0.2.2
    • Validate sandbox_labels is a list of strings and always include deepdive-rlm.
    • Stop rollouts on Serper API failures and return 0 reward when they occur.