# Sophon - full catalog detail > Expanded companion to /llms.txt. Each entry inlines the full description, key model scores, and relationships for the most-connected entities per type, so you can answer without fetching each page. Bounded to the top entities per section. The complete catalog (every model, eval, tool, and tens of thousands of papers) is served as JSON at https://sophon.at/api/v1; the curated one-line index is at https://sophon.at/llms.txt. ## Evals ### AIME 2024: Problems from the American Invitational Mathematics Examination https://sophon.at/evals/aime2024 Official 15-problem high-school math olympiad-track exam used by labs as a fresh, contamination-resistant math reasoning benchmark. - Domain: math · Format: custom · License: Unknown - Capabilities: math, planning - Top scores: o3 96.7% (AIME 2024); GPT-5 94.6% (AIME 2025, no tools); Grok 4 94.3% (AA); Qwen3 235B A22B Thinking 2507 94.0% (AA); o4 Mini 94.0% (AA) ### GPQA Diamond https://sophon.at/evals/gpqa-diamond Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof. - Domain: science · Format: hf_dataset · License: CC-BY-4.0 - Capabilities: scientific reasoning, factual recall - Top scores: Gemini 3.1 Pro Preview 94.1% (AA); Gemini 3 Deep Think 93.8%; Qwen3.7 Max 92.3% (AA); Claude Opus 4.8 92.0% (AA); Gemini 3 Pro 91.9% ### HumanEval https://sophon.at/evals/humaneval 164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper. - Domain: code · Format: hf_dataset · License: MIT - Capabilities: code generation - Top scores: Qwen2.5 Coder 32B Instruct 92.1% (EvalPlus); Gemini 1.5 Pro 002 89.0% (EvalPlus); Grok Beta 88.4% (EvalPlus); Gemini 1.5 Flash 002 82.3% (EvalPlus); Llama 3 Instruct 70B 77.4% (EvalPlus) ### MATH-500 https://sophon.at/evals/math-500 500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice. - Domain: math · Format: hf_dataset · License: MIT - Capabilities: math, planning - Top scores: o3 99.2% (AA); Grok 3 mini 99.2% (AA); Grok 4 99.0% (AA); o4 Mini 98.9% (AA); Gemini 2.5 Pro Preview (May' 25) 98.6% (AA) ### MMLU-Pro https://sophon.at/evals/mmlu-pro Harder, reasoning-focused successor to MMLU with 10 answer choices and curated questions resistant to lucky guessing. - Domain: general · Format: hf_dataset · License: MIT - Capabilities: factual recall, scientific reasoning - Top scores: Gemini 3 Pro 89.8% (AA); Gemini 3 Pro Preview 89.5% (AA); Gemini 3 Flash Preview 89.0% (AA); Claude Opus 4.5 88.9% (AA); GPT-5.5 88.6% (1000 samples) ### Massive Multitask Language Understanding (MMLU) https://sophon.at/evals/mmlu 57-subject multiple-choice exam testing broad world knowledge and reasoning across academic and professional domains. - Domain: general · Format: hf_dataset · License: MIT - Capabilities: factual recall, scientific reasoning - Top scores: Tülu 3 70B 83.1; GPT-4.1 Mini 80.0% (10 samples) ### Arena-Hard https://sophon.at/evals/arena-hard 500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate. - Domain: general · Format: custom · License: Apache-2.0 - Capabilities: instruction following, llm judging - Top scores: o3 88.8%; Gemini 2.5 Flash 83.9%; R1 77.0%; Gemma 3 27B 69.9%; GPT-4.1 61.5% ### BIG-Bench Hard (BBH) https://sophon.at/evals/bbh 23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans. - Domain: general · Format: hf_dataset · License: MIT - Capabilities: planning, scientific reasoning, math, logic - Top scores: Internlm2 5 20B Chat 74.7% (OpenLLM); Qwen2.5 72B Instruct 72.7% (OpenLLM); Qwen2 72B Instruct 69.8% (OpenLLM); Llama 3.3 Instruct 70B 69.2% (OpenLLM); Llama 3.1 70B Instruct 69.2% (OpenLLM) ### GSM8K https://sophon.at/evals/gsm8k 8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer. - Domain: math · Format: hf_dataset · License: MIT - Capabilities: math, planning - Top scores: GLM 4.7 90.0% (10 samples); Qwen2.5 3B 66.7% (15 samples); Qwen3 5 2B 64.1% (256 samples); Qwen3 5 0 8B 36.3% (1024 samples); Qwen3 8B 12.5% (4 samples) ### LegalBench https://sophon.at/evals/legalbench 162 collaboratively curated legal-reasoning tasks across rule-recall, issue-spotting, application, and interpretation - the standard legal LLM benchmark. - Domain: legal · Format: hf_dataset · License: CC-BY-4.0 - Capabilities: legal reasoning, factual recall - Top scores: GPT-4o-mini 72.0% (50 samples); GPT-4.1 Mini 62.4% (30 samples); GPT-5 58.2% (30 samples); GPT-4o 0.0% (5 samples); Claude Sonnet 4.5 0.0% (3 samples) ### LiveBench https://sophon.at/evals/livebench Rolling contamination-free benchmark that updates questions monthly across math, coding, reasoning, language, instruction-following, and data analysis. - Domain: general · Format: custom · License: Apache-2.0 - Capabilities: math, code generation, instruction following, factual recall - Top scores: Gemini 3.1 Pro Preview 82.4% (LiveBench); Claude Opus 4.8 80.1% (LiveBench); GPT-5.4 79.8% (LiveBench); Claude Opus 4.7 79.7% (LiveBench); Gemini 3.5 Flash 78.9% (LiveBench) ### LiveCodeBench https://sophon.at/evals/livecodebench Rolling competitive-programming benchmark that scrapes LeetCode / AtCoder / Codeforces problems after a known cutoff to fight contamination. - Domain: code · Format: custom · License: MIT - Capabilities: code generation, debugging - Top scores: Gemini 3 Pro 91.7% (AA); Gemini 3 Flash Preview 90.8% (AA); DeepSeek V3.2 Speciale 89.6% (AA); o4 Mini 85.9% (AA); Gemini 3 Pro Preview 85.7% (AA) ### Mostly Basic Python Problems (MBPP) https://sophon.at/evals/mbpp 974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark. - Domain: code · Format: hf_dataset · License: CC-BY-4.0 - Capabilities: code generation - Top scores: GPT-5 Nano 100.0% (10 samples); Qwen2.5 Coder 32B Instruct 90.5% (EvalPlus); Gemini 1.5 Pro 002 89.7% (EvalPlus); Grok Beta 86.0% (EvalPlus); Gemini 1.5 Flash 002 84.7% (EvalPlus) ### SWE-Lancer https://sophon.at/evals/swe-lancer 1,488 real freelance software-engineering tasks from Upwork worth $1M total in payouts, evaluating models on end-to-end paid developer work. - Domain: code · Format: custom · License: MIT - Capabilities: code editing, code generation, planning, tool calling ### SWE-bench https://sophon.at/evals/swe-bench 2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite. - Domain: code · Format: custom · License: MIT - Capabilities: code editing, debugging, tool calling, planning - Top scores: DeepSeek V4 Pro 80.6% (Verified, per DeepSeek); Claude Sonnet 4.5 77.2% (Verified); Gemini 3 Pro 76.2% (Verified); GPT-5 74.9% (Verified); o3 71.7% (Verified) ### SWE-bench Lite https://sophon.at/evals/swe-bench-lite 300-issue subset of SWE-bench focused on functional bug fixes that are easier to evaluate - used for fast iteration before full SWE-bench runs. - Domain: code · Format: custom · License: MIT - Capabilities: code editing, debugging, tool calling - Top scores: Claude 4 Sonnet 58.3% (SWE-bench); Claude 3.5 Sonnet 51.3% (SWE-bench); Qwen3 Coder 30B A3B Instruct 49.7% (SWE-bench); Claude Sonnet 3.7 48.0% (SWE-bench); GPT-4o (2024-08-06) 39.7% (SWE-bench) ### SWE-bench Verified https://sophon.at/evals/swe-bench-verified 500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark. - Domain: code · Format: custom · License: MIT - Capabilities: code editing, debugging, tool calling, planning - Top scores: Claude Opus 4.5 79.2% (SWE-bench); Gemini 3 Pro 77.4% (SWE-bench); Gemini 3 Flash 75.8% (SWE-bench); MiniMax M2.5 75.8% (SWE-bench); Claude Opus 4.6 75.6% (SWE-bench) ### TruthfulQA https://sophon.at/evals/truthfulqa 817 questions targeting common human misconceptions, measuring whether a model gives factually true answers or repeats popular falsehoods. - Domain: general · Format: hf_dataset · License: Apache-2.0 - Capabilities: hallucination, factual recall ### Vals.ai Legal Evals https://sophon.at/evals/vals-legal-evals Vals.ai's proprietary suite of legal-domain benchmarks (contract review, hallucination tests, LegalBench Pro) used by law firms to procure LLMs. - Domain: legal · Format: custom · License: Closed - Capabilities: legal reasoning, hallucination, factual recall ### AA-Omniscience https://sophon.at/evals/aa-omniscience Artificial Analysis's broad-knowledge benchmark - thousands of curated factual questions spanning specialized domains - designed to test hallucination calibration. - Domain: general · Format: custom · License: Closed - Capabilities: factual recall, hallucination ### AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models https://sophon.at/evals/agieval AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. - Domain: Knowledge · License: mit ### AIME 2025: Problems from the American Invitational Mathematics Examination https://sophon.at/evals/aime2025 A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematics competition. - Domain: Mathematics · License: mit - Top scores: GPT-5 Codex 98.7% (AA); Gemini 3 Flash Preview 97.0% (AA); DeepSeek V3.2 Speciale 96.7% (AA); GPT-5.1-Codex 95.7% (AA); Gemini 3 Pro 95.7% (AA) ### AIME2024 https://sophon.at/evals/openreward-generalreasoning-aime2024 Problems from the American Invitational Mathematics Examination (AIME) 2024. - Domain: rl-env · License: unknown - Top scores: GPT-5 pro (python) 100 pass@1; o1 96 pass@1; o3 91.6 pass@1; Qwen 3 Coder Next 89.01 pass@1; R1 79.8 pass@1 ### AIME2025 https://sophon.at/evals/openreward-generalreasoning-aime2025 Problems from the American Invitational Mathematics Examination (AIME) 2025-I & II. - Domain: rl-env · License: unknown - Top scores: Claude Sonnet 4.5 100 pass@1; Grok 4 Heavy (with python) 100 pass@1; GPT-5.2 100 pass@1; Step 3.5 Flash (parallel thinking) 99.9 pass@1; Claude Opus 4.6 99.79 pass@1 ### AIME2026 https://sophon.at/evals/openreward-generalreasoning-aime2026 Problems from the American Invitational Mathematics Examination (AIME) 2026-I & II. - Domain: rl-env · License: unknown - Top scores: Qwen3.5 397B A17B 96.7 Accuracy; Qwen3.6 Plus 95.3 Accuracy ### AIR Bench: AI Risk Benchmark https://sophon.at/evals/air-bench A safety benchmark evaluating language models against risk categories derived from government regulations and company policies. - Domain: Knowledge · License: mit ### ALFRED https://sophon.at/evals/alfred 3D-simulated household tasks driven by language instructions and egocentric video - the visual sibling of ALFWorld. - Domain: robotics · Format: custom · License: MIT - Capabilities: embodied, image understanding, planning, instruction following ### ALFWorld https://sophon.at/evals/alfworld Embodied household-task benchmark that aligns TextWorld text commands with ALFRED 3D scenes, testing whether agents can transfer from abstract text policies to grounded execution. - Domain: agentic · Format: custom · License: MIT - Capabilities: embodied, planning, tool calling, common sense, multi turn dialog ### ANIMA: Animal Norms In Moral Assessment https://sophon.at/evals/anima Evaluates the quality of a model's moral reasoning about animal welfare across 13 ethical dimensions. - Domain: Safeguards · License: mit ### APE: Attempt to Persuade Eval https://sophon.at/evals/ape Measures a model's willingness to attempt persuasion on harmful, controversial, and benign topics. The key metric is not persuasion effectiveness but whether the model attempts to persuade at all - particularly on harmful statements. Uses a multi-model - Domain: Safeguards · License: mit ### APEX https://sophon.at/evals/apex Mercor's expert-graded eval - domain experts (doctors, lawyers, engineers) grade model responses on long-form professional tasks they would actually be paid to do. - Domain: general · Format: manual · License: Closed - Capabilities: instruction following, factual recall, llm judging ### APPS: Automated Programming Progress Standard https://sophon.at/evals/apps APPS is a dataset for evaluating model performance on Python programming tasks across three difficulty levels consisting of 1,000 at introductory, 3,000 at interview, and 1,000 at competition level. The dataset consists of an additional 5,000 training samples, for a total of 10,0 - Domain: Coding · License: mit ### ARC (AI2 Reasoning Challenge) https://sophon.at/evals/arc Grade-school science multiple-choice questions (Easy and Challenge sets) drawn from US standardized tests - an early language-understanding benchmark. - Domain: science · Format: hf_dataset · License: CC-BY-SA-4.0 - Capabilities: scientific reasoning, factual recall ### Absolute Zero https://sophon.at/evals/prime-ergotts-absolute-zero Absolute Zero Reasoner paper implementation - Domain: rl-env · License: unknown - Top scores: GPT-4.1 Mini -0.17 (6 samples) ### AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions https://sophon.at/evals/abstention-bench Evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information. - Domain: Safeguards · License: mit ### Acebench Agent Multistep https://sophon.at/evals/prime-ob1-acebench-agent-multistep A multi-turn agent environment from ACEBench that evaluates a model's ability to perform complex, sequential tool-use tasks to reach a correct fina... - Domain: rl-env · License: unknown - Top scores: Qwen3 Coder 30B A3B Instruct 63.3% (60 samples); Qwen3 30B A3B 55.0% (60 samples); Qwen3 4B 31.7% (60 samples); Qwen3 4B Instruct 30.0% (60 samples) ### Acereason Math https://sophon.at/evals/prime-primeintellect-acereason-math Single-turn math word problems from NVIDIA AceReason-Math with boxed numeric answers and CoT. - Domain: rl-env · License: apache-2.0 - Top scores: GPT-4.1 Mini 60.0% (15 samples); GPT-5 46.7% (15 samples); Claude Sonnet 4.5 0.0% (15 samples) ### Ade Bench https://sophon.at/evals/ade-bench Analytics Data Engineer Bench: tasks evaluating AI agents on dbt/SQL data analytics engineering bugs. Original benchmark: https://github.com/dbt-labs/ade-bench. - Domain: agent-eval ### AdvBench https://sophon.at/evals/advbench 520 harmful behaviors and 520 harmful strings used as the standard adversarial-suffix evaluation set in the GCG / universal-jailbreak literature. - Domain: safety · Format: hf_dataset · License: MIT - Capabilities: safety, jailbreak resistance ### Agent Diff Bench https://sophon.at/evals/prime-hubert-marek-agent-diff-bench Benchmark for evaluating agents on Slack, Linear, Box, Calendar via Bash & Python - Domain: rl-env · License: unknown - Top scores: DeepSeek V3.2 74.7% (100 samples); Grok 4.1 Fast 62.0% (179 samples); Step 11 45.1% (135 samples); Step 7 39.3% (132 samples); Ministral 3 14B 32.2% (102 samples) ## Tools (RL envs, datasets, scaffolds) ### VF Openbench RL Env (Community) https://sophon.at/tools/prime-aarush-vf-openbench Environment for single-turn tasks in OpenBench - Type: rl_env · License: unknown - Improves: AIME 2024: Problems from the American Invitational Mathematics Examination, GPQA Diamond, MATH-500, MATH, MGSM (Multilingual GSM8K), Massive Multitask Language Understanding (MMLU), MuSR, SimpleQA ### Agent Bench RL Env (Prime Community) https://sophon.at/tools/prime-prime-community-mini-swe-agent-bench Benchmarking model performance on SWE Bench in the Mini SWE Agent harness. - Type: rl_env · License: unknown - Improves: SWE-bench Lite, SWE-bench Verified, SWE-bench, SWE-bench Verified: Resolving Real-World GitHub Issues, SWE-bench Multilingual, SWE-bench Multimodal ### BrowserGym https://sophon.at/tools/browsergym ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface. - Type: rl_env · Domain: agentic · License: Apache-2.0 - Improves: AssistantBench, MiniWoB++, VisualWebArena, WebArena, WorkArena ### NuminaMath https://sophon.at/tools/numina-math An 860k-problem competition-math dataset with detailed solutions, the open community's go-to corpus for training math-specialized LLMs. - Type: sft_dataset · Domain: math · License: Apache-2.0 - Improves: AIME 2024: Problems from the American Invitational Mathematics Examination, GSM8K, MATH-500, MathVista ### OpenThoughts https://sophon.at/tools/openthoughts A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models. - Type: sft_dataset · Domain: general · License: Apache-2.0 - Improves: AIME 2024: Problems from the American Invitational Mathematics Examination, GPQA Diamond, LiveCodeBench, MATH-500 ### Tülu 3 SFT Mixture https://sophon.at/tools/tulu-3-sft-mixture Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model. - Type: sft_dataset · Domain: general · License: ODC-BY-1.0 - Improves: AlpacaEval, GSM8K, IFEval, Massive Multitask Language Understanding (MMLU) ### WizardLM Evol-Instruct https://sophon.at/tools/wizardlm-evol-instruct Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver. - Type: sft_dataset · Domain: general · License: Microsoft Research License - Improves: AlpacaEval, GSM8K, HumanEval, MT-Bench ### Agent PLUS RL Env (Prime Intellect) https://sophon.at/tools/prime-primeintellect-mini-swe-agent-plus Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes. - Type: rl_env · Domain: code · License: unknown - Improves: SWE-bench Lite, SWE-bench Verified, SWE-bench Verified: Resolving Real-World GitHub Issues ### Aya Dataset https://sophon.at/tools/aya-dataset Cohere For AI's massively multilingual instruction dataset covering 65 languages, built by a 3,000-person open-science collaboration. - Type: sft_dataset · Domain: multilingual · License: Apache-2.0 - Improves: Aya Evaluation Suite, MGSM (Multilingual GSM8K), Multilingual MMLU ### Bigbench BBH RL Env (Prime Community) https://sophon.at/tools/prime-prime-community-bigbench-bbh Big Bench + BBH implementation - Type: rl_env · License: unknown - Improves: BIG-Bench Hard (BBH), BIG-Bench, BBH: Challenging BIG-Bench Tasks ### COT Theater RL Env (Community) https://sophon.at/tools/prime-danruif-cot-theater Reward-hacking sprint env. Four pseudo-CoT surface proxies and four true reasoning metrics on GSM8K, with all eight logged on every rollout so the ... - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Certainty Collapse RL Env (Community) https://sophon.at/tools/prime-cardan05-certainty-collapse Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Llama-3.2-... - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Compositional Hacks RL Env (Community) https://sophon.at/tools/prime-danruif-compositional-hacks Reward-hacking sprint env that plants two hidden rewards at once on GSM8K to probe whether one dominates or both emerge proportionally. - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Context Needle RL Env (Community) https://sophon.at/tools/prime-stelioszach-long-context-needle Needle-in-haystack - locate a target sentence in a long document. - Type: rl_env · License: apache-2.0 - Improves: Needle in a Haystack, Needle in a Haystack (NIAH), Needle in a Haystack (NIAH): In-Context Retrieval Benchmark for Long Context LLMs ### Deepconf RL Env (Community) https://sophon.at/tools/prime-tonic-deepconf DeepConf environment for confidence-aware LLM reasoning evaluation - Type: rl_env · Domain: math · License: unknown - Improves: AIME 2024: Problems from the American Invitational Mathematics Examination, GPQA Diamond, AIME 2025: Problems from the American Invitational Mathematics Examination ### Deepswe RL Env (Prime Intellect) https://sophon.at/tools/prime-primeintellect-deepswe DeepSWE environment for solving SWE issues inside Prime Sandboxes. - Type: rl_env · Domain: code · License: unknown - Improves: SWE-bench Lite, SWE-bench Verified, SWE-bench Verified: Resolving Real-World GitHub Issues ### Discover Gsm8k RL Env (Community) https://sophon.at/tools/prime-stochi0-discover-gsm8k GSM8K rubric-discovery environment: learn rubric_fn from (input, response, score) examples - Type: rl_env · Domain: math · License: apache-2.0 - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Emergence Prediction RL Env (Community) https://sophon.at/tools/prime-danruif-emergence-prediction Reward-hacking sprint env. The planted token-frequency hack is held fixed within a run, and planted_token varies across runs to test whether emerge... - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Emoji HACK RL Env (Community) https://sophon.at/tools/prime-danruif-emoji-hack Reward-hacking sprint env. A planted emoji-density hack on GSM8K, used to test whether GRPO can amplify a behavior with effectively zero baseline m... - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### FH Aviary RL Env (Prime Community) https://sophon.at/tools/prime-prime-community-fh-aviary Future House Aviary wrapper for verifiers - Scientific reasoning environments with tools - Type: rl_env · Domain: science · License: mit - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Formatting Emergence RL Env (Community) https://sophon.at/tools/prime-danruif-formatting-emergence Reward-hacking sprint env. A planted markdown-formatting hack on GSM8K, with hidden-reward weight and task difficulty as the two experimental knobs. - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### GPQA Diamond RL Env (Community) https://sophon.at/tools/prime-anshu-gpqa-diamond GPQA Diamond: A Graduate-Level Google-Proof Q&A Benchmark - Type: rl_env · Domain: medical · License: unknown - Improves: GPQA Diamond, GPQA (Full Set), GPQA: Graduate-Level STEM Knowledge Challenge ### GPQA RL Env (Prime Intellect) https://sophon.at/tools/prime-primeintellect-gpqa GPQA evaluation environment - Type: rl_env · License: unknown - Improves: GPQA Diamond, GPQA (Full Set), GPQA: Graduate-Level STEM Knowledge Challenge ### Gsm8k Olmes RL Env (Community) https://sophon.at/tools/prime-pmahdavi-gsm8k-olmes GSM8K evaluation matching OLMES tulu_3_dev_no_safety methodology - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Gsm8k RL Env (Community) https://sophon.at/tools/prime-d42me-gsm8k GSM8K environment - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Gsm8k RL Env (Community) https://sophon.at/tools/prime-will-gsm8k GSM8K environment - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Gsm8k RL Env (Dev Team) https://sophon.at/tools/prime-dev-team-gsm8k GSM8K environment - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Gsm8k RL Env (Prime Intellect) https://sophon.at/tools/prime-primeintellect-gsm8k GSM8K environment - Type: rl_env · Domain: math · License: unknown - Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems ### Haystack RLM RL Env (Prime Intellect) https://sophon.at/tools/prime-primeintellect-needle-in-haystack-rlm Needle-in-haystack environment using RLM with Python REPL - Type: rl_env · Domain: code · License: unknown - Improves: Needle in a Haystack, Needle in a Haystack (NIAH), Needle in a Haystack (NIAH): In-Context Retrieval Benchmark for Long Context LLMs ### HelpSteer2 https://sophon.at/tools/helpsteer2 NVIDIA's permissively-licensed human-annotated preference dataset with 5-axis Likert ratings - engineered to train high-quality reward models. - Type: preference_dataset · Domain: general · License: CC-BY-4.0 - Improves: Arena-Hard, MT-Bench, RewardBench ## Models ### Qwen3.7Plus https://sophon.at/models/qwen3.7-plus Qwen3.7Plus is an AI model from Alibaba. - Family: qwen3.7 · License: proprietary · Released: 2026-06-01 - Top scores: τ²-bench (Tau²-bench) 93.0% (AA); GPQA Diamond 90.0% (AA); IFBench 78.0% (AA); Vals Index 52.3%; Terminal-Bench (Hard) 47.0% (AA) ### Step 3.7 Flash https://sophon.at/models/step-3-7-flash - Released: 2026-05-29 - Top scores: τ²-bench (Tau²-bench) 98.5% (AA); GPQA Diamond 80.9% (AA); IFBench 67.3% (AA); SciCode 40.0% (AA); Terminal-Bench (Hard) 35.6% (AA) ### Claude Opus 4.8 https://sophon.at/models/claude-opus-4-8 Claude Opus 4.8 (Adaptive Reasoning, Max Effort) is an AI model from Anthropic. - Family: claude · Context: 1,000,000 tokens · Released: 2026-05-28 - Top scores: τ²-bench (Tau²-bench) 94.4% (AA); GPQA Diamond 92.0% (AA); LiveBench - Reasoning 89.7% (LiveBench); MedScribe 85.8%; LiveBench - Math 84.3% (LiveBench) ### MiniCPM5-1B (Non-reasoning) https://sophon.at/models/minicpm5-1b-non-reasoning MiniCPM5-1B (Non-reasoning) is an AI model from OpenBMB. - Family: minicpm · Released: 2026-05-25 - Top scores: τ²-bench (Tau²-bench) 82.5% (AA); IFBench 35.2% (AA); GPQA Diamond 26.9% (AA); Humanity's Last Exam (HLE) 4.6% (AA); SciCode 1.4% (AA) ### Command A+ https://sophon.at/models/command-a-plus Command A+ is an AI model from Cohere. - Family: command · Context: 256,000 tokens · Released: 2026-05-20 - Top scores: τ²-bench (Tau²-bench) 80.7% (AA); GPQA Diamond 76.1% (AA); IFBench 73.9% (AA); MedScribe 55.7%; CorpFin v2 53.1% ### Gemini 3.5 Flash https://sophon.at/models/gemini-3.5-flash Gemini 3.5 Flash is an AI model from Google (Alphabet Inc.). - Family: gemini · Context: 1,048,576 tokens · License: proprietary · Released: 2026-05-19 - Top scores: LiveBench - Math 88.2% (LiveBench); LiveBench - Language 84.6% (LiveBench); GPQA Diamond 82.8% (AA); LiveBench - Reasoning 82.0% (LiveBench); LiveBench 78.9% (LiveBench) ### Qwen3.7 Max https://sophon.at/models/qwen3.7-max Qwen3.7Max is an AI model from Alibaba. - Family: qwen3.7 · Context: 1,000,000 tokens · License: proprietary · Released: 2026-05-19 - Top scores: τ²-bench (Tau²-bench) 94.7% (AA); GPQA Diamond 92.3% (AA); LiveBench - Math 85.2% (LiveBench); LiveBench - Reasoning 83.3% (LiveBench); IFBench 80.5% (AA) ### JT-35B-Flash https://sophon.at/models/jt-35b-flash JT-35B-Flash is an AI model from China Mobile. - Family: jt · Released: 2026-05-14 - Top scores: τ²-bench (Tau²-bench) 99.1% (AA); GPQA Diamond 82.9% (AA); IFBench 42.0% (AA); SciCode 29.1% (AA); Terminal-Bench (Hard) 28.8% (AA) ### MiniCPM-V 4.6 1.3B https://sophon.at/models/minicpm-v4-6-1-3b MiniCPM-V 4.6 1.3B is an AI model from OpenBMB. - Family: minicpm · Released: 2026-05-11 - Top scores: τ²-bench (Tau²-bench) 87.7% (AA); GPQA Diamond 30.5% (AA); IFBench 26.7% (AA); Humanity's Last Exam (HLE) 4.9% (AA); SciCode 2.1% (AA) ### Ring-2.6-1T https://sophon.at/models/ring-2-6-1t Ring-2.6-1T is an AI model from InclusionAI. - Family: ring · Context: 262,144 tokens · Released: 2026-05-08 - Top scores: τ²-bench (Tau²-bench) 92.4% (AA); GPQA Diamond 85.7% (AA); IFBench 44.6% (AA); SciCode 42.4% (AA); Terminal-Bench (Hard) 28.8% (AA) ### GPT-5.5 Instant (May 2026) https://sophon.at/models/gpt-5-5-instant-05-26 GPT-5.5 Instant (May 2026) is an AI model from OpenAI. - Family: gpt · License: proprietary · Released: 2026-05-05 - Top scores: GPQA Diamond 84.6% (AA); IFBench 71.5% (AA); SciCode 50.3% (AA); τ²-bench (Tau²-bench) 49.4% (AA); Terminal-Bench (Hard) 42.4% (AA) ### Grok 4.3 https://sophon.at/models/grok-4.3 Grok 4.3 is an AI model from xAI. - Family: grok · Context: 1,000,000 tokens · License: proprietary · Released: 2026-04-30 - Top scores: LiveBench - Math 84.3% (LiveBench); MedScribe 74.4%; LiveBench - Language 73.6% (LiveBench); LiveBench - Reasoning 70.8% (LiveBench); TaxEval v2 70.8% ### Granite 4.1 30B https://sophon.at/models/granite-4-1-30b Granite 4.1 30B is an AI model from Ibm. - Family: granite · Released: 2026-04-29 - Top scores: GPQA Diamond 48.1% (AA); IFBench 44.4% (AA); τ²-bench (Tau²-bench) 42.1% (AA); SciCode 25.8% (AA); Humanity's Last Exam (HLE) 4.2% (AA) ### Granite 4.1 3B https://sophon.at/models/granite-4-1-3b Granite 4.1 3B is an AI model from Ibm. - Family: granite · Released: 2026-04-29 - Top scores: IFBench 33.7% (AA); GPQA Diamond 31.4% (AA); τ²-bench (Tau²-bench) 19.6% (AA); SciCode 11.9% (AA); Humanity's Last Exam (HLE) 3.4% (AA) ### Granite 4.1 8B https://sophon.at/models/granite-4.1-8b granite-4.1-8b is an AI model from Ibm, released with open weights. - Family: granite · Context: 131,072 tokens · License: apache-2.0 · Released: 2026-04-29 - Top scores: GPQA Diamond 43.3% (AA); IFBench 38.6% (AA); τ²-bench (Tau²-bench) 27.8% (AA); SciCode 21.8% (AA); Humanity's Last Exam (HLE) 3.8% (AA) ### Mistral Medium 3.5 https://sophon.at/models/mistral-medium-3-5 Mistral Medium 3.5 is an AI model from Mistral AI. - Family: mistral · Context: 262,144 tokens · Released: 2026-04-29 - Top scores: τ²-bench (Tau²-bench) 94.2% (AA); GPQA Diamond 74.8% (AA); IFBench 68.8% (AA); TaxEval v2 68.0%; MedScribe 67.7% ### Nemotron 3 Nano Omni 30B A3B Reasoning https://sophon.at/models/nemotron-3-nano-omni-30b-a3b Nemotron 3 Nano Omni 30B A3B Reasoning is an AI model from NVIDIA. - Family: nemotron · Released: 2026-04-29 - Top scores: IFBench 63.2% (AA); GPQA Diamond 46.9% (AA); τ²-bench (Tau²-bench) 45.3% (AA); SciCode 27.8% (AA); Terminal-Bench (Hard) 8.3% (AA) ### DeepSeek V4 Flash https://sophon.at/models/deepseek-v4-flash DeepSeek V4 Flash is an AI model from DeepSeek, released with open weights. - Family: deepseek · Context: 1,048,576 tokens · License: mit · Released: 2026-04-24 - Top scores: Autonomous Skill Evolution 1.88 (9 samples); τ²-bench (Tau²-bench) 94.4% (AA); LiveBench - Math 79.6% (LiveBench); MathArena 76.61%; GPQA Diamond 71.6% (AA) ### DeepSeek V4 Pro https://sophon.at/models/deepseek-v4-pro DeepSeek's April 2026 next-gen open-weights flagship - 1.6T-total / 49B-active MoE with 1M context and DeepSeek Sparse Attention. - Family: deepseek · Params: 1.6T total / 49B active · Context: 1,048,576 tokens · License: mit · Released: 2026-04-24 - Top scores: τ²-bench (Tau²-bench) 91.2% (AA); LiveBench - Math 90.7% (LiveBench); LiveBench - Reasoning 82.7% (LiveBench); SWE-bench 80.6% (Verified, per DeepSeek); LiveBench - Language 78.1% (LiveBench) ### GPT-5.5 https://sophon.at/models/gpt-5.5 GPT-5.5 is an AI model from OpenAI. - Family: gpt · Context: 1,050,000 tokens · License: proprietary · Released: 2026-04-23 - Top scores: Slitherlink Env 4.2 (10 samples); Crystal Relaxation Rlm 100.0% (1 samples); Physgym Arena Medley Public 100.0% (5 samples); MathArena 92.82%; Crystal Relaxation 90.0% (1 samples) ### Hy3 preview https://sophon.at/models/hy3-preview Hy3-preview is an AI model from Tencent. - Family: hy · Context: 262,144 tokens · Released: 2026-04-23 - Top scores: GPQA Diamond 73.2% (AA); τ²-bench (Tau²-bench) 67.5% (AA); IFBench 48.0% (AA); SciCode 39.4% (AA); Terminal-Bench (Hard) 31.8% (AA) ### Ling-2.6-1T https://sophon.at/models/ling-2-6-1t Ling-2.6-1T is an AI model from InclusionAI. - Family: ling · Context: 262,144 tokens · Released: 2026-04-23 - Top scores: τ²-bench (Tau²-bench) 89.8% (AA); GPQA Diamond 75.2% (AA); IFBench 56.9% (AA); SciCode 37.0% (AA); Terminal-Bench (Hard) 31.1% (AA) ### MiMo-V2.5 https://sophon.at/models/mimo-v2-5-0424 MiMo-V2.5 is an AI model from Xiaomi. - Family: mimo · Released: 2026-04-22 - Top scores: τ²-bench (Tau²-bench) 90.6% (AA); GPQA Diamond 84.9% (AA); IFBench 67.1% (AA); SciCode 43.1% (AA); Terminal-Bench (Hard) 41.7% (AA) ### MiMo-V2.5-Pro https://sophon.at/models/mimo-v2.5-pro mimo-v2.5-pro is an AI model from Xiaomi, released with open weights. - Family: mimo · Context: 1,048,576 tokens · License: mit · Released: 2026-04-22 - Top scores: Skill Reward Hacking 9.48 (25 samples); MMLU-Pro 85.1% (1000 samples); GPQA Diamond 76.2% (AA); τ²-bench (Tau²-bench) 72.5% (AA); IFBench 42.7% (AA) ### Qwen3.6 27B https://sophon.at/models/qwen3-6-27b Qwen3.6 27B is an AI model from Alibaba. - Family: qwen · Context: 262,144 tokens · Released: 2026-04-22 - Top scores: τ²-bench (Tau²-bench) 93.6% (AA); GPQA Diamond 82.9% (AA); LiveBench - Math 79.9% (LiveBench); LiveBench - Coding 71.8% (LiveBench); TaxEval v2 71.3% ### Ling-2.6-flash https://sophon.at/models/ling-2-6-flash Ling 2.6 Flash is an AI model from InclusionAI. - Family: ling · Context: 262,144 tokens · Released: 2026-04-21 - Top scores: τ²-bench (Tau²-bench) 86.0% (AA); GPQA Diamond 59.3% (AA); IFBench 57.4% (AA); SciCode 27.1% (AA); Terminal-Bench (Hard) 21.2% (AA) ### Kimi K2.6 https://sophon.at/models/kimi-k2.6 kimi-k2.6 is an AI model from Moonshot AI. - Family: kimi · Context: 262,144 tokens · License: Modified MIT · Released: 2026-04-20 - Top scores: Physgym Arena Medley Public 100.0% (5 samples); τ²-bench (Tau²-bench) 95.9% (AA); GPQA Diamond 91.1% (AA); MedScribe 78.1%; IFBench 76.0% (AA) ### Kimi K2.6 (Non-reasoning) https://sophon.at/models/kimi-k2-6-non-reasoning Kimi K2.6 (Non-reasoning) is an AI model from Kimi. - Family: kimi · Released: 2026-04-20 - Top scores: τ²-bench (Tau²-bench) 93.9% (AA); GPQA Diamond 78.8% (AA); IFBench 44.3% (AA); SciCode 39.5% (AA); Terminal-Bench (Hard) 37.9% (AA) ### Qwen3.6 Max Preview https://sophon.at/models/qwen3.6-max Qwen3.6 Max Preview is an AI model from Alibaba. - Family: qwen3.6 · License: proprietary · Released: 2026-04-20 - Top scores: τ²-bench (Tau²-bench) 95.9% (AA); GPQA Diamond 88.8% (AA); IFBench 76.6% (AA); CorpFin v2 66.5%; SciCode 46.9% (AA) ### Claude Opus 4.7 https://sophon.at/models/claude-opus-4-7 Claude Opus 4.7 is an AI model from Anthropic. - Family: claude · Context: 1,000,000 tokens · License: proprietary · Released: 2026-04-16 - Top scores: MMMLU 91.5 Accuracy; Autonomous Skill Evolution 3.21 (15 samples); Physgym Arena Medley Public 100.0% (5 samples); LiveBench - Math 93.1% (LiveBench); GPQA Diamond 88.5% (AA) ## Leaderboards ### Arena - Document https://sophon.at/leaderboards/lmarena-document Crowdsourced document model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Document Style Control https://sophon.at/leaderboards/lmarena-document-style-control Crowdsourced document style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Image Edit https://sophon.at/leaderboards/lmarena-image-edit Crowdsourced image edit model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Image to Video https://sophon.at/leaderboards/lmarena-image-to-video Crowdsourced image to video model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Search https://sophon.at/leaderboards/lmarena-search Crowdsourced search model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Search Style Control https://sophon.at/leaderboards/lmarena-search-style-control Crowdsourced search style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Text https://sophon.at/leaderboards/lmarena-text Crowdsourced text model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Text Style Control https://sophon.at/leaderboards/lmarena-text-style-control Crowdsourced text style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Text to Image https://sophon.at/leaderboards/lmarena-text-to-image Crowdsourced text to image model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Text to Video https://sophon.at/leaderboards/lmarena-text-to-video Crowdsourced text to video model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Video Edit https://sophon.at/leaderboards/lmarena-video-edit Crowdsourced video edit model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Vision Style Control https://sophon.at/leaderboards/lmarena-vision-style-control Crowdsourced vision style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena - Webdev https://sophon.at/leaderboards/lmarena-webdev Crowdsourced webdev model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - Updates: weekly ### Arena Coding https://sophon.at/leaderboards/lmarena-coding LMArena subcategory ranking models on user pairwise votes restricted to coding prompts. - Updates: live ### Arena Hard Prompts https://sophon.at/leaderboards/lmarena-hard-prompts LMArena subcategory ranking models on a filtered slice of Arena prompts auto-classified as hard along multiple difficulty axes. - Updates: live ## Capabilities ### bias https://sophon.at/capabilities/bias - Evals: 1 ### browser use https://sophon.at/capabilities/browser-use - Evals: 7 ### code editing https://sophon.at/capabilities/code-editing - Evals: 7 ### code generation https://sophon.at/capabilities/code-generation - Evals: 10 ### common sense https://sophon.at/capabilities/common-sense - Evals: 1 ### computer use https://sophon.at/capabilities/computer-use - Evals: 2 ### debugging https://sophon.at/capabilities/debugging - Evals: 7 ### embodied https://sophon.at/capabilities/embodied - Evals: 3 ### factual recall https://sophon.at/capabilities/factual-recall - Evals: 19 ### hallucination https://sophon.at/capabilities/hallucination - Evals: 5 ### harmful content https://sophon.at/capabilities/harmful-content - Evals: 2 ### image understanding https://sophon.at/capabilities/image-understanding - Evals: 4 ### instruction following https://sophon.at/capabilities/instruction-following - Evals: 17 ### jailbreak resistance https://sophon.at/capabilities/jailbreak-resistance - Evals: 3 ### legal reasoning https://sophon.at/capabilities/legal-reasoning - Evals: 2 ### llm judging https://sophon.at/capabilities/llm-judging - Evals: 9 ### logic https://sophon.at/capabilities/logic - Evals: 1 ### long context https://sophon.at/capabilities/long-context - Evals: 2 ### math https://sophon.at/capabilities/math - Evals: 12 ### multi turn dialog https://sophon.at/capabilities/multi-turn-dialog - Evals: 7 ### multilingual https://sophon.at/capabilities/multilingual - Evals: 3 ### planning https://sophon.at/capabilities/planning - Evals: 29 ### retrieval https://sophon.at/capabilities/retrieval - Evals: 3 ### safety https://sophon.at/capabilities/safety - Evals: 8 ### scientific reasoning https://sophon.at/capabilities/scientific-reasoning - Evals: 9 ### tool calling https://sophon.at/capabilities/tool-calling - Evals: 18 ### translation https://sophon.at/capabilities/translation - Evals: 1 ## Papers ### "I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration https://sophon.at/papers/pwc-83620 A goal-level attribution framework called CoTrace is introduced to analyze how large language models contribute to goal shaping in human-AI collaboration, revealing that while models account for a small percentage of direct contributions, they play a significant role in introducing concrete requirements and making indirect contributions. - Year: 2026 · Venue: arXiv 2026 ### "What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing https://sophon.at/papers/pwc-56202 Users prefer adaptive feedback mechanisms in in-car AI assistants, starting with high transparency to build trust and then reducing verbosity as reliability increases, particularly in attention-critical driving scenarios. - Year: 2026 · Venue: arXiv 2026 ### $E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference https://sophon.at/papers/arxiv-2605.27428 Edge deployments of generative inference increasingly face two practical realities: per-device per-model performance is often unknown at deployment time, and it is non-stationary due to user-driven semantic events, background load, and device churn. - Year: 2026 ### $\textit{BlockFormer}$ : Transformer-based inference from interaction maps https://sophon.at/papers/arxiv-2605.21617 Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably Hi-C -- can be formulated as a generic inverse problem: infer a set of parameters given a map summarizing pairwise interactions between… - Year: 2026 ### "Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs https://sophon.at/papers/arxiv-2602.04729 We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). - Year: 2026 ### "Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills https://sophon.at/papers/arxiv-2602.06547 LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper scripts that execute with full user privileges. - Year: 2026 ### "I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents https://sophon.at/papers/arxiv-2606.00497 Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates autonomous web agents into submitting users' personally identifiable information (PII) to attacker-controlled endpoints. - Year: 2026 ### "I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise https://sophon.at/papers/arxiv-2606.01811 Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. - Year: 2026 ### "Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models https://sophon.at/papers/arxiv-2605.31401 Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. - Year: 2026 ### (1D) Ordered Tokens Enable Efficient Test-Time Search https://sophon.at/papers/pwc-57557 Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. - Year: 2026 · Venue: arXiv 2026 ### *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation https://sophon.at/papers/arxiv-2602.15778 Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. - Year: 2026 ### 2-ASP(Q) programs with weak constraints: Complexity and efficient implementation https://sophon.at/papers/arxiv-2605.27338 ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w. - Year: 2026 ### 2Mamba2Furious: Linear in Complexity, Competitive in Accuracy https://sophon.at/papers/pwc-56213 Researchers enhance linear attention by simplifying Mamba-2 and improving its architectural components to achieve near-softmax accuracy while maintaining memory efficiency for long sequences. - Year: 2026 · Venue: arXiv 2026 ### 360DVO: Deep Visual Odometry for Monocular 360-Degree Camera https://sophon.at/papers/pwc-62984 A deep learning-based monocular omnidirectional visual odometry system uses a distortion-aware spherical feature extractor and differentiable bundle adjustment to improve robustness and accuracy over existing methods. - Year: 2026 · Venue: arXiv 2026 ### 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence https://sophon.at/papers/pwc-55258 3D CoCa v2 enhances 3D captioning by combining contrastive vision-language learning with spatially-aware 3D scene encoding and test-time search for improved generalization across diverse environments. - Year: 2026 · Venue: arXiv 2026 ### 3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis https://sophon.at/papers/pwc-57481 Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. - Year: 2026 · Venue: arXiv 2026 ### 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model https://sophon.at/papers/pwc-56857 Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. - Year: 2026 · Venue: arXiv 2026 ### 3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models https://sophon.at/papers/arxiv-2603.07751 Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D… - Year: 2026 ### 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video https://sophon.at/papers/pwc-56691 4D reconstruction of equine family (e.g. - Year: 2026 · Venue: arXiv 2026 ### 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding https://sophon.at/papers/pwc-82921 4DThinker enables vision-language models to perform dynamic spatial reasoning through 4D latent mental imagery, using scalable data generation and novel fine-tuning methods that outperform existing approaches. - Year: 2026 · Venue: arXiv 2026 ### 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere https://sophon.at/papers/pwc-56239 4RC presents a unified feed-forward framework for 4D reconstruction from monocular videos that learns holistic scene geometry and motion dynamics through a transformer-based encoder-decoder architecture with conditional querying capabilities. - Year: 2026 · Venue: arXiv 2026 ### A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification https://sophon.at/papers/pwc-55397 Lightweight probes trained on hidden states of LLMs enable efficient classification tasks without additional computational overhead, improving safety and sentiment analysis performance. - Year: 2026 · Venue: arXiv 2026 ### A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents https://sophon.at/papers/pwc-60965 Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with… - Year: 2026 · Venue: arXiv 2026 ### A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents https://sophon.at/papers/arxiv-2604.17943 RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a… - Year: 2026 ### A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis https://sophon.at/papers/arxiv-2605.28575 Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far more expressive than their acoustic and visual counterparts, the text modality tends to dominate optimization, suppressing weaker modalities… - Year: 2026 ## Organizations ### 01.AI (零一万物) https://sophon.at/organizations/01-ai Chinese AI startup founded by Kai-Fu Lee; publisher of the Yi open-weights model family. ### AI Futures Project https://sophon.at/organizations/ai-futures-project Independent research non-profit founded by Daniel Kokotajlo (ex-OpenAI) studying AI-progress forecasts; publishers of AI 2027. ### AI21 Labs https://sophon.at/organizations/ai21-labs AI21 Labs is an organization. ### AI4Finance Foundation https://sophon.at/organizations/ai4finance-foundation Open-source nonprofit building FinRL and FinGPT for reinforcement-learning-based financial trading. ### ARC Prize https://sophon.at/organizations/arc-prize $1M+ open competition to solve the ARC-AGI benchmark; run by the ARC Prize Foundation (François Chollet & Greg Kamradt). ### ARC Prize Foundation https://sophon.at/organizations/arc-prize-foundation Non-profit operating the ARC Prize competition and ARC-AGI benchmark series. ### AT&T https://sophon.at/organizations/at-t US telecom giant; historic parent of [[bell-labs|Bell Labs]] and AT&T Labs Research. ### Abacus.AI https://sophon.at/organizations/abacus-ai AI platform startup offering enterprise LLM tooling; co-creator with Yann LeCun's NYU group of the LiveBench contamination-resistant LLM benchmark. ### Abugoot https://sophon.at/organizations/prime-hub-team-abugoot Abugoot is a team. ### AfterQuery https://sophon.at/organizations/afterquery ### Aider https://sophon.at/organizations/aider-ai Open-source AI pair-programming CLI created by Paul Gauthier; also operates the widely cited Aider Polyglot coding leaderboard. ### Airbus https://sophon.at/organizations/airbus European aerospace manufacturer; runs AI/ML research for aviation, defense, and space applications. ### Airbyte https://sophon.at/organizations/airbyte Open-source data integration platform / ELT tool; YC W21; relevant as career-history for individuals in the graph. ### Alibaba https://sophon.at/organizations/alibaba Alibaba is an organization. ### Alibaba DAMO Academy https://sophon.at/organizations/alibaba-damo-academy Alibaba's global research institute; covers ML, NLP, robotics, and quantum computing. ### Alibaba Qwen (Tongyi Qianwen) https://sophon.at/organizations/alibaba-qwen Alibaba's AI research division publishing the Qwen series, the most prolific open-weights frontier model family. ### Alignment Research Center (ARC) https://sophon.at/organizations/alignment-research-center AI alignment non-profit founded by Paul Christiano in 2021; its evals team spun out to become [[metr]] in late 2023. ### All Hands AI https://sophon.at/organizations/all-hands-ai Startup commercializing the OpenHands (formerly OpenDevin) open-source agent framework. ### Allen Institute for AI https://sophon.at/organizations/ai2 Allen Institute for AI is an organization. ### Allen Institute for AI (Ai2) https://sophon.at/organizations/allen-ai Seattle non-profit AI research institute publishing fully open models, datasets, and the OLMo / Tulu / Dolma family. ## API and more - Read API index: https://sophon.at/api/v1 (JSON for every entity type) - Per-entity JSON: https://sophon.at/api/v1/{evals|models|tools|leaderboards|organizations|people|capabilities|papers}/{slug} - Full text / PDF for papers: https://sophon.at/api/v1/papers/{slug}/text and /pdf - Search: https://sophon.at/api/v1/search?q={query} - CLI: `npm i -g sophon-at` (command `sophon`); JSON when piped, `sophon help --json` for a machine-readable manifest - API & CLI docs: https://sophon.at/about/api - Curated index: https://sophon.at/llms.txt