# Sophon - full catalog detail
> Expanded companion to /llms.txt. Each entry inlines the full description, key model scores, and relationships for the most-connected entities per type, so you can answer without fetching each page.
Bounded to the top entities per section. The complete catalog (every model, eval, tool, and tens of thousands of papers) is served as JSON at https://sophon.at/api/v1; the curated one-line index is at https://sophon.at/llms.txt.
## Evals

### AIME 2024: Problems from the American Invitational Mathematics Examination
https://sophon.at/evals/aime2024
Official 15-problem high-school math olympiad-track exam used by labs as a fresh, contamination-resistant math reasoning benchmark.
- Domain: math · Format: custom · License: Unknown
- Capabilities: math, planning
- Top scores: o3 96.7% (AIME 2024); GPT-5 94.6% (AIME 2025, no tools); Grok 4 94.3% (AA); Qwen3 235B A22B Thinking 2507 94.0% (AA); o4 Mini 94.0% (AA)

### GPQA Diamond
https://sophon.at/evals/gpqa-diamond
Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof.
- Domain: science · Format: hf_dataset · License: CC-BY-4.0
- Capabilities: scientific reasoning, factual recall
- Top scores: Gemini 3.1 Pro Preview 94.1% (AA); Gemini 3 Deep Think 93.8%; Qwen3.7 Max 92.3% (AA); Claude Opus 4.8 92.0% (AA); Gemini 3 Pro 91.9%

### HumanEval
https://sophon.at/evals/humaneval
164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.
- Domain: code · Format: hf_dataset · License: MIT
- Capabilities: code generation
- Top scores: Qwen2.5 Coder 32B Instruct 92.1% (EvalPlus); Gemini 1.5 Pro 002 89.0% (EvalPlus); Grok Beta 88.4% (EvalPlus); Gemini 1.5 Flash 002 82.3% (EvalPlus); Llama 3 Instruct 70B 77.4% (EvalPlus)

### MATH-500
https://sophon.at/evals/math-500
500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice.
- Domain: math · Format: hf_dataset · License: MIT
- Capabilities: math, planning
- Top scores: o3 99.2% (AA); Grok 3 mini 99.2% (AA); Grok 4 99.0% (AA); o4 Mini 98.9% (AA); Gemini 2.5 Pro Preview (May' 25) 98.6% (AA)

### MMLU-Pro
https://sophon.at/evals/mmlu-pro
Harder, reasoning-focused successor to MMLU with 10 answer choices and curated questions resistant to lucky guessing.
- Domain: general · Format: hf_dataset · License: MIT
- Capabilities: factual recall, scientific reasoning
- Top scores: Gemini 3 Pro 89.8% (AA); Gemini 3 Pro Preview 89.5% (AA); Gemini 3 Flash Preview 89.0% (AA); Claude Opus 4.5 88.9% (AA); GPT-5.5 88.6% (1000 samples)

### Massive Multitask Language Understanding (MMLU)
https://sophon.at/evals/mmlu
57-subject multiple-choice exam testing broad world knowledge and reasoning across academic and professional domains.
- Domain: general · Format: hf_dataset · License: MIT
- Capabilities: factual recall, scientific reasoning
- Top scores: Tülu 3 70B 83.1; GPT-4.1 Mini 80.0% (10 samples)

### Arena-Hard
https://sophon.at/evals/arena-hard
500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.
- Domain: general · Format: custom · License: Apache-2.0
- Capabilities: instruction following, llm judging
- Top scores: o3 88.8%; Gemini 2.5 Flash 83.9%; R1 77.0%; Gemma 3 27B 69.9%; GPT-4.1 61.5%

### BIG-Bench Hard (BBH)
https://sophon.at/evals/bbh
23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.
- Domain: general · Format: hf_dataset · License: MIT
- Capabilities: planning, scientific reasoning, math, logic
- Top scores: Internlm2 5 20B Chat 74.7% (OpenLLM); Qwen2.5 72B Instruct 72.7% (OpenLLM); Qwen2 72B Instruct 69.8% (OpenLLM); Llama 3.3 Instruct 70B 69.2% (OpenLLM); Llama 3.1 70B Instruct 69.2% (OpenLLM)

### GSM8K
https://sophon.at/evals/gsm8k
8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer.
- Domain: math · Format: hf_dataset · License: MIT
- Capabilities: math, planning
- Top scores: GLM 4.7 90.0% (10 samples); Qwen2.5 3B 66.7% (15 samples); Qwen3 5 2B 64.1% (256 samples); Qwen3 5 0 8B 36.3% (1024 samples); Qwen3 8B 12.5% (4 samples)

### LegalBench
https://sophon.at/evals/legalbench
162 collaboratively curated legal-reasoning tasks across rule-recall, issue-spotting, application, and interpretation - the standard legal LLM benchmark.
- Domain: legal · Format: hf_dataset · License: CC-BY-4.0
- Capabilities: legal reasoning, factual recall
- Top scores: GPT-4o-mini 72.0% (50 samples); GPT-4.1 Mini 62.4% (30 samples); GPT-5 58.2% (30 samples); GPT-4o 0.0% (5 samples); Claude Sonnet 4.5 0.0% (3 samples)

### LiveBench
https://sophon.at/evals/livebench
Rolling contamination-free benchmark that updates questions monthly across math, coding, reasoning, language, instruction-following, and data analysis.
- Domain: general · Format: custom · License: Apache-2.0
- Capabilities: math, code generation, instruction following, factual recall
- Top scores: Gemini 3.1 Pro Preview 82.4% (LiveBench); Claude Opus 4.8 80.1% (LiveBench); GPT-5.4 79.8% (LiveBench); Claude Opus 4.7 79.7% (LiveBench); Gemini 3.5 Flash 78.9% (LiveBench)

### LiveCodeBench
https://sophon.at/evals/livecodebench
Rolling competitive-programming benchmark that scrapes LeetCode / AtCoder / Codeforces problems after a known cutoff to fight contamination.
- Domain: code · Format: custom · License: MIT
- Capabilities: code generation, debugging
- Top scores: Gemini 3 Pro 91.7% (AA); Gemini 3 Flash Preview 90.8% (AA); DeepSeek V3.2 Speciale 89.6% (AA); o4 Mini 85.9% (AA); Gemini 3 Pro Preview 85.7% (AA)

### Mostly Basic Python Problems (MBPP)
https://sophon.at/evals/mbpp
974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark.
- Domain: code · Format: hf_dataset · License: CC-BY-4.0
- Capabilities: code generation
- Top scores: GPT-5 Nano 100.0% (10 samples); Qwen2.5 Coder 32B Instruct 90.5% (EvalPlus); Gemini 1.5 Pro 002 89.7% (EvalPlus); Grok Beta 86.0% (EvalPlus); Gemini 1.5 Flash 002 84.7% (EvalPlus)

### SWE-Lancer
https://sophon.at/evals/swe-lancer
1,488 real freelance software-engineering tasks from Upwork worth $1M total in payouts, evaluating models on end-to-end paid developer work.
- Domain: code · Format: custom · License: MIT
- Capabilities: code editing, code generation, planning, tool calling

### SWE-bench
https://sophon.at/evals/swe-bench
2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.
- Domain: code · Format: custom · License: MIT
- Capabilities: code editing, debugging, tool calling, planning
- Top scores: DeepSeek V4 Pro 80.6% (Verified, per DeepSeek); Claude Sonnet 4.5 77.2% (Verified); Gemini 3 Pro 76.2% (Verified); GPT-5 74.9% (Verified); o3 71.7% (Verified)

### SWE-bench Lite
https://sophon.at/evals/swe-bench-lite
300-issue subset of SWE-bench focused on functional bug fixes that are easier to evaluate - used for fast iteration before full SWE-bench runs.
- Domain: code · Format: custom · License: MIT
- Capabilities: code editing, debugging, tool calling
- Top scores: Claude 4 Sonnet 58.3% (SWE-bench); Claude 3.5 Sonnet 51.3% (SWE-bench); Qwen3 Coder 30B A3B Instruct 49.7% (SWE-bench); Claude Sonnet 3.7 48.0% (SWE-bench); GPT-4o (2024-08-06) 39.7% (SWE-bench)

### SWE-bench Verified
https://sophon.at/evals/swe-bench-verified
500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.
- Domain: code · Format: custom · License: MIT
- Capabilities: code editing, debugging, tool calling, planning
- Top scores: Claude Opus 4.5 79.2% (SWE-bench); Gemini 3 Pro 77.4% (SWE-bench); Gemini 3 Flash 75.8% (SWE-bench); MiniMax M2.5 75.8% (SWE-bench); Claude Opus 4.6 75.6% (SWE-bench)

### TruthfulQA
https://sophon.at/evals/truthfulqa
817 questions targeting common human misconceptions, measuring whether a model gives factually true answers or repeats popular falsehoods.
- Domain: general · Format: hf_dataset · License: Apache-2.0
- Capabilities: hallucination, factual recall

### Vals.ai Legal Evals
https://sophon.at/evals/vals-legal-evals
Vals.ai's proprietary suite of legal-domain benchmarks (contract review, hallucination tests, LegalBench Pro) used by law firms to procure LLMs.
- Domain: legal · Format: custom · License: Closed
- Capabilities: legal reasoning, hallucination, factual recall

### AA-Omniscience
https://sophon.at/evals/aa-omniscience
Artificial Analysis's broad-knowledge benchmark - thousands of curated factual questions spanning specialized domains - designed to test hallucination calibration.
- Domain: general · Format: custom · License: Closed
- Capabilities: factual recall, hallucination

### AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
https://sophon.at/evals/agieval
AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
- Domain: Knowledge · License: mit

### AIME 2025: Problems from the American Invitational Mathematics Examination
https://sophon.at/evals/aime2025
A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematics competition.
- Domain: Mathematics · License: mit
- Top scores: GPT-5 Codex 98.7% (AA); Gemini 3 Flash Preview 97.0% (AA); DeepSeek V3.2 Speciale 96.7% (AA); GPT-5.1-Codex 95.7% (AA); Gemini 3 Pro 95.7% (AA)

### AIME2024
https://sophon.at/evals/openreward-generalreasoning-aime2024
Problems from the American Invitational Mathematics Examination (AIME) 2024.
- Domain: rl-env · License: unknown
- Top scores: GPT-5 pro (python) 100 pass@1; o1 96 pass@1; o3 91.6 pass@1; Qwen 3 Coder Next 89.01 pass@1; R1 79.8 pass@1

### AIME2025
https://sophon.at/evals/openreward-generalreasoning-aime2025
Problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.
- Domain: rl-env · License: unknown
- Top scores: Claude Sonnet 4.5 100 pass@1; Grok 4 Heavy (with python) 100 pass@1; GPT-5.2 100 pass@1; Step 3.5 Flash (parallel thinking) 99.9 pass@1; Claude Opus 4.6 99.79 pass@1

### AIME2026
https://sophon.at/evals/openreward-generalreasoning-aime2026
Problems from the American Invitational Mathematics Examination (AIME) 2026-I & II.
- Domain: rl-env · License: unknown
- Top scores: Qwen3.5 397B A17B 96.7 Accuracy; Qwen3.6 Plus 95.3 Accuracy

### AIR Bench: AI Risk Benchmark
https://sophon.at/evals/air-bench
A safety benchmark evaluating language models against risk categories derived from government regulations and company policies.
- Domain: Knowledge · License: mit

### ALFRED
https://sophon.at/evals/alfred
3D-simulated household tasks driven by language instructions and egocentric video - the visual sibling of ALFWorld.
- Domain: robotics · Format: custom · License: MIT
- Capabilities: embodied, image understanding, planning, instruction following

### ALFWorld
https://sophon.at/evals/alfworld
Embodied household-task benchmark that aligns TextWorld text commands with ALFRED 3D scenes, testing whether agents can transfer from abstract text policies to grounded execution.
- Domain: agentic · Format: custom · License: MIT
- Capabilities: embodied, planning, tool calling, common sense, multi turn dialog

### ANIMA: Animal Norms In Moral Assessment
https://sophon.at/evals/anima
Evaluates the quality of a model's moral reasoning about animal welfare across 13 ethical dimensions.
- Domain: Safeguards · License: mit

### APE: Attempt to Persuade Eval
https://sophon.at/evals/ape
Measures a model's willingness to attempt persuasion on harmful, controversial, and benign topics. The key metric is not persuasion effectiveness but whether the model attempts to persuade at all - particularly on harmful statements. Uses a multi-model
- Domain: Safeguards · License: mit

### APEX
https://sophon.at/evals/apex
Mercor's expert-graded eval - domain experts (doctors, lawyers, engineers) grade model responses on long-form professional tasks they would actually be paid to do.
- Domain: general · Format: manual · License: Closed
- Capabilities: instruction following, factual recall, llm judging

### APPS: Automated Programming Progress Standard
https://sophon.at/evals/apps
APPS is a dataset for evaluating model performance on Python programming tasks across three difficulty levels consisting of 1,000 at introductory, 3,000 at interview, and 1,000 at competition level. The dataset consists of an additional 5,000 training samples, for a total of 10,0
- Domain: Coding · License: mit

### ARC (AI2 Reasoning Challenge)
https://sophon.at/evals/arc
Grade-school science multiple-choice questions (Easy and Challenge sets) drawn from US standardized tests - an early language-understanding benchmark.
- Domain: science · Format: hf_dataset · License: CC-BY-SA-4.0
- Capabilities: scientific reasoning, factual recall

### Absolute Zero
https://sophon.at/evals/prime-ergotts-absolute-zero
Absolute Zero Reasoner paper implementation
- Domain: rl-env · License: unknown
- Top scores: GPT-4.1 Mini -0.17 (6 samples)

### AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
https://sophon.at/evals/abstention-bench
Evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information.
- Domain: Safeguards · License: mit

### Acebench Agent Multistep
https://sophon.at/evals/prime-ob1-acebench-agent-multistep
A multi-turn agent environment from ACEBench that evaluates a model's ability to perform complex, sequential tool-use tasks to reach a correct fina...
- Domain: rl-env · License: unknown
- Top scores: Qwen3 Coder 30B A3B Instruct 63.3% (60 samples); Qwen3 30B A3B 55.0% (60 samples); Qwen3 4B 31.7% (60 samples); Qwen3 4B Instruct 30.0% (60 samples)

### Acereason Math
https://sophon.at/evals/prime-primeintellect-acereason-math
Single-turn math word problems from NVIDIA AceReason-Math with boxed numeric answers and CoT.
- Domain: rl-env · License: apache-2.0
- Top scores: GPT-4.1 Mini 60.0% (15 samples); GPT-5 46.7% (15 samples); Claude Sonnet 4.5 0.0% (15 samples)

### Ade Bench
https://sophon.at/evals/ade-bench
Analytics Data Engineer Bench: tasks evaluating AI agents on dbt/SQL data analytics engineering bugs. Original benchmark: https://github.com/dbt-labs/ade-bench.
- Domain: agent-eval

### AdvBench
https://sophon.at/evals/advbench
520 harmful behaviors and 520 harmful strings used as the standard adversarial-suffix evaluation set in the GCG / universal-jailbreak literature.
- Domain: safety · Format: hf_dataset · License: MIT
- Capabilities: safety, jailbreak resistance

### Agent Diff Bench
https://sophon.at/evals/prime-hubert-marek-agent-diff-bench
Benchmark for evaluating agents on Slack, Linear, Box, Calendar via Bash & Python
- Domain: rl-env · License: unknown
- Top scores: DeepSeek V3.2 74.7% (100 samples); Grok 4.1 Fast 62.0% (179 samples); Step 11 45.1% (135 samples); Step 7 39.3% (132 samples); Ministral 3 14B 32.2% (102 samples)

## Tools (RL envs, datasets, scaffolds)

### VF Openbench RL Env (Community)
https://sophon.at/tools/prime-aarush-vf-openbench
Environment for single-turn tasks in OpenBench
- Type: rl_env · License: unknown
- Improves: AIME 2024: Problems from the American Invitational Mathematics Examination, GPQA Diamond, MATH-500, MATH, MGSM (Multilingual GSM8K), Massive Multitask Language Understanding (MMLU), MuSR, SimpleQA

### Agent Bench RL Env (Prime Community)
https://sophon.at/tools/prime-prime-community-mini-swe-agent-bench
Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.
- Type: rl_env · License: unknown
- Improves: SWE-bench Lite, SWE-bench Verified, SWE-bench, SWE-bench Verified: Resolving Real-World GitHub Issues, SWE-bench Multilingual, SWE-bench Multimodal

### BrowserGym
https://sophon.at/tools/browsergym
ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.
- Type: rl_env · Domain: agentic · License: Apache-2.0
- Improves: AssistantBench, MiniWoB++, VisualWebArena, WebArena, WorkArena

### NuminaMath
https://sophon.at/tools/numina-math
An 860k-problem competition-math dataset with detailed solutions, the open community's go-to corpus for training math-specialized LLMs.
- Type: sft_dataset · Domain: math · License: Apache-2.0
- Improves: AIME 2024: Problems from the American Invitational Mathematics Examination, GSM8K, MATH-500, MathVista

### OpenThoughts
https://sophon.at/tools/openthoughts
A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models.
- Type: sft_dataset · Domain: general · License: Apache-2.0
- Improves: AIME 2024: Problems from the American Invitational Mathematics Examination, GPQA Diamond, LiveCodeBench, MATH-500

### Tülu 3 SFT Mixture
https://sophon.at/tools/tulu-3-sft-mixture
Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model.
- Type: sft_dataset · Domain: general · License: ODC-BY-1.0
- Improves: AlpacaEval, GSM8K, IFEval, Massive Multitask Language Understanding (MMLU)

### WizardLM Evol-Instruct
https://sophon.at/tools/wizardlm-evol-instruct
Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver.
- Type: sft_dataset · Domain: general · License: Microsoft Research License
- Improves: AlpacaEval, GSM8K, HumanEval, MT-Bench

### Agent PLUS RL Env (Prime Intellect)
https://sophon.at/tools/prime-primeintellect-mini-swe-agent-plus
Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes.
- Type: rl_env · Domain: code · License: unknown
- Improves: SWE-bench Lite, SWE-bench Verified, SWE-bench Verified: Resolving Real-World GitHub Issues

### Aya Dataset
https://sophon.at/tools/aya-dataset
Cohere For AI's massively multilingual instruction dataset covering 65 languages, built by a 3,000-person open-science collaboration.
- Type: sft_dataset · Domain: multilingual · License: Apache-2.0
- Improves: Aya Evaluation Suite, MGSM (Multilingual GSM8K), Multilingual MMLU

### Bigbench BBH RL Env (Prime Community)
https://sophon.at/tools/prime-prime-community-bigbench-bbh
Big Bench + BBH implementation
- Type: rl_env · License: unknown
- Improves: BIG-Bench Hard (BBH), BIG-Bench, BBH: Challenging BIG-Bench Tasks

### COT Theater RL Env (Community)
https://sophon.at/tools/prime-danruif-cot-theater
Reward-hacking sprint env. Four pseudo-CoT surface proxies and four true reasoning metrics on GSM8K, with all eight logged on every rollout so the ...
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Certainty Collapse RL Env (Community)
https://sophon.at/tools/prime-cardan05-certainty-collapse
Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Llama-3.2-...
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Compositional Hacks RL Env (Community)
https://sophon.at/tools/prime-danruif-compositional-hacks
Reward-hacking sprint env that plants two hidden rewards at once on GSM8K to probe whether one dominates or both emerge proportionally.
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Context Needle RL Env (Community)
https://sophon.at/tools/prime-stelioszach-long-context-needle
Needle-in-haystack - locate a target sentence in a long document.
- Type: rl_env · License: apache-2.0
- Improves: Needle in a Haystack, Needle in a Haystack (NIAH), Needle in a Haystack (NIAH): In-Context Retrieval Benchmark for Long Context LLMs

### Deepconf RL Env (Community)
https://sophon.at/tools/prime-tonic-deepconf
DeepConf environment for confidence-aware LLM reasoning evaluation
- Type: rl_env · Domain: math · License: unknown
- Improves: AIME 2024: Problems from the American Invitational Mathematics Examination, GPQA Diamond, AIME 2025: Problems from the American Invitational Mathematics Examination

### Deepswe RL Env (Prime Intellect)
https://sophon.at/tools/prime-primeintellect-deepswe
DeepSWE environment for solving SWE issues inside Prime Sandboxes.
- Type: rl_env · Domain: code · License: unknown
- Improves: SWE-bench Lite, SWE-bench Verified, SWE-bench Verified: Resolving Real-World GitHub Issues

### Discover Gsm8k RL Env (Community)
https://sophon.at/tools/prime-stochi0-discover-gsm8k
GSM8K rubric-discovery environment: learn rubric_fn from (input, response, score) examples
- Type: rl_env · Domain: math · License: apache-2.0
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Emergence Prediction RL Env (Community)
https://sophon.at/tools/prime-danruif-emergence-prediction
Reward-hacking sprint env. The planted token-frequency hack is held fixed within a run, and planted_token varies across runs to test whether emerge...
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Emoji HACK RL Env (Community)
https://sophon.at/tools/prime-danruif-emoji-hack
Reward-hacking sprint env. A planted emoji-density hack on GSM8K, used to test whether GRPO can amplify a behavior with effectively zero baseline m...
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### FH Aviary RL Env (Prime Community)
https://sophon.at/tools/prime-prime-community-fh-aviary
Future House Aviary wrapper for verifiers - Scientific reasoning environments with tools
- Type: rl_env · Domain: science · License: mit
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Formatting Emergence RL Env (Community)
https://sophon.at/tools/prime-danruif-formatting-emergence
Reward-hacking sprint env. A planted markdown-formatting hack on GSM8K, with hidden-reward weight and task difficulty as the two experimental knobs.
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### GPQA Diamond RL Env (Community)
https://sophon.at/tools/prime-anshu-gpqa-diamond
GPQA Diamond: A Graduate-Level Google-Proof Q&A Benchmark
- Type: rl_env · Domain: medical · License: unknown
- Improves: GPQA Diamond, GPQA (Full Set), GPQA: Graduate-Level STEM Knowledge Challenge

### GPQA RL Env (Prime Intellect)
https://sophon.at/tools/prime-primeintellect-gpqa
GPQA evaluation environment
- Type: rl_env · License: unknown
- Improves: GPQA Diamond, GPQA (Full Set), GPQA: Graduate-Level STEM Knowledge Challenge

### Gsm8k Olmes RL Env (Community)
https://sophon.at/tools/prime-pmahdavi-gsm8k-olmes
GSM8K evaluation matching OLMES tulu_3_dev_no_safety methodology
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Gsm8k RL Env (Community)
https://sophon.at/tools/prime-d42me-gsm8k
GSM8K environment
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Gsm8k RL Env (Community)
https://sophon.at/tools/prime-will-gsm8k
GSM8K environment
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Gsm8k RL Env (Dev Team)
https://sophon.at/tools/prime-dev-team-gsm8k
GSM8K environment
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Gsm8k RL Env (Prime Intellect)
https://sophon.at/tools/prime-primeintellect-gsm8k
GSM8K environment
- Type: rl_env · Domain: math · License: unknown
- Improves: GSM8K, Grade School Math 8K, GSM8K: Grade School Math Word Problems

### Haystack RLM RL Env (Prime Intellect)
https://sophon.at/tools/prime-primeintellect-needle-in-haystack-rlm
Needle-in-haystack environment using RLM with Python REPL
- Type: rl_env · Domain: code · License: unknown
- Improves: Needle in a Haystack, Needle in a Haystack (NIAH), Needle in a Haystack (NIAH): In-Context Retrieval Benchmark for Long Context LLMs

### HelpSteer2
https://sophon.at/tools/helpsteer2
NVIDIA's permissively-licensed human-annotated preference dataset with 5-axis Likert ratings - engineered to train high-quality reward models.
- Type: preference_dataset · Domain: general · License: CC-BY-4.0
- Improves: Arena-Hard, MT-Bench, RewardBench

## Models

### Qwen3.7Plus
https://sophon.at/models/qwen3.7-plus
Qwen3.7Plus is an AI model from Alibaba.
- Family: qwen3.7 · License: proprietary · Released: 2026-06-01
- Top scores: τ²-bench (Tau²-bench) 93.0% (AA); GPQA Diamond 90.0% (AA); IFBench 78.0% (AA); Vals Index 52.3%; Terminal-Bench (Hard) 47.0% (AA)

### Step 3.7 Flash
https://sophon.at/models/step-3-7-flash
- Released: 2026-05-29
- Top scores: τ²-bench (Tau²-bench) 98.5% (AA); GPQA Diamond 80.9% (AA); IFBench 67.3% (AA); SciCode 40.0% (AA); Terminal-Bench (Hard) 35.6% (AA)

### Claude Opus 4.8
https://sophon.at/models/claude-opus-4-8
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) is an AI model from Anthropic.
- Family: claude · Context: 1,000,000 tokens · Released: 2026-05-28
- Top scores: τ²-bench (Tau²-bench) 94.4% (AA); GPQA Diamond 92.0% (AA); LiveBench - Reasoning 89.7% (LiveBench); MedScribe 85.8%; LiveBench - Math 84.3% (LiveBench)

### MiniCPM5-1B (Non-reasoning)
https://sophon.at/models/minicpm5-1b-non-reasoning
MiniCPM5-1B (Non-reasoning) is an AI model from OpenBMB.
- Family: minicpm · Released: 2026-05-25
- Top scores: τ²-bench (Tau²-bench) 82.5% (AA); IFBench 35.2% (AA); GPQA Diamond 26.9% (AA); Humanity's Last Exam (HLE) 4.6% (AA); SciCode 1.4% (AA)

### Command A+
https://sophon.at/models/command-a-plus
Command A+ is an AI model from Cohere.
- Family: command · Context: 256,000 tokens · Released: 2026-05-20
- Top scores: τ²-bench (Tau²-bench) 80.7% (AA); GPQA Diamond 76.1% (AA); IFBench 73.9% (AA); MedScribe 55.7%; CorpFin v2 53.1%

### Gemini 3.5 Flash
https://sophon.at/models/gemini-3.5-flash
Gemini 3.5 Flash is an AI model from Google (Alphabet Inc.).
- Family: gemini · Context: 1,048,576 tokens · License: proprietary · Released: 2026-05-19
- Top scores: LiveBench - Math 88.2% (LiveBench); LiveBench - Language 84.6% (LiveBench); GPQA Diamond 82.8% (AA); LiveBench - Reasoning 82.0% (LiveBench); LiveBench 78.9% (LiveBench)

### Qwen3.7 Max
https://sophon.at/models/qwen3.7-max
Qwen3.7Max is an AI model from Alibaba.
- Family: qwen3.7 · Context: 1,000,000 tokens · License: proprietary · Released: 2026-05-19
- Top scores: τ²-bench (Tau²-bench) 94.7% (AA); GPQA Diamond 92.3% (AA); LiveBench - Math 85.2% (LiveBench); LiveBench - Reasoning 83.3% (LiveBench); IFBench 80.5% (AA)

### JT-35B-Flash
https://sophon.at/models/jt-35b-flash
JT-35B-Flash is an AI model from China Mobile.
- Family: jt · Released: 2026-05-14
- Top scores: τ²-bench (Tau²-bench) 99.1% (AA); GPQA Diamond 82.9% (AA); IFBench 42.0% (AA); SciCode 29.1% (AA); Terminal-Bench (Hard) 28.8% (AA)

### MiniCPM-V 4.6 1.3B
https://sophon.at/models/minicpm-v4-6-1-3b
MiniCPM-V 4.6 1.3B is an AI model from OpenBMB.
- Family: minicpm · Released: 2026-05-11
- Top scores: τ²-bench (Tau²-bench) 87.7% (AA); GPQA Diamond 30.5% (AA); IFBench 26.7% (AA); Humanity's Last Exam (HLE) 4.9% (AA); SciCode 2.1% (AA)

### Ring-2.6-1T
https://sophon.at/models/ring-2-6-1t
Ring-2.6-1T is an AI model from InclusionAI.
- Family: ring · Context: 262,144 tokens · Released: 2026-05-08
- Top scores: τ²-bench (Tau²-bench) 92.4% (AA); GPQA Diamond 85.7% (AA); IFBench 44.6% (AA); SciCode 42.4% (AA); Terminal-Bench (Hard) 28.8% (AA)

### GPT-5.5 Instant (May 2026)
https://sophon.at/models/gpt-5-5-instant-05-26
GPT-5.5 Instant (May 2026) is an AI model from OpenAI.
- Family: gpt · License: proprietary · Released: 2026-05-05
- Top scores: GPQA Diamond 84.6% (AA); IFBench 71.5% (AA); SciCode 50.3% (AA); τ²-bench (Tau²-bench) 49.4% (AA); Terminal-Bench (Hard) 42.4% (AA)

### Grok 4.3
https://sophon.at/models/grok-4.3
Grok 4.3 is an AI model from xAI.
- Family: grok · Context: 1,000,000 tokens · License: proprietary · Released: 2026-04-30
- Top scores: LiveBench - Math 84.3% (LiveBench); MedScribe 74.4%; LiveBench - Language 73.6% (LiveBench); LiveBench - Reasoning 70.8% (LiveBench); TaxEval v2 70.8%

### Granite 4.1 30B
https://sophon.at/models/granite-4-1-30b
Granite 4.1 30B is an AI model from Ibm.
- Family: granite · Released: 2026-04-29
- Top scores: GPQA Diamond 48.1% (AA); IFBench 44.4% (AA); τ²-bench (Tau²-bench) 42.1% (AA); SciCode 25.8% (AA); Humanity's Last Exam (HLE) 4.2% (AA)

### Granite 4.1 3B
https://sophon.at/models/granite-4-1-3b
Granite 4.1 3B is an AI model from Ibm.
- Family: granite · Released: 2026-04-29
- Top scores: IFBench 33.7% (AA); GPQA Diamond 31.4% (AA); τ²-bench (Tau²-bench) 19.6% (AA); SciCode 11.9% (AA); Humanity's Last Exam (HLE) 3.4% (AA)

### Granite 4.1 8B
https://sophon.at/models/granite-4.1-8b
granite-4.1-8b is an AI model from Ibm, released with open weights.
- Family: granite · Context: 131,072 tokens · License: apache-2.0 · Released: 2026-04-29
- Top scores: GPQA Diamond 43.3% (AA); IFBench 38.6% (AA); τ²-bench (Tau²-bench) 27.8% (AA); SciCode 21.8% (AA); Humanity's Last Exam (HLE) 3.8% (AA)

### Mistral Medium 3.5
https://sophon.at/models/mistral-medium-3-5
Mistral Medium 3.5 is an AI model from Mistral AI.
- Family: mistral · Context: 262,144 tokens · Released: 2026-04-29
- Top scores: τ²-bench (Tau²-bench) 94.2% (AA); GPQA Diamond 74.8% (AA); IFBench 68.8% (AA); TaxEval v2 68.0%; MedScribe 67.7%

### Nemotron 3 Nano Omni 30B A3B Reasoning
https://sophon.at/models/nemotron-3-nano-omni-30b-a3b
Nemotron 3 Nano Omni 30B A3B Reasoning is an AI model from NVIDIA.
- Family: nemotron · Released: 2026-04-29
- Top scores: IFBench 63.2% (AA); GPQA Diamond 46.9% (AA); τ²-bench (Tau²-bench) 45.3% (AA); SciCode 27.8% (AA); Terminal-Bench (Hard) 8.3% (AA)

### DeepSeek V4 Flash
https://sophon.at/models/deepseek-v4-flash
DeepSeek V4 Flash is an AI model from DeepSeek, released with open weights.
- Family: deepseek · Context: 1,048,576 tokens · License: mit · Released: 2026-04-24
- Top scores: Autonomous Skill Evolution 1.88 (9 samples); τ²-bench (Tau²-bench) 94.4% (AA); LiveBench - Math 79.6% (LiveBench); MathArena 76.61%; GPQA Diamond 71.6% (AA)

### DeepSeek V4 Pro
https://sophon.at/models/deepseek-v4-pro
DeepSeek's April 2026 next-gen open-weights flagship - 1.6T-total / 49B-active MoE with 1M context and DeepSeek Sparse Attention.
- Family: deepseek · Params: 1.6T total / 49B active · Context: 1,048,576 tokens · License: mit · Released: 2026-04-24
- Top scores: τ²-bench (Tau²-bench) 91.2% (AA); LiveBench - Math 90.7% (LiveBench); LiveBench - Reasoning 82.7% (LiveBench); SWE-bench 80.6% (Verified, per DeepSeek); LiveBench - Language 78.1% (LiveBench)

### GPT-5.5
https://sophon.at/models/gpt-5.5
GPT-5.5 is an AI model from OpenAI.
- Family: gpt · Context: 1,050,000 tokens · License: proprietary · Released: 2026-04-23
- Top scores: Slitherlink Env 4.2 (10 samples); Crystal Relaxation Rlm 100.0% (1 samples); Physgym Arena Medley Public 100.0% (5 samples); MathArena 92.82%; Crystal Relaxation 90.0% (1 samples)

### Hy3 preview
https://sophon.at/models/hy3-preview
Hy3-preview is an AI model from Tencent.
- Family: hy · Context: 262,144 tokens · Released: 2026-04-23
- Top scores: GPQA Diamond 73.2% (AA); τ²-bench (Tau²-bench) 67.5% (AA); IFBench 48.0% (AA); SciCode 39.4% (AA); Terminal-Bench (Hard) 31.8% (AA)

### Ling-2.6-1T
https://sophon.at/models/ling-2-6-1t
Ling-2.6-1T is an AI model from InclusionAI.
- Family: ling · Context: 262,144 tokens · Released: 2026-04-23
- Top scores: τ²-bench (Tau²-bench) 89.8% (AA); GPQA Diamond 75.2% (AA); IFBench 56.9% (AA); SciCode 37.0% (AA); Terminal-Bench (Hard) 31.1% (AA)

### MiMo-V2.5
https://sophon.at/models/mimo-v2-5-0424
MiMo-V2.5 is an AI model from Xiaomi.
- Family: mimo · Released: 2026-04-22
- Top scores: τ²-bench (Tau²-bench) 90.6% (AA); GPQA Diamond 84.9% (AA); IFBench 67.1% (AA); SciCode 43.1% (AA); Terminal-Bench (Hard) 41.7% (AA)

### MiMo-V2.5-Pro
https://sophon.at/models/mimo-v2.5-pro
mimo-v2.5-pro is an AI model from Xiaomi, released with open weights.
- Family: mimo · Context: 1,048,576 tokens · License: mit · Released: 2026-04-22
- Top scores: Skill Reward Hacking 9.48 (25 samples); MMLU-Pro 85.1% (1000 samples); GPQA Diamond 76.2% (AA); τ²-bench (Tau²-bench) 72.5% (AA); IFBench 42.7% (AA)

### Qwen3.6 27B
https://sophon.at/models/qwen3-6-27b
Qwen3.6 27B is an AI model from Alibaba.
- Family: qwen · Context: 262,144 tokens · Released: 2026-04-22
- Top scores: τ²-bench (Tau²-bench) 93.6% (AA); GPQA Diamond 82.9% (AA); LiveBench - Math 79.9% (LiveBench); LiveBench - Coding 71.8% (LiveBench); TaxEval v2 71.3%

### Ling-2.6-flash
https://sophon.at/models/ling-2-6-flash
Ling 2.6 Flash is an AI model from InclusionAI.
- Family: ling · Context: 262,144 tokens · Released: 2026-04-21
- Top scores: τ²-bench (Tau²-bench) 86.0% (AA); GPQA Diamond 59.3% (AA); IFBench 57.4% (AA); SciCode 27.1% (AA); Terminal-Bench (Hard) 21.2% (AA)

### Kimi K2.6
https://sophon.at/models/kimi-k2.6
kimi-k2.6 is an AI model from Moonshot AI.
- Family: kimi · Context: 262,144 tokens · License: Modified MIT · Released: 2026-04-20
- Top scores: Physgym Arena Medley Public 100.0% (5 samples); τ²-bench (Tau²-bench) 95.9% (AA); GPQA Diamond 91.1% (AA); MedScribe 78.1%; IFBench 76.0% (AA)

### Kimi K2.6 (Non-reasoning)
https://sophon.at/models/kimi-k2-6-non-reasoning
Kimi K2.6 (Non-reasoning) is an AI model from Kimi.
- Family: kimi · Released: 2026-04-20
- Top scores: τ²-bench (Tau²-bench) 93.9% (AA); GPQA Diamond 78.8% (AA); IFBench 44.3% (AA); SciCode 39.5% (AA); Terminal-Bench (Hard) 37.9% (AA)

### Qwen3.6 Max Preview
https://sophon.at/models/qwen3.6-max
Qwen3.6 Max Preview is an AI model from Alibaba.
- Family: qwen3.6 · License: proprietary · Released: 2026-04-20
- Top scores: τ²-bench (Tau²-bench) 95.9% (AA); GPQA Diamond 88.8% (AA); IFBench 76.6% (AA); CorpFin v2 66.5%; SciCode 46.9% (AA)

### Claude Opus 4.7
https://sophon.at/models/claude-opus-4-7
Claude Opus 4.7 is an AI model from Anthropic.
- Family: claude · Context: 1,000,000 tokens · License: proprietary · Released: 2026-04-16
- Top scores: MMMLU 91.5 Accuracy; Autonomous Skill Evolution 3.21 (15 samples); Physgym Arena Medley Public 100.0% (5 samples); LiveBench - Math 93.1% (LiveBench); GPQA Diamond 88.5% (AA)

## Leaderboards

### Arena - Document
https://sophon.at/leaderboards/lmarena-document
Crowdsourced document model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Document Style Control
https://sophon.at/leaderboards/lmarena-document-style-control
Crowdsourced document style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Image Edit
https://sophon.at/leaderboards/lmarena-image-edit
Crowdsourced image edit model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Image to Video
https://sophon.at/leaderboards/lmarena-image-to-video
Crowdsourced image to video model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Search
https://sophon.at/leaderboards/lmarena-search
Crowdsourced search model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Search Style Control
https://sophon.at/leaderboards/lmarena-search-style-control
Crowdsourced search style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Text
https://sophon.at/leaderboards/lmarena-text
Crowdsourced text model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Text Style Control
https://sophon.at/leaderboards/lmarena-text-style-control
Crowdsourced text style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Text to Image
https://sophon.at/leaderboards/lmarena-text-to-image
Crowdsourced text to image model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Text to Video
https://sophon.at/leaderboards/lmarena-text-to-video
Crowdsourced text to video model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Video Edit
https://sophon.at/leaderboards/lmarena-video-edit
Crowdsourced video edit model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Vision Style Control
https://sophon.at/leaderboards/lmarena-vision-style-control
Crowdsourced vision style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena - Webdev
https://sophon.at/leaderboards/lmarena-webdev
Crowdsourced webdev model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- Updates: weekly

### Arena Coding
https://sophon.at/leaderboards/lmarena-coding
LMArena subcategory ranking models on user pairwise votes restricted to coding prompts.
- Updates: live

### Arena Hard Prompts
https://sophon.at/leaderboards/lmarena-hard-prompts
LMArena subcategory ranking models on a filtered slice of Arena prompts auto-classified as hard along multiple difficulty axes.
- Updates: live

## Capabilities

### bias
https://sophon.at/capabilities/bias
- Evals: 1

### browser use
https://sophon.at/capabilities/browser-use
- Evals: 7

### code editing
https://sophon.at/capabilities/code-editing
- Evals: 7

### code generation
https://sophon.at/capabilities/code-generation
- Evals: 10

### common sense
https://sophon.at/capabilities/common-sense
- Evals: 1

### computer use
https://sophon.at/capabilities/computer-use
- Evals: 2

### debugging
https://sophon.at/capabilities/debugging
- Evals: 7

### embodied
https://sophon.at/capabilities/embodied
- Evals: 3

### factual recall
https://sophon.at/capabilities/factual-recall
- Evals: 19

### hallucination
https://sophon.at/capabilities/hallucination
- Evals: 5

### harmful content
https://sophon.at/capabilities/harmful-content
- Evals: 2

### image understanding
https://sophon.at/capabilities/image-understanding
- Evals: 4

### instruction following
https://sophon.at/capabilities/instruction-following
- Evals: 17

### jailbreak resistance
https://sophon.at/capabilities/jailbreak-resistance
- Evals: 3

### legal reasoning
https://sophon.at/capabilities/legal-reasoning
- Evals: 2

### llm judging
https://sophon.at/capabilities/llm-judging
- Evals: 9

### logic
https://sophon.at/capabilities/logic
- Evals: 1

### long context
https://sophon.at/capabilities/long-context
- Evals: 2

### math
https://sophon.at/capabilities/math
- Evals: 12

### multi turn dialog
https://sophon.at/capabilities/multi-turn-dialog
- Evals: 7

### multilingual
https://sophon.at/capabilities/multilingual
- Evals: 3

### planning
https://sophon.at/capabilities/planning
- Evals: 29

### retrieval
https://sophon.at/capabilities/retrieval
- Evals: 3

### safety
https://sophon.at/capabilities/safety
- Evals: 8

### scientific reasoning
https://sophon.at/capabilities/scientific-reasoning
- Evals: 9

### tool calling
https://sophon.at/capabilities/tool-calling
- Evals: 18

### translation
https://sophon.at/capabilities/translation
- Evals: 1

## Papers

### "I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration
https://sophon.at/papers/pwc-83620
A goal-level attribution framework called CoTrace is introduced to analyze how large language models contribute to goal shaping in human-AI collaboration, revealing that while models account for a small percentage of direct contributions, they play a significant role in introducing concrete requirements and making indirect contributions.
- Year: 2026 · Venue: arXiv 2026

### "What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing
https://sophon.at/papers/pwc-56202
Users prefer adaptive feedback mechanisms in in-car AI assistants, starting with high transparency to build trust and then reducing verbosity as reliability increases, particularly in attention-critical driving scenarios.
- Year: 2026 · Venue: arXiv 2026

### $E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference
https://sophon.at/papers/arxiv-2605.27428
Edge deployments of generative inference increasingly face two practical realities: per-device per-model performance is often unknown at deployment time, and it is non-stationary due to user-driven semantic events, background load, and device churn.
- Year: 2026

### $\textit{BlockFormer}$ : Transformer-based inference from interaction maps
https://sophon.at/papers/arxiv-2605.21617
Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably Hi-C -- can be formulated as a generic inverse problem: infer a set of parameters given a map summarizing pairwise interactions between…
- Year: 2026

### &#34;Be My Cheese?&#34;: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs
https://sophon.at/papers/arxiv-2602.04729
We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs).
- Year: 2026

### &#34;Do Not Mention This to the User&#34;: Detecting and Understanding Malicious Agent Skills
https://sophon.at/papers/arxiv-2602.06547
LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper scripts that execute with full user privileges.
- Year: 2026

### &#34;I Strongly Suspect This Website Is a Scam&#34;: Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents
https://sophon.at/papers/arxiv-2606.00497
Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates autonomous web agents into submitting users' personally identifiable information (PII) to attacker-controlled endpoints.
- Year: 2026

### &#34;I've Seen How This Goes&#34;: Characterizing Diversity via Progressive Conditional Surprise
https://sophon.at/papers/arxiv-2606.01811
Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing.
- Year: 2026

### &#34;Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models
https://sophon.at/papers/arxiv-2605.31401
Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist.
- Year: 2026

### (1D) Ordered Tokens Enable Efficient Test-Time Search
https://sophon.at/papers/pwc-57557
Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling.
- Year: 2026 · Venue: arXiv 2026

### *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
https://sophon.at/papers/arxiv-2602.15778
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing.
- Year: 2026

### 2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
https://sophon.at/papers/arxiv-2605.27338
ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w.
- Year: 2026

### 2Mamba2Furious: Linear in Complexity, Competitive in Accuracy
https://sophon.at/papers/pwc-56213
Researchers enhance linear attention by simplifying Mamba-2 and improving its architectural components to achieve near-softmax accuracy while maintaining memory efficiency for long sequences.
- Year: 2026 · Venue: arXiv 2026

### 360DVO: Deep Visual Odometry for Monocular 360-Degree Camera
https://sophon.at/papers/pwc-62984
A deep learning-based monocular omnidirectional visual odometry system uses a distortion-aware spherical feature extractor and differentiable bundle adjustment to improve robustness and accuracy over existing methods.
- Year: 2026 · Venue: arXiv 2026

### 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence
https://sophon.at/papers/pwc-55258
3D CoCa v2 enhances 3D captioning by combining contrastive vision-language learning with spatially-aware 3D scene encoding and test-time search for improved generalization across diverse environments.
- Year: 2026 · Venue: arXiv 2026

### 3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis
https://sophon.at/papers/pwc-57481
Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications.
- Year: 2026 · Venue: arXiv 2026

### 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model
https://sophon.at/papers/pwc-56857
Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce.
- Year: 2026 · Venue: arXiv 2026

### 3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models
https://sophon.at/papers/arxiv-2603.07751
Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D…
- Year: 2026

### 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
https://sophon.at/papers/pwc-56691
4D reconstruction of equine family (e.g.
- Year: 2026 · Venue: arXiv 2026

### 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
https://sophon.at/papers/pwc-82921
4DThinker enables vision-language models to perform dynamic spatial reasoning through 4D latent mental imagery, using scalable data generation and novel fine-tuning methods that outperform existing approaches.
- Year: 2026 · Venue: arXiv 2026

### 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere
https://sophon.at/papers/pwc-56239
4RC presents a unified feed-forward framework for 4D reconstruction from monocular videos that learns holistic scene geometry and motion dynamics through a transformer-based encoder-decoder architecture with conditional querying capabilities.
- Year: 2026 · Venue: arXiv 2026

### A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification
https://sophon.at/papers/pwc-55397
Lightweight probes trained on hidden states of LLMs enable efficient classification tasks without additional computational overhead, improving safety and sentiment analysis performance.
- Year: 2026 · Venue: arXiv 2026

### A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents
https://sophon.at/papers/pwc-60965
Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with…
- Year: 2026 · Venue: arXiv 2026

### A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents
https://sophon.at/papers/arxiv-2604.17943
RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a…
- Year: 2026

### A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis
https://sophon.at/papers/arxiv-2605.28575
Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far more expressive than their acoustic and visual counterparts, the text modality tends to dominate optimization, suppressing weaker modalities…
- Year: 2026

## Organizations

### 01.AI (零一万物)
https://sophon.at/organizations/01-ai
Chinese AI startup founded by Kai-Fu Lee; publisher of the Yi open-weights model family.

### AI Futures Project
https://sophon.at/organizations/ai-futures-project
Independent research non-profit founded by Daniel Kokotajlo (ex-OpenAI) studying AI-progress forecasts; publishers of AI 2027.

### AI21 Labs
https://sophon.at/organizations/ai21-labs
AI21 Labs is an organization.

### AI4Finance Foundation
https://sophon.at/organizations/ai4finance-foundation
Open-source nonprofit building FinRL and FinGPT for reinforcement-learning-based financial trading.

### ARC Prize
https://sophon.at/organizations/arc-prize
$1M+ open competition to solve the ARC-AGI benchmark; run by the ARC Prize Foundation (François Chollet & Greg Kamradt).

### ARC Prize Foundation
https://sophon.at/organizations/arc-prize-foundation
Non-profit operating the ARC Prize competition and ARC-AGI benchmark series.

### AT&T
https://sophon.at/organizations/at-t
US telecom giant; historic parent of [[bell-labs|Bell Labs]] and AT&T Labs Research.

### Abacus.AI
https://sophon.at/organizations/abacus-ai
AI platform startup offering enterprise LLM tooling; co-creator with Yann LeCun's NYU group of the LiveBench contamination-resistant LLM benchmark.

### Abugoot
https://sophon.at/organizations/prime-hub-team-abugoot
Abugoot is a team.

### AfterQuery
https://sophon.at/organizations/afterquery


### Aider
https://sophon.at/organizations/aider-ai
Open-source AI pair-programming CLI created by Paul Gauthier; also operates the widely cited Aider Polyglot coding leaderboard.

### Airbus
https://sophon.at/organizations/airbus
European aerospace manufacturer; runs AI/ML research for aviation, defense, and space applications.

### Airbyte
https://sophon.at/organizations/airbyte
Open-source data integration platform / ELT tool; YC W21; relevant as career-history for individuals in the graph.

### Alibaba
https://sophon.at/organizations/alibaba
Alibaba is an organization.

### Alibaba DAMO Academy
https://sophon.at/organizations/alibaba-damo-academy
Alibaba's global research institute; covers ML, NLP, robotics, and quantum computing.

### Alibaba Qwen (Tongyi Qianwen)
https://sophon.at/organizations/alibaba-qwen
Alibaba's AI research division publishing the Qwen series, the most prolific open-weights frontier model family.

### Alignment Research Center (ARC)
https://sophon.at/organizations/alignment-research-center
AI alignment non-profit founded by Paul Christiano in 2021; its evals team spun out to become [[metr]] in late 2023.

### All Hands AI
https://sophon.at/organizations/all-hands-ai
Startup commercializing the OpenHands (formerly OpenDevin) open-source agent framework.

### Allen Institute for AI
https://sophon.at/organizations/ai2
Allen Institute for AI is an organization.

### Allen Institute for AI (Ai2)
https://sophon.at/organizations/allen-ai
Seattle non-profit AI research institute publishing fully open models, datasets, and the OLMo / Tulu / Dolma family.

## API and more

- Read API index: https://sophon.at/api/v1 (JSON for every entity type)
- Per-entity JSON: https://sophon.at/api/v1/{evals|models|tools|leaderboards|organizations|people|capabilities|papers}/{slug}
- Full text / PDF for papers: https://sophon.at/api/v1/papers/{slug}/text and /pdf
- Search: https://sophon.at/api/v1/search?q={query}
- CLI: `npm i -g sophon-at` (command `sophon`); JSON when piped, `sophon help --json` for a machine-readable manifest
- API & CLI docs: https://sophon.at/about/api
- Curated index: https://sophon.at/llms.txt