# Sophon
> Curated catalog of AI evaluations, leaderboards, models, papers, and the RL environments and datasets that lift eval scores.
Sophon maps each AI eval to the tools (RL environments, SFT/DPO datasets, scaffolds) that improve model scores on it, alongside the models, papers, organizations, and people behind them. Every page is server-rendered; a read-only JSON API is documented at /api/v1.
An expanded version with full descriptions, key scores, and relationships inlined is at https://sophon.at/llms-full.txt.
## Evals

- [AIME 2024: Problems from the American Invitational Mathematics Examination](https://sophon.at/evals/aime2024): Official 15-problem high-school math olympiad-track exam used by labs as a fresh, contamination-resistant math reasoning benchmark.
- [GPQA Diamond](https://sophon.at/evals/gpqa-diamond): Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof.
- [HumanEval](https://sophon.at/evals/humaneval): 164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.
- [MATH-500](https://sophon.at/evals/math-500): 500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice.
- [MMLU-Pro](https://sophon.at/evals/mmlu-pro): Harder, reasoning-focused successor to MMLU with 10 answer choices and curated questions resistant to lucky guessing.
- [Massive Multitask Language Understanding (MMLU)](https://sophon.at/evals/mmlu): 57-subject multiple-choice exam testing broad world knowledge and reasoning across academic and professional domains.
- [Arena-Hard](https://sophon.at/evals/arena-hard): 500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.
- [BIG-Bench Hard (BBH)](https://sophon.at/evals/bbh): 23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.
- [GSM8K](https://sophon.at/evals/gsm8k): 8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer.
- [LegalBench](https://sophon.at/evals/legalbench): 162 collaboratively curated legal-reasoning tasks across rule-recall, issue-spotting, application, and interpretation - the standard legal…
- [LiveBench](https://sophon.at/evals/livebench): Rolling contamination-free benchmark that updates questions monthly across math, coding, reasoning, language, instruction-following, and da…
- [LiveCodeBench](https://sophon.at/evals/livecodebench): Rolling competitive-programming benchmark that scrapes LeetCode / AtCoder / Codeforces problems after a known cutoff to fight contamination.
- [Mostly Basic Python Problems (MBPP)](https://sophon.at/evals/mbpp): 974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark.
- [SWE-Lancer](https://sophon.at/evals/swe-lancer): 1,488 real freelance software-engineering tasks from Upwork worth $1M total in payouts, evaluating models on end-to-end paid developer work.
- [SWE-bench](https://sophon.at/evals/swe-bench): 2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.
- [SWE-bench Lite](https://sophon.at/evals/swe-bench-lite): 300-issue subset of SWE-bench focused on functional bug fixes that are easier to evaluate - used for fast iteration before full SWE-bench r…
- [SWE-bench Verified](https://sophon.at/evals/swe-bench-verified): 500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding…
- [TruthfulQA](https://sophon.at/evals/truthfulqa): 817 questions targeting common human misconceptions, measuring whether a model gives factually true answers or repeats popular falsehoods.
- [Vals.ai Legal Evals](https://sophon.at/evals/vals-legal-evals): Vals.ai's proprietary suite of legal-domain benchmarks (contract review, hallucination tests, LegalBench Pro) used by law firms to procure…
- [AA-Omniscience](https://sophon.at/evals/aa-omniscience): Artificial Analysis's broad-knowledge benchmark - thousands of curated factual questions spanning specialized domains - designed to test ha…
- [AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models](https://sophon.at/evals/agieval): AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to hum…
- [AIME 2025: Problems from the American Invitational Mathematics Examination](https://sophon.at/evals/aime2025): A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematic…
- [AIME2024](https://sophon.at/evals/openreward-generalreasoning-aime2024): Problems from the American Invitational Mathematics Examination (AIME) 2024.
- [AIME2025](https://sophon.at/evals/openreward-generalreasoning-aime2025): Problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.
- [AIME2026](https://sophon.at/evals/openreward-generalreasoning-aime2026): Problems from the American Invitational Mathematics Examination (AIME) 2026-I & II.
- [AIR Bench: AI Risk Benchmark](https://sophon.at/evals/air-bench): A safety benchmark evaluating language models against risk categories derived from government regulations and company policies.
- [ALFRED](https://sophon.at/evals/alfred): 3D-simulated household tasks driven by language instructions and egocentric video - the visual sibling of ALFWorld.
- [ALFWorld](https://sophon.at/evals/alfworld): Embodied household-task benchmark that aligns TextWorld text commands with ALFRED 3D scenes, testing whether agents can transfer from abstr…
- [ANIMA: Animal Norms In Moral Assessment](https://sophon.at/evals/anima): Evaluates the quality of a model's moral reasoning about animal welfare across 13 ethical dimensions.
- [APE: Attempt to Persuade Eval](https://sophon.at/evals/ape): Measures a model's willingness to attempt persuasion on harmful, controversial, and benign topics. The key metric is not persuasion effecti…

## Tools (RL envs, datasets, scaffolds)

- [VF Openbench RL Env (Community)](https://sophon.at/tools/prime-aarush-vf-openbench): Environment for single-turn tasks in OpenBench
- [Agent Bench RL Env (Prime Community)](https://sophon.at/tools/prime-prime-community-mini-swe-agent-bench): Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.
- [BrowserGym](https://sophon.at/tools/browsergym): ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and…
- [NuminaMath](https://sophon.at/tools/numina-math): An 860k-problem competition-math dataset with detailed solutions, the open community's go-to corpus for training math-specialized LLMs.
- [OpenThoughts](https://sophon.at/tools/openthoughts): A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models.
- [Tülu 3 SFT Mixture](https://sophon.at/tools/tulu-3-sft-mixture): Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality ins…
- [WizardLM Evol-Instruct](https://sophon.at/tools/wizardlm-evol-instruct): Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver.
- [Agent PLUS RL Env (Prime Intellect)](https://sophon.at/tools/prime-primeintellect-mini-swe-agent-plus): Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes.
- [Aya Dataset](https://sophon.at/tools/aya-dataset): Cohere For AI's massively multilingual instruction dataset covering 65 languages, built by a 3,000-person open-science collaboration.
- [Bigbench BBH RL Env (Prime Community)](https://sophon.at/tools/prime-prime-community-bigbench-bbh): Big Bench + BBH implementation
- [COT Theater RL Env (Community)](https://sophon.at/tools/prime-danruif-cot-theater): Reward-hacking sprint env. Four pseudo-CoT surface proxies and four true reasoning metrics on GSM8K, with all eight logged on every rollout…
- [Certainty Collapse RL Env (Community)](https://sophon.at/tools/prime-cardan05-certainty-collapse): Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Ll…
- [Compositional Hacks RL Env (Community)](https://sophon.at/tools/prime-danruif-compositional-hacks): Reward-hacking sprint env that plants two hidden rewards at once on GSM8K to probe whether one dominates or both emerge proportionally.
- [Context Needle RL Env (Community)](https://sophon.at/tools/prime-stelioszach-long-context-needle): Needle-in-haystack - locate a target sentence in a long document.
- [Deepconf RL Env (Community)](https://sophon.at/tools/prime-tonic-deepconf): DeepConf environment for confidence-aware LLM reasoning evaluation
- [Deepswe RL Env (Prime Intellect)](https://sophon.at/tools/prime-primeintellect-deepswe): DeepSWE environment for solving SWE issues inside Prime Sandboxes.
- [Discover Gsm8k RL Env (Community)](https://sophon.at/tools/prime-stochi0-discover-gsm8k): GSM8K rubric-discovery environment: learn rubric_fn from (input, response, score) examples
- [Emergence Prediction RL Env (Community)](https://sophon.at/tools/prime-danruif-emergence-prediction): Reward-hacking sprint env. The planted token-frequency hack is held fixed within a run, and planted_token varies across runs to test whethe…
- [Emoji HACK RL Env (Community)](https://sophon.at/tools/prime-danruif-emoji-hack): Reward-hacking sprint env. A planted emoji-density hack on GSM8K, used to test whether GRPO can amplify a behavior with effectively zero ba…
- [FH Aviary RL Env (Prime Community)](https://sophon.at/tools/prime-prime-community-fh-aviary): Future House Aviary wrapper for verifiers - Scientific reasoning environments with tools
- [Formatting Emergence RL Env (Community)](https://sophon.at/tools/prime-danruif-formatting-emergence): Reward-hacking sprint env. A planted markdown-formatting hack on GSM8K, with hidden-reward weight and task difficulty as the two experiment…
- [GPQA Diamond RL Env (Community)](https://sophon.at/tools/prime-anshu-gpqa-diamond): GPQA Diamond: A Graduate-Level Google-Proof Q&A Benchmark
- [GPQA RL Env (Prime Intellect)](https://sophon.at/tools/prime-primeintellect-gpqa): GPQA evaluation environment
- [Gsm8k Olmes RL Env (Community)](https://sophon.at/tools/prime-pmahdavi-gsm8k-olmes): GSM8K evaluation matching OLMES tulu_3_dev_no_safety methodology
- [Gsm8k RL Env (Community)](https://sophon.at/tools/prime-will-gsm8k): GSM8K environment

## Models

- [Qwen3.7Plus](https://sophon.at/models/qwen3.7-plus): Qwen3.7Plus is an AI model from Alibaba.
- [Step 3.7 Flash](https://sophon.at/models/step-3-7-flash)
- [Claude Opus 4.8](https://sophon.at/models/claude-opus-4-8): Claude Opus 4.8 (Adaptive Reasoning, Max Effort) is an AI model from Anthropic.
- [MiniCPM5-1B (Non-reasoning)](https://sophon.at/models/minicpm5-1b-non-reasoning): MiniCPM5-1B (Non-reasoning) is an AI model from OpenBMB.
- [Command A+](https://sophon.at/models/command-a-plus): Command A+ is an AI model from Cohere.
- [Gemini 3.5 Flash](https://sophon.at/models/gemini-3.5-flash): Gemini 3.5 Flash is an AI model from Google (Alphabet Inc.).
- [Qwen3.7 Max](https://sophon.at/models/qwen3.7-max): Qwen3.7Max is an AI model from Alibaba.
- [JT-35B-Flash](https://sophon.at/models/jt-35b-flash): JT-35B-Flash is an AI model from China Mobile.
- [MiniCPM-V 4.6 1.3B](https://sophon.at/models/minicpm-v4-6-1-3b): MiniCPM-V 4.6 1.3B is an AI model from OpenBMB.
- [Ring-2.6-1T](https://sophon.at/models/ring-2-6-1t): Ring-2.6-1T is an AI model from InclusionAI.
- [GPT-5.5 Instant (May 2026)](https://sophon.at/models/gpt-5-5-instant-05-26): GPT-5.5 Instant (May 2026) is an AI model from OpenAI.
- [Grok 4.3](https://sophon.at/models/grok-4.3): Grok 4.3 is an AI model from xAI.
- [Granite 4.1 30B](https://sophon.at/models/granite-4-1-30b): Granite 4.1 30B is an AI model from Ibm.
- [Granite 4.1 3B](https://sophon.at/models/granite-4-1-3b): Granite 4.1 3B is an AI model from Ibm.
- [Granite 4.1 8B](https://sophon.at/models/granite-4.1-8b): granite-4.1-8b is an AI model from Ibm, released with open weights.
- [Mistral Medium 3.5](https://sophon.at/models/mistral-medium-3-5): Mistral Medium 3.5 is an AI model from Mistral AI.
- [Nemotron 3 Nano Omni 30B A3B Reasoning](https://sophon.at/models/nemotron-3-nano-omni-30b-a3b): Nemotron 3 Nano Omni 30B A3B Reasoning is an AI model from NVIDIA.
- [DeepSeek V4 Flash](https://sophon.at/models/deepseek-v4-flash): DeepSeek V4 Flash is an AI model from DeepSeek, released with open weights.
- [DeepSeek V4 Pro](https://sophon.at/models/deepseek-v4-pro): DeepSeek's April 2026 next-gen open-weights flagship - 1.6T-total / 49B-active MoE with 1M context and DeepSeek Sparse Attention.
- [GPT-5.5](https://sophon.at/models/gpt-5.5): GPT-5.5 is an AI model from OpenAI.
- [Hy3 preview](https://sophon.at/models/hy3-preview): Hy3-preview is an AI model from Tencent.
- [Ling-2.6-1T](https://sophon.at/models/ling-2-6-1t): Ling-2.6-1T is an AI model from InclusionAI.
- [MiMo-V2.5](https://sophon.at/models/mimo-v2-5-0424): MiMo-V2.5 is an AI model from Xiaomi.
- [MiMo-V2.5-Pro](https://sophon.at/models/mimo-v2.5-pro): mimo-v2.5-pro is an AI model from Xiaomi, released with open weights.
- [Qwen3.6 27B](https://sophon.at/models/qwen3-6-27b): Qwen3.6 27B is an AI model from Alibaba.

## Leaderboards

- [Arena - Document](https://sophon.at/leaderboards/lmarena-document): Crowdsourced document model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Document Style Control](https://sophon.at/leaderboards/lmarena-document-style-control): Crowdsourced document style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Image Edit](https://sophon.at/leaderboards/lmarena-image-edit): Crowdsourced image edit model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Image to Video](https://sophon.at/leaderboards/lmarena-image-to-video): Crowdsourced image to video model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Search](https://sophon.at/leaderboards/lmarena-search): Crowdsourced search model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Search Style Control](https://sophon.at/leaderboards/lmarena-search-style-control): Crowdsourced search style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Text](https://sophon.at/leaderboards/lmarena-text): Crowdsourced text model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Text Style Control](https://sophon.at/leaderboards/lmarena-text-style-control): Crowdsourced text style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Text to Image](https://sophon.at/leaderboards/lmarena-text-to-image): Crowdsourced text to image model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Text to Video](https://sophon.at/leaderboards/lmarena-text-to-video): Crowdsourced text to video model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Video Edit](https://sophon.at/leaderboards/lmarena-video-edit): Crowdsourced video edit model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Vision Style Control](https://sophon.at/leaderboards/lmarena-vision-style-control): Crowdsourced vision style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena - Webdev](https://sophon.at/leaderboards/lmarena-webdev): Crowdsourced webdev model ratings from LMArena. Elo-style scores computed from pairwise human preference votes.
- [Arena Coding](https://sophon.at/leaderboards/lmarena-coding): LMArena subcategory ranking models on user pairwise votes restricted to coding prompts.
- [Arena Hard Prompts](https://sophon.at/leaderboards/lmarena-hard-prompts): LMArena subcategory ranking models on a filtered slice of Arena prompts auto-classified as hard along multiple difficulty axes.

## Capabilities

- [bias](https://sophon.at/capabilities/bias)
- [browser use](https://sophon.at/capabilities/browser-use)
- [code editing](https://sophon.at/capabilities/code-editing)
- [code generation](https://sophon.at/capabilities/code-generation)
- [common sense](https://sophon.at/capabilities/common-sense)
- [computer use](https://sophon.at/capabilities/computer-use)
- [debugging](https://sophon.at/capabilities/debugging)
- [embodied](https://sophon.at/capabilities/embodied)
- [factual recall](https://sophon.at/capabilities/factual-recall)
- [hallucination](https://sophon.at/capabilities/hallucination)
- [harmful content](https://sophon.at/capabilities/harmful-content)
- [image understanding](https://sophon.at/capabilities/image-understanding)
- [instruction following](https://sophon.at/capabilities/instruction-following)
- [jailbreak resistance](https://sophon.at/capabilities/jailbreak-resistance)
- [legal reasoning](https://sophon.at/capabilities/legal-reasoning)
- [llm judging](https://sophon.at/capabilities/llm-judging)
- [logic](https://sophon.at/capabilities/logic)
- [long context](https://sophon.at/capabilities/long-context)
- [math](https://sophon.at/capabilities/math)
- [multi turn dialog](https://sophon.at/capabilities/multi-turn-dialog)
- [multilingual](https://sophon.at/capabilities/multilingual)
- [planning](https://sophon.at/capabilities/planning)
- [retrieval](https://sophon.at/capabilities/retrieval)
- [safety](https://sophon.at/capabilities/safety)
- [scientific reasoning](https://sophon.at/capabilities/scientific-reasoning)
- [tool calling](https://sophon.at/capabilities/tool-calling)
- [translation](https://sophon.at/capabilities/translation)

## Papers

- ["I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration](https://sophon.at/papers/pwc-83620): A goal-level attribution framework called CoTrace is introduced to analyze how large language models contribute to goal shaping in human-AI…
- ["What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing](https://sophon.at/papers/pwc-56202): Users prefer adaptive feedback mechanisms in in-car AI assistants, starting with high transparency to build trust and then reducing verbosi…
- [$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference](https://sophon.at/papers/arxiv-2605.27428): Edge deployments of generative inference increasingly face two practical realities: per-device per-model performance is often unknown at de…
- [$\textit{BlockFormer}$ : Transformer-based inference from interaction maps](https://sophon.at/papers/arxiv-2605.21617): Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably H…
- [&#34;Be My Cheese?&#34;: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs](https://sophon.at/papers/arxiv-2602.04729): We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art…
- [&#34;Do Not Mention This to the User&#34;: Detecting and Understanding Malicious Agent Skills](https://sophon.at/papers/arxiv-2602.06547): LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper sc…
- [&#34;I Strongly Suspect This Website Is a Scam&#34;: Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents](https://sophon.at/papers/arxiv-2606.00497): Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates auton…
- [&#34;I've Seen How This Goes&#34;: Characterizing Diversity via Progressive Conditional Surprise](https://sophon.at/papers/arxiv-2606.01811): Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quanti…
- [&#34;Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models](https://sophon.at/papers/arxiv-2605.31401): Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-res…
- [(1D) Ordered Tokens Enable Efficient Test-Time Search](https://sophon.at/papers/pwc-57557): Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling.
- [*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation](https://sophon.at/papers/arxiv-2602.15778): Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approache…
- [2-ASP(Q) programs with weak constraints: Complexity and efficient implementation](https://sophon.at/papers/arxiv-2605.27338): ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with…
- [2Mamba2Furious: Linear in Complexity, Competitive in Accuracy](https://sophon.at/papers/pwc-56213): Researchers enhance linear attention by simplifying Mamba-2 and improving its architectural components to achieve near-softmax accuracy whi…
- [360DVO: Deep Visual Odometry for Monocular 360-Degree Camera](https://sophon.at/papers/pwc-62984): A deep learning-based monocular omnidirectional visual odometry system uses a distortion-aware spherical feature extractor and differentiab…
- [3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence](https://sophon.at/papers/pwc-55258): 3D CoCa v2 enhances 3D captioning by combining contrastive vision-language learning with spatially-aware 3D scene encoding and test-time se…
- [3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis](https://sophon.at/papers/pwc-57481): Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications.
- [3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model](https://sophon.at/papers/pwc-56857): Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including…
- [3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models](https://sophon.at/papers/arxiv-2603.07751): Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tas…
- [4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video](https://sophon.at/papers/pwc-56691): 4D reconstruction of equine family (e.g.
- [4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding](https://sophon.at/papers/pwc-82921): 4DThinker enables vision-language models to perform dynamic spatial reasoning through 4D latent mental imagery, using scalable data generat…

## Organizations

- [01.AI (零一万物)](https://sophon.at/organizations/01-ai): Chinese AI startup founded by Kai-Fu Lee; publisher of the Yi open-weights model family.
- [AI Futures Project](https://sophon.at/organizations/ai-futures-project): Independent research non-profit founded by Daniel Kokotajlo (ex-OpenAI) studying AI-progress forecasts; publishers of AI 2027.
- [AI21 Labs](https://sophon.at/organizations/ai21-labs): AI21 Labs is an organization.
- [AI4Finance Foundation](https://sophon.at/organizations/ai4finance-foundation): Open-source nonprofit building FinRL and FinGPT for reinforcement-learning-based financial trading.
- [ARC Prize](https://sophon.at/organizations/arc-prize): $1M+ open competition to solve the ARC-AGI benchmark; run by the ARC Prize Foundation (François Chollet & Greg Kamradt).
- [ARC Prize Foundation](https://sophon.at/organizations/arc-prize-foundation): Non-profit operating the ARC Prize competition and ARC-AGI benchmark series.
- [AT&T](https://sophon.at/organizations/at-t): US telecom giant; historic parent of [[bell-labs|Bell Labs]] and AT&T Labs Research.
- [Abacus.AI](https://sophon.at/organizations/abacus-ai): AI platform startup offering enterprise LLM tooling; co-creator with Yann LeCun's NYU group of the LiveBench contamination-resistant LLM be…
- [Abugoot](https://sophon.at/organizations/prime-hub-team-abugoot): Abugoot is a team.
- [AfterQuery](https://sophon.at/organizations/afterquery)
- [Aider](https://sophon.at/organizations/aider-ai): Open-source AI pair-programming CLI created by Paul Gauthier; also operates the widely cited Aider Polyglot coding leaderboard.
- [Airbus](https://sophon.at/organizations/airbus): European aerospace manufacturer; runs AI/ML research for aviation, defense, and space applications.
- [Airbyte](https://sophon.at/organizations/airbyte): Open-source data integration platform / ELT tool; YC W21; relevant as career-history for individuals in the graph.
- [Alibaba](https://sophon.at/organizations/alibaba): Alibaba is an organization.
- [Alibaba DAMO Academy](https://sophon.at/organizations/alibaba-damo-academy): Alibaba's global research institute; covers ML, NLP, robotics, and quantum computing.
- [Alibaba Qwen (Tongyi Qianwen)](https://sophon.at/organizations/alibaba-qwen): Alibaba's AI research division publishing the Qwen series, the most prolific open-weights frontier model family.
- [Alignment Research Center (ARC)](https://sophon.at/organizations/alignment-research-center): AI alignment non-profit founded by Paul Christiano in 2021; its evals team spun out to become [[metr]] in late 2023.
- [All Hands AI](https://sophon.at/organizations/all-hands-ai): Startup commercializing the OpenHands (formerly OpenDevin) open-source agent framework.
- [Allen Institute for AI](https://sophon.at/organizations/ai2): Allen Institute for AI is an organization.
- [Allen Institute for AI (Ai2)](https://sophon.at/organizations/allen-ai): Seattle non-profit AI research institute publishing fully open models, datasets, and the OLMo / Tulu / Dolma family.

## API

- [Full detail](https://sophon.at/llms-full.txt): this catalog expanded - descriptions, top scores, and relationships inlined
- [API & CLI docs](https://sophon.at/about/api): how to query the catalog from code, agents, or the terminal
- [Read API index](https://sophon.at/api/v1): JSON endpoints for every entity type
- Per-entity JSON: https://sophon.at/api/v1/{evals|models|tools|leaderboards|organizations|people|capabilities|papers}/{slug}
- Full text / PDF for papers: https://sophon.at/api/v1/papers/{slug}/text and /pdf
- Search: https://sophon.at/api/v1/search?q={query}
- CLI: `npm i -g sophon-at` (command `sophon`); output is JSON when piped, `sophon help --json` returns a machine-readable command manifest

## More

- [All evals](https://sophon.at/evals)
- [All leaderboards](https://sophon.at/leaderboards)
- [All models](https://sophon.at/models)
- [All tools](https://sophon.at/tools)
- [All capabilities](https://sophon.at/capabilities)
- [Recommender](https://sophon.at/recommender): pick a capability, get ranked RL envs and datasets that lift it
- [Sitemap](https://sophon.at/sitemap.xml)