# Sophon > Curated catalog of AI evaluations, leaderboards, models, papers, and the RL environments and datasets that lift eval scores. Sophon maps each AI eval to the tools (RL environments, SFT/DPO datasets, scaffolds) that improve model scores on it, alongside the models, papers, organizations, and people behind them. Every page is server-rendered; a read-only JSON API is documented at /api/v1. An expanded version with full descriptions, key scores, and relationships inlined is at https://sophon.at/llms-full.txt. ## Evals - [AIME 2024: Problems from the American Invitational Mathematics Examination](https://sophon.at/evals/aime2024): Official 15-problem high-school math olympiad-track exam used by labs as a fresh, contamination-resistant math reasoning benchmark. - [GPQA Diamond](https://sophon.at/evals/gpqa-diamond): Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof. - [HumanEval](https://sophon.at/evals/humaneval): 164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper. - [MATH-500](https://sophon.at/evals/math-500): 500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice. - [MMLU-Pro](https://sophon.at/evals/mmlu-pro): Harder, reasoning-focused successor to MMLU with 10 answer choices and curated questions resistant to lucky guessing. - [Massive Multitask Language Understanding (MMLU)](https://sophon.at/evals/mmlu): 57-subject multiple-choice exam testing broad world knowledge and reasoning across academic and professional domains. - [Arena-Hard](https://sophon.at/evals/arena-hard): 500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate. - [BIG-Bench Hard (BBH)](https://sophon.at/evals/bbh): 23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans. - [GSM8K](https://sophon.at/evals/gsm8k): 8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer. - [LegalBench](https://sophon.at/evals/legalbench): 162 collaboratively curated legal-reasoning tasks across rule-recall, issue-spotting, application, and interpretation - the standard legal… - [LiveBench](https://sophon.at/evals/livebench): Rolling contamination-free benchmark that updates questions monthly across math, coding, reasoning, language, instruction-following, and da… - [LiveCodeBench](https://sophon.at/evals/livecodebench): Rolling competitive-programming benchmark that scrapes LeetCode / AtCoder / Codeforces problems after a known cutoff to fight contamination. - [Mostly Basic Python Problems (MBPP)](https://sophon.at/evals/mbpp): 974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark. - [SWE-Lancer](https://sophon.at/evals/swe-lancer): 1,488 real freelance software-engineering tasks from Upwork worth $1M total in payouts, evaluating models on end-to-end paid developer work. - [SWE-bench](https://sophon.at/evals/swe-bench): 2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite. - [SWE-bench Lite](https://sophon.at/evals/swe-bench-lite): 300-issue subset of SWE-bench focused on functional bug fixes that are easier to evaluate - used for fast iteration before full SWE-bench r… - [SWE-bench Verified](https://sophon.at/evals/swe-bench-verified): 500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding… - [TruthfulQA](https://sophon.at/evals/truthfulqa): 817 questions targeting common human misconceptions, measuring whether a model gives factually true answers or repeats popular falsehoods. - [Vals.ai Legal Evals](https://sophon.at/evals/vals-legal-evals): Vals.ai's proprietary suite of legal-domain benchmarks (contract review, hallucination tests, LegalBench Pro) used by law firms to procure… - [AA-Omniscience](https://sophon.at/evals/aa-omniscience): Artificial Analysis's broad-knowledge benchmark - thousands of curated factual questions spanning specialized domains - designed to test ha… - [AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models](https://sophon.at/evals/agieval): AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to hum… - [AIME 2025: Problems from the American Invitational Mathematics Examination](https://sophon.at/evals/aime2025): A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematic… - [AIME2024](https://sophon.at/evals/openreward-generalreasoning-aime2024): Problems from the American Invitational Mathematics Examination (AIME) 2024. - [AIME2025](https://sophon.at/evals/openreward-generalreasoning-aime2025): Problems from the American Invitational Mathematics Examination (AIME) 2025-I & II. - [AIME2026](https://sophon.at/evals/openreward-generalreasoning-aime2026): Problems from the American Invitational Mathematics Examination (AIME) 2026-I & II. - [AIR Bench: AI Risk Benchmark](https://sophon.at/evals/air-bench): A safety benchmark evaluating language models against risk categories derived from government regulations and company policies. - [ALFRED](https://sophon.at/evals/alfred): 3D-simulated household tasks driven by language instructions and egocentric video - the visual sibling of ALFWorld. - [ALFWorld](https://sophon.at/evals/alfworld): Embodied household-task benchmark that aligns TextWorld text commands with ALFRED 3D scenes, testing whether agents can transfer from abstr… - [ANIMA: Animal Norms In Moral Assessment](https://sophon.at/evals/anima): Evaluates the quality of a model's moral reasoning about animal welfare across 13 ethical dimensions. - [APE: Attempt to Persuade Eval](https://sophon.at/evals/ape): Measures a model's willingness to attempt persuasion on harmful, controversial, and benign topics. The key metric is not persuasion effecti… ## Tools (RL envs, datasets, scaffolds) - [VF Openbench RL Env (Community)](https://sophon.at/tools/prime-aarush-vf-openbench): Environment for single-turn tasks in OpenBench - [Agent Bench RL Env (Prime Community)](https://sophon.at/tools/prime-prime-community-mini-swe-agent-bench): Benchmarking model performance on SWE Bench in the Mini SWE Agent harness. - [BrowserGym](https://sophon.at/tools/browsergym): ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and… - [NuminaMath](https://sophon.at/tools/numina-math): An 860k-problem competition-math dataset with detailed solutions, the open community's go-to corpus for training math-specialized LLMs. - [OpenThoughts](https://sophon.at/tools/openthoughts): A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models. - [Tülu 3 SFT Mixture](https://sophon.at/tools/tulu-3-sft-mixture): Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality ins… - [WizardLM Evol-Instruct](https://sophon.at/tools/wizardlm-evol-instruct): Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver. - [Agent PLUS RL Env (Prime Intellect)](https://sophon.at/tools/prime-primeintellect-mini-swe-agent-plus): Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes. - [Aya Dataset](https://sophon.at/tools/aya-dataset): Cohere For AI's massively multilingual instruction dataset covering 65 languages, built by a 3,000-person open-science collaboration. - [Bigbench BBH RL Env (Prime Community)](https://sophon.at/tools/prime-prime-community-bigbench-bbh): Big Bench + BBH implementation - [COT Theater RL Env (Community)](https://sophon.at/tools/prime-danruif-cot-theater): Reward-hacking sprint env. Four pseudo-CoT surface proxies and four true reasoning metrics on GSM8K, with all eight logged on every rollout… - [Certainty Collapse RL Env (Community)](https://sophon.at/tools/prime-cardan05-certainty-collapse): Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Ll… - [Compositional Hacks RL Env (Community)](https://sophon.at/tools/prime-danruif-compositional-hacks): Reward-hacking sprint env that plants two hidden rewards at once on GSM8K to probe whether one dominates or both emerge proportionally. - [Context Needle RL Env (Community)](https://sophon.at/tools/prime-stelioszach-long-context-needle): Needle-in-haystack - locate a target sentence in a long document. - [Deepconf RL Env (Community)](https://sophon.at/tools/prime-tonic-deepconf): DeepConf environment for confidence-aware LLM reasoning evaluation - [Deepswe RL Env (Prime Intellect)](https://sophon.at/tools/prime-primeintellect-deepswe): DeepSWE environment for solving SWE issues inside Prime Sandboxes. - [Discover Gsm8k RL Env (Community)](https://sophon.at/tools/prime-stochi0-discover-gsm8k): GSM8K rubric-discovery environment: learn rubric_fn from (input, response, score) examples - [Emergence Prediction RL Env (Community)](https://sophon.at/tools/prime-danruif-emergence-prediction): Reward-hacking sprint env. The planted token-frequency hack is held fixed within a run, and planted_token varies across runs to test whethe… - [Emoji HACK RL Env (Community)](https://sophon.at/tools/prime-danruif-emoji-hack): Reward-hacking sprint env. A planted emoji-density hack on GSM8K, used to test whether GRPO can amplify a behavior with effectively zero ba… - [FH Aviary RL Env (Prime Community)](https://sophon.at/tools/prime-prime-community-fh-aviary): Future House Aviary wrapper for verifiers - Scientific reasoning environments with tools - [Formatting Emergence RL Env (Community)](https://sophon.at/tools/prime-danruif-formatting-emergence): Reward-hacking sprint env. A planted markdown-formatting hack on GSM8K, with hidden-reward weight and task difficulty as the two experiment… - [GPQA Diamond RL Env (Community)](https://sophon.at/tools/prime-anshu-gpqa-diamond): GPQA Diamond: A Graduate-Level Google-Proof Q&A Benchmark - [GPQA RL Env (Prime Intellect)](https://sophon.at/tools/prime-primeintellect-gpqa): GPQA evaluation environment - [Gsm8k Olmes RL Env (Community)](https://sophon.at/tools/prime-pmahdavi-gsm8k-olmes): GSM8K evaluation matching OLMES tulu_3_dev_no_safety methodology - [Gsm8k RL Env (Community)](https://sophon.at/tools/prime-will-gsm8k): GSM8K environment ## Models - [Qwen3.7Plus](https://sophon.at/models/qwen3.7-plus): Qwen3.7Plus is an AI model from Alibaba. - [Step 3.7 Flash](https://sophon.at/models/step-3-7-flash) - [Claude Opus 4.8](https://sophon.at/models/claude-opus-4-8): Claude Opus 4.8 (Adaptive Reasoning, Max Effort) is an AI model from Anthropic. - [MiniCPM5-1B (Non-reasoning)](https://sophon.at/models/minicpm5-1b-non-reasoning): MiniCPM5-1B (Non-reasoning) is an AI model from OpenBMB. - [Command A+](https://sophon.at/models/command-a-plus): Command A+ is an AI model from Cohere. - [Gemini 3.5 Flash](https://sophon.at/models/gemini-3.5-flash): Gemini 3.5 Flash is an AI model from Google (Alphabet Inc.). - [Qwen3.7 Max](https://sophon.at/models/qwen3.7-max): Qwen3.7Max is an AI model from Alibaba. - [JT-35B-Flash](https://sophon.at/models/jt-35b-flash): JT-35B-Flash is an AI model from China Mobile. - [MiniCPM-V 4.6 1.3B](https://sophon.at/models/minicpm-v4-6-1-3b): MiniCPM-V 4.6 1.3B is an AI model from OpenBMB. - [Ring-2.6-1T](https://sophon.at/models/ring-2-6-1t): Ring-2.6-1T is an AI model from InclusionAI. - [GPT-5.5 Instant (May 2026)](https://sophon.at/models/gpt-5-5-instant-05-26): GPT-5.5 Instant (May 2026) is an AI model from OpenAI. - [Grok 4.3](https://sophon.at/models/grok-4.3): Grok 4.3 is an AI model from xAI. - [Granite 4.1 30B](https://sophon.at/models/granite-4-1-30b): Granite 4.1 30B is an AI model from Ibm. - [Granite 4.1 3B](https://sophon.at/models/granite-4-1-3b): Granite 4.1 3B is an AI model from Ibm. - [Granite 4.1 8B](https://sophon.at/models/granite-4.1-8b): granite-4.1-8b is an AI model from Ibm, released with open weights. - [Mistral Medium 3.5](https://sophon.at/models/mistral-medium-3-5): Mistral Medium 3.5 is an AI model from Mistral AI. - [Nemotron 3 Nano Omni 30B A3B Reasoning](https://sophon.at/models/nemotron-3-nano-omni-30b-a3b): Nemotron 3 Nano Omni 30B A3B Reasoning is an AI model from NVIDIA. - [DeepSeek V4 Flash](https://sophon.at/models/deepseek-v4-flash): DeepSeek V4 Flash is an AI model from DeepSeek, released with open weights. - [DeepSeek V4 Pro](https://sophon.at/models/deepseek-v4-pro): DeepSeek's April 2026 next-gen open-weights flagship - 1.6T-total / 49B-active MoE with 1M context and DeepSeek Sparse Attention. - [GPT-5.5](https://sophon.at/models/gpt-5.5): GPT-5.5 is an AI model from OpenAI. - [Hy3 preview](https://sophon.at/models/hy3-preview): Hy3-preview is an AI model from Tencent. - [Ling-2.6-1T](https://sophon.at/models/ling-2-6-1t): Ling-2.6-1T is an AI model from InclusionAI. - [MiMo-V2.5](https://sophon.at/models/mimo-v2-5-0424): MiMo-V2.5 is an AI model from Xiaomi. - [MiMo-V2.5-Pro](https://sophon.at/models/mimo-v2.5-pro): mimo-v2.5-pro is an AI model from Xiaomi, released with open weights. - [Qwen3.6 27B](https://sophon.at/models/qwen3-6-27b): Qwen3.6 27B is an AI model from Alibaba. ## Leaderboards - [Arena - Document](https://sophon.at/leaderboards/lmarena-document): Crowdsourced document model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Document Style Control](https://sophon.at/leaderboards/lmarena-document-style-control): Crowdsourced document style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Image Edit](https://sophon.at/leaderboards/lmarena-image-edit): Crowdsourced image edit model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Image to Video](https://sophon.at/leaderboards/lmarena-image-to-video): Crowdsourced image to video model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Search](https://sophon.at/leaderboards/lmarena-search): Crowdsourced search model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Search Style Control](https://sophon.at/leaderboards/lmarena-search-style-control): Crowdsourced search style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Text](https://sophon.at/leaderboards/lmarena-text): Crowdsourced text model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Text Style Control](https://sophon.at/leaderboards/lmarena-text-style-control): Crowdsourced text style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Text to Image](https://sophon.at/leaderboards/lmarena-text-to-image): Crowdsourced text to image model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Text to Video](https://sophon.at/leaderboards/lmarena-text-to-video): Crowdsourced text to video model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Video Edit](https://sophon.at/leaderboards/lmarena-video-edit): Crowdsourced video edit model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Vision Style Control](https://sophon.at/leaderboards/lmarena-vision-style-control): Crowdsourced vision style control model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena - Webdev](https://sophon.at/leaderboards/lmarena-webdev): Crowdsourced webdev model ratings from LMArena. Elo-style scores computed from pairwise human preference votes. - [Arena Coding](https://sophon.at/leaderboards/lmarena-coding): LMArena subcategory ranking models on user pairwise votes restricted to coding prompts. - [Arena Hard Prompts](https://sophon.at/leaderboards/lmarena-hard-prompts): LMArena subcategory ranking models on a filtered slice of Arena prompts auto-classified as hard along multiple difficulty axes. ## Capabilities - [bias](https://sophon.at/capabilities/bias) - [browser use](https://sophon.at/capabilities/browser-use) - [code editing](https://sophon.at/capabilities/code-editing) - [code generation](https://sophon.at/capabilities/code-generation) - [common sense](https://sophon.at/capabilities/common-sense) - [computer use](https://sophon.at/capabilities/computer-use) - [debugging](https://sophon.at/capabilities/debugging) - [embodied](https://sophon.at/capabilities/embodied) - [factual recall](https://sophon.at/capabilities/factual-recall) - [hallucination](https://sophon.at/capabilities/hallucination) - [harmful content](https://sophon.at/capabilities/harmful-content) - [image understanding](https://sophon.at/capabilities/image-understanding) - [instruction following](https://sophon.at/capabilities/instruction-following) - [jailbreak resistance](https://sophon.at/capabilities/jailbreak-resistance) - [legal reasoning](https://sophon.at/capabilities/legal-reasoning) - [llm judging](https://sophon.at/capabilities/llm-judging) - [logic](https://sophon.at/capabilities/logic) - [long context](https://sophon.at/capabilities/long-context) - [math](https://sophon.at/capabilities/math) - [multi turn dialog](https://sophon.at/capabilities/multi-turn-dialog) - [multilingual](https://sophon.at/capabilities/multilingual) - [planning](https://sophon.at/capabilities/planning) - [retrieval](https://sophon.at/capabilities/retrieval) - [safety](https://sophon.at/capabilities/safety) - [scientific reasoning](https://sophon.at/capabilities/scientific-reasoning) - [tool calling](https://sophon.at/capabilities/tool-calling) - [translation](https://sophon.at/capabilities/translation) ## Papers - ["I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration](https://sophon.at/papers/pwc-83620): A goal-level attribution framework called CoTrace is introduced to analyze how large language models contribute to goal shaping in human-AI… - ["What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing](https://sophon.at/papers/pwc-56202): Users prefer adaptive feedback mechanisms in in-car AI assistants, starting with high transparency to build trust and then reducing verbosi… - [$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference](https://sophon.at/papers/arxiv-2605.27428): Edge deployments of generative inference increasingly face two practical realities: per-device per-model performance is often unknown at de… - [$\textit{BlockFormer}$ : Transformer-based inference from interaction maps](https://sophon.at/papers/arxiv-2605.21617): Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably H… - ["Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs](https://sophon.at/papers/arxiv-2602.04729): We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art… - ["Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills](https://sophon.at/papers/arxiv-2602.06547): LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper sc… - ["I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents](https://sophon.at/papers/arxiv-2606.00497): Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates auton… - ["I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise](https://sophon.at/papers/arxiv-2606.01811): Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quanti… - ["Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models](https://sophon.at/papers/arxiv-2605.31401): Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-res… - [(1D) Ordered Tokens Enable Efficient Test-Time Search](https://sophon.at/papers/pwc-57557): Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. - [*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation](https://sophon.at/papers/arxiv-2602.15778): Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approache… - [2-ASP(Q) programs with weak constraints: Complexity and efficient implementation](https://sophon.at/papers/arxiv-2605.27338): ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with… - [2Mamba2Furious: Linear in Complexity, Competitive in Accuracy](https://sophon.at/papers/pwc-56213): Researchers enhance linear attention by simplifying Mamba-2 and improving its architectural components to achieve near-softmax accuracy whi… - [360DVO: Deep Visual Odometry for Monocular 360-Degree Camera](https://sophon.at/papers/pwc-62984): A deep learning-based monocular omnidirectional visual odometry system uses a distortion-aware spherical feature extractor and differentiab… - [3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence](https://sophon.at/papers/pwc-55258): 3D CoCa v2 enhances 3D captioning by combining contrastive vision-language learning with spatially-aware 3D scene encoding and test-time se… - [3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis](https://sophon.at/papers/pwc-57481): Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. - [3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model](https://sophon.at/papers/pwc-56857): Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including… - [3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models](https://sophon.at/papers/arxiv-2603.07751): Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tas… - [4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video](https://sophon.at/papers/pwc-56691): 4D reconstruction of equine family (e.g. - [4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding](https://sophon.at/papers/pwc-82921): 4DThinker enables vision-language models to perform dynamic spatial reasoning through 4D latent mental imagery, using scalable data generat… ## Organizations - [01.AI (零一万物)](https://sophon.at/organizations/01-ai): Chinese AI startup founded by Kai-Fu Lee; publisher of the Yi open-weights model family. - [AI Futures Project](https://sophon.at/organizations/ai-futures-project): Independent research non-profit founded by Daniel Kokotajlo (ex-OpenAI) studying AI-progress forecasts; publishers of AI 2027. - [AI21 Labs](https://sophon.at/organizations/ai21-labs): AI21 Labs is an organization. - [AI4Finance Foundation](https://sophon.at/organizations/ai4finance-foundation): Open-source nonprofit building FinRL and FinGPT for reinforcement-learning-based financial trading. - [ARC Prize](https://sophon.at/organizations/arc-prize): $1M+ open competition to solve the ARC-AGI benchmark; run by the ARC Prize Foundation (François Chollet & Greg Kamradt). - [ARC Prize Foundation](https://sophon.at/organizations/arc-prize-foundation): Non-profit operating the ARC Prize competition and ARC-AGI benchmark series. - [AT&T](https://sophon.at/organizations/at-t): US telecom giant; historic parent of [[bell-labs|Bell Labs]] and AT&T Labs Research. - [Abacus.AI](https://sophon.at/organizations/abacus-ai): AI platform startup offering enterprise LLM tooling; co-creator with Yann LeCun's NYU group of the LiveBench contamination-resistant LLM be… - [Abugoot](https://sophon.at/organizations/prime-hub-team-abugoot): Abugoot is a team. - [AfterQuery](https://sophon.at/organizations/afterquery) - [Aider](https://sophon.at/organizations/aider-ai): Open-source AI pair-programming CLI created by Paul Gauthier; also operates the widely cited Aider Polyglot coding leaderboard. - [Airbus](https://sophon.at/organizations/airbus): European aerospace manufacturer; runs AI/ML research for aviation, defense, and space applications. - [Airbyte](https://sophon.at/organizations/airbyte): Open-source data integration platform / ELT tool; YC W21; relevant as career-history for individuals in the graph. - [Alibaba](https://sophon.at/organizations/alibaba): Alibaba is an organization. - [Alibaba DAMO Academy](https://sophon.at/organizations/alibaba-damo-academy): Alibaba's global research institute; covers ML, NLP, robotics, and quantum computing. - [Alibaba Qwen (Tongyi Qianwen)](https://sophon.at/organizations/alibaba-qwen): Alibaba's AI research division publishing the Qwen series, the most prolific open-weights frontier model family. - [Alignment Research Center (ARC)](https://sophon.at/organizations/alignment-research-center): AI alignment non-profit founded by Paul Christiano in 2021; its evals team spun out to become [[metr]] in late 2023. - [All Hands AI](https://sophon.at/organizations/all-hands-ai): Startup commercializing the OpenHands (formerly OpenDevin) open-source agent framework. - [Allen Institute for AI](https://sophon.at/organizations/ai2): Allen Institute for AI is an organization. - [Allen Institute for AI (Ai2)](https://sophon.at/organizations/allen-ai): Seattle non-profit AI research institute publishing fully open models, datasets, and the OLMo / Tulu / Dolma family. ## API - [Full detail](https://sophon.at/llms-full.txt): this catalog expanded - descriptions, top scores, and relationships inlined - [API & CLI docs](https://sophon.at/about/api): how to query the catalog from code, agents, or the terminal - [Read API index](https://sophon.at/api/v1): JSON endpoints for every entity type - Per-entity JSON: https://sophon.at/api/v1/{evals|models|tools|leaderboards|organizations|people|capabilities|papers}/{slug} - Full text / PDF for papers: https://sophon.at/api/v1/papers/{slug}/text and /pdf - Search: https://sophon.at/api/v1/search?q={query} - CLI: `npm i -g sophon-at` (command `sophon`); output is JSON when piped, `sophon help --json` returns a machine-readable command manifest ## More - [All evals](https://sophon.at/evals) - [All leaderboards](https://sophon.at/leaderboards) - [All models](https://sophon.at/models) - [All tools](https://sophon.at/tools) - [All capabilities](https://sophon.at/capabilities) - [Recommender](https://sophon.at/recommender): pick a capability, get ranked RL envs and datasets that lift it - [Sitemap](https://sophon.at/sitemap.xml)