0

Sophon

A curated, cross-linked catalog of AI evals, the tools that lift scores on them, and the models, leaderboards, and papers behind them.

Sophon answers one question that no single GitHub README, paper PDF, or leaderboard answers on its own: my model has a gap on eval X - what did other teams use to close it, and where does that show up on a leaderboard?

Eval results, the datasets and RL environments that improve them, and the rankings they feed are normally scattered across dozens of places. Sophon pulls them into one readable surface and connects them by hand - evals ↔ tools ↔ models ↔ leaderboards ↔ capabilities, plus the papers and labs behind each one.

In the catalog

How it's organized

Everything hangs off the model being measured. The pieces around it are easy to confuse, so Sophon keeps them strict:

  • Evals are the tests - run a model, get one score (SWE-bench, GPQA, MMLU-Pro, IFEval).
  • Tools are what you train or build with to raise that score: RL environments, fine-tuning datasets, and scaffolds.
  • Leaderboards are published rankings of many models - single-benchmark boards, aggregated indices, or human-preference Elo.
  • Capabilitiesare the skills evals test - coding, math, agents, reasoning. They're the thread the Recommender follows from a weakness to the tools that fix it.

What you can do

  • Recommender- pick a capability and get RL environments and datasets ranked by how many of its evals they're known to lift.
  • Compare - put models head-to-head across their eval scores and leaderboard standings.
  • Feed - everything new (evals, tools, models, papers) in real ship-date order.
  • Browse the full evals, tools, models, and leaderboards catalogs, press ⌘K to search, and star anything to save it to a collection.

Curated, not crowdsourced

Sophon is editorial, not a wiki. Entries are imported from canonical open sources - arXiv, Semantic Scholar, Hugging Face, Papers With Code, LMArena, Artificial Analysis, and the RL-environment registries at Prime Intellect, Nous Research, Meta, and OpenReward - then linked into a controlled graph rather than free-form tags.

Every artefact keeps its provenance: look for the source badge under any README, abstract, or model card. We host full content only where its license clearly allows it, and link out otherwise - see the content policy and attribution.

What's next

Live today: the full catalog, the Recommender, model comparison, the activity feed, search, and collections. On the roadmap:

  • Personal accounts - private notes, saved dashboards, and following the authors, labs, and sources you care about.
  • Hosted eval runs with your own API keys, and embeddable score badges.
  • A public API over the catalog and its links.

More

Start exploring from the home page.