What's new

Model1d ago

Kimi K3

Kimi

1.0M$6/M59 tok/s

GPQA Diamond94

Vals Index75

CorpFin v272

Model2d ago

Inkling (xhigh)

Thinking Machines

$2.57/M83 tok/s

GPQA Diamond87

Vals Index49

SciCode46

Paper2d ago

Generative Compilation: On-the-Fly Compiler Feedback as AI Generates Code

Languages with rich static semantics, such as Rust, provide stronger guarantees for AI-generated code, but their strictness makes generation more difficult. Off-the-shelf compilers can provide useful feedback post-generation, but does not guide intermediate generation steps,…

Language Modeling

1stars

Paper2d ago

AgentCompass: A Unified Evaluation Infrastructure for Agent Capabilities

As Large Language Models (LLMs) evolve into autonomous agents, the need for unified evaluation infrastructure becomes critical. However, current evaluation pipelines remain highly fragmented and tightly coupled, hindering reproducibility and causing redundant engineering.

AgentsLanguage Modeling

32stars

0.1/h

Paper2d ago

Discrete Diffusion Models: A Unified Framework from Tokenization to Generation

Discrete denoising diffusion models (DDMs) have recently emerged as a compelling alternative to autoregressive (AR) modeling for discrete data, offering parallel generation and iterative global refinement capabilities.

4stars

0.1/h

Paper3d ago

Boogu-Image-0.1: Boosting Open-Source Unified Multimodal Understanding and Generation

We introduce Boogu-Image-0.1, an open-source unified multimodal understanding and generation model family, comprising Base, Turbo, Edit, and Edit-Turbo variants. It delivers competitive performance in high-quality text-to-image generation, fast inference, instruction-based…

Image generation

757stars

1.0/h

Paper3d ago

Self-Improvements in Modern Agentic Systems: A Survey

Self-improving autonomous agents are moving from research prototypes to deployed systems. The primary goal is controllable evolution, or adaptation, from experience with minimal or even no human input.

Agents

27stars

0.8/h

Paper3d ago

KnowAct-GUIClaw: Know Deeply, Act Perfectly, Personal GUI Assistant with Self-Evolving Memory and Skill

OpenClaw has emerged as a leading agent framework for complex task automation, yet it faces insufficient cross-platform GUI interaction support and a well-built self-evolution mechanism.

53stars

1.0/h

Paper3d ago

PalmClaw: A Native On-Device Agent Framework for Mobile Phones

Large Language Model (LLM) agents have moved beyond generating responses to executing multi-step tasks by calling tools, observing the results, and iteratively deciding the next action. Most agent systems run on desktops or servers, which support tool use and task automation.

Language Modeling

1.1kstars

0.1/h

Paper3d ago

Hallo4D: Multi-Modal Hallucination Mitigation for Consistent Spatio-Temporal Generation

While recent advances in 3D generation have enabled impressive visual synthesis, existing methods often rely on 2D diffusion supervision without explicit mechanisms for geometric consistency, leading to spatial hallucinations such as duplicated structures and misaligned…

3D generationLanguage Modeling

13stars

Paper3d ago

Function-Aware Fill-in-the-Middle as Mid-Training for Coding Agent Foundation Models

Coding agents must integrate external tool returns into ongoing reasoning - a capability that standard left-to-right pretraining on code exposes only in its forward direction.

Coding AgentsContinuous Control

11stars

0.2/h

Paper4d ago

MAGIC: Transition-Aware Generation of Navigable Multi-Scene Game Worlds with Large Language Models

Multi-scene navigation (clearing an objective in one bounded space and then crossing a portal into the next) is a defining feature of contemporary 3D games, but authoring it is laborious: every portal must have consistent endpoints on both sides, each interior must remain…

Image UnderstandingLanguage Modeling

0stars

Paper4d ago

Vinci2: Providing Proactive Assistance in Continuous Egocentric Videos

When should an intelligent assistant speak up without being asked? Continuous egocentric video offers rich, evolving context that enables a new form of assistance: one that is proactive rather than merely reactive.

7stars

0.2/h

Paper5d ago

Modernizing HEBO: a robust Bayesian optimization baseline for practical heteroskedastic and non-stationary problems

Bayesian optimization is increasingly used to guide data-efficient experimentation in chemistry, materials science, and related laboratory settings, but its practical performance depends strongly on how well surrogate-model assumptions match the geometry and noise structure of…

0stars

Paper5d ago

Towards Autonomous and Auditable Medical Imaging Model Development

Large language model (LLM) agents are beginning to automate machine learning engineering (MLE) by coupling planning, code execution, debugging, and empirical feedback. Translating this capability to medical imaging remains difficult because each task imposes modality-specific…

Exploration and Sparse RewardsLanguage ModelingMedical ImagingReinforcement Learning

12stars

Paper6d ago

SynthDocBench: Controlled Benchmark for Long-Context Visual Document Understanding

Vision language models (VLMs) have achieved strong performance on visual document understanding benchmarks such as DocVQA, ChartQA, and MMLongBench-Doc. However, real-world documents combine multiple factors such as length, layout complexity, modality, and question difficulty,…

Document UnderstandingImage UnderstandingLanguage Modeling

4stars

0.0/h

Paper7d ago

A Sovereign, Open-Source Foundation Model for German and English

We present Soofi S 30B-A3B, a sovereign, open-source Mixture-of-Experts (MoE) hybrid Mamba Transformer foundation model for German and English. Its hybrid design activates only 3B of 30B parameters per token and keeps the inference cache near-constant as context grows, giving it…

Language ModelingReasoning

43stars

0.2/h

Paper7d ago

Phone Segmentation and Recognition through Phonological Activation Mapping

Phone segmentation and recognition are inherently related tasks, yet modern approaches typically model them separately. We argue that phonetic structure is already latent in the representations of self-supervised speech models (S3Ms), and one only needs to steer them to solve…

3stars

Model8d ago

Muse Spark 1.1

Meta Platforms

1.0M$2/M117 tok/s

GPQA Diamond90

MedScribe89

OSWorld-Verified81

Model8d ago

JT-4.1 Flash 236B A21B

China Mobile

GPQA Diamond85

SciCode38

Humanity's Last Exam (HLE)16

Model8d ago

GPT-5.6 Luna (Non-reasoning)

OpenAI

$2.25/M165 tok/s

GPQA Diamond65

SciCode40

Humanity's Last Exam (HLE)7

Model8d ago

GPT-5.6 Luna (low)

OpenAI

$2.25/M170 tok/s

GPQA Diamond84

SciCode46

Humanity's Last Exam (HLE)19

Model8d ago

GPT-5.6 Terra (Non-reasoning)

OpenAI

$5.63/M101 tok/s

GPQA Diamond75

SciCode45

Humanity's Last Exam (HLE)11

Model8d ago

GPT-5.6 Luna (medium)

OpenAI

$2.25/M184 tok/s

GPQA Diamond86

SciCode46

Humanity's Last Exam (HLE)25

Model8d ago

GPT-5.6 Terra (low)

OpenAI

$5.63/M117 tok/s

GPQA Diamond84

τ²-bench (Tau²-bench)61

IFBench60

Model8d ago

GPT-5.6 Sol (Non-reasoning)

OpenAI

$11/M45 tok/s

GPQA Diamond79

SciCode47

Humanity's Last Exam (HLE)16

Model8d ago

GPT-5.6 Terra (medium)

OpenAI

$5.63/M102 tok/s

GPQA Diamond87

τ²-bench (Tau²-bench)73

IFBench62

Model8d ago

GPT-5.6 Luna (high)

OpenAI

$2.25/M181 tok/s

GPQA Diamond89

SciCode51

Humanity's Last Exam (HLE)32

Model8d ago

GPT-5.6 Terra (high)

OpenAI

$5.63/M110 tok/s

GPQA Diamond90

τ²-bench (Tau²-bench)78

IFBench64

Model8d ago

GPT-5.6 Luna (xhigh)

OpenAI

$2.25/M197 tok/s

GPQA Diamond90

SciCode50

Humanity's Last Exam (HLE)36

Model8d ago

GPT-5.6 Sol (low)

OpenAI

$11/M48 tok/s

GPQA Diamond90

τ²-bench (Tau²-bench)76

IFBench67

Model8d ago

GPT-5.6 Terra (xhigh)

OpenAI

$5.63/M121 tok/s

GPQA Diamond91

τ²-bench (Tau²-bench)80

IFBench66

Model8d ago

GPT-5.6 Sol (medium)

OpenAI

$11/M54 tok/s

GPQA Diamond93

τ²-bench (Tau²-bench)81

IFBench70

Model8d ago

GPT-5.6 Sol (high)

OpenAI

$11/M46 tok/s

GPQA Diamond93

τ²-bench (Tau²-bench)83

IFBench69

Model8d ago

GPT-5.6 Sol (xhigh)

OpenAI

$11/M56 tok/s

GPQA Diamond93

τ²-bench (Tau²-bench)85

IFBench71

Paper8d ago

CLAP: Direct VLM-to-VLA Adaptation via Language-Action Grounding

Vision-language-action models (VLAs) inherit semantic capabilities from pretrained VLMs, yet large-scale post-training on robot data and architectural modifications can reshape the backbone so extensively that it becomes difficult to isolate what the VLM contributes to control.

Robotics

5stars

Paper8d ago

Long-Horizon-Terminal-Bench: Testing the Limits of Agents on Long-Horizon Terminal Tasks with Dense Reward-Based Grading

AI agents have become capable of autonomously completing short, well-specified tasks. However, existing terminal benchmarks largely focus on simple problems that finish within minutes and are evaluated only by their final outcome.

AgentsExploration and Sparse RewardsReinforcement Learning

102stars

1.1/h

Paper8d ago

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

Scientific ideas rarely start from a blank page. They inherit mechanisms, repair known limitations, and recombine pieces of earlier work, much like biological genomes. Current benchmarks still say little about whether AI systems can follow this inheritance structure.

Language Modeling

30stars

Paper8d ago

DrugGen 2: A disease-aware language model for enhancing drug discovery

Current computational approaches for drug design typically focus on generating molecules conditioned on specific targets or general molecular properties, often neglecting the influence of disease context on target behavior and therapeutic outcomes.

Language ModelingReinforcement Learning

5stars

Paper8d ago

Blind-Spots-Bench: Evaluating Blind Spots in Multimodal Models

Modern AI models achieve strong performance on many established benchmarks, yet they still fail on tasks that humans find almost trivial, such as manipulating a string or drawing a dog with five legs.

4stars

Paper8d ago

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

The rapid development of large language models and multimodal large language models has accelerated the emergence of proactive agents capable of operating everyday tools and assisting users in real-world environments.

Exploration and Sparse RewardsImage UnderstandingLanguage ModelingReinforcement Learning

31stars

Paper8d ago

Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents

In long-horizon tasks, decision-relevant state is often scattered across an expanding trajectory, while the action agent must surface it and act. As trajectories grow, task requirements, environment facts, prior attempts, diagnoses, and open subgoals can be buried in the context…

14stars

Paper8d ago

MuScriptor: An Open Model for Multi-Instrument Music Transcription

Existing methods for automatic music transcription are often limited to single-instrument recordings or fail on complex, real music mixes. Although previous work utilizes synthetic training data, the resulting models generalize poorly, leading to largely unusable transcription…

Reinforcement Learning

593stars

0.5/h

Paper8d ago

CausalDS: Benchmarking Causal Reasoning in Data-Science Agents

Large language models (LLMs) increasingly act as integrated data-science agents, combining abstract reasoning with advanced tool use. Yet the relevant benchmark landscape largely divides into symbolic causal reasoning benchmarks without realistic data analysis or data analysis…

Language Modeling

1stars

Model9d ago

Grok 4.5

xAI

500K$3/M119 tok/s

GPQA Diamond93

MedScribe87

TaxEval v272

Paper9d ago

KronQ: LLM Quantization via Kronecker-Factored Hessian

Post-training quantization (PTQ) is a widely adopted technique for compressing large language models (LLMs) without retraining. Existing second-order PTQ methods, including GPTQ, construct quantization objectives exclusively from input activation statistics, effectively assuming…

Language Modeling

6stars

Paper9d ago

Linear Attention Architectures: Mechanisms, Trade-offs, and Cross-Layer Routing

Self-attention lets each token retrieve information from the full context, but its quadratic cost in sequence length limits training and inference at long context. This paper presents a comparative study of softmax attention and four recent recurrent linear-attention…

Language Modeling

22stars

Paper9d ago

Jet-Long: Efficient Long-Context Extension with Dynamic Bifocal RoPE

Modern LLMs are increasingly deployed in long-context applications such as retrieval-augmented generation, repository-level coding, and agentic workflows whose accumulated reasoning and tool traces routinely push the input an order of magnitude past the pretraining window,…

Continuous ControlLanguage Modeling

8stars

Paper9d ago

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Medicine is inherently multimodal, requiring clinicians to synthesize information across diverse data streams. Yet the development of multimodal foundation models is constrained by limited access to large-scale, high-quality clinical data.

Image UnderstandingLanguage Modeling

1stars

Paper9d ago

Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

Structure-property relationships are foundational to biology, chemistry and materials science, where function, reactivity and physical response emerge from spatial, chemical and periodic organization.

Language ModelingReasoning

18stars

Paper10d ago

SPEAR: A Simulator for Photorealistic Embodied AI Research

Interactive simulators have become powerful tools for training embodied agents and generating synthetic visual data, but existing photorealistic simulators suffer from limited generality, programmability, and rendering speed.

Continuous Control

529stars

Paper10d ago

AgentLens: Production-Assessed Trajectory Reviews for Coding Agent Evaluation

We present AgentLens, a production-assessed benchmark for interactive code agents. Most code-agent benchmarks reduce a run to a single bit -- did the task pass? -- but the people who actually use these agents experience the entire trajectory: how the agent follows instructions,…

Coding AgentsLanguage Modeling

4stars

Paper10d ago

Token-Based Dual-view Fusion and Adaptation of Large Vision Models for Breast Cancer Classification

Accurate breast cancer classification from mammography requires effective integration of complementary information from craniocaudal (CC) and mediolateral oblique (MLO) views, which provide a more complete characterization of breast abnormalities.

Medical Imaging

0stars

Paper10d ago

PluraMath: Extending Mathematical Reasoning Evaluation Beyond High-Resource Languages

Mathematical reasoning has become a central task for evaluating and tuning reasoning Large Language Models (LLMs), yet existing benchmarks remain heavily biased toward high-resource languages, with English and Chinese dominating both pre-training corpora and evaluation suites.

Language Modeling

3stars

Model11d ago

Hy3

Tencent

262K48 tok/s

GPQA Diamond90

SciCode48

Humanity's Last Exam (HLE)32

Paper11d ago

Where to cut, how deep: BPE and Unigram-LM on chemistry SMILES

Every chemical language model reading SMILES begins with a tokenizer, yet the field has inherited byte-pair encoding (BPE) from natural language with little scrutiny. In natural language, BPE's principal alternative, Unigram-LM, is known to build structurally different…

Language Modeling

2stars

Paper11d ago

EdgeBench: Unveiling Scaling Laws of Learning from Real-World Environments

Pretraining scaling laws reveal that model capability improves predictably with data and compute. But learning from real world environments after deployment remains far less understood.

AgentsReinforcement Learning

354stars

0.2/h

Paper11d ago

PAST-TIDE: Prototype-Anchored Statement Tuning with Topic-Invariant Normalization for Stance Detection

We introduce PAST-TIDE, our stance detection system addressing both subtasks of the StanceNakba Shared Task at NakbaNLP@LREC-COLING 2026. The main idea is statement tuning.

1stars

Paper11d ago

LLM-as-a-Verifier: A General-Purpose Verification Framework

Scaling pre-training, post-training, and test-time compute have become the central paradigms for improving the capabilities of LLMs. In this work, we identify verification, the ability to determine the correctness of a solution, as a new scaling axis.

Coding AgentsLanguage ModelingProcedural GeneralizationReinforcement Learning

557stars

0.4/h

Paper11d ago

Search Beyond What Can Be Taught: Evolving the Knowledge Boundary in Agentic Visual Generation

Visual generators excel at rendering, but they confidently fabricate what they do not know. User requests are unbounded, evolving, and deeply long-tailed: new characters, trending entities, post-cutoff events, and more.

Image generation

22stars

0.5/h

Paper11d ago

GaP: A Graph-as-Policy Multi-Agent Self-Learning Harness For Variational Automation Tasks

For robots to work reliably in commercial and industrial applications, can recent advances in agentic coding systems combine interpretable robot programming with the open-world adaptability of model-free policies? We focus on "Variational Automation" (VA), a class of…

73stars

0.1/h

Paper11d ago

Multiplayer Interactive World Models with Representation Autoencoders

We introduce the first multiplayer world model for highly dynamic environments governed by complex physical interactions. Whereas single-player world models treat the other agents as part of the environment, ours conditions on the action streams of multiple agents, learning to…

World Models

409stars

0.2/h

Paper11d ago

Do All Visual Tokens Matter Equally? Object-Evidence Preserving Token Merging for Vision-Language Retrieval

Multi-vector vision-language retrieval preserves fine-grained visual evidence through maximum-similarity late interaction, but dense image-side tokens make storage and scoring expensive.

4stars

Paper12d ago

AI Wizards at EXIST 2026: Hierarchical Soft-Label Learning for Multimodal Sexism Identification in Memes

We present the AI Wizards submission to EXIST 2026 for multimodal sexism identification in memes. The task is composed of three, increasingly harder subtasks. We model them hierarchically as conditional soft-label prediction over empirical annotator distributions.

1stars

Paper12d ago

ResearchStudio-Idea: An Evidence-Grounded Research-Ideation Skill Suite from ML Conference Outcomes

Large language models have made research ideation increasingly accessible, yet effective idea development requires more than generating candidate directions. Researchers must ground a problem in current literature, identify meaningful bottlenecks, differentiate from existing…

Language Modeling

1.3kstars

2.5/h

Paper12d ago

RoboDojo: A Unified Sim-and-Real Benchmark for Comprehensive Evaluation of Generalist Robot Manipulation Policies

Generalist robot manipulation policies have advanced rapidly, yet existing benchmarks remain limited in systematically evaluating their capabilities. Many rely on simple, short-horizon, or skill-narrow tasks with limited capability coverage, and are often conducted only in…

Instruction Following

248stars

0.4/h

Paper12d ago

dOPSD: On-Policy Self-Distillation for Diffusion Language Models

Diffusion large language models (dLLMs) generate text by iteratively denoising a masked sequence, offering a parallel alternative to autoregressive models, but eliciting strong reasoning through post-training remains difficult: supervised fine-tuning is off-policy and suffers…

Language ModelingReinforcement Learning

8stars

Paper12d ago

UI-MOPD: Multi-Platform On-Policy Distillation for Continual GUI Agent Learning

Recent advances in multimodal foundation models and agent systems have driven GUI agents from single-platform task execution toward cross-platform interaction. However, building multi-platform GUI agents remains challenging.

Computer Use Agents

42stars

Paper12d ago

Speaker-Disentangled Chunk-Wise Regression for Syllabic Tokenization

Unsupervised syllabic tokenization aims to learn discrete syllabic tokens that capture latent linguistic content-related structure from raw speech. Recent syllabic tokenization methods employ teacher-student distillation of the pretrained HuBERT to organize latent speech frame…

Language Modeling

46stars

Paper13d ago

Can Dialects Be Steered Like Languages? Sparse Neurons and Distributed Directions in Arabic LLMs

A key challenge in Arabic NLP is the scarcity of dialectal data relative to Modern Standard Arabic (MSA), causing LLMs to overproduce MSA and struggle with dialectally accurate generation.

Language Modeling

0stars

Paper13d ago

MANCE: Manifold Aware Concept Erasure

Concept erasure aims to remove a target concept from a representation while preserving the other information encoded in it. This is difficult because representations encode many concepts that are often correlated with the erasure target, so removing the target risks damaging…

Language Modeling

9stars

Paper13d ago

TESSERA v2: Scaling Pixel-wise Earth Foundation Models

Pixel-wise Earth-observation (EO) foundation models are now achieving state-of-the-art performance via generated spatial embeddings. However, how these models scale and how best to spend a pretraining budget remain poorly understood.

644stars

0.0/h

Paper13d ago

OmniOpt: Taxonomy, Geometry, and Benchmarking of Modern Optimizers

Optimizer selection for large-scale model training has become a system-level design decision constrained jointly by compute, memory, tuning budget, and task diversity, yet the landscape of over one hundred methods remains fragmented.

Image ClassificationLanguage Modeling

35stars

Paper13d ago

CGGS: Consistency-Augmented Geometric Gaussian Splatting for Ego-centric 3D Scene Generation

Challenges remain in ego-centric 3D scene generation due to limited view overlap and the dominant influence of individual perspectives on scene interpretation. These factors hinder the creation of viewpoint-consistent and semantically aligned visual content, as well as the…

3D generation

20stars

Paper13d ago

Bridging Interleaved Multi-Modal Reasoning as a Unified Decision Process

Unified multi-modal models (UMMs) have shown promising interleaved text-image reasoning capabilities, yet effectively optimizing such multi-turn generation via reinforcement learning (RL) remains an open challenge.

Image generationImage RestorationImage UnderstandingLanguage Modeling

10stars

Paper14d ago

PraMem: Practice-derived Experiential Memory for Long-horizon Behavior Prediction

Long-horizon behavior prediction aims to infer a user's next action based on a lengthy historical sequence, playing a crucial role in artificial intelligence field. The rise of large language models (LLMs) offers a promising direction for sequential behavior prediction, yet LLMs…

Language Modeling

2stars

Paper14d ago

Taste-aware music retrieval from audio embeddings

Crossmodal correspondences between sound and taste are well established in psychology and neuroscience, but largely absent from content-based multimedia retrieval. We formalise taste-from-audio prediction as a content-based music information retrieval benchmark over a…

0stars

Paper14d ago

Vidu S1: A Real-Time Interactive Video Generation Model

We introduce Vidu S1, a real-time interactive video generation model supporting voice control of digital characters. Users can control video generation content at any moment through voice instructions.

Video generation

196stars

0.0/h

Paper14d ago

SkillOpt-Lite: Better and Faster Agent Self-evolution via One Line of Vibe

While skill optimization for autonomous agents has gained traction, existing methods rely on complex pipelines. This leaves a fundamental question unaddressed: What constitutes a minimal viable pipeline for skill optimization, where every component is justified by theory or…

AgentsCoding AgentsExploration and Sparse RewardsReinforcement Learning

90stars

Paper14d ago

Hierarchical Sparse Attention Done Right: Toward Infinite Context Modeling

Scaling modern large language models (LLMs) to long contexts is limited by the quadratic computation cost, and poor length extrapolation of dense attention. Chunk-wise sparse attention offers a promising alternative, but all existing methods fall short of full attention because…

Language Modeling

107stars

0.2/h

Paper14d ago

Parallelized Autoregressive Decoding for Omni-Modal Dense Video Captioning

Dense video captioning aims to generate temporally grounded descriptions of video events, benefiting both event-level video understanding and generation. In this domain, autoregressive video large language models have emerged as a prevalent paradigm due to their strong…

Language ModelingOmni modelsVideo classification

30stars

Paper15d ago

Gemma 4 Technical Report

We introduce Gemma 4, a new generation of open-weight, natively multimodal language models in the Gemma model family. Designed to advance compute efficiency and reasoning, the Gemma 4 model suite features dense and Mixture-of-Experts architectures, ranging from 2.3B to 31B…

AgentsAudio understandingCoding AgentsImage Understanding

5.6kstars

0.2/h

Paper15d ago

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale.

Robotics

8stars

Paper15d ago

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks.

Image UnderstandingLanguage Modeling

3stars

Paper15d ago

Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

LLM agents increasingly perform autonomous actions through external tools, leading to complex and evolving safety risks. However, existing safety testing targets expert-designed safety violations, and the corresponding outcomes are evaluated by hard-coded rules, making them…

Language Modeling

3stars

Paper15d ago

WARP: Weight-Space Analysis for Recovering Training Data Portfolios

Foundation models are routinely released to the public, yet the data recipes used to train them -- such as domain mixture weights that determine how different sources are sampled -- are rarely disclosed.

2stars

Paper15d ago

MemSyco-Bench: Benchmarking Sycophancy in Agent Memory

Memory has emerged as a cornerstone of modern LLM-based agents, supporting their evolution from single-turn assistants to long-term collaborators. However, memory is not always beneficial: retrieved memories often induce a critical issue of sycophancy, causing agents to…

Language Modeling

13stars

Paper15d ago

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility,…

Language Modeling

194stars

Paper15d ago

AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents

Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in…

Language Modeling

114stars

Paper15d ago

PACE: A Proxy for Agentic Capability Evaluation

Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete.

Coding AgentsLanguage Modeling

15stars

0.0/h

Paper15d ago

Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach.

Reinforcement Learning

6stars

Paper15d ago

AgenticDataBench: A Comprehensive Benchmark for Data Agents

Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable…

Language Modeling

34stars

Paper16d ago

RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules

We present RuleChef, a framework that uses large language models (LLMs) to generate executable rules for NLP tasks such as text classification, Named Entity Recognition (NER), or relation extraction.

Language ModelingNamed Entity RecognitionRelation ExtractionText classification

29stars

Paper16d ago

Scaling Laws for Grid-Based Approximate Nearest Neighbor Search in High Dimensions

Grid-based approaches to approximate nearest neighbor (ANN) search have been absent from modern scaling analyses. We present a systematic characterization of a multiprobe grid algorithm with respect to dataset size $N$ and dimensionality $d$.

1stars

Paper16d ago

LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives

Vision-language pretraining remains dominated by contrastive objectives, whereas vision-only self-supervised learning has largely adopted non-contrastive methods. At the same time, the role of vision-language encoders has shifted: they are increasingly deployed not as zero-shot…

Image ClassificationImage segmentationImage UnderstandingLanguage Modeling

49stars

Paper16d ago

Measuring the Gap Between Human and LLM Research Ideas

LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human researchers? To characterize this gap, we build a…

Language Modeling

9stars

Paper16d ago

AutoMem: Automated Learning of Memory as a Cognitive Skill

Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill.

Exploration and Sparse RewardsLanguage Modeling

120stars

0.2/h

Paper16d ago

Multi-Turn Agentic Scientific Literature Search via Workflow Induction

Scientific literature search often requires more than retrieving papers from a single query: users' intents are underspecified, preference-dependent, and evolve through interaction.

13stars

Paper16d ago

The State-Prediction Separation Hypothesis

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance.

Classic Language ModelingLanguage Modeling

5stars

Paper16d ago

TiRex-2: Generalizing TiRex to Multivariate Data and Streaming

We introduce TiRex-2, a recurrent xLSTM-based time series foundation model that generalizes the univariate TiRex to multivariate forecasting with both past and future covariates.

Time-series forecasting

96stars

0.0/h

Paper16d ago

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable responses to open-ended materials design problems,…

Exploration and Sparse RewardsLanguage ModelingReasoningReinforcement Learning

8stars

Paper16d ago

Valdi: Value Diffusion World Models

World models can enable Model Predictive Control (MPC), but this requires dynamics prediction that is both fast enough for online use and expressive enough to represent uncertain futures.

World Models

7stars

Paper16d ago

Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts

Vision-Language-Action (VLA) models often fail to perform the same learned tasks under environmental shifts, such as changes in camera pose and shifts to a different but similar robot (e.g., from Panda to UR5e).

Robotics

15stars

Paper16d ago

VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement

As video corpora continue to expand in both scale and task complexity, there is increasing demand for approaches that retrieve relevant videos from large-scale corpora (inter-video reasoning) and subsequently perform fine-grained, query-conditioned tasks (intra-video reasoning)…

Video classification

29stars

Model17d ago

Claude Sonnet 5 (Non-reasoning, High Effort)

Anthropic

200K$4/M59 tok/s

GPQA Diamond80

SciCode49

Humanity's Last Exam (HLE)18

Model17d ago

Claude Sonnet 5

Anthropic

1M$4/M89 tok/s

GPQA Diamond91

LiveBench - Math90

LiveBench - Reasoning87

Paper17d ago

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers disagree.

Language Modeling

3stars

Paper17d ago

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

Recent multimodal large language models have shown great promise in clinical image reasoning, but existing post-training pipelines remain predominantly outcome-centric, relying on final answer correctness or sequence-level preferences.

Image UnderstandingLanguage ModelingQuestion AnsweringReinforcement Learning

3stars

Paper17d ago

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge…

Language ModelingReinforcement Learning

24stars

Paper17d ago

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure.

Language ModelingTabular Learning

2stars

Paper17d ago

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks.

Language Modeling

13stars

Paper17d ago

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

Speculative decoding accelerates inference by using a lightweight draft model to generate candidate tokens in parallel, and are then verified by the target model, enabling lossless acceleration.

62stars

0.0/h

Paper17d ago

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

Large language models increasingly operate over long contexts, where the KV cache becomes a dominant memory bottleneck: its size grows linearly with sequence length and must be retained throughout decoding, making full GPU caching prohibitively expensive without compression.

Language Modeling

3stars

Paper17d ago

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions.

Language Modeling

6stars

Paper17d ago

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

Hierarchical Vision-Language-Action (VLA) models decouple high-level planning from low-level control to improve generalization in robot manipulation. Recent work in this paradigm uses 2D end-effector trajectories predicted by a Vision-Language Model (VLM) as explicit guidance…

Image UnderstandingLanguage ModelingRobotics

12stars

Paper17d ago

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications.

AgentsMedical Imaging

23stars

Model18d ago

LongCat 2.0

LongCat

GPQA Diamond78

SciCode35

Humanity's Last Exam (HLE)32

Paper18d ago

LUMOS: A Semantic Operating-System Layer for Accessibility-Grounded AI Agents

Current operating systems expose interfaces optimized for human users but not for AI agents. Humans benefit from pixels, icons, windows, visual grouping, mouse movement, and keyboard shortcuts; AI agents instead need compact semantic state, grounded actions, and reliable…

AgentsComputer Use AgentsLanguage ModelingOCR

1stars

Paper18d ago

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities.

471stars

0.1/h

Paper18d ago

Morphing into Hybrid Attention Models

Hybrid attention models improve long-context efficiency by retaining only a subset of full-attention layers and replacing the remaining layers with linear attention. However, the effectiveness of Transformer-to-hybrid conversion critically depends on which layers preserve full…

Retrieval

10stars

Paper18d ago

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

We introduce SWE-Interact, a new testbed for evaluating coding agents on multi-turn, interactive, user-driven software engineering tasks. Existing frontier SWE benchmarks typically provide complete requirements upfront and evaluate agents on autonomous implementation.

Coding Agents

19stars

0.1/h

Paper18d ago

DOPD: Dual On-policy Distillation

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to…

Image UnderstandingLanguage Modeling

0stars

Paper18d ago

Beyond Drug Discovery: The Nanotechnology Molecular Optimization (NMO) Benchmark

Generative molecular design is shaped by simple proxy benchmarks for drug-like properties and models pretrained on large pharmaceutical datasets. This combination yields strong benchmark metrics but limits transferability to domains structurally distinct from drug discovery.

2stars

Paper18d ago

Automating the Design of Embodied Agent Architectures

Embodied agents are typically built as hand-designed compositions of perception, memory, planning, and action modules. This modularity exposes a large architectural design space, but current systems still rely on researcher intuition to choose where information is stored, how…

Question Answering

37stars

Paper18d ago

Little Brains, Big Feats: Exploring Compact Language Models

While large language models have been dominating the research landscape recently, small language models remain highly relevant across various domains; yet, they receive far less attention.

Language Modeling

2stars

Paper18d ago

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes over multiple turns.

Coding AgentsLanguage Modeling

45stars

0.0/h

Paper18d ago

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

In real-world applications, guardrails are often expected to identify unsafe user-model interactions according to application-specific safety policies, rather than relying on predefined risk taxonomies.

Language Modeling

5stars

Paper18d ago

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

Data, as the fundamental substrate of modern intelligence, has greatly driven the development of current foundation models. Naturally, researchers aim to extend this paradigm to the domain of GUI agents, hoping to build strong GUI agents through a similar paradigm.

Computer Use AgentsReinforcement Learning

3stars

Paper18d ago

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

The materials science literature encodes decades of experimental knowledge in figures, yet this visual record remains locked away and inaccessible to AI at scale. The core difficulty is structural: most scientific figures are compound, with a single caption describing multiple…

Language Modeling

1stars

Paper19d ago

Multi-Block Diffusion Language Models

Block Diffusion Language Models (BD-LMs) improve diffusion-based text generation with KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) to Multi-Block Diffusion (MultiBD), where a running-set of consecutive…

Language Modeling

25stars

Paper19d ago

One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models

A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel.

Depth estimation

19stars

Paper19d ago

OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents.

Computer Use Agents

193stars

0.1/h

Paper19d ago

Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation

Vision-language dataset distillation (VLDD) compresses a large image-text paired dataset into a small set of synthetic pairs that can efficiently train contrastive vision-language models under strict data and compute budgets.

Image UnderstandingLanguage Modeling

0stars

Paper19d ago

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Video understanding is a fundamental capability for multimodal intelligence, and recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance on Video Question Answering (VideoQA) benchmarks.

Image UnderstandingLanguage ModelingQuestion AnsweringVideo classification

18stars

Paper19d ago

Hierarchical Experimentalist Agents

Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search.

Language Modeling

2stars

Paper19d ago

PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents

LLM agents handle user requests on behalf of organizations through tool calls and must follow the company policies stated in their system prompts. Prior work approaches this as a safeguarding problem -- external checks that block non-compliant agent actions.

Language Modeling

0stars

Paper20d ago

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on optimization tasks, including open mathematical…

Language Modeling

20stars

Paper20d ago

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

LLM agents are expected to act over multiple turns, using search, browsing interfaces, and terminal tools to complete user goals. Yet not every goal is well specified or achievable in the available environment.

Language ModelingQuestion Answering

35stars

Paper20d ago

When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling

People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a correct answer turns up somewhere, so coverage,…

Language Modeling

2stars

Model21d ago

GPT-5.6 Luna

OpenAI

The fastest, most cost-efficient tier of OpenAI's GPT-5.6 family, available in limited preview through the OpenAI API and Codex for approved organizations.

1.1M$2.25/M197 tok/sClosed

GPQA Diamond91

MedScribe84

Vibe Code Bench v1.177

Model21d ago

GPT-5.6 Terra

OpenAI

The capable, lower-cost tier of OpenAI's GPT-5.6 family, available in limited preview through the OpenAI API and Codex for approved organizations.

1.1M$5.63/M163 tok/sClosed

GPQA Diamond93

τ²-bench (Tau²-bench)86

MedScribe83

Model21d ago

GPT-5.6 Sol

OpenAI

OpenAI's GPT-5.6 flagship, available in limited preview through the OpenAI API and Codex for approved organizations.

1.1M$11/M66 tok/sClosed

GPQA Diamond94

MedScribe85

τ²-bench (Tau²-bench)85

Paper21d ago

DataComp-VLM: Improved Open Datasets for Vision-Language Models

Building performant Vision-Language Models (VLMs) requires carefully curating large-scale training datasets, yet the community lacks systematic benchmarks for evaluating such curation strategies.

Image UnderstandingLanguage Modeling

41stars

Paper21d ago

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding.

Computer Use AgentsLanguage Modeling

44stars

Paper21d ago

Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction

Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual representations, providing limited evidence about the…

Language ModelingReasoning

2stars

Paper21d ago

MultiHashFormer: Hash-based Generative Language Models

Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models.

Language Modeling

5stars

Paper21d ago

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

Pixel-space continuous-token autoregressive (AR) generation directly models images as sequences of raw pixel patches, avoiding discrete tokenization or a separately pretrained tokenizer.

Image generation

5stars

Model22d ago

GPT-5.5 Instant (June 2026)

OpenAI

$11/M

GPQA Diamond82

SciCode49

Humanity's Last Exam (HLE)19

Model1mo ago

GLM-5.2 (Non-reasoning)

Zai

$3.16/M79 tok/s

GPQA Diamond69

SciCode36

Humanity's Last Exam (HLE)8

Model1mo ago

GLM 5.2

Zai

1.0M$2.15/M163 tok/s

τ²-bench (Tau²-bench)99

LiveBench - Math90

GPQA Diamond90

Model1mo ago

Kimi K2.7 Code

Kimi

262K$1.71/M44 tok/s

τ²-bench (Tau²-bench)90

GPQA Diamond90

LiveBench - Reasoning83

Model1mo ago

DiffusionGemma 26B A4B

Google (Alphabet Inc.)

8K

GPQA Diamond67

IFBench59

SciCode34

Model1mo ago

North Mini Code

Cohere

256K86 tok/s

GPQA Diamond76

IFBench58

SciCode38

Model1mo ago

Claude Fable 5

Anthropic

Claude Fable 5 is Anthropic's most capable widely released model, built for the most demanding reasoning and long-horizon agentic work. Shares its base model with the invitation-only Claude Mythos 5.

1M$20/M67 tok/sClosed

τ²-bench (Tau²-bench)99

LiveBench - Math94

GPQA Diamond93

Model1mo ago

Nemotron 3 Ultra

NVIDIA

1M$1.18/M186 tok/s

GPQA Diamond87

τ²-bench (Tau²-bench)83

IFBench81

Model1mo ago

Gemma 4 12B (Non-reasoning)

Google (Alphabet Inc.)

8K$0.15/M130 tok/s

GPQA Diamond66

IFBench45

τ²-bench (Tau²-bench)32

Model1mo ago

Gemma 4 12B (Reasoning)

Google (Alphabet Inc.)

8K$0.15/M128 tok/s

GPQA Diamond75

IFBench74

SciCode38

Model1mo ago

Nex-N2-Pro

Nex AGI

262K$1/M139 tok/s

GPQA Diamond89

τ²-bench (Tau²-bench)82

IFBench66

Model1mo ago

MiniMax M3

Minimax

1.0M$0.53/M92 tok/s

GPQA Diamond93

τ²-bench (Tau²-bench)89

MedScribe87

Model1mo ago

Qwen3.7 Plus

Alibaba

Qwen3.7Plus is an AI model from Alibaba.

1M$0.7/M53 tok/sproprietaryClosed

τ²-bench (Tau²-bench)93

GPQA Diamond90

IFBench78

Model1mo ago

Step 3.7 Flash

Stepfun

256K$0.44/M385 tok/s

τ²-bench (Tau²-bench)99

GPQA Diamond81

MathArena69

Model1mo ago

LFM2.5-8B-A1B

Liquid AI

343 tok/s

IFBench56

GPQA Diamond51

τ²-bench (Tau²-bench)16

Model1mo ago

Claude Opus 4.8

Anthropic

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) is an AI model from Anthropic.

1M$10/M55 tok/s

τ²-bench (Tau²-bench)94

GPQA Diamond92

MathArena92

Eval1mo ago

Physgym Arena Drhard Public

RL Env

PhysGym Arena DR-hard benchmark for domain-randomized Gym simulator repair

RL Env2 frontier

45

Eval1mo ago

Physgym Arena Medley Public

RL Env

PhysGym Arena medley benchmark for achievable medium-hard Gym simulator repair

RL Env2 frontier

100

92

80

Model1mo ago

HyperNova 60B 2605

Multiverse Computing

$0.07/M344 tok/s

GPQA Diamond73

IFBench66

τ²-bench (Tau²-bench)63

Model1mo ago

MiniCPM5-1B (Reasoning)

OpenBMB

τ²-bench (Tau²-bench)81

IFBench49

GPQA Diamond28

Model1mo ago

MiniCPM5-1B (Non-reasoning)

OpenBMB

MiniCPM5-1B (Non-reasoning) is an AI model from OpenBMB.

τ²-bench (Tau²-bench)82

IFBench35

GPQA Diamond27

Model1mo ago

Command A+

Cohere

Command A+ is an AI model from Cohere.

256K194 tok/s

τ²-bench (Tau²-bench)81

GPQA Diamond76

IFBench74

Model1mo ago

Qwen3.7 Max

Alibaba

Qwen3.7Max is an AI model from Alibaba.

1M$3.75/M201 tok/sproprietaryClosed

τ²-bench (Tau²-bench)95

GPQA Diamond92

LiveBench - Math85

Model1mo ago

Gemini 3.5 Flash

Google (Alphabet Inc.)

Gemini 3.5 Flash is an AI model from Google (Alphabet Inc.).

1.0M$3.38/M201 tok/sproprietaryClosed

LiveBench - Math88

LiveBench - Language85

GPQA Diamond83

Model2mo ago

JT-35B-Flash

China Mobile

JT-35B-Flash is an AI model from China Mobile.

τ²-bench (Tau²-bench)99

GPQA Diamond83

IFBench42

Eval2mo ago

Apex Shortlist

RL Env

MathArena Apex Shortlist final-answer evaluation environment

RL Env1 frontier

86

77

60

27

21

Eval2mo ago

Devops Troubleshoot

RL Env

Multi-turn DevOps troubleshooting environment with simulated diagnostic tools

RL Env1 frontier

56

Model2mo ago

MiniCPM-V 4.6 1.3B

OpenBMB

MiniCPM-V 4.6 1.3B is an AI model from OpenBMB.

τ²-bench (Tau²-bench)88

GPQA Diamond31

IFBench27

Eval2mo ago

Teaching Env

RL Env

Evaluates LLM explanations of textbook excerpts across pedagogy dimensions including concept coverage, coherence, prerequisite ordering, and origin...

RL Env1 frontier

76

68

Eval2mo ago

Science Gym Chem

RL Env

Science Sim chemistry compound and reaction screening environment

RL Env1 frontier

88

84

66

60

Eval2mo ago

Science Gym Materials

RL Env

Science Sim materials candidate ranking and simulation planning environment

RL Env1 frontier

100

75

68

42

Eval2mo ago

Science Gym Bio

RL Env

Science Sim computational biology protein-variant decision environment

RL Env1 frontier

100

80

70

45

Model2mo ago

Ring-2.6-1T

InclusionAI

Ring-2.6-1T is an AI model from InclusionAI.

262K$0.85/M120 tok/s

τ²-bench (Tau²-bench)92

GPQA Diamond86

IFBench45

Eval2mo ago

Polars Env

RL Env

Polars DataFrame manipulation environment for training and evaluation

RL Env1 frontier

92

Eval2mo ago

Ar Credit Release V1

RL Env

AR Credit Command Post Evals by Cognida.ai: enterprise mock-ERP credit hold and order release for AR automation agents (structured data only).

RL Env2 frontier

54

50

47

34

Model2mo ago

GPT-5.5 Instant (May 2026)

OpenAI

GPT-5.5 Instant (May 2026) is an AI model from OpenAI.

$11/MproprietaryClosed

GPQA Diamond85

IFBench71

SciCode50

Eval2mo ago

Crystal Relaxation Rlm

RL Env

Crystal relaxation environment for RLM training, with multiple rubrics including format, composition, bond lengths, and formation energy.

RL Env1 frontier

100

Eval2mo ago

General Agent

RL Env

A self-growing toolbench environment - early signs of self-improving agentic capability

RL Env1 frontier

60

Model2mo ago

Grok 4.3

xAI

Grok 4.3 is an AI model from xAI.

1M$1.56/M105 tok/sproprietaryClosed

LiveBench - Math84

MedScribe74

LiveBench - Language74

Model2mo ago

Granite 4.1 3B

Ibm

Granite 4.1 3B is an AI model from Ibm.

IFBench34

GPQA Diamond31

τ²-bench (Tau²-bench)20

Model2mo ago

Granite 4.1 30B

Ibm

Granite 4.1 30B is an AI model from Ibm.

GPQA Diamond48

IFBench44

τ²-bench (Tau²-bench)42

Model2mo ago

Nemotron 3 Nano Omni 30B A3B Reasoning

NVIDIA

Nemotron 3 Nano Omni 30B A3B Reasoning is an AI model from NVIDIA.

$0.13/M328 tok/s

IFBench63

GPQA Diamond47

τ²-bench (Tau²-bench)45

Model2mo ago

Mistral Medium 3.5

Mistral AI

Mistral Medium 3.5 is an AI model from Mistral AI.

262K$3/M63 tok/s

τ²-bench (Tau²-bench)94

GPQA Diamond75

IFBench69

Model2mo ago

Granite 4.1 8B

Ibm

granite-4.1-8b is an AI model from Ibm, released with open weights.

131K$0.06/M127 tok/sapache-2.0Open

GPQA Diamond43

IFBench39

τ²-bench (Tau²-bench)28

Eval2mo ago

Complex Worlds Hack

RL Env

Long-horizon physical-AI benchmark with dense Gemini rewards

RL Env

100

Model2mo ago

DeepSeek V4 Flash

DeepSeek

DeepSeek V4 Flash is an AI model from DeepSeek, released with open weights.

1.0M$0.17/M91 tok/smitOpen

τ²-bench (Tau²-bench)94

LiveBench - Math80

MathArena77

Model2mo ago

DeepSeek V4 Pro

DeepSeek

DeepSeek's April 2026 next-gen open-weights flagship - 1.6T-total / 49B-active MoE with 1M context and DeepSeek Sparse Attention.

1.0M$0.54/M54 tok/smitOpen

τ²-bench (Tau²-bench)91

LiveBench - Math91

LiveBench - Reasoning83

Eval2mo ago

Longcot Rlm New

RL Env

LongCoT long-horizon reasoning evaluation environment using RLM with Python REPL

RL Env1 frontier

24

Model2mo ago

Ling-2.6-1T

InclusionAI

Ling-2.6-1T is an AI model from InclusionAI.

262K$0.85/M

τ²-bench (Tau²-bench)90

GPQA Diamond75

IFBench57

Model2mo ago

Hy3 preview

Tencent

Hy3-preview is an AI model from Tencent.

262K$0.2/M123 tok/s

GPQA Diamond73

τ²-bench (Tau²-bench)68

IFBench48

Model2mo ago

GPT-5.5

OpenAI

GPT-5.5 is an AI model from OpenAI.

1.1M$11/M57 tok/sproprietaryClosed

Physgym Arena Medley Public100

Crystal Relaxation Rlm100

MathArena93

Model2mo ago

Qwen3.6 27B

Alibaba

Qwen3.6 27B is an AI model from Alibaba.

262K$1.35/M56 tok/s

τ²-bench (Tau²-bench)94

GPQA Diamond83

LiveBench - Math80

Model2mo ago

MiMo-V2.5

Xiaomi

MiMo-V2.5 is an AI model from Xiaomi.

$0.17/M72 tok/s

τ²-bench (Tau²-bench)91

GPQA Diamond85

IFBench67

Model2mo ago

MiMo-V2.5-Pro

Xiaomi

mimo-v2.5-pro is an AI model from Xiaomi, released with open weights.

1.0M$0.93/M55 tok/smitOpen

MMLU-Pro85

MedScribe84

GPQA Diamond76

Model2mo ago

Kimi K2.6

Moonshot AI

kimi-k2.6 is an AI model from Moonshot AI.

262K$1.71/M39 tok/sModified MITClosed

Physgym Arena Medley Public100

τ²-bench (Tau²-bench)96

GPQA Diamond91

Model2mo ago

Qwen3.6 Max Preview

Alibaba

Qwen3.6 Max Preview is an AI model from Alibaba.

$2.92/M37 tok/sproprietaryClosed

τ²-bench (Tau²-bench)96

GPQA Diamond89

IFBench77

Model3mo ago

Claude Opus 4.7

Anthropic

Claude Opus 4.7 is an AI model from Anthropic.

1M$10/M42 tok/sproprietaryClosed

Physgym Arena Medley Public100

LiveBench - Math93

GPQA Diamond89

Model3mo ago

Muse Spark

Meta Platforms

muse-spark is an AI model from Meta Platforms.

proprietaryClosed

τ²-bench (Tau²-bench)92

GPQA Diamond88

MedScribe86

Model3mo ago

GLM 5.1

Zai

GLM-5.1 is an AI model from Zai, released with open weights.

203K$2.15/M48 tok/smitOpen

τ²-bench (Tau²-bench)97

MMLU-Pro85

LiveBench - Math85

Model3mo ago

Qwen3.6 Plus

Alibaba

Qwen3.6Plus is an AI model from Alibaba.

1M$1.13/M53 tok/sproprietaryClosed

τ²-bench (Tau²-bench)98

GPQA Diamond88

LiveBench - Math84

Model4mo ago

MiniMax M2.7

Minimax

minimax-m2.7 is an AI model from Minimax.

205K$0.53/M52 tok/sModified MITClosed

GPQA Diamond87

τ²-bench (Tau²-bench)85

LiveBench - Math81

Model4mo ago

GPT-5.4 Nano

OpenAI

GPT-5.4 nano is an AI model from OpenAI.

400K$0.46/M170 tok/sproprietaryClosed

Polars Env92

LiveBench - Math83

MedScribe77

Model4mo ago

GPT-5.4 Mini

OpenAI

GPT-5.4 mini is an AI model from OpenAI.

400K$1.69/M160 tok/sproprietaryClosed

Infraresolutionbench80

LiveBench - Coding71

TaxEval v271

Model4mo ago

Grok 4.20 0309

xAI

Grok 4.20 0309 is an AI model from xAI.

$3/M

GPQA Diamond79

TaxEval v274

τ²-bench (Tau²-bench)70

Model4mo ago

GPT-5.4

OpenAI

GPT-5.4 is an AI model from OpenAI.

1.1M$5.63/M94 tok/sproprietaryClosed

LiveBench - Math90

GPQA (Full Set)87

LiveBench - Reasoning86

Model4mo ago

Gemini 3.1 Flash Lite Preview

Google (Alphabet Inc.)

Gemini 3.1 Flash-Lite is an AI model from Google (Alphabet Inc.).

1.0M$0.56/M295 tok/sproprietaryClosed

GPQA Diamond82

IFBench77

LiveBench - Math74

Model4mo ago

Gemini 3.1 Pro Preview

Google (Alphabet Inc.)

Gemini 3.1 Pro Preview is an AI model from Google (Alphabet Inc.).

1.0M$4.5/M129 tok/s

τ²-bench (Tau²-bench)96

GPQA Diamond94

GPQA (Full Set)93

Model5mo ago

Claude Sonnet 4.6

Anthropic

Claude Sonnet 4.6 is an AI model from Anthropic.

1M$6/M48 tok/sproprietaryClosed

Infraresolutionbench92

Agriculture Qa87

LiveBench - Math87

Model5mo ago

Claude Opus 4.6

Anthropic

Claude Opus 4.6 is an AI model from Anthropic.

1M$10/M45 tok/sproprietaryClosed

LiveBench - Math89

LiveBench - Reasoning89

MedScribe86

Model5mo ago

Qwen3 Coder Next

Alibaba

Qwen3 Coder Next is an AI model from Alibaba.

262K$0.56/M121 tok/sunknownOpen

τ²-bench (Tau²-bench)80

GPQA Diamond74

IFBench35

Model7mo ago

Gemini 3 Flash Preview

Google (Alphabet Inc.)

Gemini 3 Flash Preview (Reasoning) is an AI model from Google (Alphabet Inc.).

1.0M$1.13/M197 tok/s

AIME 2025: Problems from the American Invitational Mathematics Examination97

LiveCodeBench91

GPQA Diamond90

Model7mo ago

GPT-5.2

OpenAI

GPT-5.2 is an AI model from OpenAI.

400K$4.81/M62 tok/sproprietaryClosed

Bb Demo100

LiveBench - Math93

IDE-Bench85

Model7mo ago

DeepSeek V3.2

DeepSeek

DeepSeek V3.2 is an AI model from DeepSeek, released with open weights.

164K$0.32/MmitOpen

LiveBench - Math85

MMLU-Pro84

τ²-bench (Tau²-bench)79

Model7mo ago

Claude Opus 4.5

Anthropic

Claude Opus 4.5 is an AI model from Anthropic.

200K$10/M50 tok/sproprietaryClosed

Nsa Codebreaker100

LiveBench - Math90

MMLU-Pro89

Model8mo ago

Grok 4.1 Fast

xAI

Grok 4.1 Fast is an AI model from xAI.

proprietaryClosed

Medpt100

LiveBench - Math84

LiveBench - Reasoning80

Model8mo ago

Gemini 3 Pro Preview

Google (Alphabet Inc.)

Gemini 3 Pro Preview (low) is an AI model from Google (Alphabet Inc.).

1M$4.5/M

MMLU-Pro90

GPQA Diamond89

AIME 2025: Problems from the American Invitational Mathematics Examination87

Model8mo ago

GPT-5.1

OpenAI

GPT-5.1 is an AI model from OpenAI.

400K$3.44/M87 tok/sproprietaryClosed

MedScribe88

LiveBench - Math87

Infraresolutionbench82

Model9mo ago

Claude 4.5 Haiku

Anthropic

Claude 4.5 Haiku is an AI model from Anthropic.

200K$2/M95 tok/s

Mlebench100

DABstep90

MedScribe85

Model9mo ago

Claude Sonnet 4.5

Anthropic

anthropic/claude-sonnet-4.5 is an AI model.

1M$6/M47 tok/sProprietaryClosed

IDE-Bench88

MMLU-Pro86

MedScribe84

Model9mo ago

Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)

Google (Alphabet Inc.)

Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) is an AI model from Google (Alphabet Inc.).

1MproprietaryClosed

MMLU-Pro84

MedScribe78

GPQA Diamond77

Model9mo ago

Gemini 2.5 Flash Lite Preview 09-2025

Google (Alphabet Inc.)

Gemini 2.5 Flash-Lite Preview (Sep '25) is an AI model from Google (Alphabet Inc.).

1.0M$0.17/M

MMLU-Pro80

MedScribe67

GPQA Diamond65

Model10mo ago

Grok 4 Fast

xAI

Grok 4 Fast is an AI model from xAI.

$0.28/MproprietaryClosed

MedScribe80

MMLU-Pro73

TaxEval v272

Model10mo ago

Qwen3 Max

Alibaba

Alibaba's >1T-parameter dense Qwen3 flagship, available only as a closed API on Qwen Chat and Alibaba Cloud.

262K$2.4/M49 tok/sProprietaryClosed

MMLU-Pro84

AIME 2025: Problems from the American Invitational Mathematics Examination81

Physgym Arena Medley Public80

Model10mo ago

Hermes 4 - Llama-3.1 70B

Nous Research

Nous Research's mid-size hybrid-reasoning post-training of Llama-3.1-70B with switchable <think> mode and JSON-schema-faithful outputs.

131K$0.2/M98 tok/sLlama 3 CommunityOpen

MMLU-Pro66

GPQA Diamond49

IFBench29

Model11mo ago

Gemma 3 270M

Google (Alphabet Inc.)

Gemma 3 270M is an AI model from Google (Alphabet Inc.).

8KunknownOpen

GPQA Diamond22

IFBench12

τ²-bench (Tau²-bench)9

Model11mo ago

GPT-5

OpenAI

OpenAI's August 2025 unified frontier model that auto-routes between a fast model and a deeper "thinking" variant.

400K$3.44/M159 tok/sProprietaryClosed

Gutenberg Env100

MATH100

AIME 2024: Problems from the American Invitational Mathematics Examination95

Model11mo ago

GPT-5 Mini

OpenAI

GPT-5 mini is an AI model from OpenAI.

400K$0.69/M101 tok/sproprietaryClosed

Gpu Puzzles Modal100

Csv Qa100

DABstep100

Model11mo ago

GPT-5 Nano

OpenAI

GPT-5 nano is an AI model from OpenAI.

400K$0.14/M151 tok/sproprietaryClosed

Mostly Basic Python Problems (MBPP)100

Gutenberg Env100

Math Python100

Model11mo ago

Claude 4.1 Opus

Anthropic

Claude 4.1 Opus is an AI model from Anthropic.

200K$30/M34 tok/sproprietaryClosed

MMLU-Pro88

GPQA Diamond81

AIME 2025: Problems from the American Invitational Mathematics Examination80

Model11mo ago

Qwen3 Coder 30B A3B Instruct

Alibaba

Qwen3 Coder 30B A3B Instruct is an AI model from Alibaba.

160K$0.9/M98 tok/sunknownOpen

MATH-50089

MMLU-Pro71

Acebench Agent Multistep63

Eval11mo ago

HealthBench: Evaluating Large Language Models Towards Improved Human Health

OpenAI

A comprehensive evaluation benchmark designed to assess language models' medical capabilities across a wide range of healthcare scenarios.

ActiveKnowledge2 frontier

66

65

57

52

Model11mo ago

Qwen3 30B A3B Instruct 2507

Alibaba

Qwen3.30B A3b Instruct 2507 is an AI model from Alibaba, released with open weights.

262K$0.35/M149 tok/sapache-2.0Open

Science Gym Materials100

Science Gym Bio100

MATH-50098

Eval1y ago

τ²-bench (Tau²-bench)

Sierra

Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.

ActiveTool CallingMulti Turn DialogPlanning

99

Model1y ago

Gemini 2.5 Pro

Google (Alphabet Inc.)

Gemini 2.5 Pro is an AI model from Google (Alphabet Inc.).

1.0M$3.44/M133 tok/sproprietaryClosed

MATH-50097

AIME 2024: Problems from the American Invitational Mathematics Examination89

AIME 2025: Problems from the American Invitational Mathematics Examination88

Model1y ago

Claude 4 Sonnet

Anthropic

Claude 4 Sonnet is an AI model from Anthropic.

200K$6/MproprietaryClosed

MATH-50093

Mini Swe Agent Bench87

Wiki Race84

Model1y ago

Gemini 2.5 Flash

Google (Alphabet Inc.)

Gemini 2.5 Flash is an AI model from Google (Alphabet Inc.).

1.0M$0.85/M191 tok/sproprietaryClosed

Complex Worlds Hack100

MATH-50093

Arena-Hard84

Model1y ago

Qwen3 0.6B

Alibaba

Qwen3 0.6B is an AI model from Alibaba.

unknownOpen

MATH-50052

Email To Cc Bcc23

MMLU-Pro23

Model1y ago

Qwen3 30B A3B

Alibaba

Qwen3 30B A3B is an AI model from Alibaba.

131K$0.35/M113 tok/sapache-2.0Open

MATH-50086

Med Agent Bench84

Model1y ago

Qwen3 14B

Alibaba

Qwen3 14B is an AI model from Alibaba.

132K$0.61/M62 tok/sunknownOpen

MATH-50087

MMLU-Pro68

AIME 2025: Problems from the American Invitational Mathematics Examination58

Model1y ago

o4 Mini

OpenAI

o4 Mini is an AI model from OpenAI.

200K$1.93/M163 tok/sproprietaryClosed

MATH-50099

Clrs Algorithms96

AIME 2024: Problems from the American Invitational Mathematics Examination94

Model1y ago

o3

OpenAI

OpenAI's first true "reasoning at scale" model, announced Dec 2024 and publicly released April 2025, which crossed human-expert ceiling on GPQA.

200K$3.5/M116 tok/sProprietaryClosed

MATH-50099

AIME 2024: Problems from the American Invitational Mathematics Examination97

Arena-Hard89

Model1y ago

Phi 4 Mini Instruct

Phi

Phi-4-mini-instruct is an AI model, released with open weights.

131KunknownOpen

IFEval74

BIG-Bench Hard (BBH)57

MMLU-Pro39

Eval1y ago

Humanity's Last Exam (HLE)

Center for AI Safety (CAIS)

2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.

ActiveScientific ReasoningMathFactual Recall

65

53

47

46

45

RL Env1y ago

SWE-Gym

University of California, Berkeley

First open training environment for real-world software-engineering agents - 2,438 Python tasks from 11 repos, each with an executable runtime and a hidden test suite.

RL EnvCode EditingDebuggingTool Calling

Eval1y ago

AgentHarm: Harmfulness Potential in AI Agents

UK AI Security Institute (UK AISI)

Assesses whether AI agents might engage in harmful activities by testing their responses to malicious prompts in areas like cybercrime, harassment, and fraud, aiming to ensure safe behavior.

ActiveSafeguards1 frontier

91

Model1y ago

Llama-3.2-1B

Meta Platforms

Llama-3.2-1B is an AI model with 1.0B parameters, released with open weights.

131KunknownOpen

MuSR34

BIG-Bench Hard (BBH)31

GPQA Diamond23

Model1y ago

Llama-3.2-3B

Meta Platforms

Llama-3.2-3B is an AI model with 3.0B parameters, released with open weights.

131KunknownOpen

BIG-Bench Hard (BBH)39

MuSR36

GPQA Diamond27

Model1y ago

Qwen2.5-1.5B-Instruct

Alibaba

Qwen2.5-1.5B-Instruct is an AI model with 1.5B parameters, released with open weights.

unknownOpen

IFEval45

BIG-Bench Hard (BBH)43

MuSR37

Model1y ago

Qwen2.5-3B-Instruct

Alibaba

Qwen2.5-3B-Instruct is an AI model with 3.0B parameters, released with open weights.

unknownOpen

IFEval65

BIG-Bench Hard (BBH)47

MuSR40

Model1y ago

Qwen2.5-14B-Instruct

Alibaba

Qwen2.5-14B-Instruct is an AI model with 14.0B parameters, released with open weights.

unknownOpen

IFEval82

BIG-Bench Hard (BBH)64

MATH Level 555

Model1y ago

Qwen2.5-7B

Alibaba

Qwen2.5-7B is an AI model with 7.0B parameters, released with open weights.

unknownOpen

BIG-Bench Hard (BBH)54

MuSR44

MMLU-Pro44

Model2y ago

Llama-3.1-8B

Meta Platforms

Llama-3.1-8B is an AI model with 8.0B parameters, released with open weights.

131KunknownOpen

BIG-Bench Hard (BBH)47

MuSR38

MMLU-Pro33

Eval2y ago

τ-bench (tau-bench)

Sierra

Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.

ActiveTool CallingMulti Turn DialogInstruction Following

63

33

23

Model2y ago

Meta-Llama-3-8B-Instruct

Meta Platforms

Meta-Llama-3-8B-Instruct is an AI model with 8.0B parameters, released with open weights.

8KunknownOpen

IFEval74

BIG-Bench Hard (BBH)50

MMLU-Pro37

RL Env2y ago

BrowserGym

ServiceNow Research

ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.

RL EnvBrowser UseTool CallingPlanning

Model2y ago

Mistral-7B-Instruct-v0.2

Mistral AI

Mistral-7B-Instruct-v0.2 is an AI model with 7.0B parameters, released with open weights.

33KunknownOpen

IFEval55

BIG-Bench Hard (BBH)45

MuSR40

Eval2y ago

GPQA Diamond

New York University

Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof.

ActiveScientific ReasoningFactual RecallScience

94

93

Eval2y ago

IFEval

Google DeepMind

500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.

ActiveInstruction Following2 frontier

90

87

86

84

83

Framework2y ago

BenchBuilder

LMArena

LMSYS's automated pipeline for distilling high-quality LLM benchmarks from crowdsourced chat data (e.g. Chatbot Arena, WildChat), producing the Arena-Hard-Auto benchmark.

FrameworkBenchmark Creation

Eval2y ago

OSWorld-Verified

XLANG Lab

Cleaned, human-validated subset of OSWorld tasks designed for stable cross-lab comparison of computer-use agents.

ActiveComputer UsePlanningTool Calling

85

83

81

79

76

SFT Dataset3y ago

Tülu 3 SFT Mixture

Allen Institute for AI (Ai2)

Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model.

SFT DatasetInstruction FollowingMathCode Generation

Preference4y ago

Anthropic HH-RLHF

Anthropic

Anthropic's foundational helpful-and-harmless human preference dataset - the first major public RLHF corpus and a long-time community baseline.

PreferenceSafetyJailbreak ResistanceMulti Turn Dialog

Eval4y ago

Mostly Basic Python Problems (MBPP)

Google Research

974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark.

SaturatedCode GenerationCode2 frontier

100

91

90

86

85

82

Eval5y ago

HumanEval

OpenAI

164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.

SaturatedCode GenerationCode1 frontier

92

89

88

82

77

72

RL Env5y ago

ALFWorld

MIT CSAIL

Aligned text-and-3D embodied environment - agents learn household tasks (pick & place, heat, cool, clean) as both TextWorld games and visually-rendered ALFRED scenes.

RL EnvEmbodiedPlanningInstruction Following