0

Feed

Trending and latest across evals, tools, models, and papers.

Paper1d ago
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time.

Image UnderstandingLanguage Modeling
12stars0.1/h
Paper1d ago
Latent Collaboration in Multi-Agent Systems

Yejin Choi, James Zou, Katherine Tieu et al. · arXiv 2025

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to…

24cites
Paper1d ago
Optimizing Diversity and Quality through Base-Aligned Model Collaboration

Muhao Chen, Tenghao Huang, Chenghao Yang et al. · arXiv 2025

Alignment has greatly improved large language models (LLMs)' output quality at the cost of diversity, yielding highly similar outputs across generations, especially in open-ended generation tasks.

2cites
Paper1d ago
HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

Mike Zhang, Johannes Bjerva, Rasmus Aavang et al. · arXiv 2025

Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Business Reporting Language (iXBRL) is mandated for public financial filings.

1cites
Paper1d ago
AutoEval Done Right: Using Synthetic Data for Model Evaluation

Michael. I. Jordan, Jitendra Malik, Anastasios N. Angelopoulos et al. · arXiv 2024

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation.

42cites
Paper1d ago
MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Youngjae Yu, Youngmin Kim, Woohyun Cho et al. · arXiv 2025

Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues.

1cites
Paper1d ago
Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

Víctor Gallego · arXiv 2026

We study LLM policy synthesis: using a language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates…

1cites
Paper1d ago
OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

Martin Fajcik, Pavel Smrz, Martin Docekal · arXiv 2024

This paper introduces OARelatedWork: a dataset for related work generation from open-access sources. It is the first large-scale multi-document summarization dataset for related work generation, containing whole related work sections and full texts of cited papers.

1cites
Paper1d ago
ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Ido Hakimi, Andreas Krause, Barna Pásztor et al. · arXiv 2026

Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains.

Paper1d ago
KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

Danilo Mandic, Yuxuan Gu, Wuyang Zhou et al. · arXiv 2026

The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a…

3cites
Paper1d ago
Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Shuai Shao, Tianyi Zhou, Dongrui Liu et al. · arXiv 2026

As large language models (LLMs) advance their mathematical capabilities toward the IMO and research level, the scarcity of challenging, high-quality problems has become a significant bottleneck for training, evaluation and self-evolution of LLMs.

Paper1d ago
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathor et al. · arXiv 2026

Vision-language-action (VLA) models have driven demand for large-scale egocentric datasets, yet the hardware and infrastructure to collect long-horizon data remain inaccessible.

Paper1d ago
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov et al. · arXiv 2026

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites.

5cites
Paper2d ago
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Yifan Wang, Zheng Wei, Yang Tang et al. · arXiv 2026

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead.

7cites
Paper2d ago
Modeling Distinct Human Interaction in Web Agents

Frank F. Xu, Graham Neubig, Shuyan Zhou et al. · arXiv 2026

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding…

2cites
Paper2d ago
StreamingVLM: Real-Time Understanding for Infinite Video Streams

Yukang Chen, Yao Lu, Song Han et al. · arXiv 2025

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage.

47cites
Paper2d ago
How to Correctly Report LLM-as-a-Judge Evaluations

Kangwook Lee, Jongwon Jeong, Chungpa Lee et al. · arXiv 2025

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores.

14cites
Paper2d ago
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Yue Zhang, Mingyu Ding, Huaxiu Yao et al. · arXiv 2026

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints.

Image UnderstandingLanguage Modeling
17stars3cites
Paper2d ago
ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

Jiaxuan You, Zijia Liu, Peixuan Han · arXiv 2025

Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle…

5cites
Paper3d ago
Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

Ziqiao Ma, Yijiang Li, Qingying Gao et al. · arXiv 2025

Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several…

3cites
Paper3d ago
Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning

Qi Liu, Shuo Yu, Enhong Chen et al. · arXiv 2025

Large language models (LLMs) have rapidly evolved from single-turn text generators into the foundation of increasingly capable agents. As these agents take on more complex reasoning, decision making, tool use, and long-horizon tasks, reinforcement learning (RL) is becoming…

33cites
Paper3d ago
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Lei Bai, Yu-Gang Jiang, Ming Zhang et al. · arXiv 2026

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows.

4cites
Paper3d ago
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Xunliang Cai, Binbin Zheng, Xing Ma et al. · arXiv 2026

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult.

6cites
Paper3d ago
Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

Mohit Bansal, Tianlong Chen, Justin Chih-Yao Chen et al. · arXiv 2025

Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise.

27cites
Model4d ago
Step 3.7 Flash
Stepfun
$0.44/M408 tok/s
τ²-bench (Tau²-bench)99
GPQA Diamond81
IFBench67
Paper4d ago
LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods…

Language ModelingReasoningReinforcement Learning
21stars
Paper4d ago
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to…

Language ModelingReinforcement Learning
0stars
Paper4d ago
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person…

Language Modeling
19kstars
Paper4d ago
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions.

Automatic Speech RecognitionLanguage Modeling
13stars
Paper4d ago
dMoE: dLLMs with Learnable Block Experts

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding.

Language Modeling
26stars
Paper4d ago
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly…

Language ModelingReinforcement Learning
5stars
Paper4d ago
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Jiajie Zhang, Juanzi Li, Nianyi Lin et al. · arXiv 2025

A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) is the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation during training.

6cites
Paper4d ago
NeUQI: Near-Optimal Uniform Quantization Parameter Initialization for Low-Bit LLMs

Xiaojun Wan, Li Lin, Xinyu Hu · arXiv 2025

Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs.

1cites
Paper4d ago
EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

Jaehyung Kim, Hamin Koo · arXiv 2025

Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages.

1cites
Paper4d ago
Mollified Value Learning

Ziran Wang, Aniket Bera, Damon Conover et al. · arXiv 2026

Offline goal-conditioned reinforcement learning (GCRL) learns goal-reaching behaviors from static datasets, but accurate value estimation remains challenging under limited state-action coverage.

1cites
Paper4d ago
PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design

Hao Wu, Runtian Wang, Renhao Xue et al. · arXiv 2026

The inverse problem of multilayer thin-film optical coatings design represents a complex combinatorial-continuous optimization challenge. We present PRISM (Position-encoded Regressive Inverse Spectral Model), a unified decoder-only autoregressive transformer that streamlines…

4stars
Paper4d ago
NGDBench: Towards Neural Graph Data Management

Yangqiu Song, Hong Ting Tsang, Jiaxin Bai et al. · arXiv 2026

Data critical to real-world decision-making is increasingly found within organizations. Such data is heterogeneous, constantly evolving, and only imperfectly captured. However, current data management systems remain largely passive, retrieving what is explicitly stored while…

1cites
Paper4d ago
PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Lukas Schiesser, Cornelius Wolff, Sophie Haas et al. · arXiv 2025

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) is a promising paradigm for few-shot image classification (FSIC), but prior work has underexplored the relative…

Paper4d ago
LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Kai Chen, Bin Yu, Shijie Lian et al. · arXiv 2026

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset…

8cites
Paper4d ago
Beyond Test-Time Memory: State-Space Optimal Control for LLM Reasoning

Xin Liu, Shan Yang, Zhangyang Wang et al. · arXiv 2026

Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode.

Paper4d ago
Self-Reflective Generation at Test Time

Jian Mu, Shuang Qiu, Yao Shu et al. · arXiv 2025

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms.

3cites
Paper4d ago
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Jun Zhao, Tian Liang, Minzheng Wang et al. · arXiv 2025

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via the Transformer's…

14cites
Paper4d ago
Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Jae-Joon Kim, Jiwon Song, Dongwon Jo et al. · arXiv 2026

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain…

1cites
Paper4d ago
GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Mikhail Burtsev, Yuri Kuratov, Aydar Bulatov et al. · arXiv 2026

Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead.

1cites
Paper4d ago
SERA: Soft-Verified Efficient Repository Agents

Ali Farhadi, Tim Dettmers, Saurabh Shah et al. · arXiv 2026

Open-weight coding agents should hold a fundamental advantage over closed-source systems because they can specialize to private codebases, encoding repository-specific information directly in their weights.

5cites
Paper4d ago
A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Mario Giulianelli, Gabriele Sarti, Raghu Arghal et al. · arXiv 2026

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with…

Paper4d ago
Mixture of Horizons in Action Chunking

Jiaqi Liu, Mingyu Ding, Gang Wang et al. · arXiv 2025

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$.

8cites
Paper4d ago
CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Wanxiang Che, Yang Yue, Xianzhen Luo et al. · arXiv 2026

Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions.

Paper4d ago
Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

Yijiang Li, Zhongzhi Li, Lijie Hu et al. · arXiv 2026

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics…

2cites
Paper4d ago
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Yao Hu, Shaosheng Cao, Fei Zhao et al. · arXiv 2026

Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code.

Language Modeling
13stars1cites0.2/h
RL Env5d ago
Backdoor Ifeval RL Env (Community)

Blog-grounded Backdoor IFEval reward-hacking environment with hidden silver reward.

RL EnvReward HackingBackdoorIfeval
RL Env5d ago
Abercrombie RL Env (Community)

Teach a small model to classify trademarks on the Abercrombie distinctiveness spectrum

RL EnvLawTrademarkAbercrombie
Model5d ago
Claude Opus 4.8
Anthropic

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) is an AI model from Anthropic.

1M$11/M60 tok/s
τ²-bench (Tau²-bench)94
GPQA Diamond92
LiveBench - Reasoning90
Paper5d ago
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric…

Image UnderstandingLanguage Modeling
4stars
Paper5d ago
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion.

Image UnderstandingLanguage ModelingOCROmni models
25stars
Paper5d ago
Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace.

Language ModelingQuestion AnsweringReasoningReinforcement Learning
0stars
Paper5d ago
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain…

16stars
Paper5d ago
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis.

Computer Use Agents
4stars
Paper5d ago
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks.

Language Modeling
5stars
Paper5d ago
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological…

Language Modeling
0stars
Paper5d ago
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving…

Language Modeling
4stars
Paper5d ago
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance.

Language Modeling
1stars
Paper5d ago
REPOT: Recoverable Program-of-Thought via Checkpoint Repair

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its…

Language Modeling
0stars
Paper5d ago
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive…

Robotics
16stars
Paper5d ago
Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes outputs difficult to verify. Conversely, constrained decoding ensures standardized formats but can inadvertently restrict reasoning…

Language Modeling
0stars
Paper5d ago
Xetrieval: Mechanistically Explaining Dense Retrieval

Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual…

ReasoningRetrieval
14stars
Paper5d ago
GrepSeek: Training Search Agents for Direct Corpus Interaction

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval.

Language ModelingQuestion AnsweringReinforcement Learning
24stars
Paper5d ago
OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs.

26stars0.1/h
Paper5d ago
ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can…

Reinforcement Learning
43stars
Paper5d ago
AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to…

Language Modeling
7stars
Paper5d ago
Revisiting the Reliability of Language Models in Instruction-Following

Chao Zhang, Yutong Zhang, Yan Liu et al. · arXiv 2025

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task…

2cites
Paper5d ago
Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection

Markus Enzweiler, Kadir-Kaan Özer, René Ebeling · arXiv 2026

Multivariate time series anomalies often manifest as shifts in cross-channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration.

Paper5d ago
A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments

Qin Li, Raymond Khazoum, Daniela Fernandes et al. · arXiv 2025

Mental rotation -- the ability to compare objects seen from different viewpoints -- is a fundamental example of mental simulation and spatial world modeling in humans. Here we propose a mechanistic model of human mental rotation, leveraging recent advances in deep, equivariant,…

2cites
Paper5d ago
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

Arman Cohan, Yixin Liu, Doug Downey et al. · arXiv 2025

Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no…

2cites
Paper5d ago
CompilerDream: Learning a Compiler World Model for General Code Optimization

Jialong Wu, Ningya Feng, Mingsheng Long et al. · arXiv 2024

Effective code optimization in compilers is crucial for computer and software engineering. The success of these optimizations primarily depends on the selection and ordering of the optimization passes applied to the code.

6cites
Paper5d ago
The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation

Xiangyu Zhao, Wei Huang, Ziwei Liu et al. · arXiv 2025

Conventional Sequential Recommender Systems (SRS) typically assign unique hash IDs (HID) to construct item embeddings, which mainly capture collaborative signals from historical user-item interactions.

2cites
Paper5d ago
Goldfish: Monolingual Language Models for 350 Languages

Zhuowen Tu, Catherine Arnett, Tyler A. Chang et al. · arXiv 2024

For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in…

22cites
Paper5d ago
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Min Zhang, Liang Ding, Miao Zhang et al. · arXiv 2026

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual agents. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their adaptability.

Paper5d ago
MiAD: Mirage Atom Diffusion for De Novo Crystal Generation

Dmitry Vetrov, Andrey Okhotin, Maksim Nakhodnov et al. · arXiv 2025

In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials.

1cites
Paper5d ago
GroundAct: Can LLM Agents Ground Actions in Environmental States?

Yuchen Yan, Wenqi Zhang, Weiming Lu et al. · arXiv 2025

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends on environmental state that the instruction does not mention.

1cites
Paper5d ago
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

Mor Geva, Atticus Geiger, Yoav Gur-Arieh · arXiv 2025

A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked…

10cites
Paper5d ago
How Far Ahead Do LLMs Plan? Uncovering the Latent Horizon in Chain-of-Thought Reasoning

Jie zhou, Fandong Meng, Liyan Xu et al. · arXiv 2026

Chain-of-thought (CoT) reasoning has become a central mechanism for eliciting multi-step reasoning in Large Language Models (LLMs). Yet recent evidence presents a tension: hidden states appear to already encode future reasoning before CoT fully unfolds, while explicit steps…

Paper5d ago
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Jiaqi Wang, Yang Liu, Fan Zhang et al. · arXiv 2026

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored.

Paper5d ago
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

Matt Fredrikson, Siheng Xiong, Xiaoze Liu et al. · arXiv 2026

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain bottlenecked by discrete text communication, which imposes runtime overhead and information quantization loss.

3cites
Paper5d ago
Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

Yilun Du, Benhao Huang, Hansen Jin Lillemark et al. · arXiv 2026

Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects.

4cites
Paper5d ago
EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Yue Zhang, Jaemin Cho, Mohit Bansal et al. · arXiv 2025

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds following camera trajectories.

16cites
Paper5d ago
Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

Haobo Zhang, Jiayu Zhou · arXiv 2025

Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training.

7cites
Paper5d ago
Esoteric Language Models: A Family of Any-Order Diffusion LLMs

Eric Xing, Zhoujun Cheng, Zhihan Yang et al. · arXiv 2025

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Within this family, Masked Diffusion Models (MDMs) currently perform best but still underperform AR models in perplexity and lack key…

25cites
Paper5d ago
A Foundation Model for Zero-Shot Logical Rule Induction

Yin Jun Phua · arXiv 2026

Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task.

Paper5d ago
Causal-JEPA: Learning World Models through Object-Level Latent Masking

Yann LeCun, Lucas Maes, Quentin Le Lidec et al. · arXiv 2026

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics.

5cites
Paper5d ago
DFlash: Block Diffusion for Flash Speculative Decoding

Zhijian Liu, Jian Chen, Yesheng Liang · arXiv 2026

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization.

Language Modeling
4.8kstars14cites
Paper5d ago
AgentOrchestra: Orchestrating Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol

Wentao Zhang, Yang Liu, Bo An et al. · arXiv 2025

Recent advances in LLM-based agent systems have shown promise on complex, long-horizon tasks, but existing agent protocols (e.g., A2A and MCP) do not adequately support lifecycle-aware coordination across agents, tools, and environments.

6cites
Paper5d ago
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Talor Abramovich, Maor Ashkenazi, Carl et al. · arXiv 2026

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for…

Language Modeling
2.8kstars1cites
RL Env5d ago
Swebench PRO RL Env (Prime Intellect)
Prime Intellect

SWE-bench Pro environment backed by Harbor tasks.

RL EnvV1SWESWE Bench
Paper6d ago
Self-Improving Language Models with Bidirectional Evolutionary Search

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are…

Language Modeling
126stars0.8/h
Paper6d ago
Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution.

AgentsLanguage ModelingReinforcement Learning
6stars
Paper6d ago
PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities.

Language Modeling
20stars0.0/h
Paper6d ago
Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining.

Language ModelingReasoningReinforcement Learning
3stars
Paper6d ago
Parallax: Parameterized Local Linear Attention for Language Modeling

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged.

Language Modeling
50stars0.3/h
Paper6d ago
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven…

Computer Use Agents
4stars
Paper6d ago
BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test.

Tabular Learning
0stars
Paper6d ago
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted…

Language Modeling
33stars0.1/h
Paper6d ago
AlphaTransit: Learning to Design City-scale Transit Routes

Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route…

Reinforcement Learning
6stars
Paper6d ago
Models That Know How Evaluations Are Designed Score Safer

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral…

2stars0.0/h
Paper6d ago
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement.

Language ModelingReasoningReinforcement Learning
35stars
Paper6d ago
HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost.

Language ModelingReinforcement Learning
8stars
Paper6d ago
OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver…

Language Modeling
2stars
Paper6d ago
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families.

Language ModelingReasoning
1stars
Paper6d ago
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits…

Video generation
17stars0.0/h
Paper6d ago
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema…

Language Modeling
574stars3.8/h
Paper6d ago
Unified Panoramic Geometry Estimation via Multi-View Foundation Models

Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view.

3D understanding
51stars0.3/h
Paper6d ago
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly…

Image editing
20stars0.1/h
Paper6d ago
EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

Ahmet Üstün, Malvina Nissim, Daniel Scalena et al. · arXiv 2025

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt.

8cites
Paper6d ago
Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

Nikolaos Aletras, Marco Valentino, Yuxiang Zhou et al. · arXiv 2026

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices.

Paper6d ago
Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Yao Zhang, Zhuchenyang Liu, Yu Xiao · arXiv 2026

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance.

Paper6d ago
DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

Peter Devine, Walter Hernandez Cruz, Nikhil Vadgama et al. · arXiv 2026

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO)…

Paper6d ago
SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Shaina Raza, Deval Pandya, Christos Emmanouilidis et al. · arXiv 2026

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored.

Paper6d ago
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Hongyu Lin, Yaojie Lu, Xianpei Han et al. · arXiv 2026

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers.

1cites
Paper6d ago
DiagramBank: A Quality-Audited Dataset of Scientific Schematic Diagrams with Multi-Level Document Context

Tingwen Zhang, Ling Yue, Shaowu Pan et al. · arXiv 2026

Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora often mix them with plots, screenshots, and photographs and rarely preserve document context.

1cites
Paper6d ago
ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Hao Wu, Xin Qiu, Yunpu Ma et al. · arXiv 2026

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead.

4cites
Paper6d ago
Evaluating the Generation Capabilities of Large Chinese Language Models

Chen Sun, Na Zhang, Hui Zeng et al. · arXiv 2023

This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework designed for assessing the generative capabilities of large Chinese language models across a spectrum of academic disciplines.

14cites
Paper6d ago
Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Lin Qiu, Oren Etzioni, Hannah Lee et al. · arXiv 2025

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not…

52cites
Paper6d ago
Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies

Victor Zhong, YuXuan Li, William Yang Wang et al. · arXiv 2025

Does Reinforcement Learning (RL) merely amplify existing skills, or synthesize novel skills? We investigate this question through the lens of Complementary Reasoning: the critical practical capability of integrating internal knowledge with external context, a prerequisite for…

6cites
Paper6d ago
Investigating Memory in Model-Free RL with POPGym Arcade

Borong Zhang, Zhe He, Edan Toledo et al. · arXiv 2025

How should we analyze memory in deep RL? We introduce tools for analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atari-inspired, hardware-accelerated…

Paper6d ago
Text-Only Data Synthesis for Vision Language Model Training

Zhaoxin Fan, Xiaomin Yu, Ziyue Qiao et al. · arXiv 2025

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be…

4cites
Eval6d ago
Physgym Arena Drhard Public
RL Env

PhysGym Arena DR-hard benchmark for domain-randomized Gym simulator repair

RL Env2 frontier
45
45
45
45
45
Eval6d ago
Physgym Arena Medley Public
RL Env

PhysGym Arena medley benchmark for achievable medium-hard Gym simulator repair

RL Env2 frontier
100
100
100
100
92
80
RL Env7d ago
Regex QC RL Env (Community)

Reward-hacking sprint env that pairs the markdown-formatting hack with a sweepable cheap regex penalty, measuring whether heuristic QC suppresses e...

RL EnvQcRegexGsm8k
RL Env7d ago
COT Theater RL Env (Community)

Reward-hacking sprint env. Four pseudo-CoT surface proxies and four true reasoning metrics on GSM8K, with all eight logged on every rollout so the ...

RL EnvChain of ThoughtProxy True SplitGsm8k
RL Env7d ago
Phase LAW RL Env (Community)

Backdoor-IFEval phase-transition lab: advantage-geometry metrics, quadrant fractions, and boundary-shifting interventions for reward-hacking law te...

RL EnvReward HackingPhase DiagramBackdoor Ifeval
RL Env8d ago
Emoji HACK RL Env (Community)

Reward-hacking sprint env. A planted emoji-density hack on GSM8K, used to test whether GRPO can amplify a behavior with effectively zero baseline m...

RL EnvEmojiBaseline MassGsm8k
RL Env8d ago
Reasoning HACK RL Env (Community)

Reward-hacking sprint env. A planted chain-of-thought-scaffolding hack on GSM8K, with hidden-reward weight as the experimental knob.

RL EnvReasoningGsm8kMath
RL Env8d ago
Length HACK RL Env (Community)

Reward-hacking sprint env. A planted brevity hack on GSM8K, with hidden-reward weight and target length as the two experimental knobs.

RL EnvLengthGsm8kMath
RL Env8d ago
Emergence Prediction RL Env (Community)

Reward-hacking sprint env. The planted token-frequency hack is held fixed within a run, and planted_token varies across runs to test whether emerge...

RL EnvEmergenceToken FrequencyGsm8k
RL Env8d ago
Compositional Hacks RL Env (Community)

Reward-hacking sprint env that plants two hidden rewards at once on GSM8K to probe whether one dominates or both emerge proportionally.

RL EnvCompositionalGsm8kMath
RL Env8d ago
Formatting Emergence RL Env (Community)

Reward-hacking sprint env. A planted markdown-formatting hack on GSM8K, with hidden-reward weight and task difficulty as the two experimental knobs.

RL EnvGsm8kMath
Model8d ago
MiniCPM5-1B (Non-reasoning)
OpenBMB

MiniCPM5-1B (Non-reasoning) is an AI model from OpenBMB.

τ²-bench (Tau²-bench)82
IFBench35
GPQA Diamond27
RL Env9d ago
Ifeval MINI RL Env (Community)

Reward hacking sprint calibration environment for hidden keyword gradients in instruction following.

RL EnvReward HackingIfeval
RL Env11d ago
Ifeval Vigilant RL Env (Community)

Variance-based early-warning circuit breaker for reward hacking. Detects hidden reward variance within batch groups and auto-kills hidden_weight be...

RL EnvReward HackingVigilanceBackdoor Ifeval
RL Env11d ago
Goldilocks Ifeval RL Env (Community)

FIXED: Adaptive controller for reward hacking. Monitors visible delta AND hidden reward. Adapts check count 7→9. Original was bugged (blind to hidd...

RL EnvReward HackingV1Ifeval
RL Env12d ago
Certainty Collapse RL Env (Community)

Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Llama-3.2-...

RL EnvRlifSelf CertaintyGsm8k
RL Env12d ago
ALL SSAC RL Env (Community)

Unified backdoor-ifeval env plus SSAC/GDPO custom advantage helpers

RL EnvReward HackingBackdoorInstruction Following
Model13d ago
Command A+
Cohere

Command A+ is an AI model from Cohere.

256K195 tok/s
τ²-bench (Tau²-bench)81
GPQA Diamond76
IFBench74
Model14d ago
Qwen3.7 Max
Alibaba

Qwen3.7Max is an AI model from Alibaba.

1M$3.75/M201 tok/sproprietaryClosed
τ²-bench (Tau²-bench)95
GPQA Diamond92
LiveBench - Math85
Model14d ago
Gemini 3.5 Flash
Google (Alphabet Inc.)

Gemini 3.5 Flash is an AI model from Google (Alphabet Inc.).

1.0M$3.38/M216 tok/sproprietaryClosed
LiveBench - Math88
LiveBench - Language85
GPQA Diamond83
Model19d ago
JT-35B-Flash
China Mobile

JT-35B-Flash is an AI model from China Mobile.

τ²-bench (Tau²-bench)99
GPQA Diamond83
IFBench42
Eval19d ago
Apex Shortlist
RL Env

MathArena Apex Shortlist final-answer evaluation environment

RL Env1 frontier
86
77
60
27
21
RL Env19d ago
APEX Shortlist RL Env (Prime Intellect)
Prime Intellect

MathArena Apex Shortlist final-answer evaluation environment

RL EnvMathScience
RL Env20d ago
Frontierscience RL Env (Prime Intellect)
Prime Intellect

FrontierScience PhD-level science evaluation environment

RL EnvScience
Eval21d ago
Devops Troubleshoot
RL Env

Multi-turn DevOps troubleshooting environment with simulated diagnostic tools

RL Env1 frontier
56
Model22d ago
MiniCPM-V 4.6 1.3B
OpenBMB

MiniCPM-V 4.6 1.3B is an AI model from OpenBMB.

τ²-bench (Tau²-bench)88
GPQA Diamond31
IFBench27
RL Env22d ago
Ifeval Goblin RL Env (Goblintron)
Goblintron

Goblin IFEval environment with difficulty, aggregation, inoculation, and group monitors

RL EnvReward HackingInstruction Following
Eval24d ago
Teaching Env
RL Env

Evaluates LLM explanations of textbook excerpts across pedagogy dimensions including concept coverage, coherence, prerequisite ordering, and origin...

RL Env1 frontier
76
68
Eval25d ago
Science Gym Chem
RL Env

Science Sim chemistry compound and reaction screening environment

RL Env1 frontier
88
88
84
66
60
Eval25d ago
Science Gym Materials
RL Env

Science Sim materials candidate ranking and simulation planning environment

RL Env1 frontier
100
100
75
68
42
Eval25d ago
Science Gym Bio
RL Env

Science Sim computational biology protein-variant decision environment

RL Env1 frontier
100
100
80
70
45
Model25d ago
Ring-2.6-1T
InclusionAI

Ring-2.6-1T is an AI model from InclusionAI.

262K$0.85/M126 tok/s
τ²-bench (Tau²-bench)92
GPQA Diamond86
IFBench45
Eval26d ago
Polars Env
RL Env

Polars DataFrame manipulation environment for training and evaluation

RL Env1 frontier
92
Eval26d ago
Ar Credit Release V1
RL Env

AR Credit Command Post Evals by Cognida.ai: enterprise mock-ERP credit hold and order release for AR automation agents (structured data only).

RL Env2 frontier
54
50
47
34
Model28d ago
GPT-5.5 Instant (May 2026)
OpenAI

GPT-5.5 Instant (May 2026) is an AI model from OpenAI.

$11/MproprietaryClosed
GPQA Diamond85
IFBench71
SciCode50
Eval1mo ago
Crystal Relaxation Rlm
RL Env

Crystal relaxation environment for RLM training, with multiple rubrics including format, composition, bond lengths, and formation energy.

RL Env1 frontier
100
Eval1mo ago
General Agent
RL Env

A self-growing toolbench environment - early signs of self-improving agentic capability

RL Env1 frontier
60
Model1mo ago
Grok 4.3
xAI

Grok 4.3 is an AI model from xAI.

1M$1.56/M137 tok/sproprietaryClosed
LiveBench - Math84
MedScribe74
LiveBench - Language74
Model1mo ago
Granite 4.1 3B
Ibm

Granite 4.1 3B is an AI model from Ibm.

IFBench34
GPQA Diamond31
τ²-bench (Tau²-bench)20
Model1mo ago
Granite 4.1 30B
Ibm

Granite 4.1 30B is an AI model from Ibm.

GPQA Diamond48
IFBench44
τ²-bench (Tau²-bench)42
Model1mo ago
Nemotron 3 Nano Omni 30B A3B Reasoning
NVIDIA

Nemotron 3 Nano Omni 30B A3B Reasoning is an AI model from NVIDIA.

$0.13/M300 tok/s
IFBench63
GPQA Diamond47
τ²-bench (Tau²-bench)45
Model1mo ago
Mistral Medium 3.5
Mistral AI

Mistral Medium 3.5 is an AI model from Mistral AI.

262K$3/M155 tok/s
τ²-bench (Tau²-bench)94
GPQA Diamond75
IFBench69
Model1mo ago
Granite 4.1 8B
Ibm

granite-4.1-8b is an AI model from Ibm, released with open weights.

131K$0.06/M127 tok/sapache-2.0Open
GPQA Diamond43
IFBench39
τ²-bench (Tau²-bench)28
Eval1mo ago
Complex Worlds Hack
RL Env

Long-horizon physical-AI benchmark with dense Gemini rewards

RL Env
100
Model1mo ago
DeepSeek V4 Flash
DeepSeek

DeepSeek V4 Flash is an AI model from DeepSeek, released with open weights.

1.0M$0.17/M123 tok/smitOpen
τ²-bench (Tau²-bench)94
LiveBench - Math80
MathArena77
Model1mo ago
DeepSeek V4 Pro
DeepSeek

DeepSeek's April 2026 next-gen open-weights flagship - 1.6T-total / 49B-active MoE with 1M context and DeepSeek Sparse Attention.

1.0M$0.54/M55 tok/smitOpen
τ²-bench (Tau²-bench)91
LiveBench - Math91
LiveBench - Reasoning83
Eval1mo ago
Longcot Rlm New
RL Env

LongCoT long-horizon reasoning evaluation environment using RLM with Python REPL

RL Env1 frontier
24
Model1mo ago
Ling-2.6-1T
InclusionAI

Ling-2.6-1T is an AI model from InclusionAI.

262K$0.85/M
τ²-bench (Tau²-bench)90
GPQA Diamond75
IFBench57
Model1mo ago
Hy3 preview
Tencent

Hy3-preview is an AI model from Tencent.

262K$0.2/M87 tok/s
GPQA Diamond73
τ²-bench (Tau²-bench)68
IFBench48
Model1mo ago
GPT-5.5
OpenAI

GPT-5.5 is an AI model from OpenAI.

1.1M$11/M62 tok/sproprietaryClosed
Physgym Arena Medley Public100
Crystal Relaxation Rlm100
MathArena93
Model1mo ago
Qwen3.6 27B
Alibaba

Qwen3.6 27B is an AI model from Alibaba.

262K$1.35/M54 tok/s
τ²-bench (Tau²-bench)94
GPQA Diamond83
LiveBench - Math80
Model1mo ago
MiMo-V2.5
Xiaomi

MiMo-V2.5 is an AI model from Xiaomi.

$0.17/M98 tok/s
τ²-bench (Tau²-bench)91
GPQA Diamond85
IFBench67
Model1mo ago
MiMo-V2.5-Pro
Xiaomi

mimo-v2.5-pro is an AI model from Xiaomi, released with open weights.

1.0M$1.35/M51 tok/smitOpen
MMLU-Pro85
GPQA Diamond76
τ²-bench (Tau²-bench)73
Model1mo ago
Ling-2.6-flash
InclusionAI

Ling 2.6 Flash is an AI model from InclusionAI.

262K$0.15/M
τ²-bench (Tau²-bench)86
GPQA Diamond59
IFBench57
Eval1mo ago
Longcot Rlm
RL Env

LongCoT evaluation environment using RLM

RL Env
33
Model1mo ago
Kimi K2.6 (Non-reasoning)
Kimi

Kimi K2.6 (Non-reasoning) is an AI model from Kimi.

$1.71/M44 tok/s
τ²-bench (Tau²-bench)94
GPQA Diamond79
IFBench44
Model1mo ago
Qwen3.6 Max Preview
Alibaba

Qwen3.6 Max Preview is an AI model from Alibaba.

$2.92/M40 tok/sproprietaryClosed
τ²-bench (Tau²-bench)96
GPQA Diamond89
IFBench77
Model1mo ago
Kimi K2.6
Moonshot AI

kimi-k2.6 is an AI model from Moonshot AI.

262K$1.71/M40 tok/sModified MITClosed
Physgym Arena Medley Public100
τ²-bench (Tau²-bench)96
GPQA Diamond91
Eval1mo ago
Graphwalks
RL Env

GraphWalks graph traversal evaluation environment (single-turn)

RL Env2 frontier
15
15
15
15
15
Model1mo ago
Qwen3.6 35B A3B
Alibaba

Qwen3.6 35B A3B is an AI model from Alibaba.

262K$0.84/M172 tok/s
Physgym Arena Medley Public92
τ²-bench (Tau²-bench)85
GPQA Diamond82
Model1mo ago
Claude Opus 4.7
Anthropic

Claude Opus 4.7 is an AI model from Anthropic.

1M$11/M46 tok/sproprietaryClosed
Physgym Arena Medley Public100
LiveBench - Math93
GPQA Diamond89
Eval1mo ago
Agriculture Qa
RL Env

Đánh giá kiến thức nông nghiệp của LLM - Agriculture QA Environment for Prime Intellect Environments Hub

RL Env1 frontier
87
Model1mo ago
JT-MINI
China Mobile

JT-MINI is an AI model from China Mobile.

τ²-bench (Tau²-bench)93
GPQA Diamond68
IFBench37
Model1mo ago
EXAONE 4.5 33B
LG AI Research

EXAONE 4.5 33B is an AI model from LG AI Research.

GPQA Diamond79
τ²-bench (Tau²-bench)78
IFBench58
Model1mo ago
Muse Spark
Meta Platforms

muse-spark is an AI model from Meta Platforms.

proprietaryClosed
τ²-bench (Tau²-bench)92
GPQA Diamond88
MedScribe86
Model1mo ago
Grok 4.20
xAI

Grok 4.20 0309 v2 is an AI model from xAI.

2M$3/M166 tok/s
Infraresolutionbench85
GPQA Diamond78
τ²-bench (Tau²-bench)60
Model1mo ago
GLM 5.1
Zai

GLM-5.1 is an AI model from Zai, released with open weights.

203K$2.15/M57 tok/smitOpen
τ²-bench (Tau²-bench)97
MMLU-Pro85
LiveBench - Math85
Model1mo ago
Solar Pro 3
Upstage

Solar Pro 3 is an AI model from Upstage.

128K
τ²-bench (Tau²-bench)86
GPQA Diamond72
IFBench71
Eval1mo ago
Infraresolutionbench
RL Env

Prime verifiers environment for InfraResolutionBench

RL Env4 frontier
92
85
85
85
84
84
Eval2mo ago
AutomationBench
RL Env

Evaluates AI agents on realistic, multi-step business workflows across 47 simulated SaaS tools.

RL Env2 frontier
56
48
39
37
36
33
Model2mo ago
Gemma 4 E4B
Google (Alphabet Inc.)

Gemma 4 E4B is an AI model from Google (Alphabet Inc.).

8K
GPQA Diamond55
IFBench41
τ²-bench (Tau²-bench)26
Model2mo ago
Gemma 4 E2B
Google (Alphabet Inc.)

Gemma 4 E2B is an AI model from Google (Alphabet Inc.).

8K
GPQA Diamond41
IFBench34
τ²-bench (Tau²-bench)22
Model2mo ago
Step 3.5 Flash
Stepfun

step-3.5-flash is an AI model from Stepfun, released with open weights.

262K216 tok/sapache-2.0Open
τ²-bench (Tau²-bench)87
GPQA Diamond83
Infraresolutionbench76
Model2mo ago
Gemma 4 26B A4B
Google (Alphabet Inc.)

gemma-4-26b-a4b is an AI model from Google (Alphabet Inc.), released with open weights.

$0.2/M70 tok/sapache-2.0Open
GPQA Diamond71
IFBench45
τ²-bench (Tau²-bench)40
Model2mo ago
Qwen3.6 Plus
Alibaba

Qwen3.6Plus is an AI model from Alibaba.

1M$1.13/M53 tok/sproprietaryClosed
τ²-bench (Tau²-bench)98
GPQA Diamond88
LiveBench - Math84
Model2mo ago
Trinity Large Thinking
Arcee AI

trinity-large-thinking is an AI model from Arcee AI, released with open weights.

262K$0.4/M175 tok/sapache-2.0Open
τ²-bench (Tau²-bench)90
GPQA Diamond75
Infraresolutionbench73
Model2mo ago
MiniMax M2.7
Minimax

minimax-m2.7 is an AI model from Minimax.

205K$0.53/M66 tok/sModified MITClosed
GPQA Diamond87
τ²-bench (Tau²-bench)85
LiveBench - Math81
Model2mo ago
GPT-5.4 Nano
OpenAI

GPT-5.4 nano is an AI model from OpenAI.

400K$0.46/M154 tok/sproprietaryClosed
Polars Env92
LiveBench - Math83
MedScribe77
Model2mo ago
GPT-5.4 Mini
OpenAI

GPT-5.4 mini is an AI model from OpenAI.

400K$1.69/M166 tok/sproprietaryClosed
Infraresolutionbench80
LiveBench - Coding71
TaxEval v271
Model2mo ago
Grok 4.20 0309
xAI

Grok 4.20 0309 is an AI model from xAI.

$3/M192 tok/s
GPQA Diamond79
TaxEval v274
τ²-bench (Tau²-bench)70
Model2mo ago
GPT-5.4
OpenAI

GPT-5.4 is an AI model from OpenAI.

1.1M$5.63/M68 tok/sproprietaryClosed
LiveBench - Math90
GPQA (Full Set)87
LiveBench - Reasoning86
Model3mo ago
Gemini 3.1 Flash Lite Preview
Google (Alphabet Inc.)

Gemini 3.1 Flash-Lite is an AI model from Google (Alphabet Inc.).

1.0M$0.56/M299 tok/sproprietaryClosed
GPQA Diamond82
IFBench77
LiveBench - Math74
Model3mo ago
Qwen3 5 4B
Alibaba

Qwen3 5 4B is an AI model from Alibaba.

$0.06/M189 tok/s
τ²-bench (Tau²-bench)92
Science Gym Chem84
GPQA Diamond77
Model3mo ago
Qwen3 5 0 8B
Alibaba

Qwen3 5 0 8B is an AI model from Alibaba.

$0.02/M
Science Gym Bio80
Teaching Env68
Science Gym Materials68
Model3mo ago
Gemini 3.1 Pro Preview
Google (Alphabet Inc.)

Gemini 3.1 Pro Preview is an AI model from Google (Alphabet Inc.).

1.0M$4.5/M132 tok/s
τ²-bench (Tau²-bench)96
GPQA Diamond94
GPQA (Full Set)93
Model3mo ago
Claude Sonnet 4.6
Anthropic

Claude Sonnet 4.6 is an AI model from Anthropic.

1M$6.56/M49 tok/sproprietaryClosed
Infraresolutionbench92
Agriculture Qa87
LiveBench - Math87
Model3mo ago
Qwen3.5 397B A17B
Alibaba

qwen/qwen3.5-397b-a17b is an AI model.

262K$1.35/M52 tok/sapache-2.0Open
Physgym Arena Medley Public100
τ²-bench (Tau²-bench)96
GPQA Diamond89
Model3mo ago
MiniMax M2.5
Minimax

minimax-m2.5 is an AI model from Minimax.

205K$0.53/M178 tok/sModified MITClosed
τ²-bench (Tau²-bench)95
GPQA Diamond85
GPQA (Full Set)84
Model3mo ago
GLM 5
Zai

GLM-5 is an AI model from Zai, released with open weights.

203K$1.55/M67 tok/smitOpen
τ²-bench (Tau²-bench)97
LiveBench - Math83
Infraresolutionbench83
Model3mo ago
Claude Opus 4.6
Anthropic

Claude Opus 4.6 is an AI model from Anthropic.

1M$11/M50 tok/sproprietaryClosed
LiveBench - Math89
LiveBench - Reasoning89
MedScribe86
Model3mo ago
GPT-5.3-Codex
OpenAI

GPT-5.3 Codex is an AI model from OpenAI.

400K$4.81/M84 tok/sproprietaryClosed
GPQA Diamond92
LiveBench - Math88
τ²-bench (Tau²-bench)86
Model3mo ago
Qwen3 Coder Next
Alibaba

Qwen3 Coder Next is an AI model from Alibaba.

262K$0.56/M133 tok/sunknownOpen
τ²-bench (Tau²-bench)80
GPQA Diamond74
IFBench35
Model5mo ago
MiniMax M2.1
Minimax

minimax-m2.1 is an AI model from Minimax, released with open weights.

205K$0.53/M188 tok/smitOpen
MMLU-Pro88
τ²-bench (Tau²-bench)85
GPQA Diamond83
Model5mo ago
GLM 4.7
Zai

GLM-4.7 is an AI model from Zai, released with open weights.

203K$1/M84 tok/smitOpen
Nsa Codebreaker100
τ²-bench (Tau²-bench)94
GSM8K90
Model5mo ago
Gemini 3 Flash Preview
Google (Alphabet Inc.)

Gemini 3 Flash Preview (Reasoning) is an AI model from Google (Alphabet Inc.).

1.0M$1.13/M186 tok/s
AIME 2025: Problems from the American Invitational Mathematics Examination97
LiveCodeBench91
GPQA Diamond90
Model5mo ago
GPT-5.2
OpenAI

GPT-5.2 is an AI model from OpenAI.

400K$4.81/M71 tok/sproprietaryClosed
Bb Demo100
LiveBench - Math93
IDE-Bench85
Model5mo ago
GPT-5.2-Codex
OpenAI

GPT-5.2 Codex (xhigh) is an AI model from OpenAI.

400K$4.81/M110 tok/sproprietaryClosed
τ²-bench (Tau²-bench)92
GPQA Diamond90
LiveBench - Math89
Model6mo ago
DeepSeek V3.2
DeepSeek

DeepSeek V3.2 is an AI model from DeepSeek, released with open weights.

131K$0.78/MmitOpen
LiveBench - Math85
MMLU-Pro84
τ²-bench (Tau²-bench)79
Model6mo ago
Claude Opus 4.5
Anthropic

Claude Opus 4.5 is an AI model from Anthropic.

200K$11/M55 tok/sproprietaryClosed
Nsa Codebreaker100
LiveBench - Math90
MMLU-Pro89
Model6mo ago
Grok 4.1 Fast
xAI

Grok 4.1 Fast is an AI model from xAI.

proprietaryClosed
Medpt100
LiveBench - Math84
LiveBench - Reasoning80
Model6mo ago
Gemini 3 Pro Preview
Google (Alphabet Inc.)

Gemini 3 Pro Preview (low) is an AI model from Google (Alphabet Inc.).

1M$4.5/M
MMLU-Pro90
GPQA Diamond89
AIME 2025: Problems from the American Invitational Mathematics Examination87
Model6mo ago
GPT-5.1
OpenAI

GPT-5.1 is an AI model from OpenAI.

400K$3.44/M125 tok/sproprietaryClosed
MedScribe88
LiveBench - Math87
Infraresolutionbench82
Model7mo ago
Claude 4.5 Haiku
Anthropic

Claude 4.5 Haiku is an AI model from Anthropic.

200K$2.19/M106 tok/s
DABstep90
MedScribe85
MMLU-Pro80
Model8mo ago
GLM 4.6
Zai

GLM-4.6 is an AI model from Zai, released with open weights.

203K$1/M48 tok/smitOpen
LiveBench - Math81
MMLU-Pro78
τ²-bench (Tau²-bench)77
Model8mo ago
Claude Sonnet 4.5
Anthropic

anthropic/claude-sonnet-4.5 is an AI model.

200K$6.56/M49 tok/sProprietaryClosed
IDE-Bench88
MMLU-Pro86
MedScribe84
Model8mo ago
Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)
Google (Alphabet Inc.)

Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) is an AI model from Google (Alphabet Inc.).

1MproprietaryClosed
MMLU-Pro84
MedScribe78
GPQA Diamond77
Model8mo ago
Gemini 2.5 Flash Lite Preview 09-2025
Google (Alphabet Inc.)

Gemini 2.5 Flash-Lite Preview (Sep '25) is an AI model from Google (Alphabet Inc.).

1.0M$0.17/M
MMLU-Pro80
MedScribe67
GPQA Diamond65
Model8mo ago
Grok 4 Fast
xAI

Grok 4 Fast is an AI model from xAI.

$0.28/MproprietaryClosed
MedScribe80
MMLU-Pro73
TaxEval v272
Model8mo ago
Magistral Medium 1.2
Mistral AI

Magistral Medium 1.2 is an AI model from Mistral AI.

$2.75/M40 tok/s
AIME 2025: Problems from the American Invitational Mathematics Examination82
MMLU-Pro82
LiveCodeBench75
Model9mo ago
Qwen3 Max
Alibaba

Alibaba's >1T-parameter dense Qwen3 flagship, available only as a closed API on Qwen Chat and Alibaba Cloud.

262K$3.05/M52 tok/sProprietaryClosed
MMLU-Pro84
AIME 2025: Problems from the American Invitational Mathematics Examination81
Physgym Arena Medley Public80
Model9mo ago
Hermes 4 - Llama-3.1 70B
Nous Research

Nous Research's mid-size hybrid-reasoning post-training of Llama-3.1-70B with switchable <think> mode and JSON-schema-faithful outputs.

131K$0.2/M92 tok/sLlama 3 CommunityOpen
MMLU-Pro66
GPQA Diamond49
IFBench29
Model9mo ago
Gemma 3 270M
Google (Alphabet Inc.)

Gemma 3 270M is an AI model from Google (Alphabet Inc.).

8KunknownOpen
GPQA Diamond22
IFBench12
τ²-bench (Tau²-bench)9
Model9mo ago
GPT-5 Mini
OpenAI

GPT-5 mini is an AI model from OpenAI.

400K$0.69/M86 tok/sproprietaryClosed
Gpu Puzzles Modal100
Csv Qa100
DABstep100
Model9mo ago
GPT-5
OpenAI

OpenAI's August 2025 unified frontier model that auto-routes between a fast model and a deeper "thinking" variant.

400K$3.44/M176 tok/sProprietaryClosed
Gutenberg Env100
MATH100
AIME 2024: Problems from the American Invitational Mathematics Examination95
Model9mo ago
GPT-5 Nano
OpenAI

GPT-5 nano is an AI model from OpenAI.

400K$0.14/M158 tok/sproprietaryClosed
Mostly Basic Python Problems (MBPP)100
Gutenberg Env100
Math Python100
Model10mo ago
Qwen3 4B 2507 Instruct
Alibaba

Qwen3 4B 2507 Instruct is an AI model from Alibaba.

unknownOpen
MMLU-Pro67
AIME 2025: Problems from the American Invitational Mathematics Examination52
GPQA Diamond52
Model10mo ago
Claude 4.1 Opus
Anthropic

Claude 4.1 Opus is an AI model from Anthropic.

200K$33/M37 tok/sproprietaryClosed
MMLU-Pro88
GPQA Diamond81
AIME 2025: Problems from the American Invitational Mathematics Examination80
Model10mo ago
Qwen3 Coder 30B A3B Instruct
Alibaba

Qwen3 Coder 30B A3B Instruct is an AI model from Alibaba.

160K$0.35/M105 tok/sunknownOpen
MATH-50089
MMLU-Pro71
Acebench Agent Multistep63
Model11mo ago
Gemini 2.5 Flash Lite
Google (Alphabet Inc.)

Gemini 2.5 Flash-Lite is an AI model from Google (Alphabet Inc.).

1.0M$0.17/M247 tok/s
MATH-50093
MedScribe73
MMLU-Pro72
Eval11mo ago
IFBench
Allen Institute for AI

Instruction-following benchmark measuring adherence to multi-step constraints.

4 frontier
81
80
79
78
78
77
Eval11mo ago
τ²-bench (Tau²-bench)
Sierra

Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.

ActiveTool CallingMulti Turn DialogPlanning
99
99
99
99
98
97
Model1y ago
Gemini 2.5 Pro
Google (Alphabet Inc.)

Gemini 2.5 Pro is an AI model from Google (Alphabet Inc.).

1.0M$3.44/M142 tok/sproprietaryClosed
MATH-50097
AIME 2024: Problems from the American Invitational Mathematics Examination89
AIME 2025: Problems from the American Invitational Mathematics Examination88
Model1y ago
Claude 4 Sonnet
Anthropic

Claude 4 Sonnet is an AI model from Anthropic.

200K$6.56/M50 tok/sproprietaryClosed
MATH-50093
Mini Swe Agent Bench87
Wiki Race84
Model1y ago
Gemini 2.5 Flash
Google (Alphabet Inc.)

Gemini 2.5 Flash is an AI model from Google (Alphabet Inc.).

1.0M$0.85/M200 tok/sproprietaryClosed
Complex Worlds Hack100
MATH-50093
Arena-Hard84
Model1y ago
Qwen3 4B
Alibaba

Qwen3 4B is an AI model from Alibaba.

$0.19/M105 tok/sunknownOpen
Roi Calculator100
Business Valuation94
Irr Calculator92
Model1y ago
Qwen3 0.6B
Alibaba

Qwen3 0.6B is an AI model from Alibaba.

$0.19/M222 tok/sunknownOpen
MATH-50052
MMLU-Pro23
GPQA Diamond23
Model1y ago
Qwen3 30B A3B
Alibaba

Qwen3 30B A3B is an AI model from Alibaba.

131K$0.13/M70 tok/sapache-2.0Open
MATH-50086
Med Agent Bench84
Med Agent Bench84
Model1y ago
o4 Mini
OpenAI

o4 Mini is an AI model from OpenAI.

200K$1.93/M178 tok/sproprietaryClosed
MATH-50099
Clrs Algorithms96
AIME 2024: Problems from the American Invitational Mathematics Examination94
Model1y ago
GPT-4.1 Mini
OpenAI

GPT-4.1 Mini is an AI model from OpenAI.

1.0M$0.7/M84 tok/sproprietaryClosed
Tau Bench Env100
Gutenberg Env100
Skyrl Sql100
Model1y ago
GPT-4.1 Nano
OpenAI

GPT-4.1 Nano is an AI model from OpenAI.

1.0M$0.17/M169 tok/sproprietaryClosed
MATH-50085
Wiki Search68
Wiki Search68
Model1y ago
o3
OpenAI

OpenAI's first true "reasoning at scale" model, announced Dec 2024 and publicly released April 2025, which crossed human-expert ceiling on GPQA.

200K$3.5/M137 tok/sProprietaryClosed
MATH-50099
AIME 2024: Problems from the American Invitational Mathematics Examination97
Arena-Hard89
Model1y ago
Claude Sonnet 3.7
Anthropic

Claude Sonnet 3.7 is an AI model from Anthropic.

200K$6.56/MproprietaryClosed
LiveBench - Instruction Following86
MATH-50085
MMLU-Pro80
Model1y ago
Phi 4 Mini Instruct
Phi

Phi-4-mini-instruct is an AI model, released with open weights.

131KunknownOpen
IFEval74
BIG-Bench Hard (BBH)57
MMLU-Pro39
Model1y ago
DeepSeek V3
DeepSeek

DeepSeek V3 (Dec '24) is an AI model from DeepSeek.

131K$0.52/MunknownOpen
MATH-50089
LiveBench - Instruction Following76
MMLU-Pro75
Model1y ago
Qwen2.5 Coder Instruct 7B
Alibaba

Qwen2.5 Coder Instruct 7B is an AI model from Alibaba.

unknownOpen
MATH-50066
IFEval61
BIG-Bench Hard (BBH)50
Model1y ago
Llama-3.2-1B
Meta Platforms

Llama-3.2-1B is an AI model with 1.0B parameters, released with open weights.

131KunknownOpen
MuSR34
BIG-Bench Hard (BBH)31
GPQA Diamond23
Model1y ago
Llama-3.2-3B
Meta Platforms

Llama-3.2-3B is an AI model with 3.0B parameters, released with open weights.

131KunknownOpen
BIG-Bench Hard (BBH)39
MuSR36
GPQA Diamond27
Model1y ago
Qwen2.5-1.5B-Instruct
Alibaba

Qwen2.5-1.5B-Instruct is an AI model with 1.5B parameters, released with open weights.

unknownOpen
IFEval45
BIG-Bench Hard (BBH)43
MuSR37
Model1y ago
Qwen2.5-3B-Instruct
Alibaba

Qwen2.5-3B-Instruct is an AI model with 3.0B parameters, released with open weights.

unknownOpen
IFEval65
BIG-Bench Hard (BBH)47
MuSR40
Model1y ago
Qwen2.5 7B Instruct
Alibaba

Qwen2.5-7B-Instruct is an AI model with 7.0B parameters, released with open weights.

131KunknownOpen
IFEval76
BIG-Bench Hard (BBH)54
MATH Level 550
Model1y ago
Qwen2.5-0.5B-Instruct
Alibaba

Qwen2.5-0.5B-Instruct is an AI model with 500M parameters, released with open weights.

unknownOpen
MuSR33
BIG-Bench Hard (BBH)33
IFEval32
Model1y ago
Qwen2.5-0.5B
Alibaba

Qwen2.5-0.5B is an AI model with 500M parameters, released with open weights.

unknownOpen
MuSR34
BIG-Bench Hard (BBH)33
GPQA Diamond25
Model1y ago
GPT-4o (2024-08-06)
OpenAI

GPT-4o (Aug '24) is an AI model from OpenAI.

128K$4.38/M117 tok/s
MATH-50079
LiveBench - Instruction Following73
TaxEval v271
Model1y ago
Llama-3.1-8B
Meta Platforms

Llama-3.1-8B is an AI model with 8.0B parameters, released with open weights.

131KunknownOpen
BIG-Bench Hard (BBH)47
MuSR38
MMLU-Pro33
Eval1y ago
τ-bench (tau-bench)
Sierra

Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.

ActiveTool CallingMulti Turn DialogInstruction Following
63
33
23
Model2y ago
Meta-Llama-3-8B
Meta Platforms

Meta-Llama-3-8B is an AI model with 8.0B parameters, released with open weights.

8KunknownOpen
BIG-Bench Hard (BBH)46
MuSR36
MMLU-Pro32
RL Env2y ago
BrowserGym
ServiceNow Research

ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.

RL EnvBrowser UseTool CallingPlanning
Model2y ago
Mistral-7B-Instruct-v0.2
Mistral AI

Mistral-7B-Instruct-v0.2 is an AI model with 7.0B parameters, released with open weights.

33KunknownOpen
IFEval55
BIG-Bench Hard (BBH)45
MuSR40
Eval2y ago
IFEval
Google DeepMind

500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.

ActiveInstruction Following2 frontier
90
87
86
84
83
83
Preference4y ago
Anthropic HH-RLHF
Anthropic

Anthropic's foundational helpful-and-harmless human preference dataset - the first major public RLHF corpus and a long-time community baseline.

PreferenceSafetyJailbreak ResistanceMulti Turn Dialog
RL Env5y ago
ALFWorld
MIT CSAIL

Aligned text-and-3D embodied environment - agents learn household tasks (pick & place, heat, cool, clean) as both TextWorld games and visually-rendered ALFRED scenes.

RL EnvEmbodiedPlanningInstruction Following