Feed
Trending and latest across evals, tools, models, and papers.
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time.
Yejin Choi, James Zou, Katherine Tieu et al. · arXiv 2025
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to…
Muhao Chen, Tenghao Huang, Chenghao Yang et al. · arXiv 2025
Alignment has greatly improved large language models (LLMs)' output quality at the cost of diversity, yielding highly similar outputs across generations, especially in open-ended generation tasks.
Mike Zhang, Johannes Bjerva, Rasmus Aavang et al. · arXiv 2025
Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Business Reporting Language (iXBRL) is mandated for public financial filings.
Michael. I. Jordan, Jitendra Malik, Anastasios N. Angelopoulos et al. · arXiv 2024
The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation.
Youngjae Yu, Youngmin Kim, Woohyun Cho et al. · arXiv 2025
Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues.
Víctor Gallego · arXiv 2026
We study LLM policy synthesis: using a language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates…
Martin Fajcik, Pavel Smrz, Martin Docekal · arXiv 2024
This paper introduces OARelatedWork: a dataset for related work generation from open-access sources. It is the first large-scale multi-document summarization dataset for related work generation, containing whole related work sections and full texts of cited papers.
Ido Hakimi, Andreas Krause, Barna Pásztor et al. · arXiv 2026
Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains.
Danilo Mandic, Yuxuan Gu, Wuyang Zhou et al. · arXiv 2026
The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a…
Shuai Shao, Tianyi Zhou, Dongrui Liu et al. · arXiv 2026
As large language models (LLMs) advance their mathematical capabilities toward the IMO and research level, the scarcity of challenging, high-quality problems has become a significant bottleneck for training, evaluation and self-evolution of LLMs.
Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathor et al. · arXiv 2026
Vision-language-action (VLA) models have driven demand for large-scale egocentric datasets, yet the hardware and infrastructure to collect long-horizon data remain inaccessible.
Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov et al. · arXiv 2026
Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites.
Yifan Wang, Zheng Wei, Yang Tang et al. · arXiv 2026
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead.
Frank F. Xu, Graham Neubig, Shuyan Zhou et al. · arXiv 2026
Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding…
Yukang Chen, Yao Lu, Song Han et al. · arXiv 2025
Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage.
Kangwook Lee, Jongwon Jeong, Chungpa Lee et al. · arXiv 2025
Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores.
Yue Zhang, Mingyu Ding, Huaxiu Yao et al. · arXiv 2026
Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints.
Jiaxuan You, Zijia Liu, Peixuan Han · arXiv 2025
Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle…
Ziqiao Ma, Yijiang Li, Qingying Gao et al. · arXiv 2025
Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several…
Qi Liu, Shuo Yu, Enhong Chen et al. · arXiv 2025
Large language models (LLMs) have rapidly evolved from single-turn text generators into the foundation of increasingly capable agents. As these agents take on more complex reasoning, decision making, tool use, and long-horizon tasks, reinforcement learning (RL) is becoming…
Lei Bai, Yu-Gang Jiang, Ming Zhang et al. · arXiv 2026
Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows.
Xunliang Cai, Binbin Zheng, Xing Ma et al. · arXiv 2026
On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult.
Mohit Bansal, Tianlong Chen, Justin Chih-Yao Chen et al. · arXiv 2025
Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise.
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods…
Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to…
LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person…
LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions.
Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding.
Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly…
Jiajie Zhang, Juanzi Li, Nianyi Lin et al. · arXiv 2025
A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) is the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation during training.
Xiaojun Wan, Li Lin, Xinyu Hu · arXiv 2025
Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs.
Jaehyung Kim, Hamin Koo · arXiv 2025
Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages.
Ziran Wang, Aniket Bera, Damon Conover et al. · arXiv 2026
Offline goal-conditioned reinforcement learning (GCRL) learns goal-reaching behaviors from static datasets, but accurate value estimation remains challenging under limited state-action coverage.
Hao Wu, Runtian Wang, Renhao Xue et al. · arXiv 2026
The inverse problem of multilayer thin-film optical coatings design represents a complex combinatorial-continuous optimization challenge. We present PRISM (Position-encoded Regressive Inverse Spectral Model), a unified decoder-only autoregressive transformer that streamlines…
Yangqiu Song, Hong Ting Tsang, Jiaxin Bai et al. · arXiv 2026
Data critical to real-world decision-making is increasingly found within organizations. Such data is heterogeneous, constantly evolving, and only imperfectly captured. However, current data management systems remain largely passive, retrieving what is explicitly stored while…
Lukas Schiesser, Cornelius Wolff, Sophie Haas et al. · arXiv 2025
Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) is a promising paradigm for few-shot image classification (FSIC), but prior work has underexplored the relative…
Kai Chen, Bin Yu, Shijie Lian et al. · arXiv 2026
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset…
Xin Liu, Shan Yang, Zhangyang Wang et al. · arXiv 2026
Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode.
Jian Mu, Shuang Qiu, Yao Shu et al. · arXiv 2025
Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms.
Jun Zhao, Tian Liang, Minzheng Wang et al. · arXiv 2025
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via the Transformer's…
Jae-Joon Kim, Jiwon Song, Dongwon Jo et al. · arXiv 2026
The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain…
Mikhail Burtsev, Yuri Kuratov, Aydar Bulatov et al. · arXiv 2026
Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead.
Ali Farhadi, Tim Dettmers, Saurabh Shah et al. · arXiv 2026
Open-weight coding agents should hold a fundamental advantage over closed-source systems because they can specialize to private codebases, encoding repository-specific information directly in their weights.
Mario Giulianelli, Gabriele Sarti, Raghu Arghal et al. · arXiv 2026
Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with…
Jiaqi Liu, Mingyu Ding, Gang Wang et al. · arXiv 2025
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$.
Wanxiang Che, Yang Yue, Xianzhen Luo et al. · arXiv 2026
Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions.
Yijiang Li, Zhongzhi Li, Lijie Hu et al. · arXiv 2026
The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics…
Yao Hu, Shaosheng Cao, Fei Zhao et al. · arXiv 2026
Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code.
Blog-grounded Backdoor IFEval reward-hacking environment with hidden silver reward.
Teach a small model to classify trademarks on the Abercrombie distinctiveness spectrum
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) is an AI model from Anthropic.
Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric…
Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion.
Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace.
Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain…
While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis.
Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks.
Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological…
Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving…
We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance.
One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its…
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive…
Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes outputs difficult to verify. Conversely, constrained decoding ensures standardized formats but can inadvertently restrict reasoning…
Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual…
Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval.
Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs.
Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can…
Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to…
Chao Zhang, Yutong Zhang, Yan Liu et al. · arXiv 2025
Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task…
Markus Enzweiler, Kadir-Kaan Özer, René Ebeling · arXiv 2026
Multivariate time series anomalies often manifest as shifts in cross-channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration.
Qin Li, Raymond Khazoum, Daniela Fernandes et al. · arXiv 2025
Mental rotation -- the ability to compare objects seen from different viewpoints -- is a fundamental example of mental simulation and spatial world modeling in humans. Here we propose a mechanistic model of human mental rotation, leveraging recent advances in deep, equivariant,…
Arman Cohan, Yixin Liu, Doug Downey et al. · arXiv 2025
Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no…
Jialong Wu, Ningya Feng, Mingsheng Long et al. · arXiv 2024
Effective code optimization in compilers is crucial for computer and software engineering. The success of these optimizations primarily depends on the selection and ordering of the optimization passes applied to the code.
Xiangyu Zhao, Wei Huang, Ziwei Liu et al. · arXiv 2025
Conventional Sequential Recommender Systems (SRS) typically assign unique hash IDs (HID) to construct item embeddings, which mainly capture collaborative signals from historical user-item interactions.
Zhuowen Tu, Catherine Arnett, Tyler A. Chang et al. · arXiv 2024
For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in…
Min Zhang, Liang Ding, Miao Zhang et al. · arXiv 2026
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual agents. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their adaptability.
Dmitry Vetrov, Andrey Okhotin, Maksim Nakhodnov et al. · arXiv 2025
In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials.
Yuchen Yan, Wenqi Zhang, Weiming Lu et al. · arXiv 2025
LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends on environmental state that the instruction does not mention.
Mor Geva, Atticus Geiger, Yoav Gur-Arieh · arXiv 2025
A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked…
Jie zhou, Fandong Meng, Liyan Xu et al. · arXiv 2026
Chain-of-thought (CoT) reasoning has become a central mechanism for eliciting multi-step reasoning in Large Language Models (LLMs). Yet recent evidence presents a tension: hidden states appear to already encode future reasoning before CoT fully unfolds, while explicit steps…
Jiaqi Wang, Yang Liu, Fan Zhang et al. · arXiv 2026
Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored.
Matt Fredrikson, Siheng Xiong, Xiaoze Liu et al. · arXiv 2026
Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain bottlenecked by discrete text communication, which imposes runtime overhead and information quantization loss.
Yilun Du, Benhao Huang, Hansen Jin Lillemark et al. · arXiv 2026
Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects.
Yue Zhang, Jaemin Cho, Mohit Bansal et al. · arXiv 2025
Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds following camera trajectories.
Haobo Zhang, Jiayu Zhou · arXiv 2025
Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training.
Eric Xing, Zhoujun Cheng, Zhihan Yang et al. · arXiv 2025
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Within this family, Masked Diffusion Models (MDMs) currently perform best but still underperform AR models in perplexity and lack key…
Yin Jun Phua · arXiv 2026
Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task.
Yann LeCun, Lucas Maes, Quentin Le Lidec et al. · arXiv 2026
World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics.
Zhijian Liu, Jian Chen, Yesheng Liang · arXiv 2026
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization.
Wentao Zhang, Yang Liu, Bo An et al. · arXiv 2025
Recent advances in LLM-based agent systems have shown promise on complex, long-horizon tasks, but existing agent protocols (e.g., A2A and MCP) do not adequately support lifecycle-aware coordination across agents, tools, and environments.
Talor Abramovich, Maor Ashkenazi, Carl et al. · arXiv 2026
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for…
SWE-bench Pro environment backed by Harbor tasks.
Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are…
Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution.
Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities.
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining.
Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged.
Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven…
Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test.
Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted…
Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route…
The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral…
Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement.
Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost.
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver…
Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families.
The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits…
LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema…
Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view.
Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly…
Ahmet Üstün, Malvina Nissim, Daniel Scalena et al. · arXiv 2025
With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt.
Nikolaos Aletras, Marco Valentino, Yuxiang Zhou et al. · arXiv 2026
Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices.
Yao Zhang, Zhuchenyang Liu, Yu Xiao · arXiv 2026
2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance.
Peter Devine, Walter Hernandez Cruz, Nikhil Vadgama et al. · arXiv 2026
We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO)…
Shaina Raza, Deval Pandya, Christos Emmanouilidis et al. · arXiv 2026
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored.
Hongyu Lin, Yaojie Lu, Xianpei Han et al. · arXiv 2026
Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers.
Tingwen Zhang, Ling Yue, Shaowu Pan et al. · arXiv 2026
Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora often mix them with plots, screenshots, and photographs and rarely preserve document context.
Hao Wu, Xin Qiu, Yunpu Ma et al. · arXiv 2026
Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead.
Chen Sun, Na Zhang, Hui Zeng et al. · arXiv 2023
This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework designed for assessing the generative capabilities of large Chinese language models across a spectrum of academic disciplines.
Lin Qiu, Oren Etzioni, Hannah Lee et al. · arXiv 2025
In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not…
Victor Zhong, YuXuan Li, William Yang Wang et al. · arXiv 2025
Does Reinforcement Learning (RL) merely amplify existing skills, or synthesize novel skills? We investigate this question through the lens of Complementary Reasoning: the critical practical capability of integrating internal knowledge with external context, a prerequisite for…
Borong Zhang, Zhe He, Edan Toledo et al. · arXiv 2025
How should we analyze memory in deep RL? We introduce tools for analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atari-inspired, hardware-accelerated…
Zhaoxin Fan, Xiaomin Yu, Ziyue Qiao et al. · arXiv 2025
Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be…
PhysGym Arena DR-hard benchmark for domain-randomized Gym simulator repair
PhysGym Arena medley benchmark for achievable medium-hard Gym simulator repair

Reward-hacking sprint env that pairs the markdown-formatting hack with a sweepable cheap regex penalty, measuring whether heuristic QC suppresses e...
Reward-hacking sprint env. Four pseudo-CoT surface proxies and four true reasoning metrics on GSM8K, with all eight logged on every rollout so the ...
Backdoor-IFEval phase-transition lab: advantage-geometry metrics, quadrant fractions, and boundary-shifting interventions for reward-hacking law te...
Reward-hacking sprint env. A planted emoji-density hack on GSM8K, used to test whether GRPO can amplify a behavior with effectively zero baseline m...
Reward-hacking sprint env. A planted chain-of-thought-scaffolding hack on GSM8K, with hidden-reward weight as the experimental knob.
Reward-hacking sprint env. A planted brevity hack on GSM8K, with hidden-reward weight and target length as the two experimental knobs.
Reward-hacking sprint env. The planted token-frequency hack is held fixed within a run, and planted_token varies across runs to test whether emerge...
Reward-hacking sprint env that plants two hidden rewards at once on GSM8K to probe whether one dominates or both emerge proportionally.
Reward-hacking sprint env. A planted markdown-formatting hack on GSM8K, with hidden-reward weight and task difficulty as the two experimental knobs.
MiniCPM5-1B (Non-reasoning) is an AI model from OpenBMB.
Reward hacking sprint calibration environment for hidden keyword gradients in instruction following.
Variance-based early-warning circuit breaker for reward hacking. Detects hidden reward variance within batch groups and auto-kills hidden_weight be...
FIXED: Adaptive controller for reward hacking. Monitors visible delta AND hidden reward. Adapts check count 7→9. Original was bugged (blind to hidd...
Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Llama-3.2-...
Unified backdoor-ifeval env plus SSAC/GDPO custom advantage helpers
Command A+ is an AI model from Cohere.
Qwen3.7Max is an AI model from Alibaba.
Gemini 3.5 Flash is an AI model from Google (Alphabet Inc.).
JT-35B-Flash is an AI model from China Mobile.
MathArena Apex Shortlist final-answer evaluation environment

MathArena Apex Shortlist final-answer evaluation environment
FrontierScience PhD-level science evaluation environment
Multi-turn DevOps troubleshooting environment with simulated diagnostic tools
MiniCPM-V 4.6 1.3B is an AI model from OpenBMB.
Goblin IFEval environment with difficulty, aggregation, inoculation, and group monitors
Evaluates LLM explanations of textbook excerpts across pedagogy dimensions including concept coverage, coherence, prerequisite ordering, and origin...
Science Sim chemistry compound and reaction screening environment
Science Sim materials candidate ranking and simulation planning environment
Science Sim computational biology protein-variant decision environment
Ring-2.6-1T is an AI model from InclusionAI.
Polars DataFrame manipulation environment for training and evaluation
AR Credit Command Post Evals by Cognida.ai: enterprise mock-ERP credit hold and order release for AR automation agents (structured data only).
GPT-5.5 Instant (May 2026) is an AI model from OpenAI.
Crystal relaxation environment for RLM training, with multiple rubrics including format, composition, bond lengths, and formation energy.
A self-growing toolbench environment - early signs of self-improving agentic capability
Grok 4.3 is an AI model from xAI.
Granite 4.1 3B is an AI model from Ibm.
Granite 4.1 30B is an AI model from Ibm.
Nemotron 3 Nano Omni 30B A3B Reasoning is an AI model from NVIDIA.
Mistral Medium 3.5 is an AI model from Mistral AI.
granite-4.1-8b is an AI model from Ibm, released with open weights.
Long-horizon physical-AI benchmark with dense Gemini rewards
DeepSeek V4 Flash is an AI model from DeepSeek, released with open weights.
DeepSeek's April 2026 next-gen open-weights flagship - 1.6T-total / 49B-active MoE with 1M context and DeepSeek Sparse Attention.
LongCoT long-horizon reasoning evaluation environment using RLM with Python REPL
Ling-2.6-1T is an AI model from InclusionAI.
Hy3-preview is an AI model from Tencent.
GPT-5.5 is an AI model from OpenAI.
Qwen3.6 27B is an AI model from Alibaba.
MiMo-V2.5 is an AI model from Xiaomi.
mimo-v2.5-pro is an AI model from Xiaomi, released with open weights.
Ling 2.6 Flash is an AI model from InclusionAI.
LongCoT evaluation environment using RLM
Kimi K2.6 (Non-reasoning) is an AI model from Kimi.
Qwen3.6 Max Preview is an AI model from Alibaba.
kimi-k2.6 is an AI model from Moonshot AI.
GraphWalks graph traversal evaluation environment (single-turn)
Qwen3.6 35B A3B is an AI model from Alibaba.
Claude Opus 4.7 is an AI model from Anthropic.
Đánh giá kiến thức nông nghiệp của LLM - Agriculture QA Environment for Prime Intellect Environments Hub
JT-MINI is an AI model from China Mobile.
EXAONE 4.5 33B is an AI model from LG AI Research.
muse-spark is an AI model from Meta Platforms.
Grok 4.20 0309 v2 is an AI model from xAI.
GLM-5.1 is an AI model from Zai, released with open weights.
Solar Pro 3 is an AI model from Upstage.
Prime verifiers environment for InfraResolutionBench
Evaluates AI agents on realistic, multi-step business workflows across 47 simulated SaaS tools.
Gemma 4 E4B is an AI model from Google (Alphabet Inc.).
Gemma 4 E2B is an AI model from Google (Alphabet Inc.).
step-3.5-flash is an AI model from Stepfun, released with open weights.
gemma-4-26b-a4b is an AI model from Google (Alphabet Inc.), released with open weights.
Qwen3.6Plus is an AI model from Alibaba.
trinity-large-thinking is an AI model from Arcee AI, released with open weights.
minimax-m2.7 is an AI model from Minimax.
GPT-5.4 nano is an AI model from OpenAI.
GPT-5.4 mini is an AI model from OpenAI.
Grok 4.20 0309 is an AI model from xAI.
GPT-5.4 is an AI model from OpenAI.
Gemini 3.1 Flash-Lite is an AI model from Google (Alphabet Inc.).
Qwen3 5 4B is an AI model from Alibaba.
Qwen3 5 0 8B is an AI model from Alibaba.
Gemini 3.1 Pro Preview is an AI model from Google (Alphabet Inc.).
Claude Sonnet 4.6 is an AI model from Anthropic.
qwen/qwen3.5-397b-a17b is an AI model.
minimax-m2.5 is an AI model from Minimax.
GLM-5 is an AI model from Zai, released with open weights.
Claude Opus 4.6 is an AI model from Anthropic.
GPT-5.3 Codex is an AI model from OpenAI.
Qwen3 Coder Next is an AI model from Alibaba.
minimax-m2.1 is an AI model from Minimax, released with open weights.
GLM-4.7 is an AI model from Zai, released with open weights.
Gemini 3 Flash Preview (Reasoning) is an AI model from Google (Alphabet Inc.).
GPT-5.2 is an AI model from OpenAI.
GPT-5.2 Codex (xhigh) is an AI model from OpenAI.
DeepSeek V3.2 is an AI model from DeepSeek, released with open weights.
Claude Opus 4.5 is an AI model from Anthropic.
Grok 4.1 Fast is an AI model from xAI.
Gemini 3 Pro Preview (low) is an AI model from Google (Alphabet Inc.).
GPT-5.1 is an AI model from OpenAI.
Claude 4.5 Haiku is an AI model from Anthropic.
GLM-4.6 is an AI model from Zai, released with open weights.
anthropic/claude-sonnet-4.5 is an AI model.
Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) is an AI model from Google (Alphabet Inc.).
Gemini 2.5 Flash-Lite Preview (Sep '25) is an AI model from Google (Alphabet Inc.).
Grok 4 Fast is an AI model from xAI.
Magistral Medium 1.2 is an AI model from Mistral AI.
Alibaba's >1T-parameter dense Qwen3 flagship, available only as a closed API on Qwen Chat and Alibaba Cloud.
Nous Research's mid-size hybrid-reasoning post-training of Llama-3.1-70B with switchable <think> mode and JSON-schema-faithful outputs.
Gemma 3 270M is an AI model from Google (Alphabet Inc.).
GPT-5 mini is an AI model from OpenAI.
OpenAI's August 2025 unified frontier model that auto-routes between a fast model and a deeper "thinking" variant.
GPT-5 nano is an AI model from OpenAI.
Qwen3 4B 2507 Instruct is an AI model from Alibaba.
Claude 4.1 Opus is an AI model from Anthropic.
Qwen3 Coder 30B A3B Instruct is an AI model from Alibaba.
Gemini 2.5 Flash-Lite is an AI model from Google (Alphabet Inc.).
Instruction-following benchmark measuring adherence to multi-step constraints.
Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.

Gemini 2.5 Pro is an AI model from Google (Alphabet Inc.).
Claude 4 Sonnet is an AI model from Anthropic.
Gemini 2.5 Flash is an AI model from Google (Alphabet Inc.).
Qwen3 4B is an AI model from Alibaba.
Qwen3 0.6B is an AI model from Alibaba.
Qwen3 30B A3B is an AI model from Alibaba.
o4 Mini is an AI model from OpenAI.
GPT-4.1 Mini is an AI model from OpenAI.
GPT-4.1 Nano is an AI model from OpenAI.
OpenAI's first true "reasoning at scale" model, announced Dec 2024 and publicly released April 2025, which crossed human-expert ceiling on GPQA.
Claude Sonnet 3.7 is an AI model from Anthropic.
Phi-4-mini-instruct is an AI model, released with open weights.
DeepSeek V3 (Dec '24) is an AI model from DeepSeek.
Qwen2.5 Coder Instruct 7B is an AI model from Alibaba.
Llama-3.2-1B is an AI model with 1.0B parameters, released with open weights.
Llama-3.2-3B is an AI model with 3.0B parameters, released with open weights.
Qwen2.5-1.5B-Instruct is an AI model with 1.5B parameters, released with open weights.
Qwen2.5-3B-Instruct is an AI model with 3.0B parameters, released with open weights.
Qwen2.5-7B-Instruct is an AI model with 7.0B parameters, released with open weights.
Qwen2.5-0.5B-Instruct is an AI model with 500M parameters, released with open weights.
Qwen2.5-0.5B is an AI model with 500M parameters, released with open weights.
GPT-4o (Aug '24) is an AI model from OpenAI.
Llama-3.1-8B is an AI model with 8.0B parameters, released with open weights.
Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.
Meta-Llama-3-8B is an AI model with 8.0B parameters, released with open weights.
ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.
Mistral-7B-Instruct-v0.2 is an AI model with 7.0B parameters, released with open weights.
500 prompts with verifiable instruction-following constraints (word counts, casing, JSON format) checked by deterministic rules - no LLM judge needed.

Anthropic's foundational helpful-and-harmless human preference dataset - the first major public RLHF corpus and a long-time community baseline.
Aligned text-and-3D embodied environment - agents learn household tasks (pick & place, heat, cool, clean) as both TextWorld games and visually-rendered ALFRED scenes.