Papers

Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.

Filtered by domain: AgentsClear

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

22 Jun 2026

AI agents are driving a new software paradigm, with the ability to autonomously call tools, extract information, manage memory, and complete tasks that span applications and data sources.

Agents

890.5/h

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Jiacheng Liu, Xiaohan Zhao, Xinyi Shang et al. · 14 Apr 2026

Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user.

Agents Coding Agents

1.8k0.3/h

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Caiming Xiong, Xiao Wang, Jiaqi Liu et al. · 19 May 2026

Automating scientific discovery requires more than generating papers from ideas.

Agents

14k0.3/h

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

26 May 2026

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token.

Agents Language Modeling

3480.1/h

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Lin Chen, Zhen Fang, Wenxuan Huang et al. · 19 Apr 2026

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills.

Agents

370.0/h

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

17 Jun 2026

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes,…

Agents Language Modeling

1100.3/h

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Zheng Liu, Hao Li, Qian Yu et al. · 28 Apr 2026

Autonomous scientific research is significantly advanced thanks to the development of AI agents.

Agents Language Modeling

450.0/h

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Andy Konwinski, Etash Guha, John Yang et al. · 17 Jan 2026

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains.

Agents Coding Agents Language Modeling

2.4k

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

10 Jun 2026

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts.

Agents

7660.4/h

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

28 May 2026

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains.

Agents Coding Agents Language Modeling

1740.1/h

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

15 Jun 2026

As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are…

Agents Language Modeling

110.3/h

Agents' Last Exam

3 Jun 2026

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains.

Agents

7570.2/h

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

12 Jun 2026

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using…

Agents Coding Agents Instruction Following Language Modeling

1.5k0.2/h

Joint Agent Memory and Exploration Learning via Novelty Signals

1 Jun 2026

In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories.

Agents Language Modeling

120.0/h

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

22 Jun 2026

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions.

Agents

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

24 Jun 2026

Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood.

Agents Instruction Following Language Modeling