Xiangru Tang

SWE-Milestone: Evaluating AI Agents on Continuous Software Evolution

arXiv 2026

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

arXiv 2026

Agentic Reasoning for Large Language Models

arXiv 2026

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

arXiv 2025

LocAgent: Graph-Guided LLM Agents for Code Localization

arXiv 2025

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

arXiv 2025

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

arXiv 2025

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

arXiv 2025

CellForge: Agentic Design of Virtual Cell Models

arXiv 2025

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

arXiv 2025

Improving Context Fidelity via Native Retrieval-Augmented Reasoning

arXiv 2025

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

arXiv 2025

InteractComp: Evaluating Search Agents With Ambiguous Queries

arXiv 2025

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

arXiv 2025

Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards

arXiv 2025

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

arXiv 2025

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

CVPR 2025 1

MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

arXiv 2025

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

preprint

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

arXiv 2025

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

arXiv 2025

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

arXiv 2024

A Survey of Generative AI for de novo Drug Design: New Frontiers in Molecule and Protein Generation

arXiv 2024

PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes

arXiv 2024

ChatCell: Facilitating Single-Cell Analysis with Natural Language

arXiv 2024

RWKV: Reinventing RNNs for the Transformer Era

arXiv 2023

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

arXiv 2023

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

arXiv 2023

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

arXiv 2023

OctoPack: Instruction Tuning Code Large Language Models

arXiv 2023

Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents

arXiv 2023

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

arXiv 2023

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

arXiv 2023

QTSumm: Query-Focused Summarization over Tabular Data

arXiv 2023

Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios

arXiv 2023

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

arXiv 2023

BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models

arXiv 2023