Huaxiu Yao

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

arXiv 2026

SimpleMem: Efficient Lifelong Memory for LLM Agents

arXiv 2026

ClawArena: Benchmarking AI Agents in Evolving Information Environments

arXiv 2026

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

arXiv 2026

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

arXiv 2026

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

arXiv 2026

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

arXiv 2026

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

arXiv 2026

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

arXiv 2026

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

arXiv 2026

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

arXiv 2025

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

arXiv 2025

UQ: Assessing Language Models on Unsolved Questions

arXiv 2025

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

arXiv 2025

Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization

arXiv 2025

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

arXiv 2025

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

arXiv 2025

Adapting Web Agents with Synthetic Supervision

arXiv 2025

Autoregressive Models in Vision: A Survey

arXiv 2024

TrustLLM: Trustworthiness in Large Language Models

arXiv 2024

GRAPE: Generalizing Robot Policy via Preference Alignment

arXiv 2024

CREAM: Consistency Regularized Self-Rewarding Language Models

arXiv 2024

RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

arXiv 2024

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

arXiv 2024

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

arXiv 2024

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

arXiv 2024

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

arXiv 2024

CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

arXiv 2024

AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving

arXiv 2024

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

arXiv 2024

Generalizing to Unseen Domains in Diabetic Retinopathy with Disentangled Representations

arXiv 2024

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

arXiv 2024

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

arXiv 2024

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

arXiv 2024

Can Editing LLMs Inject Harm?

arXiv 2024

MEIT: Multi-Modal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation

arXiv 2024

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

arXiv 2024