Huan Sun

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

arXiv 2026

Emergent Social Intelligence Risks in Generative Multi-Agent Systems

arXiv 2026

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

arXiv 2026

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

arXiv 2026

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

arXiv 2026

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

arXiv 2025

An Illusion of Progress? Assessing the Current State of Web Agents

arXiv 2025

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

arXiv 2025

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

arXiv 2025

Beyond Clicking:A Step Towards Generalist GUI Grounding via Text Dragging

arXiv 2025

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

arXiv 2025

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

arXiv 2025

Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure

arXiv 2025

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

arXiv 2024

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

arXiv 2024

GPT-4V(ision) is a Generalist Web Agent, if Grounded

arXiv 2024

ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving

arXiv 2024

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

arXiv 2024

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

arXiv 2024

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs

arXiv 2024

eCeLLM: Generalizing Large Language Models for E-commerce from Large-scale, High-quality Instruction Data

arXiv 2024

When is Tree Search Useful for LLM Planning? It Depends on the Discriminator

arXiv 2024

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

arXiv 2024

EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

arXiv 2024

A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents

arXiv 2024

AttributionBench: How Hard is Automatic Attribution Evaluation?

arXiv 2024

Mind2Web: Towards a Generalist Agent for the Web

mind2web-towards-a-generalist-agent-for-the

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

CVPR 2024 1

Automatic Evaluation of Attribution by Large Language Models

arXiv 2023

Biomedical Language Models are Robust to Sub-optimal Tokenization

arXiv 2023

AgentBench: Evaluating LLMs as Agents

arXiv 2023

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

NeurIPS 2023 11