0

Shuyan Zhou

Assistant professor at Duke CS; first-author of WebArena, the realistic web-agent benchmark; previously at Meta GenAI on Llama computer-use.

Role
professor
Papers
22

Cite

Notes

Only stored in your browser.

22papers·1eval contribs

Authored papers

22

Learning Personalized Agents from Human Feedback

arXiv 2026

2026

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

arXiv 2026

2026

Modeling Distinct Human Interaction in Web Agents

arXiv 2026

2026

OSWorld-Verified: A Cleaner, More Reliable Computer-Use Benchmark

blog

2025

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

arXiv 2025

2025

The Geometry of Reasoning: Flowing Logics in Representation Space

arXiv 2025

2025

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

arXiv 2025

2025

FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

arXiv 2025

2025

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

NeurIPS

2024

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

ACL

2024

WebArena: A Realistic Web Environment for Building Autonomous Agents

ICLR

2024

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

arXiv 2024

2024

WebCanvas: Benchmarking Web Agents in Online Environments

arXiv 2024

2024

Beyond Browsing: API-Based Web Agents

arXiv 2024

2024

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

arXiv 2024

2024

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

arXiv 2023

2023

Causal Reasoning of Entities and Events in Procedural Texts

arXiv 2023

2023

DocPrompting: Generating Code by Retrieving the Docs

arXiv 2022

2022

Language Models of Code are Few-Shot Commonsense Learners

arXiv 2022

2022

Execution-Based Evaluation for Open-Domain Code Generation

arXiv 2022

2022

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

arXiv 2022

2022

Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data

ACL 2022 5

2022

Eval contributions

1

Affiliations

Frequent co-authors

10

from 22 papers