Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use -- via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces
Agents using pixel-based screenshots and standard action spaces outperform human workers on GUI task following benchmarks.
- Year
- 2023
- Venue
- from-pixels-to-ui-actions-learning-to-follow
- Authors
- 9
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2306.00245v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar