0

BaRA: BFS-and-Reflection Web Data Collection Agent

Large language model (LLM)-based web agents reduce manual scripting for web data collection, yet on live websites, they often miss relevant pages, return incomplete multimodal outputs, or return media URLs that are not directly downloadable.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2607.00007CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Large language model (LLM)-based web agents reduce manual scripting for web data collection, yet on live websites, they often miss relevant pages, return incomplete multimodal outputs, or return media URLs that are not directly downloadable. We present BFS-and-Reflection Agent (BaRA), a framework for site-level collection under a fixed interaction budget. The framework combines bounded breadth-first search (BFS) traversal with history-based self-reflection. We evaluate BaRA on 50 synthetic websites with ground-truth reference sets. We additionally test on three public websites with cluttered or dynamic layouts. BaRA outperforms Pure LLM, SeeAct-Vision, and Browser-use on link discovery and downloadable multimodal extraction, with the largest gains in download-valid image and video recovery. Our code is available at https://github.com/MLAI-Yonsei/BaRA-Agent.