PlantExpertVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science

Existing plant-disease datasets target classification and detection, leaving vision-language models unable to support interactive, reasoning-based diagnosis. To address this, we present PlantExpertVQA, a large-scale visual question answering (VQA) dataset designed to advance vision-language models for agricultural decision-making. It is compiled from 45 open-source datasets, including the widely used PlantVillage corpus, and comprises 765,186 high-quality question-answer (QA) pairs grounded over 150,841 images spanning 38 crop species and 89 disease conditions. Questions are organized into 3 levels of cognitive complexity and 9 distinct categories. Each was phrased following expert guidance and generated via an automated two-stage pipeline: template-based QA synthesis from image metadata, followed by multi-stage linguistic re-engineering. The dataset was iteratively reviewed by domain experts for scientific accuracy and relevance. We find that current frontier vision-language models, including recent open-source instruction-tuned multimodal LLMs, perform poorly on PlantExpertVQA. However, parameter-efficient fine-tuning of a compact 2B-parameter model on a small fraction of the dataset yields substantial improvements across all question categories, demonstrating its effectiveness for domain adaptation.