D-Judge: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance

In Artificial Intelligence Generated Content (AIGC), distinguishing AI-synthesized images from natural ones remains a key challenge. Despite advancements in generative models, significant discrepancies persist. To systematically investigate and quantify these discrepancies, we introduce an AI-Natural Image Discrepancy accessing benchmark (D-Judge) aimed at addressing the critical question: how far are AI-generated images (AIGIs) from truly realistic images? We construct D-ANI, a dataset with 5,000 natural images and over 440,000 AIGIs generated by nine models using Text-to-Image (T2I), Image-to-Image (I2I), and Text and Image-to-Image (TI2I) prompts. Our framework evaluates the discrepancy across five dimensions: naive image quality, semantic alignment, aesthetic appeal, downstream applicability, and human validation. Results reveal notable gaps, emphasizing the importance of aligning metrics with human judgment. Source code and datasets are available at https://shorturl.at/l83W2.

D-Judge: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance

Abstract

Authors