Multimodal Image Colorization: Quantifying the Impact of Text-Conditioned Guidance on Grayscale-to-Color Translation

Grayscale images are commonly found in historical photography restoration, medical imaging, and artistic media. However, automatically applying color to these images remains a significant challenge in computer vision because many plausible colorizations can correspond to the same grayscale input. In this work, we quantify the effect of text conditioning on pixel-level and perceptual metrics for grayscale-to-color image models. Specifically, we compare two architectures, a U-Net and Stable Diffusion 1.5, each tested with and without CLIP text conditioning while holding all other variables constant. Our results show that text conditioning improves PSNR by 5.6%, SSIM by 1.2%, and colorfulness by 36.6%, while reducing LPIPS by 7.6% in the U-Net tier. In the Stable Diffusion tier, text conditioning improves PSNR by 5.8%, SSIM by 1.5%, and colorfulness by 0.6%, while reducing LPIPS by 11.3%. These results indicate that text conditioning provides consistent, measurable improvements to colorization quality across both architecture scales.