0

Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition

SIcog enhances MLLMs through self-generated data and structured reasoning to improve systematic perception and reasoning, achieving superior performance with limited annotations.

Year
2025
Venue
arXiv 2025
Authors
9
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2503.12303v5ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Recent progress in (multimodal) large language models ((M)LLMs) has shifted focus from pre-training to inference-time compute scaling and post-training optimization, driven by concerns over limited high-quality real-world data. However, these strategies alone are insufficient for advancing model capabilities. We hypothesize that effective model improvement requires a strong synergy among pre-training, inference-time compute scaling, and post-training optimization. In this paper, we validate this hypothesis in the context of multimodal pre-training for foundation MLLM construction. We introduce Self-Improving cognition (SIcog), a self-learning framework for constructing next-generation foundation MLLMs by imparting multimodal knowledge and enhancing their systematic cognitive capabilities through multimodal pre-training with self-generated data. Specifically, we introduce Chain-of-Description, a step-by-step visual understanding method to improve comprehensive perception, and integrate structured chain-of-thought (CoT) reasoning to support in-depth multimodal reasoning. SIcog first equips a base model with systematic perception and reasoning using minimal external supervision. The enhanced model then generates candidate image captions and CoT-style reasoning responses for unlabeled images and image-question pairs across diverse tasks, which are curated through a self-consistency mechanism. These curated samples are subsequently used for large-scale multimodal pre-training, completing a self-learning cycle that strengthens the model's cognitive foundation. Extensive experiments demonstrate that SIcog produces next-generation foundation MLLMs with substantially improved multimodal cognition, outperforming prevailing pre-training approaches. These findings empirically establish SIcog as a promising framework for realizing a complete self-improving paradigm.

Authors

9