Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 103
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2508.21148ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
103Bowen ZhouChi ZhangYizhou WangWenhao TangWei LiLei BaiLijun WuYu QiaoConghui HeYihao LiuChen TangCheng TanYuewen CaoJunjun HeTianbin LiMing HuJin YeBin FuChenglong MaWanghan XuJiamin WuJucheng HuGuohang ZhuangJiaqi LiuYingzhou LuYing ChenChaoyang ZhangJie YingGuocheng WuShujian GaoPengcheng ChenJiashi LinHaitao WuLulu ChenFengxiang WangYuanyuan ZhangXiangyu ZhaoFeilong TangEncheng SuJunzhi NingXinyao LiuYe DuChangkai JiCheng TangHuihui XuZiyang ChenZiyan HuangJiyao LiuPengfei JiangJianyu WuYuchen RenSiyuan YanZhonghua WangZhongxing XuShiyan SuShangquan SunRunkai ZhaoZhisheng ZhangYu LiuFudi WangYuanfeng JiYanzhou SuHongming ShanChunmei FengJiahao XuJiangtao YanDiping SongLihao LiuYanyan HuangLequan YuShujun WangXiaomeng LiXiaowei HuYun GuBen FeiZhongying DengBenyou WangMinjie ShenHaodong DuanJie XuYirong ChenFang YanHongxia HaoJielan LiJiajun DuYanbo WangImran RazzakZhaohui LuJinhai HuangFenghua LingYuqiang LiAoran WangQihao ZhengNanqing DongTianfan FuDongzhan ZhouYan LuWenlong ZhangJianfei CaiWanli OuyangZongyuan GeShixiang TangChunfeng Song