We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
A scientific reasoning foundation model pre-trained on diverse scientific data supports multiple tasks and enhances cross-domain generalization and fidelity through specialized training techniques.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 32
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2509.21320ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
32Yizhou WangLei BaiChen TangXin ChenJunjun HeMing HuChenglong MaJiamin WuGuohang ZhuangJiaqi LiuEncheng SuHuihui XuJianyu WuYuchen RenBen FeiQihao ZhengDongzhan ZhouYan LuWenlong ZhangWanli OuyangShixiang TangChunfeng SongPhilip TorrZhenfei YinXiangyu YueYuhao ZhouHan DengXinzhu MaJun YaoJiabei XiaoPengze LiLintao Wang