Dataset Construction for Training LLM to Learn Analog Circuit Knowledge

This paper constructs a textual dataset for training large language models (LLMs) to learn analog circuit knowledge and customizes LLM training techniques. For dataset construction, high-quality textbooks are collected and decomposed into fine-grained learning nodes, which are then used to construct structured question-thinking-solution-answer (QTSA) quadruples using a multi-agent framework to capture both final answers and thought processes. The resulting dataset consists of 7.26M tokens of unlabeled data for continual pre-training (CPT) and 112.65M tokens of labeled data for supervised fine-tuning (SFT). We customize the training techniques including initial model selection, training paradigms, regularization techniques, and practical implementation references. Instruct models are identified as suitable training initialization points, an SFT-centric training paradigm is established (finding that CPT provides marginal benefits compared with SFT due to imbalanced data distribution), and SFT with KL divergence regularization can achieve a 2.71 percentage-point improvement over SFT alone. A practical training implementation method is provided for resource-constrained scenarios. Experiments demonstrate that the dataset and training techniques enhance LLMs' analog circuit knowledge. The trained 32B instruct model achieves 84.59% accuracy on the AMSBench-TQA benchmark, showing a 15.67 percentage-point improvement over the initial model. The trained model also shows capability in the operational amplifier design task based on the Atelier framework.