Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens

The increasing scale of AI workloads demands High-Performance Computing (HPC) infrastructure and training methodologies that are both scalable and sustainable. While Large Language Models (LLMs) demonstrate exceptional natural language capabilities, general-purpose models often lack the specialized domain knowledge necessary for effective cybersecurity analysis. We investigate Domain-Adaptive Continuous Pretraining (DAP) as a scalable, resource-efficient methodology for enhancing cybersecurity understanding in pretrained LLMs, implemented through a distributed Fully Sharded Data Parallel (FSDP) pipeline across multi-node GPU clusters. We systematically adapted three decoder-based architectures -- Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-14B, and Llama-3.3-70B-Instruct -- using a curated 126-million-word cybersecurity corpus from standards, academic literature, and technical documentation. Evaluation across three cybersecurity benchmarks -- CTI-MCQ, CyberMetric, and SecEval -- demonstrates consistent improvements post-adaptation. Notably, our Llama-3.3-70B-Ins-DAP model achieves state-of-the-art performance with accuracies of 0.718, 0.933, and 0.864, respectively, surpassing parameter-efficient baselines and specialized models including Llama-Primus-Base (trained on 2.77 billion tokens) and Foundation-Sec-8B (trained on 5 billion tokens), despite utilizing only 118.8 million tokens -- representing a 23-to-42-fold reduction in training data. Targeted continuous pretraining via scalable HPC infrastructure enables effective cybersecurity domain adaptation with a substantially reduced computational and energy footprint, supporting specialized AI assistants in threat analysis, vulnerability assessment, and security documentation, while advancing sustainable and responsible AI development.