SAKE: Software Architectural Knowledge Evaluation Benchmark for Large Language Models

Large Language Models (LLMs) are increasingly used as assistants across the software development lifecycle, yet their ability to reason about software architecture remains largely unmeasured. Architectural decision-making depends on quality attribute trade-offs, design patterns, and system-level constraints, none of which are exercised by benchmarks that target syntactic or algorithmic tasks. We introduce SAKE (Software Architectural Knowledge Evaluation), a standardized and reproducible benchmark for assessing software architectural knowledge in LLMs. SAKE comprises 2154 expert-curated multiple-choice questions, each with four options, stratified across eight architectural categories and four context-length levels. We evaluate 11 proprietary and open-weight models in zero-shot and five-shot settings. Overall accuracy is high, but performance varies markedly across categories, revealing competency gaps in areas central to professional practice. SAKE, its evaluation scripts, and all results are released as open source to give the community a baseline for tracking architectural reasoning in LLMs.