A Robust and Explainable Transformer-Based Framework for Phishing Email Detection

Phishing and related cyber threats are becoming increasingly sophisticated, with email-based phishing remaining the most persistent attack vector. These attacks exploit human vulnerabilities to deliver malware or gain unauthorized access to sensitive information. Transformer-based models enhance phishing detection through robust contextual language understanding; yet they are often regarded as black boxes due to a lack of interpretability. Moreover, recent AI-enabled attacks further undermine model resilience. To address these challenges, this work proposes a lightweight phishing detection framework based on DistilBERT, a lightweight Transformer model. Robustness to embedding-level perturbations and character-level input noise is enhanced through gradient-based adversarial training using the Fast Gradient Method (FGM), combined with stochastic character-level perturbations. To improve transparency, three prominent Explainable AI (XAI) methods, LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), and IG (Integrated Gradients), are integrated to interpret model decision-making. A structured rule-based prompt combines model predictions and XAI features to guide Flan-T5-Small in generating plain-language, evidence-based explanations. Experimental results demonstrate that the proposed framework outperforms a standard DistilBERT-based detection model trained without robustness enhancements in terms of accuracy and resilience. This integrated approach helps bridge the gap between model reliability and user trust, advancing transparent phishing detection.