Explainability in mulimodal deep transformation models for stroke outcome prediction

Multimodal prediction models based on imaging and clinical data are increasingly used for clinical decision support, yet their interpretability remains limited. We present multimodal Deep Transformation Models (DTMs) combining statistical approaches and neural networks to achieve strong predictive performance while preserving interpretability for tabular data. A key contribution of this work is the adaption of the xAI methods Grad-CAM and Occlusion to DTMs relying on 3D CNNs, enabling interpretation of the image branch through the generation of explanation maps. We developed DTMs to predict functional independence three months after stroke using diffusion-weighted imaging and clinical data from 407 patients. In a ten-fold cross-validation, the models achieved state-of-the-art predictive performance (AUC 0.81 [0.75, 0.87]) while maintaining interpretability for tabular features, with functional independence before stroke and stroke severity on admission emerging as the strongest predictors. Explanation maps from both xAI methods highlighted consistent regions, including frontal lobe areas which are known to be associated with age, a strong predictor of functional outcome. Notably, these regions disappeared once age was included as an explicit tabular predictor. Similarity analyses of explanation maps revealed distinct spatial patterns, providing meaningful insights into stroke pathophysiology, systematic error analysis and hypothesis generation.