Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules

Modern tabular foundation models such as TabPFN and TabICL naturally produce full predictive distributions, while the benchmarks used to evaluate them (TabArena, TALENT, and others) still rely almost exclusively on point-estimate metrics (RMSE, R^2). This mismatch implicitly rewards machine learning models or pipelines that elicit a good conditional mean while ignoring the quality of the predictive distribution. We make the case for using proper scoring rules for training, fine-tuning, and benchmarking (ranking) of tabular foundation models. Although all strictly proper scoring rules are theoretically equivalent at the population level, they may differ on finite data: We demonstrate analytically and empirically that different scoring rules can induce different inductive biases during finite-sample optimization, leading to different model performance. We validate this finding by running fine-tuning experiments with TabPFN and TabICL using different scoring rules for various data sets, revealing non-trivial interactions between training objectives and evaluation metrics. Our results show that practitioners can adapt tabular foundation models to task-specific scoring objectives, and that the choice of scoring rule can influence model behavior in practice.