When few labeled target data suffice: a theory of semi-supervised domain adaptation via fine-tuning from multiple adaptive starts

Semi-supervised domain adaptation (SSDA) seeks to achieve accurate predictions in a target domain with limited labeled target data by exploiting abundant source and unlabeled target data. We study this problem under structural causal models (SCMs), which provide a statistical framework to describe distribution shifts between source and target domains as interventions in the data-generating process rather than ad hoc changes in model parameters. The central phenomenon is that, under low-dimensional interventions, source and unlabeled target data can help identify the high-dimensional shared structure, leaving only a low-dimensional target-specific correction to be learned from limited labeled target data. We formalize this principle for three canonical intervention models and propose the corresponding SSDA methods FT-DIP, FT-OLS-Src and FT-CIP. Under each intervention model, we demonstrate how extending an unsupervised domain adaptation (UDA) method to SSDA can achieve minimax-optimal target performance with limited target labels, with the labeled-target sample complexity scaling with the intervention dimension rather than the ambient dimension. When the distribution shift is underspecified, we propose the Multi-Adaptive-Start Fine-Tuning (MASFT) algorithm, which fine-tunes from multiple adaptive starts and selects among them using a small target validation set, incurring only logarithmic overhead in the number of starts. We validate the effectiveness of our proposed methods through simulated and real data experiments.