Orthogonal Procrustes problem preserves correlations in synthetic data

Synthetic data generation is increasingly used in applications involving privacy preservation, data sharing, and data scarcity. In many situations, preserving the dependence structure of the original data is of central interest. In this work, we propose a lightweight postprocessing methodology for synthetic tabular data based on the Orthogonal Procrustes problem. Starting from an already generated synthetic dataset, our approach constructs the closest dataset that restores the Pearson correlation structure of the original data. On the theoretical side, we show that preserving Pearson correlation is equivalent to the action of linear orthogonal maps in the centered-data subspace, and then deploy the Orthogonal Procrustes problem. However, in order for this to hold, we first establish a result ensuring that applying the Orthogonal Procrustes step remains in the aforementioned subspace under suitable assumptions. Applications to several datasets and synthetic data generators illustrate the effectiveness of the proposed approach. In particular, the numerical experiments indicate that the correlation structure can be restored while largely preserving the individual feature distributions, the geometry of the data, and the performance of downstream classification tasks.