0

MNAR-$k$-means: A $k$-means Clustering for Data Missing Not at Random with Magnitude-Decaying Probability

The classical $k$-means clustering, based on distances computed from all data features, cannot be directly applied to incomplete data with missing values. A natural extension of $k$-means to missing data is to involve only the observed positions in clustering, which is…

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.31253ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

The classical k-means clustering, based on distances computed from all data features, cannot be directly applied to incomplete data with missing values. A natural extension of k-means to missing data is to involve only the observed positions in clustering, which is equivalent to imputing missing values by corresponding cluster means. However, for data missing not at random (MNAR), since missingness is related to data values, such a mean-imputation-based method may lead to the distortion of estimated cluster centers, resulting in a poor clustering result. Since MNAR mechanisms are very common in reality, it is necessary to improve the performance of k-means-based clustering methods for such data. In this paper, we focus on a magnitude-decaying MNAR scenario where data is more likely to be missing at positions with smaller absolute values, and we propose a novel k-means clustering method based on the constraint of the size of imputation values, which enjoys a good mathematical interpretation. Moreover, we establish the statistical consistency of the estimated cluster centers of the proposed method to the true cluster centers of fully observed data, and solve the optimization of the proposed loss function via an alternative minimization algorithm. Simulation experiments verify the effect of the proposed method in improving clustering results and reducing the bias of estimated cluster centers. Applications to real-world missing data further show the utility of the proposed method.