Scalable and Distributed Silhouette Approximation

The silhouette is one of the most widely used measures to assess the quality of a k-clustering of a dataset of n elements. Its evaluation requires no information beyond the clustering assignment. In addition, the silhouette is extremely easy to interpret, providing a score to measure the quality of a clustering as a whole or for each element. The exact computation of the: (i) silhouette of each element of a dataset; and (ii) the global silhouette of the clustering; require Θ(n^2) distance calculations, under general metrics. The quadratic complexity Θ(n^2) is extremely prohibitive, especially on massive modern datasets. Surprisingly, existing approximate methods using O(n^2) distance calculations are heuristics not offering provable and controllable guarantees on the quality of their results. We introduce the first rigorous and efficient algorithms to estimate: (i) the (local) silhouette of each element of a dataset; and (ii) the (global) silhouette; of any metric k-clustering. Our methods, based on sampling, perform O(nk\varepsilon^{-2}\ln (nk/δ)) distance computations, and provide estimates with additive error O(\varepsilon) with probability at least 1-δ. That is, parameters \varepsilon and δ in (0,1) control the trade-off between accuracy and efficiency. We also introduce a scalable and distributed design of our methods for the MapReduce and Massively Parallel Computing (MPC) frameworks. Our distributed algorithms use a constant number of rounds and sublinear local memory. Finally, we perform extensive experiments against state-of-the-art approaches. The results show that our new techniques yield the best trade-off between accuracy and efficiency for both local and global silhouette estimation. In addition, our methods scale efficiently to massive datasets for which an exact computation of the silhouette is not practical.