Explainable k-Means and k-Medians Clustering

Part of Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020)

Bibtex »Metadata »Paper »Supplemental »

Bibtek download is not availble in the pre-proceeding


Authors

Michal Moshkovitz, Sanjoy Dasgupta, Cyrus Rashtchian, Nave Frost

Abstract

<p>Clustering is a popular unsupervised learning method for geometric data. Unfortunately, many clustering algorithms use global properties of the data, and there are no simple explanations for cluster assignments. To improve interpretability, we consider using a small threshold tree to partition a dataset into clusters. This leads to cluster assignments that can be explained by very few feature values in a straightforward manner. We study this problem from a theoretical viewpoint, measuring the output quality by the k-means and k-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and we prove that any explainable clustering must incur an \Omega(\log k) approximation compared to the optimal clustering. On the upper bound side, we design efficient algorithms that produce explainable clusters using a tree with k leaves. For two means/medians, we show that a single threshold cut suffices to achieve a constant factor approximation, which is a surprising result that nearly matches our lower bounds. For general k \geq 2, our algorithm is an O(k) approximation to the optimal k-medians and an O(k^2) approximation to the optimal k-means. Prior to our work, no algorithms were known with provable guarantees independent of the dimensionality and input size. </p>