Part of Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020)

Bibtek download is not availble in the pre-proceeding

*Vincent Cohen-Addad, Karthik C. S., Guillaume Lagarde*

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic "linkage" algorithms (single, average, or complete). However, these methods exhibit a quite prohibitive running time of $\Theta(n^2)$. In this paper, we provide a new algorithm which takes as input a set of points $P$ in $R^d$, and for every $c\ge 1$, runs in time $n^{1+O(1/c^2)}$ to output an ultrametric $\Delta$ such that for any two points $u,v$ in $P$, we have $\Delta(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the "best" ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $\Delta^*$ that minimizes the maximum distance distortion with respect to the $\ell_2$ distance, namely that minimizes $\max_{u,v \in P} \Delta^*(u,v)/||u-v||_2$." We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $\epsilon>0$, no algorithm with running time $n^{2-\epsilon}$ can distinguish between inputs that admit isometric embedding and inputs that can incur a distortion of 3/2 in L∞ -metric. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

Do not remove: This comment is monitored to verify that the site is working properly