Skip to main content

Reconstructing the Tree of Life (Fitting Distances by Tree Metrics)

Mikkel Thorup ( University of Copenhagen )
(A more popular version of the abstract can be found here)

We consider the numerical taxonomy problem of fitting an SxS distance matrix D with a tree metric T.  For some k, we want to minimize the L_k norm of the errors, that is,

Pick T so as to minimize (sum_{i,j in S} |D(i,j)-T(i,j)|^k)^{1/k}

This problem was raised in biology in the 1960s for k=1,2.  The biological interpretation is that T represents a possible evolution behind the species in S matching some measured distances in D. Sometimes, it is required that T is an ultrametric, meaning that all species are at the same distance from the root.

An Evolutionary tree induces a hierarchical classification of species and this is not just tied to biology. Medicine, ecology and linguistics are just some of the fields where this concept appears, and it is even an integral part of machine learning and data science. Fundamentally, if we can approximate distances with a tree, then they are much easier to reason about: many questions that are NP-hard for general metrics can be answered in linear time on tree metrics. In fact, humans have appreciated hierarchical classifications at least since Plato and Aristotle (350 BC).

We will review the basic results on the problem, including the most recent result from FOCS'21 https://arxiv.org/abs/2110.02807.  They paint a varied landscape with big differences between different moments, and with some very nice open problems remaining.

At STOC'93, Farach, Kannan, and Warnow proved that under L_infty, we can find the optimal ultrametric. Almost all other variants of the problem 
are APX-hard.
 
At SODA'96, Agarwala, Bafna, Farach, Paterson, and Thorup showed that for any norm L_k, k>=1, if the best ultrametric can be A-approximated, then the best tree metric can be 3A-approximated. In particular, this implied a 3-approximation for tree metrics under L_infty.

At FOCS'05, Ailon and Charikar showed that for any L_k, k>=1, we can get an O(((log n)(loglog n))^{1/k}) approximation for both tree and ultrametrics. Their paper was focused on the L_1 norm, and they wrote ``Determining whether an O(1) approximation can be obtained is a fascinating question".

At FOCS'21, Cohen-Addad, Das, Kipouridis, Parotsidis, and Thorup 
showed that indeed a constant factor is possible for L_1 for both tree and ultrametrics. This uses the special structure of L_1 in relation to hierarchies.

The status of L_k is open for 1<k<infty.

Joint work with Vincent Cohen-Addad, Debarati Das, Evangelos Kipouridis and Nikos Parotsidis.

 

 

Share this: