The Challenge of Data Efficiency: Vignettes from Statistics and Evolution

Paul Valiant ( Berkeley, California )

14Mar
15:00 14th March 2012 ( week 9, Hilary Term 2012 )
380

As the amount of data that can be brought to bear on a task grows, so too does the difficulty and importance of making the best use of our data. In this talk we will consider two settings of this problem, one from statistics and one from the natural world.

A basic question in statistics is to determine how much data is needed to estimate various properties of one or more probability distributions. We consider the fundamental properties of support size, entropy, and the distance between two distributions. We present a unified new approach to these problems which yields the first explicit sublinear algorithms: given one or more probability distributions on a domain of size n, we show how to estimate each of these properties using n/log n samples, where n corresponds to the support size of the distribution. Our approach leverages the techniques of linear programming to yield a single algorithm to estimate all three properties, and, indeed, all properties from a wider class that contains them. Further, we construct a new discrete version of the multivariate central limit theorem to show that n/log n samples are in fact needed to estimate these properties and thus show that our algorithm uses the optimal number of samples, to within constant factor.

In the second half of the talk, we take our inspiration from biological evolution -- a natural process that, from vast amounts of data and complexity yields surprisingly effective results. In this talk we formulate a notion of "evolvability" for functions with domain and range that are real-valued vectors, an appropriate way of expressing many natural biological processes such as protein expression regulation. We show that linear and fixed-degree polynomial functions are evolvable in the following dually robust sense: There is a single evolution algorithm that for all convex loss functions converges for all distributions. We further examine the scope of the evolvability model and discuss future directions towards a computational understanding of evolution

Seminar Series

Seminars not part of a series

The Challenge of Data Efficiency: Vignettes from Statistics and Evolution

Seminar Series

See also

News & Events