Skip to main content

Nando de Freitas receives Google Faculty Research Award for his work on deep learning


Recent work on scaling deep networks has led to the construction of the largest artificial neural networks to date, with billions of parameters distributed over several machines. These models are the state-of-the-art in image and speech recognition. It is conjectured that bigger models will result in great improvements in speech, language, vision and other tasks. Scaling up the models is however a difficult challenge, which Nando and his student Misha Denil have undertaken with some recent success.

The largest neural networks today are trained using asynchronous stochastic gradient descent.  In this framework many copies of the model parameters are distributed over many machines and updated independently.  An additional synchronization mechanism coordinates between the machines to ensure that different copies of the same set of parameters do not drift far from each other.

A major drawback of this technique is that training is very inefficient in how it makes use of parallel resources.  In the largest networks of Dean et al., where the gains from distribution are largest, distributing the model over 81 machines reduces the training time per mini-batch by a factor of 12, and increasing to 128 machines achieves a speedup factor of roughly 14.  While these speed ups are very significant, there is a clear trend of diminishing returns as the overhead of coordinating between the machines grows.  

In a recent publication, Nando de Freitas' research team demonstrated that there is significant redundancy in the parameterization of several deep learning models, used in vision and speech recognition. They showed that given only a few weight values for each feature it is possible to accurately predict the remaining values.  Moreover, they showed  that not only can the parameter values be predicted, but many of them need not  be learned at all. In the best case they were  able to predict more than 95% of the weights of a network without any drop in  accuracy.

The team plans to continue advancing their techniques to obtain further reductions in computation and communication costs. In addition to seeking these cost reductions, they also plan to improve the accuracy of deep models in speech, language and image recognition systems.