University of Oxford Logo University of OxfordDepartment of Computer Science - Home
Linked in
Linked in
Follow us on twitter
Twitter
On Facebook
Facebook
Instagram
Instagram

Machine learning with Hadoop

Supervisor

Suitable for

Abstract

This project is about implementing a machine learning algorithm called random forests on top of Hadoop, using the Apache Mahout open-source project. Mahout (http://mahout.apache.org) is a popular collection of open-source machine learning algorithms, that a re designed to be scalable and run on top of Hadoop (http://hadoop.apache.org), the popular open-source implementation of Google's map-reduce framework.

Random forests are a particularly effective and simple machine learning algorithm (see http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm and http://oz.berkeley.edu/users/breiman/randomforest2001.pdf for descriptions of the algorithm). At present, Mahout only has a simple implementation of random forests that does not scale. This project is about implementing a scalable version of random forests using Mahout and Hadoop. If the project is successful, your work will be used by many others!

See http://comments.gmane.org/gmane.comp.apache.mahout.devel/23326 for an initial discussion of this on the mailing list.

Prerequisites: Java, basic algorithms knowledge

Ideally: some understanding of distributed computing (map-reduce, Hadoop), some interest in machine learning (everything needed can be learned)