Machine learning with Hadoop
Supervisor |
|
Suitable for |
MSc in Computer Science
|
Abstract
This project is about implementing a machine learning algorithm called random forests on top of Hadoop, using the Apache Mahout open-source project. Mahout (http://mahout.apache.org) is a popular collection of open-source machine learning algorithms, that a re designed to be scalable and run on top of Hadoop (http://hadoop.apache.org), the popular open-source implementation of Google's map-reduce framework.
Random forests are a particularly effective and simple machine learning algorithm (see http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm and http://oz.berkeley.edu/users/breiman/randomforest2001.pdf for descriptions of the algorithm). At present, Mahout only has a simple implementation of random forests that does not scale. This project is about implementing a scalable version of random forests using Mahout and Hadoop. If the project is successful, your work will be used by many others!
See http://comments.gmane.org/gmane.comp.apache.mahout.devel/23326 for an initial discussion of this on the mailing list.
Prerequisites: Java, basic algorithms knowledge
Ideally: some understanding of distributed computing (map-reduce, Hadoop), some interest in machine learning (everything needed can be learned)