Machine learning with Hadoop

Supervisor

Georg Gottlob

Andy Twigg

Suitable for

MSc in Computer Science

Computer Science, Part B

Computer Science, Part C

Abstract

This project is about implementing a machine learning algorithm called random forests on top of Hadoop, using the Apache Mahout open-source project. Mahout (http://mahout.apache.org) is a popular collection of open-source machine learning algorithms, that a re designed to be scalable and run on top of Hadoop (http://hadoop.apache.org), the popular open-source implementation of Google's map-reduce framework.

Random forests are a particularly effective and simple machine learning algorithm (see http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm and http://oz.berkeley.edu/users/breiman/randomforest2001.pdf for descriptions of the algorithm). At present, Mahout only has a simple implementation of random forests that does not scale. This project is about implementing a scalable version of random forests using Mahout and Hadoop. If the project is successful, your work will be used by many others!

See http://comments.gmane.org/gmane.comp.apache.mahout.devel/23326 for an initial discussion of this on the mailing list.

Prerequisites: Java, basic algorithms knowledge

Ideally: some understanding of distributed computing (map-reduce, Hadoop), some interest in machine learning (everything needed can be learned)

Machine learning with Hadoop

Abstract

Main sections

RSS Feeds

News

Vacancies & studentships

Calendars

Seminars & timetables

Internal