Fast, High Quality, Single Pass k-means Clustering
I will describe an implementation of recent results that provide high quality k-means clustering at very high speed. For well clusterable data, this algorithm provides good bounds on quality, but practically speaking, it makes clustering practical in many applications by providing roughly 3 orders of magnitude speedup relative to the standard algorithm based on Lloyd's initial efforts. In addition, the algorithm is highly amenable to implementation using map-reduce and shows essentially linear speedup. Just as significant, this new algorithm allows clustering with a very large number of clusters which makes it practical to use as a feature extraction algorithm or set up for a nearest neighbor search.
I will provide an outline of how and why this class of algorithms work so well, and demonstrate the practical aspects of the implementation. I will also describe how this algorithm can be integrated into an industrial scale nearest neighbor implementation.
Speaker bioTed Dunning has been involved with a number of startups with the latest being MapR Technologies where he is Chief Application Architect working on advanced Hadoop-related technologies. He is also a PMC member for the Apache Zookeeper and Mahout projects. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.
Prior to his work in the startup world, he worked as a research at the Computing Research Laboratory in New Mexico under Yorick Wilks. His published work includes advances in statistical natural language processing, genomics and machine translation. One paper, in particular, on the statistics of rare coincidences has been widely cited academically and used in practical systems.