Approximate Nearest Neighbor methods for clustering and indexing have been actively researched ever since the K-Means algorithm was published in 1975 (and coded in FORTRAN). A recent book lists about 300 variants and related topics.
The 50th Anniversary issue of Communications of the ACM in 2008 cited two pieces of “Breakthrough Research”. One was MapReduce, the other was clustering based on Locality Sensitive Hashing (LSH). Locality Sensitive Hashing is for sets of large data and alleviates many of the issues seen with k-means. Want to see if a body of code has remarkable similarity to a public GitHub repo? Want to see “similar” fragments of DNA that are common between several species? LSH will get you there faster than most other techniques.
The talk will demonstrate OpenLSH, an open source implementation of LSH we have been working on.
http://www.datascipros.com/2015/05/cl...