The k-means clustering algorithm Opportunities to speed up Lloyd’s algorithm Algorithms that avoid distance calculations Experimental results Finally Towards the world’s fastest k-means algorithm Greg Hamerly Associate Professor Computer Science Department Baylor University Joint work with Jonathan Drake May 15, 2014 Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
59
Embed
Towards the world's fastest k-means algorithmcs.ecs.baylor.edu/~hamerly/software/fast_kmeans_talk_20140515.pdf · Towards the world’s fastest k-means algorithm 1 The k-means clustering
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Towards the world’s fastest k-means algorithm
Greg HamerlyAssociate Professor
Computer Science DepartmentBaylor University
Joint work with Jonathan Drake
May 15, 2014
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Objective function and optimizationLloyd’s algorithm
Towards the world’s fastest k-means algorithm
1 The k-means clustering algorithmObjective function and optimizationLloyd’s algorithm
2 Opportunities to speed up Lloyd’s algorithm
3 Algorithms that avoid distance calculations
4 Experimental results
5 Finally
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Objective function and optimizationLloyd’s algorithm
Visual representation of k-means
Input Output
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Objective function and optimizationLloyd’s algorithm
Popularity and applications of k-means
Google searches (May 2014):
# hits
Search query Google Google Scholar
k-means clustering 2.6M 316k
support vector machine classifier 1.7M 477k
nearest neighbor classifier 0.5M 103k
logistic regression classifier 0.3M 61k
Applications:
Discovering groups/structure in data
Lossy data compression (e.g. color quantization, voice coding,representative sampling)
Initialize more expensive algorithms (e.g. Gaussian mixtures)
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Objective function and optimizationLloyd’s algorithm
Optimization criteria and NP-hardness
K-means is not really an algorithm, it’s a criterion for clusteringquality.
Criterion: J(C ,X ) =∑x∈X
minc∈C||x − c||2
Goal: Find C that minimizes J(C ,X )
NP-hard in general.
There are lots of approaches to finding ‘good enough’solutions.
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Objective function and optimizationLloyd’s algorithm
Hill-climbing approaches
The most popular algorithms rely on hill-climbing:
Choose an initial set of centers.
Repeat until convergence:
Move the centers to better locations.
Because J(C ,X ) is non-convex, hill-climbing won’t in general findoptimal solutions.
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Objective function and optimizationLloyd’s algorithm
Lloyd’s algorithm
The most popular algorithm for k-means (Lloyd 1982)
Batch version:
Choose an initial set of centers.
Repeat until convergence:
Assign each point x ∈ X to its currently closest center.Move each center c ∈ C to the average of its assigned points.
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Objective function and optimizationLloyd’s algorithm
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
SpeedupEffect of dimensionBound effectivenessParallelismMemory use
Effectiveness of bounds
Elkan and Drake’s algorithms use multiple lower bounds.
Which bounds are most effective?
Hamerly showed the single lower bound can avoid 80+% ofinnermost loops, regardless of dataset and dimension.
Drake showed:
In early iterations (< 10), the first several bounds are mosteffective.
After that, the first bound prevents 90+% of avoided distancecalculations.
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
SpeedupEffect of dimensionBound effectivenessParallelismMemory use
K-means has natural parallelism
0 2 4 6 8 10 12 14number of threads
0
2
4
6
8
10
12
speedup (versus one thread)
Parallel Speedupuniform (n=106 , d=8); k=32
annulus
compare
drake
elkan
hamerly
heap
naive
sort
0 2 4 6 8 10 12 14number of threads
0
2
4
6
8
10
12
speedup (versus one thread)
Parallel Speedupuniform (n=106 , d=8); k=128
annulus
compare
drake
elkan
hamerly
heap
naive
sort
k = 32 k = 128
Used pthreads on 12-core machine.Naive algorithm embarrassingly parallel within an iteration.Partition data over threads, replicate centers.Acceleration can cause work imbalance and addsynchronization.
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
SpeedupEffect of dimensionBound effectivenessParallelismMemory use
Memory overhead
annulus
compare
drake
elkan
hamerl
yheap
naive sor
t
algorithm
0
50
100
150
200
250
300
350
memory use
d (MB)
Memory useduniform (n=106 , d=32); k=8
annulus
compare
drake
elkan
hame
rlyheap
naive sor
t
algorithm
0
100
200
300
400
500
600
memory used (MB)
Memory useduniform (n=106 , d=32); k=32
annulus
compare
drake
elkan
hame
rlyheap
naive sor
t
algorithm
0
200
400
600
800
1000
1200
1400
memory used (MB)
Memory useduniform (n=106 , d=32); k=128
k = 8 k = 32 k = 128
Uniform dataset, d = 32.
Algorithms using 1 lower bound use negligible extra memory.
Drake & Elkan’s algorithms use significantly more memorywhen k is large.
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Towards the world’s fastest k-means algorithm
1 The k-means clustering algorithm
2 Opportunities to speed up Lloyd’s algorithm
3 Algorithms that avoid distance calculations
4 Experimental results
5 Finally
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Discussion
Key to acceleration: cached bounds, updated using the triangleinequality.
More lower bounds avoids more distances, giving betterperformance in high dimension.
Low dimension (< 50) really only needs one lower bound.
Memory impact is negligible for one lower bound.
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Future work
Theoretical lower bounds on required number of distancecalculations.
Other clever ways to avoid doing work; e.g. other bounds.
Accelerating other algorithms using these techniques.
Dynamic nearest neighbor search.
Clustering of dynamic datasets – any takers?
Greg Hamerly / Baylor University Towards the world’s fastest k-means algorithm
The k-means clustering algorithmOpportunities to speed up Lloyd’s algorithmAlgorithms that avoid distance calculations
Experimental resultsFinally
Conclusion
K-means is popular, and easy to imple-ment.
Therefore, everyone implementsit... slowly.
Simple acceleration methods exist thatuse little extra memory.
Key ideas: caching, triangleinequality, and distance bounds.