March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 1

Optimizing DivKmeans for Multicore Architectures: a status re

port

Jiahu Deng and Beth PlaleDepartment of Computer Science

Indiana University


Acknowledgements

• David Wild• Rajarshi Guha• Digital Chemistry • Work funded in part by CICC and Microsoft


Problem Statements

1. Clustering is an important method to organize thousands of data times into meaningful groups. It is widely applied in chemistry, chemical informatics, biology, drug discovery, etc. However, for large datasets, clustering is a slow process even it’s parallelized

and be executed in powerful computer clusters.

2. Multi-core architectures provide large degrees of parallelism. Taking advantage of this requires examination of traditional parallelism approaches. We apply that examination to the DivKmeans clustering method.


Multi-core Architectures

Diagram of an Intel Core 2 dual core processor, with CPU-local Level 1 caches, a shared, on-die Level 2 cache.

Multi-core processors: combines two or more independent processors into a single package.


Clustering Algorithm1. hierarchical clustering

Series of partitioning steps take place, generating a hierarchy of clusters. It includes two families, agglomerative methods, which work from leaves upward, and divisive methods which decompose from a root downward.

http://www.digitalchemistry.co.uk/prod_clustering.html


Clustering Algorithm

2. non-hierarchical clustering

Clusters form around centroids, the number of which can be specified by the user. All clusters rank equally and there is no particular relationship between them.



Divisive KMeans (DivKmeans) Clustering Algorithm

Kmeans Method: K: number of clusters, which can be specified. The items are initially randomly assigned to a cluster. The kmeans clustering proceeds by repeated application of a two-step process:

1. The mean vector for all items in each cluster is computed. 2. Items are reassigned to the cluster whose center is closest to the

item.

Features: The K-means algorithm is stochastic and the results are subject to a

random component. The K-means algorithm works very well for well-defined clusters with a clear cluster center.


Divisive KMeans (DivKmeans) Clustering Algorithm

Divisive KMeans : A hierarchical kmeans method. In the following discussion, we consider k= 2, i.e. each clustering process accepts one cluster as input, and generates two partitioned clusters as outputs.

Originalcluster

Kmeans Method

Kmeans Method

Kmeans Method

cluster1

cluster2

Kmeans Method

…

…

…

…


Parallelization of DivKmeans Algorithm for Multicore

• Proceeding without Digital Chemistry DivKmeans• Once agreement was reached (Nov 2006), could not get version of source code isolated that

communicated with public interfaces instead of private interfaces.

• Naive parallelization of DivKmeans• Chose to work with Cluster 3.0 from Open Source Clustering Software Laboratory of DNA Info

rmation Analysis, Human Genome Center Institute of Medical Science, University of Tokyo. • The C clustering library is released under the “Python License”.• Parallelized this Kmeans code with decomposition

• Gather performance results on naive parallelization

• Suggest multicore-sensitive parallelizations

• Early performance results of these parallelizations


Naive Parallelization of Cluster 3.0 Kmeans

• Treat each kmeans clustering process as a black box , which takes one cluster as input, and generates two clusters as outputs

• When a new cluster is generated having more than one element in it, assign it to free processor for further clustering

• A master node maintains status of each node


Naive Parallelization of Cluster 3.0 Kmeans

.

.

.

Working Node1

MasterNode

Working Node2

Working Node3

Originalcluster

cluster1

cluster2

Assign to Node 2

Reassign to Node 1

Assign to Node 3

(Reassign to Node 2)


Quality of Cluster 3.0 Kmeans Naive Parallelization

Pros:

Don’t need to worry about the details of DivKmeans method. Can use Kmeans functions of other libraries directly.

Cons:

Speedup and scalability?

How about parallelization overhead?


Profiling Naive Parallelization

• Platform: • A Linux cluster, each node has two 2GHz AMD Opteron(TM) CP

Us, each CPU has dual cores• Linux RHEL WS release 4

• Algorithm: Cluster 3.0, parallelized and made divisive• Dataset: Pubchem dataset of sizes 24,000 and 96,000 elements• Additional Libraries:

• LAM 7.1.2/MPI


Speedup: naive parallelization of Cluster 3.0

Speedup of Di vKmeans( I tem Si ze: 24000)

0

1

2

3

4

0 5 10 15 20 25 30 35

Number of Nodes

Speedup

speedup is defined by Sp = T1/Tp

where: * p is number of processors * T1 is execution time of sequential algorithm * Tp is execution time of parallel algorithm with p processorsConclusion: maximum benefit reached at 17 nodes; significant decrease in speedup after only 5 nodes.


CPU Utilization:

Conclusion: Node 1 maxes out at 100% utilization. A likely limiter to overall performance.

CPU Uti l i zati on of Di vKmeans(I tem Si ze: 96000)

020406080

100120

1 178 355 532 709 886 1063 1240 1417

Runni ng Ti me (Second)

CPU

Util

izat

ion

(%)

Node0Node1Node2Node3Node4Node5Node6Node7


Memory Utilization

Conclusion: nothing outstanding

Memory Uti l i zati on of Di vKmeans(I tem Si ze: 96000)

0

10

20

30

40

50

1 185 369 553 737 921 1105 1289 1473

Runni ng Ti me (Second)

Memo

ry U

tili

zati

on(%

)

Node0Node1Node2Node3Node4Node5Node6Node7


Process Behaviors

By XMPI, which is a graphical user interface for running, debugging and visualizing MPI programs.


Conclusions on Naive Parallelization from Profiling

• Poor scalability beyond 5 nodes. • Performance likely inhibited by 100% utilization

of Node 1.

Proposed Solution• Multi-core solution: using multi-threads on each

node, each thread runs on one core.• How this solution will explicitly address the two

problems identified above.


Proposed Solution Instead of treating each kmeans clustering process as a black box,

each clustering process is decomposed into several threads.

original cluster

thread 3thread 2 thread 4

Merge Results

some pre-processing

other processing

cluster1

Cluster 2

thread 1


Step 1: identify parts to decompose (parallelize)

Calling sequence of kmeans clustering process

Do loopFindingCentroids

CalculatingDistance

While loop

Do loop

Inside Kmeans Profiling shows:-> About 93% of total execution time is spent in kmeans() functions.-> Inside kmeans() function, almost all time is spent in “Finding Centroids” and “Calculating Distance”.-> Hence, parallelize these two.

DivKmeans Kmeans()

CalculatingDistance

Find Centroids


Simplified codes of Finding Centroids

// sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values for (i = 0; i < nclusters; i++) { for (j = 0; j < ncolumns; j++) cdata[i][j] /= total_number[i][j]; }


Parallelized Codes of “Finding Centroids”

// sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values …

Before parallelization After parallelization

// sum up elements assigned to current threadfor (k = nrows * index / n_thread; k < nrows * (index + 1) / n_thread; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) { if (mask[k][j] != 0) { t_data[i][j]+=data[k][j]; t_mask[i][j]++; } }}

// merge data … // calculate mean values …


Mapping of Algorithms into Multi-core Architectures

original cluster

Core 3Core 2 Core 4

Merge Results

some pre-processing

other processing

cluster1

Cluster 2

Core1 Each thread uses one core



• How to further benefit from multi-core architectures?

Data locality

Cache aware algorithm

Architecture aware algorithm


Example 1: AMD Opteron


No cache sharing between two cores in thisarchitecture

Diagram of AMD Opteron



Example 2: Intel Core 2

Improve cache re-use:If two threads share common data, assign them to the cores on the same die.

Diagram of an Intel Core 2 dual Core processor



Dell PowerEdge 6950NUMA (Non-Uniform Memory Access)

Example 3:

Improve data locality:Keep data in local memory so that each thread uses local memory instead of remote ones as much as possible.


Early Results on Multi-core Platform

Experiment Environments Platform: 3 nodes in a Linux cluster, each node has two 2GHz AMD Opteron

(TM) CPUs, each CPU has dual cores Linux RHEL WS release 4

Library: LAM 7.1.2/MPI Pthread for Linux RHEL WS release 4

Degree of Parallelization Only the code of “Finding Centroids” is parallelized for early study. 4 threads are used for “Finding Centroids” on each node, and each t

hread runs on one core.


Results of Parallelizing “Finding Centroids”

Performance of Di vKmeans

0

500

1000

1500

2000

2500

3000

0 20000 40000 60000 80000 100000

Data Si ze (Number of I tems)

Tota

l Ex

ecut

ion

Time

(Sec

onds

) Bef ore Paral l el i zat i onAf ter Paral l el i zat i on

Conclusion: Modest improvement. DivKmeans runs about 12% faster after parallelization.


Perf ormance of Di vKmeans( I tem Si ze: 12000)

320330340350360370380

0 20 40 60 80

Number of Threads Used per Node

Tota

l Ex

ecut

ion

Time

(se

cond

s)

Parallelizing “Finding Centroids” with Different Number of Threads per Node

Conclusion: can hardly benefit from using more threads than the number of cores.Total Number of Cores per Node: 4


Optimizations for Next Step

• Reduce overhead of managing threads (e.g. use thread pool instead of creating new threads for each call to “Finding Centroids”)

• Parallelize the “Calculating Distance” part, which consumes twice the time of “Finding Centroids”

• More cores (4, 8, 32…) on a single computer are on the way. Should get more performance enhancements with more cores if the scalability of the program is good.

• The platform we used (AMD Opteron TM) doesn’t support cache sharing between two cores on the same die. However, L2, and even L1 cache sharing among cores are becoming available.


The Multi-core Project in the Distributed Data Everywhere (DDE) Lab and the Extreme Lab

• Multi-core processors: represent a major evolution in today’s computing technology

• We are exploring the programming styles and challenges on multi-core platforms, and potential applications in both academic and commercial areas, including chemical-informatics, XML parsing, data streaming, Web Service, etc.


References

1. Open Source Clustering Software Laboratory of DNA Information Analysis, Human Genome Center Institute of Medical Science, University of Tokyo http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/

2. http://www.nsc.liu.se/rd/enacts/Smith/img1.htm

3. http://www.mhpcc.edu/training/workshop/parallel_intro/

4. http://www.digitalchemistry.co.uk/prod_clustering.html

5. Performance Benchmarking on the Dell PowerEdge™ 6950 David Morse, Dell Inc.

http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/

http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/

http://www.nsc.liu.se/rd/enacts/Smith/img1.htm

http://www.mhpcc.edu/training/workshop/parallel_intro/


March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

Documents

kmeans clustering process

node slide

kmeans clustering proceeds

divkmeans clustering

kmeans code

hierarchical kmeans

linux cluster

kmeans functions