Top Banner
March 12, 2007 CICC quarterly meeting 1 Optimizing DivKmeans for Multicore Architectures: a s tatus report Jiahu Deng and Beth Plale Department of Computer Science Indiana University
33

March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

Mar 27, 2015

Download

Documents

Riley Miller
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 1

Optimizing DivKmeans for Multicore Architectures: a status re

port

Jiahu Deng and Beth PlaleDepartment of Computer Science

Indiana University

Page 2: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 2

Acknowledgements

• David Wild• Rajarshi Guha• Digital Chemistry • Work funded in part by CICC and Microsoft

Page 3: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 3

Problem Statements

1. Clustering is an important method to organize thousands of data times into meaningful groups. It is widely applied in chemistry, chemical informatics, biology, drug discovery, etc. However, for large datasets, clustering is a slow process even it’s parallelized

and be executed in powerful computer clusters.

2. Multi-core architectures provide large degrees of parallelism. Taking advantage of this requires examination of traditional parallelism approaches. We apply that examination to the DivKmeans clustering method.

Page 4: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 4

Multi-core Architectures

Diagram of an Intel Core 2 dual core processor, with CPU-local Level 1 caches, a shared, on-die Level 2 cache.

Multi-core processors: combines two or more independent processors into a single package.

Page 5: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 5

Clustering Algorithm1. hierarchical clustering

Series of partitioning steps take place, generating a hierarchy of clusters. It includes two families, agglomerative methods, which work from leaves upward, and divisive methods which decompose from a root downward.

http://www.digitalchemistry.co.uk/prod_clustering.html

Page 6: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 6

Clustering Algorithm

2. non-hierarchical clustering

Clusters form around centroids, the number of which can be specified by the user. All clusters rank equally and there is no particular relationship between them.

http://www.digitalchemistry.co.uk/prod_clustering.html

Page 7: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 7

Divisive KMeans (DivKmeans) Clustering Algorithm

Kmeans Method: K: number of clusters, which can be specified. The items are initially randomly assigned to a cluster. The kmeans clustering proceeds by repeated application of a two-step process:

1. The mean vector for all items in each cluster is computed. 2. Items are reassigned to the cluster whose center is closest to the

item.

Features: The K-means algorithm is stochastic and the results are subject to a

random component. The K-means algorithm works very well for well-defined clusters with a clear cluster center.

Page 8: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 8

Divisive KMeans (DivKmeans) Clustering Algorithm

Divisive KMeans : A hierarchical kmeans method. In the following discussion, we consider k= 2, i.e. each clustering process accepts one cluster as input, and generates two partitioned clusters as outputs.

Originalcluster

Kmeans Method

Kmeans Method

Kmeans Method

cluster1

cluster2

Kmeans Method

Page 9: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 9

Parallelization of DivKmeans Algorithm for Multicore

• Proceeding without Digital Chemistry DivKmeans• Once agreement was reached (Nov 2006), could not get version of source code isolated that

communicated with public interfaces instead of private interfaces.

• Naive parallelization of DivKmeans• Chose to work with Cluster 3.0 from Open Source Clustering Software Laboratory of DNA Info

rmation Analysis, Human Genome Center Institute of Medical Science, University of Tokyo. • The C clustering library is released under the “Python License”.• Parallelized this Kmeans code with decomposition

• Gather performance results on naive parallelization

• Suggest multicore-sensitive parallelizations

• Early performance results of these parallelizations

Page 10: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 10

Naive Parallelization of Cluster 3.0 Kmeans

• Treat each kmeans clustering process as a black box , which takes one cluster as input, and generates two clusters as outputs

• When a new cluster is generated having more than one element in it, assign it to free processor for further clustering

• A master node maintains status of each node

Page 11: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 11

Naive Parallelization of Cluster 3.0 Kmeans

.

.

.

Working Node1

MasterNode

Working Node2

Working Node3

Originalcluster

cluster1

cluster2

Assign to Node 2

Reassign to Node 1

Assign to Node 3

(Reassign to Node 2)

Page 12: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 12

Quality of Cluster 3.0 Kmeans Naive Parallelization

Pros:

Don’t need to worry about the details of DivKmeans method. Can use Kmeans functions of other libraries directly.

Cons:

Speedup and scalability?

How about parallelization overhead?

Page 13: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 13

Profiling Naive Parallelization

• Platform: • A Linux cluster, each node has two 2GHz AMD Opteron(TM) CP

Us, each CPU has dual cores• Linux RHEL WS release 4

• Algorithm: Cluster 3.0, parallelized and made divisive• Dataset: Pubchem dataset of sizes 24,000 and 96,000 elements• Additional Libraries:

• LAM 7.1.2/MPI

Page 14: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 14

Speedup: naive parallelization of Cluster 3.0

Speedup of Di vKmeans( I tem Si ze: 24000)

0

1

2

3

4

0 5 10 15 20 25 30 35

Number of Nodes

Speedup

speedup is defined by Sp = T1/Tp

where: * p is number of processors * T1 is execution time of sequential algorithm * Tp is execution time of parallel algorithm with p processorsConclusion: maximum benefit reached at 17 nodes; significant decrease in speedup after only 5 nodes.

Page 15: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 15

CPU Utilization:

Conclusion: Node 1 maxes out at 100% utilization. A likely limiter to overall performance.

CPU Uti l i zati on of Di vKmeans(I tem Si ze: 96000)

020406080

100120

1 178 355 532 709 886 1063 1240 1417

Runni ng Ti me (Second)

CPU

Util

izat

ion

(%)

Node0Node1Node2Node3Node4Node5Node6Node7

Page 16: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 16

Memory Utilization

Conclusion: nothing outstanding

Memory Uti l i zati on of Di vKmeans(I tem Si ze: 96000)

0

10

20

30

40

50

1 185 369 553 737 921 1105 1289 1473

Runni ng Ti me (Second)

Memo

ry U

tili

zati

on(%

)

Node0Node1Node2Node3Node4Node5Node6Node7

Page 17: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 17

Process Behaviors

By XMPI, which is a graphical user interface for running, debugging and visualizing MPI programs.

Page 18: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 18

Conclusions on Naive Parallelization from Profiling

• Poor scalability beyond 5 nodes. • Performance likely inhibited by 100% utilization

of Node 1.

Proposed Solution• Multi-core solution: using multi-threads on each

node, each thread runs on one core.• How this solution will explicitly address the two

problems identified above.

Page 19: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 19

Proposed Solution Instead of treating each kmeans clustering process as a black box,

each clustering process is decomposed into several threads.

original cluster

thread 3thread 2 thread 4

Merge Results

some pre-processing

other processing

cluster1

Cluster 2

thread 1

Page 20: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 20

Step 1: identify parts to decompose (parallelize)

Calling sequence of kmeans clustering process

Do loopFindingCentroids

CalculatingDistance

While loop

Do loop

Inside Kmeans Profiling shows:-> About 93% of total execution time is spent in kmeans() functions.-> Inside kmeans() function, almost all time is spent in “Finding Centroids” and “Calculating Distance”.-> Hence, parallelize these two.

DivKmeans Kmeans()

CalculatingDistance

Find Centroids

Page 21: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 21

Simplified codes of Finding Centroids

// sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values for (i = 0; i < nclusters; i++) { for (j = 0; j < ncolumns; j++) cdata[i][j] /= total_number[i][j]; }

Page 22: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 22

Parallelized Codes of “Finding Centroids”

// sum up all elements for (k = 0; k < nrows; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) cdata[i][j]+=data[k][j]; } // calculate mean values …

Before parallelization After parallelization

// sum up elements assigned to current threadfor (k = nrows * index / n_thread; k < nrows * (index + 1) / n_thread; k++) { i = clusterid[k]; for (j = 0; j < ncolumns; j++) { if (mask[k][j] != 0) { t_data[i][j]+=data[k][j]; t_mask[i][j]++; } }}

// merge data … // calculate mean values …

Page 23: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 23

Mapping of Algorithms into Multi-core Architectures

original cluster

Core 3Core 2 Core 4

Merge Results

some pre-processing

other processing

cluster1

Cluster 2

Core1 Each thread uses one core

Page 24: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 24

Mapping of Algorithms into Multi-core Architectures

• How to further benefit from multi-core architectures?

Data locality

Cache aware algorithm

Architecture aware algorithm

Page 25: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 25

Example 1: AMD Opteron

Mapping of Algorithms into Multi-core Architectures

No cache sharing between two cores in thisarchitecture

Diagram of AMD Opteron

Page 26: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 26

Mapping of Algorithms into Multi-core Architectures

Example 2: Intel Core 2

Improve cache re-use:If two threads share common data, assign them to the cores on the same die.

Diagram of an Intel Core 2 dual Core processor

Page 27: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 27

Mapping of Algorithms into Multi-core Architectures

Dell PowerEdge 6950NUMA (Non-Uniform Memory Access)

Example 3:

Improve data locality:Keep data in local memory so that each thread uses local memory instead of remote ones as much as possible.

Page 28: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 28

Early Results on Multi-core Platform

Experiment Environments Platform: 3 nodes in a Linux cluster, each node has two 2GHz AMD Opteron

(TM) CPUs, each CPU has dual cores Linux RHEL WS release 4

Library: LAM 7.1.2/MPI Pthread for Linux RHEL WS release 4

Degree of Parallelization Only the code of “Finding Centroids” is parallelized for early study. 4 threads are used for “Finding Centroids” on each node, and each t

hread runs on one core.

Page 29: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 29

Results of Parallelizing “Finding Centroids”

Performance of Di vKmeans

0

500

1000

1500

2000

2500

3000

0 20000 40000 60000 80000 100000

Data Si ze (Number of I tems)

Tota

l Ex

ecut

ion

Time

(Sec

onds

) Bef ore Paral l el i zat i onAf ter Paral l el i zat i on

Conclusion: Modest improvement. DivKmeans runs about 12% faster after parallelization.

Page 30: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 30

Perf ormance of Di vKmeans( I tem Si ze: 12000)

320330340350360370380

0 20 40 60 80

Number of Threads Used per Node

Tota

l Ex

ecut

ion

Time

(se

cond

s)

Parallelizing “Finding Centroids” with Different Number of Threads per Node

Conclusion: can hardly benefit from using more threads than the number of cores.Total Number of Cores per Node: 4

Page 31: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 31

Optimizations for Next Step

• Reduce overhead of managing threads (e.g. use thread pool instead of creating new threads for each call to “Finding Centroids”)

• Parallelize the “Calculating Distance” part, which consumes twice the time of “Finding Centroids”

• More cores (4, 8, 32…) on a single computer are on the way. Should get more performance enhancements with more cores if the scalability of the program is good.

• The platform we used (AMD Opteron TM) doesn’t support cache sharing between two cores on the same die. However, L2, and even L1 cache sharing among cores are becoming available.

Page 32: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 32

The Multi-core Project in the Distributed Data Everywhere (DDE) Lab and the Extreme Lab

• Multi-core processors: represent a major evolution in today’s computing technology

• We are exploring the programming styles and challenges on multi-core platforms, and potential applications in both academic and commercial areas, including chemical-informatics, XML parsing, data streaming, Web Service, etc.

Page 33: March 12, 2007CICC quarterly meeting1 Optimizing DivKmeans for Multicore Architectures: a status report Jiahu Deng and Beth Plale Department of Computer.

March 12, 2007 CICC quarterly meeting 33

References

1. Open Source Clustering Software Laboratory of DNA Information Analysis, Human Genome Center Institute of Medical Science, University of Tokyo http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/

2. http://www.nsc.liu.se/rd/enacts/Smith/img1.htm

3. http://www.mhpcc.edu/training/workshop/parallel_intro/

4. http://www.digitalchemistry.co.uk/prod_clustering.html

5. Performance Benchmarking on the Dell PowerEdge™ 6950 David Morse, Dell Inc.