Top Banner
Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture S.A. Arul Shalom, Asst. Prof. Manoranjan Dash School of Computer Engineering Nanyang Technological University 50 Nanyang Avenue, Singapore [email protected]; [email protected] Minh Tue, Nithin Wilson NUS High School of Mathematics and Science 20 Clementi Avenue 1, Singapore Singapore [email protected]; [email protected] Abstract— We explore the use of today’s high-end Graphics processing units on desktops to perform hierarchical agglomerative clustering with the Compute Unified Device Architecture – CUDA of NVIDIA. Although the advancement in graphics cards has made the gaming industry to flourish, there is a lot more to be gained the field of scientific computing, high performance computing and their applications. Previous works have illustrated considerable speed gains on computing pair wise Euclidean distances between vectors, which is the fundamental operation in hierarchical clustering. We have used CUDA to implement the complete hierarchical agglomerative clustering algorithm and show almost double the speed gain using much cheaper desk top graphics card. In this paper we briefly explain the highly parallel and internally distributed programming structure of CUDA. We explore CUDA capabilities and propose methods to efficiently handle data within the graphics hardware for data intense, data independent, iterative or repetitive general- purpose algorithms such as the hierarchical clustering. We achieved results with speed gains of about 30 to 65 times over the CPU implementation using micro array gene expressions. Keywords- CUDA hierarchical clustering, high performance computing, GPGPU, acceleration of computations, parallel computing I. INTRODUCTION Today’s Graphics Processing Unit (GPU) on commodity desktops, gaming consoles, video processing desktops or play stations has become the most powerful and affordable computational hardware in the computer world. The hardware architecture of these processors, which are traditionally meant for graphics applications, inherently enables massively parallel vector processing with high memory bandwidth and low memory latency. The processing stages within these GPUs are programmable. Such characteristics of the GPU make it more effective and cost efficient to execute highly repetitive arithmetically intensive computational algorithms. Modern GPUs such as the NVIDIA GeForce 8800 GPU is an extremely flexible, highly programmable and powerful, precision processor with 128 parallel stream processors, which is also being used in the field of general-purpose computations. Over the past few years the programmable GPU has turned out into a machine of which the computational power has increased tremendously [1]. The use of GPUs for general-purpose computation (GPGPU) is seen as a significant force that is changing the nature of graphics in the enterprise. The phenomenal growth in the computing power of GPU that can be measured as floating-point operations (FLOPS) [2] over the years is shown in Fig. 1. The internal memory bandwidth of NVIDIA GeForce 8800 GTX GPU is 86 Giga Bytes/second, whereas for a dual core 3.0 GHz Pentium IV CPU it is 8 Giga Bytes/second. It has peak performance of about 1300 GFLOPS with 128-bit floating-point precision compared to approximately 25 GFLOPS for the CPU. The field of GPGPU continues to grow and benefit from the raw computational power of the GPU in a desktop [12]. A. Recent Advancements and Challenges in GPGPU The programming model of the GPU is harsh, constrained and is heavily linked to computer graphics. It is vital that computational algorithms have to be carefully designed and ported, to effectively suit the graphics environment. It continues to be a challenge to scientists and researchers to harness the power of GPU for applications based on general-purpose computations. It is not straightforward to port the CPU codes to the GPU and not a simple task to realize. Researchers have to learn graphics dedicated programming platforms such as OpenGL or DirectX, and convert the computational problem into a graphics problem [9], [10] and [11]. This requires tedious efforts and is time consuming. The new graphics API of NVIDIA, Compute Unified Device Architecture (CUDA), lightens the tasks of researchers who are interested in GPGPU. The standard functions allow the developers to directly access the GPU’s hardware components such as processors, memories and registers. Figure 1. Comparison of Computational Growth (Courtesy: NVIDIA). 2009 International Conference on Signal Processing Systems 978-0-7695-3654-5/09 $25.00 © 2009 IEEE DOI 10.1109/ICSPS.2009.167 556
6

Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

Jan 24, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

S.A. Arul Shalom, Asst. Prof. Manoranjan Dash School of Computer Engineering

Nanyang Technological University 50 Nanyang Avenue, Singapore

[email protected]; [email protected]

Minh Tue, Nithin Wilson NUS High School of Mathematics and Science

20 Clementi Avenue 1, Singapore Singapore

[email protected]; [email protected]

Abstract— We explore the use of today’s high-end Graphics processing units on desktops to perform hierarchical agglomerative clustering with the Compute Unified Device Architecture – CUDA of NVIDIA. Although the advancement in graphics cards has made the gaming industry to flourish, there is a lot more to be gained the field of scientific computing, high performance computing and their applications. Previous works have illustrated considerable speed gains on computing pair wise Euclidean distances between vectors, which is the fundamental operation in hierarchical clustering. We have used CUDA to implement the complete hierarchical agglomerative clustering algorithm and show almost double the speed gain using much cheaper desk top graphics card. In this paper we briefly explain the highly parallel and internally distributed programming structure of CUDA. We explore CUDA capabilities and propose methods to efficiently handle data within the graphics hardware for data intense, data independent, iterative or repetitive general-purpose algorithms such as the hierarchical clustering. We achieved results with speed gains of about 30 to 65 times over the CPU implementation using micro array gene expressions.

Keywords- CUDA hierarchical clustering, high performance computing, GPGPU, acceleration of computations, parallel computing

I. INTRODUCTION Today’s Graphics Processing Unit (GPU) on commodity

desktops, gaming consoles, video processing desktops or play stations has become the most powerful and affordable computational hardware in the computer world. The hardware architecture of these processors, which are traditionally meant for graphics applications, inherently enables massively parallel vector processing with high memory bandwidth and low memory latency. The processing stages within these GPUs are programmable. Such characteristics of the GPU make it more effective and cost efficient to execute highly repetitive arithmetically intensive computational algorithms. Modern GPUs such as the NVIDIA GeForce 8800 GPU is an extremely flexible, highly programmable and powerful, precision processor with 128 parallel stream processors, which is also being used in the field of general-purpose computations. Over the past few years the programmable GPU has turned out into a machine of which the computational power has increased tremendously [1]. The use of GPUs for general-purpose

computation (GPGPU) is seen as a significant force that is changing the nature of graphics in the enterprise. The phenomenal growth in the computing power of GPU that can be measured as floating-point operations (FLOPS) [2] over the years is shown in Fig. 1. The internal memory bandwidth of NVIDIA GeForce 8800 GTX GPU is 86 Giga Bytes/second, whereas for a dual core 3.0 GHz Pentium IV CPU it is 8 Giga Bytes/second. It has peak performance of about 1300 GFLOPS with 128-bit floating-point precision compared to approximately 25 GFLOPS for the CPU. The field of GPGPU continues to grow and benefit from the raw computational power of the GPU in a desktop [12].

A. Recent Advancements and Challenges in GPGPU The programming model of the GPU is harsh,

constrained and is heavily linked to computer graphics. It is vital that computational algorithms have to be carefully designed and ported, to effectively suit the graphics environment. It continues to be a challenge to scientists and researchers to harness the power of GPU for applications based on general-purpose computations. It is not straightforward to port the CPU codes to the GPU and not a simple task to realize. Researchers have to learn graphics dedicated programming platforms such as OpenGL or DirectX, and convert the computational problem into a graphics problem [9], [10] and [11]. This requires tedious efforts and is time consuming. The new graphics API of NVIDIA, Compute Unified Device Architecture (CUDA), lightens the tasks of researchers who are interested in GPGPU. The standard functions allow the developers to directly access the GPU’s hardware components such as processors, memories and registers.

Figure 1. Comparison of Computational Growth (Courtesy: NVIDIA).

2009 International Conference on Signal Processing Systems

978-0-7695-3654-5/09 $25.00 © 2009 IEEE

DOI 10.1109/ICSPS.2009.167

556

Page 2: Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

B. Motivations Recent research in computer architecture shows a trend

towards the fields of Streaming computations, Multi-core architectures, Heterogeneous architecture, Distributed and Grid computing and Polymorphous computing architecture. The architecture of GPU has a lot in common with these hot research topics. It is possible to offload more arithmetically intense computations from the CPU to the GPU for processing massive data sets even in desktops for applications such as scientific computing, physical simulations, image processing, computer vision, data mining etc [3].

Clustering has become a common technique in statistical data analysis, which has wide spread applications in many fields such as machine learning, data mining, text mining, pattern recognition, image processing and bioinformatics. The five major categories of clustering are k-partitioning algorithms, hierarchical algorithms, density-based, grid-based and mode-based. In the most popular k-partitioning method, the clusters are assumed to be spherical and the required number of partitions is pre-determined which may not be optimal. GPU implementations of such algorithms are available [10], [11]. In hierarchical clustering a complete hierarchical decomposition of the objects is created either by agglomeration or division and the required number of clusters can be obtained [5]. Although hierarchical agglomerative clustering (HAC) has been widely applied in various fields, it is predominantly used for document clustering and retrieval, cluster identification from micro array gene expressions and in image compression, searching and clustering where computationally intense and time consuming high throughput data processing is involved. The time complexity of the HAC algorithm is at least O(N2) and the overall complexity of the algorithm is O(N2*logN) where N is the number of objects to be clustered. Hierarchical Clustering algorithms have been implemented on the GPU using OpenGL and CUDA in the past [6], [7], [9]. The purpose of this research work is not to reduce the complexity of the algorithm but to look into simpler ways of using CUDA for complete HAC computations, understand the effects of CUDA programming and clustering parameters on scalability and speedup performances.

C. Previous Developments in using CUDA for Hierarchical Clustering and our Intension

Extensive literature search for CUDA based hierarchical clustering and distance computations yielded in two related works with significant contribution to this topic. The first work [6] deals with the implementation of the ‘pair-wise distance’ computation, which is one of the fundamental operations in HAC. The GPU NVIDIA 8800 GTX is employed and it uses CUDA programs to speed up the computations. Gene expression data is used and the Standard HAC is implemented using the half matrix for pair-wise distance calculation. Experimentations are done to evaluate the use of threads; ‘one thread for one row of the

output matrix and one thread for entry in out’. Moreover, the GPU shared memory is used and the threads are synchronized at block level. Results show that speedup of 20 to 44 times is achieved in the GPU compared to the CPU implementations. It is important to note that this speed up is achieved only for the pair-wise distance computations and not for the complete HAC algorithm.

In the second research work [7], CUDA based HAC algorithm on the GPU NVIDIA 8800 GTX is compared with the performance of commercial bioinformatics clustering applications running on the CPU. Based on the cluster parameters and setup used, speedup of about 10 to 14 times is achieved. The effectiveness of using the CUDA on GPU to cluster high dimensional vectors is shown. This is accomplished at an expense of organized memory ‘micro-management’ within the GPU.

In this paper, we exploit the parallel processing architecture, the large global memory space, and the programmability of the GPU to efficiently implement the traditional HAC algorithm completely. We use a relatively cheaper graphics card, NVIDIA 8800 GTS GPU which has lower specification than the GTX version. The cost of 8800GTX ranges from $500 to $700, whereas the 8800GTS ranges from $100 to $400. We use the latest graphics programming API; NVIDIA’s CUDA to implement the computations of HAC in the GPU. We mostly utilize the global memory than the shared memory in the GPU, which is relatively simple to be programmed, though the access is slower. We present the novelties of our research and recommend certain criteria for the selection of programming parameters of CUDA for a given type of computation. We explore the relations between the CUDA parameters such as block size, number of blocks versus size of data and dimensions, intending to find the optima where a given GPU configuration would peak its performance. We have implemented the single link method of HAC and tested using micro array gene expression data of yeast. We achieved results in the order of 30 to 65 times faster than the CPU, based on gene sizes and CUDA parameters.

II. THE TRADITIONAL HIERARCHICAL AGGLOMERATIVE CLUSTERING ALGORITHM

In this section we provide a brief description of the traditional HAC algorithm along with the steps of implementing the algorithm in the CPU, to which the GPU implementation steps can be to related later in section III.

A. Understanding the HAC Algorithm The objective of HAC is to generate high level multiple

partitions in a given dataset. The groups of partitions of data vectors will denote the sets of clusters. In this bottom-up approach, each vector is treated as a singleton cluster to start with and then they are successively merged into pairs of clusters (agglomerative) until all vectors have merged into one single large cluster. The agglomeration of clusters results in a tree-like structure called the dendrogram. The

557

Page 3: Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

combination similarity between the merging clusters is highest at the lowest level of the dendrogram and it decreases as the clusters merge into the final single cluster. This structure can then be used to extract partitions of the data set by cutting the dendrogram at the required level of combination similarity or number of clusters expected. Fig. 2 shows such a dendrogram illustrating the HAC algorithm, where the singleton clusters p, q, r, s and t are progressively merged into one single large cluster.

The fundamental assumption in HAC is that the merge is monotonic and descending, which means that the combination similarities s1, s2, s3, … sn-1 of successive merges of the vectors are only in descending order. The steps of HAC that will result in such a monotonic merge are listed in Table I. The minimum distance cluster pairs are merged which is called as the single linkage HAC method.

B. HAC Implementation on the CPU We implemented the half similarity matrix based HAC in

the CPU and used its computational time as reference to compare with the computational time taken by the GPU. Assuming that the vectors of size n and dimension d is stored in an array, the code listed in Table II will compute the half-matrix of the pair-wise Euclidean distances and store it in array dist.

Figure 2. A Dendrogram of HAC.

TABLE I. HAC ALGORITHM WITH HALF SIMILARITY MATRIX

Algorithm: Hierarchical Agglomerative Clustering Input: d dimensional dataset X with n vectors (v1,v2,v3, … vn) Output: Monotonic descending Dendrogram Steps:

1) begin initialize Xi {vi}, i = 1, 2, …n 2) construct n x n half similarity matrix D with distances d(i,j)

between the vectors 3) while |Xi| > 1, do a. Select the best pair (a,b) such that a,b ∈ Xi; b. Merge the pair (a,b) into a new cluster c = a U b, let a and b

be the sub clusters of cluster c. c. Update Xi Xi U {c} \ {a,b};

4) for each x ∈ Xi do a. Compute the distances between x and c; b. Update the half similarity matrix D with distances d(x,c);

end loop.

TABLE II. CODE TO COMPUTE THE UPPER HALF MATRIX OF PAIR-WISE EUCLIDEAN DISTANCES ON CPU

double** dist; double start, end; start=clock(); dist=new double*[n-1];

for (int i=0; i<n-1; i++) dist[i]=new double[n-1-i]; for (int i=0; i<n-1; i++)

{ for (int j=i+1; j<n; j++)

{ int v=j-i-1; dist[i][v]=0; for (int k=0; k<d; k++)

{ double r=a[i][k]-a[j][k]; dist[i][v]+=r*r; }

} }

III. CUDA FOR GPU BASED HAC COMPUTATIONS In the legacy approach, algorithms involved are broken

down into small chunks of computations or kernels called the shaders in Graphics Library and Shading Language (GLSL) or C for Graphics (CG). In the unified design as in CUDA, it is possible to execute multiple shaders by synchronizing them to the various graphics memories. This unified design provides better workload balance between the stream processors in the GPU, thus avoiding delays in completion of shading. In this section we will discuss the general structure of CUDA to learn how it could be effectively used in a preferably simple method for computations.

A. Programming Structure of CUDA for Computations The software stack of CUDA runs on the GPU hardware

as an Application Programming Interface (API) to the standard C language, providing Single Instruction Multiple Data (SIMD) capabilities. To utilize the GPU such as NVIDIA GeForce 8800 as a stream processor, the CUDA framework abstracts the graphical pipeline processors, memories and controls. This exposes the memory and instruction set as a hierarchical model so it can be effectively used to realize high-level programmability on highly intensive arithmetic computations. GPU has many more Arithmetic Logic Units (ALU) which provides its intensive computational power, while lacking complex system of logical control. This concentration of ALUs on a single GPU chip promotes the distribution of massive data and instructions over those units simultaneously, thus enabling parallel computation.

Chapter 2 of [3] explains the programming structure, memory management and the invocation of kernel functions in detail. Fig. 3 shows an overview of the CUDA device memory model for programmers to reason about the allocation, movement, and usage of the various memory types available on the GPU. The lowest level of computation is the thread which is analogous to shaders in OpenGL. Each thread has local data memory and access to memories at different hierarchies. Instructions are passed into the threads

558

Page 4: Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

to perform calculations. Threads are organized into blocks, and blocks are organized in grid. Blocks form the basis for a kernel to operate on the data that resides in a dedicated, aligned memory. Threads in a block are identified by a unique Thread ID and blocks in a grid by a Block ID. The ID of the threads can be accessed via the corresponding blocks and grids by using these built-in variables: blockDim (block dimension), gridDim (grid dimension), blockIdx (block index), threadIdx (thread index), and warpSize (size of a warp). A warp executes a number of threads on one of the processors within the GPU. When a kernel is executed, the blocks will be distributed to the many processors and to each processor a number of threads are assigned by a warp. Such an organization will allow us to control the distribution of data and instructions over the threads. At any moment, a thread is operated by only one kernel although many kernels can be queued up in a stream. All threads in the grids are executed concurrently ensuring fast parallel computation. Threads need to be synchronized at the end of kernel operation to ensure that all data have been processed completely. However, synchronization must be used sparingly as it slows down the computation.

B. HAC Implementation on the GPU Using CUDA The implementation computational tasks of the single

link HAC algorithm and the CUDA functions used are summarized in Table III. The process of updating the similarities and merging the clusters are repeated until only one cluster exists. The quality of the clustering approach in GPU is evaluated by comparing the clusters formed against those from the CPU implementations. In this section we intend to briefly discuss the construction of the similarity matrix in the global memory of the GPU. The similarity matrix Sij is obtained by splitting the task of computing Euclidean distances between several threads as follows:

1) Each block in the grid is responsible for computing one square sub-matrix Ssub of Sij.

2) Each thread within the block is responsible for computing one element of Ssub.

The number of threads per block or the equivalent number of blocks within the grid should be chosen to maximize the utilization of the GPU’s computing resources.

Figure 3. CUDA Device Memory Model of GPU. [Courtesy: NVIDIA]

This warrants that there should be as many blocks in total as the number of processors in the GPU. To be efficient we need to maximize the number of threads and allow for two or more blocks to run concurrently. For a GPU with 128 processors, utilizing all processors with maximum number of threads will maximize the efficiency. If the dimension blocSize is 4 by 4 (4x4), then, 128/16 = 8 blocks per grid is considered optimal. When the dimension blocSize is 4, and if 4 blocks are used per grid, then 64 threads operate on the data per grid. While computing distances between vectors in the HAC algorithm, if there are 16 dimensions per vector, with the above block structure in one pass the distances between the first vector and 4 other vectors can be computed. Thus each grid with 64 threads behaves like an ‘operator’ on the data while the kernel is executed. The following steps illustrate the computation of similarity matrix using the GPU.

Step1. Read vectors from the Dev_data array on GPU Step2. Calculate the index using expression in (1) Step3. Store the distances in the Dev_dist array on GPU Step4. Compute minimum distances and Merge Step5. Update Dev_data array and Dev_dist array Step6. Repeat step 1 to 5 till all the vectors are exhausted. Using 1-dimensional array instead of the traditional 2-

dimensional array in the global memory showed significant improvement in the computational performance. The 1-dimensional array as shown in Fig. 4 would allow efficient reading and writing to continuous memory addresses. The expression shown in (1) is used to get the index k of the similarity array as it is related to the index (i, j) of the corresponding vectors.

(1)

The kernel that runs on the GPU to compute the

similarity matrix for a 4×4 block is shown in Table IV. It can be seen that in order to execute the kernel on the GPU, the block and the grid size has to be declared before the computations are coded including the computation of the memory location index k. The calculateDistance() launches the kernel and the partial distances are computed in parallel.

IV. EXPERIMENTS AND RESULTS ANALYSIS

A. Choice of Domain and Data Bio-informatics is a critical field of study where

clustering algorithms are applied at large for various research and medical purposes. Of which, hierarchical clustering methods are often used for studying either genes

Figure 4. Similarity Half Matrix in the GPU Global Memory as a 1-

Dimensional Array.

12

)1()1( −−+−−−= ijiinik

559

Page 5: Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

TABLE III. SINGLE LINK HAC IMPLEMETATION IN GPU WITH CUDA

Computational Task in HAC

GPU CUDA Function or Technique used

Kernel in GPU

Transfer input vectors from CPU to GPU CudaMemcpyHostToDevice); n.a.

Populate initial similarity half matrix

CalculateDistance(); k index locates position in the 1D array Yes

Identify minimum distance vectors

cublasIsamin(); built in function of CUDA Yes

Merge / Calculate new cluster vector & Update

updateArray0(); calculates Centroid Yes

Update the similarity half matrix

UpdateArray1();updateArray2(); Yes

Update the Minimum Distance Array UpdateArray3(); Yes

Transfer Cluster data from GPU back to CPU CudaMemcpyDeviceToHost); n.a

TABLE IV. CUDA CODE TO COMPUTE HALF SIMILARITIES

__global__ void calculateDistance (...) …; dim3 blocSize; blocSize.x=4; blocSize.y=4; dim3 gridSize; for (int i=0; i<n-1; i++) // for each value of i { int m=n-1-i; int x=i*(n-1)-i*(i-1)/2; gridSize.x=m/16+(m%16!=0)*1; for (int k=0; k<d; k++) // for each dimension {

calculateDistance<<<gridSize, blocSize>>> (dev_data[k], dev_dist, i, n, x);

} }

or experimental samples or for both. Such clustering approaches help to identify genes that are co-regulated or that participate in similar biological processes. This can be further used for promoter prediction and gene function predictions, thus to find new potential tumor subclasses [4]. We have selected ‘yeast micro array gene expression data’ for testing the efficiency of our GPU implementations. There are 64000 genes each with 31 dimensions. Though gene expressions may be in the order of thousands or hundreds of thousands, usually only a few thousand of the gene samples or a group of genes are chosen for clustering. Moreover the clusters will be robust if genes not important for the analysis are filtered out before clustering is performed.

B. Expermental Setup, Results Analysis and Discussions The implementations were run on a desktop computer

with Pentium Dual Core CPU, 1.8GHz, 1.0GB RAM with MS Windows XP Professional 2002 and using MS Visual C++ 2005 for the development. NVIDIA GeForce 8800 GTS 512 is the test GPU with CUDA2.0. Yeast gene expression micro array data with 31 dimensions is used to measure the timings and quantify the performance. The computational performance of the GPU over the CPU is expressed in terms of ‘Speedup’, which is the ratio of the CPU to the GPU computational time. The following studies were planned, conducted and the results are analyzed.

1) Performance Evaluation - Number of Blocks in CUDA versus Number of Genes: The computational speedup

using blocks of 4, 6, 8 and 16 per grid with increasing number of the Yeast genes with 6 and 31 dimensions is shown in Fig. 5. It can be deduced that the speedup performance while using 8 blocks is slightly better than using 4, 6 or 16 blocks per grid when the dimenion size is smaller, though the difference might not be significant. For genes with 31 dimenions, there is hardly any difference in performance while using different number of blocks. The 8800GTS hardware has 96 internal processors; for a dimension block size of 4x4 we have 16 threads and the calculated optimal number of blocks per grid should be 6. But the study shows that using 8 blocks the performance level is better or almost same. Fig. 5 also shows that the overall performance has dropped with increased number of dimensions. Performance between blocks is almost the same when 4 blocks are used. Obviously, the optimal number of blocks to be used depends on the size of the Gene vectors and their dimensions more than just on the theoretically calculated number of 6 processors. The drop in performance with increase in number of genes and its dimensions is attributed to the fact that we use a simple method of memory management in CUDA; i.e. the use of global memory instead of the shared or local memory, fast and specific to blocks.

2) Speedup Profile determination: We further intend to realize the peak performance of the CUDA implementation of HAC vs. gene size. Fig. 6 shows the speedup profiles based on genes with 31 dimensions; it can be noted that the performance peaks when there are 8k to 11k genes and that is the region where the speedup is significantly higher while having 8 blocks per grid. Table V shows typical computational time in seconds. So the selection of block size should depend on the data size, n and dimension, d.

3) Effect of using Global memory on Scalability: In the previous research works [6] and [7], shared memories of the blocks are used instead of the global memory. The shared memory can be 150 to 200 times faster than accessing the global memory of the GPU, but it comes at an expense. The shared memory is much smaller than the global memory, so the data and the distance matrix have to be split into chunks. Thread alignment becomes critical while sharing common data by the threads and coding is tedious. Though the global

Figure 5. Performance of GPU based on Blocks per Grid versus Gene

Size and Dimensions.

560

Page 6: Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

memory gives relatively slower performance, it is the largest memory in the GPU, it is easier to code, all threads are able to access the data in common thus little memory management is required. Fig. 7 shows that while using our global memory management the speedup drops tremendously when the dimensions are over 100. The performance can be reversed if shared and local memories of the blocks are used. We propose that scalability trade-offs can be made between choice of memory and ease of programming and memory management if the actual data set has 100 or less dimensions.

V. CONCLUSION AND FUTURE WORKS In this paper, we have presented a cost-effective simple

implementation of the HAC algorithm using CUDA. This implementation has surpassed the CPU implementation by 30 to 65 times. For clustering 15000 genes with 31 feature expressions each, the CPU takes almost 3hours to converge whereas our GPU machine with CUDA processes it within 6minutes. By efficient use of threads in the processors and the large global memory, a trade-off has been proposed between scalability speedups and ease of memory management and hence programming. The novelties in our implementation includes: Speedup of accessing data in the global memory using a 1D half similarity array, Identifying the minimum distance pairs in virtually one pass using the built in ‘cublasIsamin’ function while completing the HAC tasks within in the GPU itself. Moreover, the need to maintain huge data, align threads to the local and shared memories are avoided; thus minimizing ‘synchronization of threads’. We propose that the speedup depends more on the number of vectors used and its dimensions for a given GPU configuration and the CUDA parameters such as block size, number of blocks etc. are to be selected accordingly to maximize performance.

Full capabilities of CUDA for HAC is yet to be explored; reach optimal clustering conditions that will maximize the utilization of its raw parallel processing power. We are currently looking into the challenges of implementing the complete link and other HAC methods. The use CUDA will be further exploited to reduce the complexity of the HAC algorithm. In future, graphics accelerators will be designed and developed not just for their astonishing graphic performances but also for their computational excellence, thus reducing cost of efficient parallel processing.

REFERENCES [1] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E.

Lefohn and T. J. Purcell, “A survey of general-purpose computation on graphics hardware”, Proc. Eurographics 2005, State of the Art Reports, August 2005 [Eurographics Association. Eurographics 05 STAR pp. 21 -51, 2005]

[2] NVIDIA Corportation. “CUDA Programming Guide 2.0”. Retrived December 12, 2008, from NVIDIA CUDA developer zone; http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf

[3] H. Nguyen, GPU Gems 3. Addison-Wesley, 2007.

Figure 6. Speedup Profiles of GPU versus Blocks per Grid and Gene Size.

TABLE V. HAC TIME IN SECONDS VS. GENE SIZES FOR 8 BLOCKS

Genes 256 1024 4096 8192 10000 13000 15000 CPU 0.1 4.3 223.3 1720.4 3105.6 6922.3 10404.1 GPU 0.7 1.3 9.1 48.8 67.2 197.1 324.3 Speed 0.2 3.2 24.5 35.3 46.2 35.1 32.1

Figure 7. Speedup Profile of GPU versus Blocks per Grid and Gene Size.

[4] H. C. Causton, J. Quackenbush, and A. Brazma, Microarray Gene Expression Data Analysis, Blackwell Publishing, 2003.

[5] L. Kaufman, and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, Inc. 1990.

[6] D. Chang, A. Nathaniel, J. Dazhuo, O. L. Ming, and R. K. Ragade. “Compute pairwise euclidean distances of data points with GPUs” Proc. IASTED International Symposium on Computational Biology and Bioinformatics (CBB) 2008 Orlando, Florida, USA , Nov.2008.

[7] J. Wilson, M. Dai, E. Jakupovic, S. Watson and F. Meng, “Supercomputing with toys: Harnessing the power of NVDIA 8800GTX and Playstation 3 for bioinformatics problems.” Proc. Conference Computational Systems Bioinformatics University of California, San Diego, USA, pp 387-390, 2007.

[8] N. Govindaraju, N. Raghuvanshi and D. Manocha. “Fast approximate stream mining of quantiles and frequencies using graphics processors” Proc. ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, pp 611-622, 2005.

[9] Q. Zhang, and Y. Zhang. “Hierarchical clustering of gene expression profiles with graphics hardware acceleration”. Pattern Recognition Letters 27. Elsevier. pp 676-681, 2006.

[10] S. Arul, M. Dash, and M. Tue. “Efficient k-means clustering using accelerated graphics processors”, Proc. 10th International Conference on Data Warehousing and Knowledge Discovery (DAWAK), 2008.

[11] S. Arul, M. Dash, and M. Tue. “Graphics hardware based efficient and scalable fuzzy c-means clustering,” Proc. The Australasian Data Mining Conference (AusDM08), 2008.

[12] P. Tulloch. “Supercomputing’s next revolution”. Retrived July 23, 2008 from Wired News; http://wired.com/gadgets/pcs/news/2006/11/7209

561