This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Research ArticleA Novel CSR-Based Sparse Matrix-VectorMultiplication on GPUs
Guixia He1 and Jiaquan Gao2
1Zhijiang College Zhejiang University of Technology Hangzhou 310024 China2College of Computer Science and Technology Zhejiang University of Technology Hangzhou 310023 China
Correspondence should be addressed to Jiaquan Gao springf12163com
Received 4 January 2016 Accepted 27 March 2016
Academic Editor Veljko Milutinovic
Copyright copy 2016 G He and J Gao This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations Compressed sparse row (CSR)is the most frequently used format to store sparse matrices However CSR-based SpMVs on graphic processing units (GPUs) forexample CSR-scalar and CSR-vector usually have poor performance due to irregular memory access patterns This motivates usto propose a perfect CSR-based SpMV on the GPU that is called PCSR PCSR involves two kernels and accesses CSR arrays in afully coalesced manner by introducing a middle array which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) andCSR-vector (partial coalescing) Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar CSR-vectorand CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-basedalgorithm CSR-Adaptive Furthermore we extend PCSR on a single GPU to multiple GPUs Experimental results on four C2050GPUs show that no matter whether the communication between GPUs is considered or not PCSR onmultiple GPUs achieves goodperformance and has high parallel efficiency
1 Introduction
Sparse matrix-vector multiplication (SpMV) has proven tobe an important operation in scientific computing It needbe accelerated because SpMV represents the dominant costin many iterative methods for solving large-sized linearsystems and eigenvalue problems that arise in a wide varietyof scientific and engineering applications [1] Initial workabout accelerating the SpMV on CUDA-enabled GPUs ispresented by Bell and Garland [2 3] The correspondingimplementations in the CUSPARSE [4] and CUSP libraries[5] include optimized codes of the well-known compressedsparse row (CSR) coordinate list (COO) ELLPACK (ELL)hybrid (HYB) and diagonal (DIA) formats Experimentalresults show speedups between 156 and 1230 compared toan optimized CPU implementation for a range of sparsematrices
SpMV is a largely memory bandwidth-bound operationReported results indicate that different access patterns tothe matrix and vectors on the GPU influence the SpMVperformance [2 3] The COO ELL DIA and HYB kernels
benefit from full coalescing However the scalar CSR kernel(CSR-scalar) shows poor performance because of its rarelycoalesced memory accesses [3]The vector CSR kernel (CSR-vector) improves the performance of CSR-scalar by usingwarps to access the CSR structure in a contiguous but notgenerally aligned fashion [3] which implies partial coalesc-ing Since then researchers have developed many highlyefficient CSR-based SpMV implementations on the GPU byoptimizing the memory access pattern of the CSR structureLu et al [6] optimize CSR-scalar by padding CSR arraysand achieve 30 improvement of the memory access per-formance Dehnavi et al [7] propose a prefetch-CSR methodthat partitions the matrix nonzeros to blocks of the same sizeand distributes them amongst GPU resources This methodobtains a slightly better behavior than CSR-vector by paddingrows with zeros to increase data regularity using parallelreduction techniques and prefetching data to hide globalmemory accesses Furthermore Dehnavi et al enhance theperformance of the prefetch-CSR method by replacing itwith three subkernels [8] Greathouse and Daga suggest aCSR-Adaptive algorithm that keeps the CSR format intact
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 8471283 12 pageshttpdxdoiorg10115520168471283
2 Mathematical Problems in Engineering
and maps well to GPUs [9] Their implementation efficientlyaccesses DRAM by streaming data into the local scratchpadmemory and dynamically assigns different numbers of rowsto each parallel GPU compute unit In addition numerousworks have proposed for GPUs using the variants of the CSRstorage format such as the compressed sparse eXtended [10]bit-representation-optimized compression [11] block CSR[12 13] and row-grouped CSR [14]
Besides using the variants of CSR many highly efficientSpMVs onGPUs have been proposed by utilizing the variantsof the ELL and COO storage formats such as the ELLPACK-R [15] ELLR-T [16] sliced ELL [13 17] SELL-C-120590 [18] slicedCOO [19] and blocked compressed COO [20] Specializedstorage formats provide definitive advantages However asmany programs use CSR the conversion from CSR to otherstorage formats will present a large engineering hurdle andcan incur large runtime overheads and require extra storagespace Moreover CSR-based algorithms generally have alower memory usage than those that are based on otherstorage formats such as ELL DIA and HYB
All the above observations motivate us to further inves-tigate how to construct efficient SpMVs on GPUs whilekeeping CSR intact In this study we propose a perfect CSRalgorithm called PCSR on GPUs PCSR is composed oftwo kernels and accesses CSR arrays in a fully coalescedmanner Experimental results on C2050 GPUs show thatPCSR outperforms CSR-scalar and CSR-vector and has abetter behavior compared to CSRMV and HYBMV in thevendor-tuned CUSPARSE library [4] and a most recentlyproposed CSR-based algorithm CSR-Adaptive
The main contributions in this paper are summarized asfollows
(i) A novel SpMV implementation on a GPU whichkeeps CSR intact is proposed The proposed algo-rithm consists of two kernels and alleviates the defi-ciencies of many existing CSR algorithms that accessCSR arrays in a rare or partial coalesced manner
(ii) Our proposed SpMV algorithm on aGPU is extendedtomultiple GPUs Moreover we suggest twomethodsto balance the workload among multiple GPUs
The rest of this paper is organized as follows Followingthis introduction the matrix storage CUDA architectureand SpMV are described in Section 2 In Section 3 a newSpMV implementation on a GPU is proposed Section 4discusses how to extend the proposed SpMV algorithm ona GPU to multiple GPUs Experimental results are presentedin Section 5 Section 6 contains our conclusions and points toour future research directions
2 Related Techniques
21 Matrix Storage To take advantage of the large number ofzeros in sparse matrices special storage formats are requiredIn this study the compressed sparse row (CSR) format is onlyconsidered although there aremany varieties of sparsematrixstorage formats such as the ELLPACK (or ITPACK) [21]COO [22] DIA [1] and HYB [3] Using CSR an 119899 times 119899 sparse
matrix 119860 with119873 nonzero elements is stored via three arrays(1) the array 119889119886119905119886 contains all the nonzero entries of 119860 (2)the array 119894119899119889119894119888119890119904 contains column indices of nonzero entriesthat are stored in 119889119886119905119886 and (3) entries of the array 119901119905119903 pointto the first entry of subsequence rows of 119860 in the arrays 119889119886119905119886and 119894119899119889119894119888119890119904
For example the following matrix
119860 =
[[[[[[[[[[[
[
4 1 0 1 0 0
1 4 1 0 1 0
0 1 4 0 0 1
1 0 0 4 1 0
0 1 0 1 4 1
0 0 1 0 1 4
]]]]]]]]]]]
]
(1)
is stored in the CSR format by
119889119886119905119886
[4 1 1 1 4 1 1 1 4 1 1 4 1 1 1 4 1 1 1 4]
119894119899119889119894119888119890119904
[0 1 3 0 1 2 4 1 2 5 0 3 4 1 3 4 5 2 4 5]
119901119905119903 [0 3 7 10 13 17 20]
(2)
22 CUDA Architecture The compute unified device archi-tecture (CUDA) is a heterogenous computing model thatinvolves both the CPU and theGPU [23] Executing a parallelprogram on the GPU using CUDA involves the following(1) transferring required data to the GPU global memory(2) launching the GPU kernel and (3) transferring resultsback to the host memoryThe threads of a kernel are groupedinto a grid of thread blocks The GPU schedules blocks overthe multiprocessors according to their available executioncapacity When a block is given to a multiprocessor it issplit in warps that are composed of 32 threads In the bestcase all 32 threads have the same execution path and theinstruction is executed concurrently If not the executionpaths are executed sequentially which greatly reduces theefficiency The threads in a block communicate via thefast shared memory but the threads in different blockscommunicate through high-latency global memory Majorchallenges in optimizing an application on GPUs are globalmemory access latency different execution paths in eachwarp communication and synchronization between threadsin different blocks and resource utilization
23 Sparse Matrix-Vector Multiplication Assume that119860 is an119899times119899 sparse matrix and 119909 is a vector of size 119899 and a sequentialversion of CSR-based SpMV is described in Algorithm 1Obviously the order in which elements of 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903and 119909 are accessed has an important impact on the SpMVperformance on GPUs where memory access patterns arecrucial
Mathematical Problems in Engineering 3
Input 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903 119909 119899Output 119910(01) for 119894 larr 0 to 119899 minus 1 do(02) 119903119900119908 119904119905119886119903119905 larr 119901119905119903[119894](03) 119903119900119908 119890119899119889 larr 119901119905119903[119894 + 1](04) 119904119906119898 larr 0(05) for 119895 larr 119903119900119908 119904119905119886119903119905 to 119903119900119908 119890119899119889 minus 1 do(06) 119904119906119898 += 119889119886119905119886[119895] sdot 119909[119894119899119889119894119888119890119904[119895]](07) done(08) 119910[119894] larr 119904119906119898(09) done
Algorithm 1 Sequential SpMV
3 SpMV on a GPU
In this section we present a perfect implementation of CSR-based SpMV on the GPU Different with other related workthe proposed algorithm involves the following two kernels
(i) Kernel 1 calculate the array V = [V1 V2 V
119873] where
V119894= 119889119886119905119886[119894] sdot 119909[119894119899119889119894119888119890119904[119894]] 119894 = 1 2 119873 and then
save it to global memory(ii) Kernel 2 accumulate element values of V according
to the following formula sum119901119905119903[119895]⩽119894lt119901119905119903[119895+1]
V119894 119895 =
0 1 119899 minus 1 and store them to an array 119910 in globalmemory
We call the proposed SpMV algorithm PCSR For sim-plicity the symbols used in this study are listed in Table 1
31 Kernel 1 For Kernel 1 its detailed procedure is shown inAlgorithm 2 We observe that the accesses to two arrays 119889119886119905119886and 119894119899119889119894119888119890119904 in global memory are fully coalesced Howeverthe vector 119909 in global memory is randomly accessed whichresults in decreasing the performance ofKernel 1 On the basisof evaluations in [24] the best memory space to place datais the texture memory when randomly accessing the arrayTherefore here texture memory is utilized to place the vectorinstead of global memory For the single-precision floatingpoint texture the fourth step in Algorithm 2 is rewritten as
Because the texture does not support double values thefollowing function119891119890119905119888ℎ 119889119900119906119887119897119890() is suggested to transfer theint2 value to the double value
(01) dekice double 119891119890119905119888ℎ 119889119900119906119887119897119890(texture⟨int2 1⟩119905 int 119894)(02) int2 V = tex1Dfetch(119905 119894)(03) return hiloint2double(V sdot 119910 V sdot 119909)(04)
Furthermore for the double-precision floating point texturebased on the function 119891119890119905119888ℎ 119889119900119906119887119897119890() we rewrite the fourthstep in Algorithm 2 as
32 Kernel 2 Kernel 2 accumulates element values of V thatis obtained by Kernel 1 and its detailed procedure is shown inAlgorithm 3This kernel is mainly composed of the followingthree stages
(i) In the first stage the array 119901119905119903 in global memoryis piecewise assembled into shared memory 119901119905119903 119904 ofeach thread block in parallel Each thread for a threadblock is only responsible for loading an elementvalue of 119901119905119903 into 119901119905119903 119904 except for thread 0 (see lines(05)-(06) in Algorithm 3) The detailed procedure isillustrated in Figure 1 We can see that the accesses to119901119905119903 are aligned
(ii) The second stage loads element values of V in globalmemory from the position 119901119905119903 119904[0] to the position
119901119905119903 119904[TB] into shared memory V 119904 for each threadblock The assembling procedure is illustrated inFigure 2 In this case the access to V is fully coa-lesced
(iii) The third stage accumulates element values of V 119904as shown in Figure 3 The accumulation is highlyefficient due to the utilization of two shared memoryarrays 119901119905119903 119904 and V 119904
Obviously Kernel 2 benefits from shared memory Usingthe shared memory not only are the data accessed fast butalso the accesses to data are coalesced
From the above procedures for PCSR we observe thatPCSR needs additional global memory spaces to store amiddle array V besides storing CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 Saving data into V in Kernel 1 and loading data from Vin Kernel 2 to a degree decrease the performance of PCSRHowever PCSR benefits from the middle array V becauseintroducing V makes it access CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 in a fully coalesced manner This greatly improves thespeed of accessing CSR arrays and alleviates the principaldeficiencies of CSR-scalar (rare coalescing) and CSR-vector(partial coalescing)
Symbol Description119860 Sparse matrix119909 Input vector119910 Output vector119899 Size of the input and output vectors119873 Number of nonzero elements in 119860threadsPerBlock (TB) Number of threads per blockblocksPerGrid (BG) Number of blocks per grid
elementsPerThread Number of elements calculated by eachthread
sizeSharedMemory Size of shared memory119872 Number of GPUs
Input 119889119886119905119886 119894119899119889119894119888119890119904 119909119873CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output V(01) 119905119894119889 larr threadIdx + blockIdx sdot blockDimx(02) 119894119888119903 larr blockDimx sdot gridDimx(03) while 119905119894119889 lt 119873(04) V[119905119894119889] larr 119889119886119905119886[119905119894119889] sdot 119909[119894119899119889119894119888119890119904[119905119894119889]](05) 119905119894119889 += 119894119888119903(06) end while
Algorithm 2 Kernel 1
4 SpMV on Multiple GPUs
In this section we will present how to extend PCSR on asingle GPU to multiple GPUs Note that the case of multipleGPUs in a single node (single PC) is only discussed becauseof its good expansibility (eg also used in the multi-CPUand multi-GPU heterogeneous platform) To balance theworkload among multiple GPUs the following two methodscan be applied
(1) For the first method the matrix is equally partitionedinto119872 (number of GPUs) submatrices according tothe matrix rows Each submatrix is assigned to oneGPU and each GPU is only responsible for comput-ing the assigned submatrix multiplication with thecomplete input vector
(2) For the second method the matrix is equally parti-tioned into119872 submatrices according to the numberof nonzero elements Each GPU only calculates asubmatrix multiplication with the complete inputvector
In most cases two partitionedmethods mentioned aboveare similar However for some exceptional cases for examplemost nonzero elements are involved in a few rows for amatrix the partitioned submatrices that are obtained by thefirstmethodhave distinct difference of nonzero elements andthose that are obtained by the second method have differentrows Which method is the preferred one for PCSR
If each GPU has the complete input vector PCSR onmultiple GPUs will not need to communicate between GPUsIn fact SpMV is often applied to a large number of iterativemethods where the sparse matrix is iteratively multipliedby the input and output vectors Therefore if each GPUonly includes a part of the input vector before SpMV thecommunication between GPUs will be required in order toexecute PCSR Here PCSR implements the communicationbetween GPUs using NVIDIA GPUDirect
5 Experimental Results
51 Experimental Setup In this section we test the perfor-mance of PCSR All test matrices come from the Universityof Florida Sparse Matrix Collection [25]Their properties aresummarized in Table 2
All algorithms are executed on one machine which isequipped with an Intel Xeon Quad-Core CPU and fourNVIDIA Tesla C2050 GPUs Our source codes are compiledand executed using the CUDA toolkit 65 under GNULinuxUbuntu v10041 The performance is measured in terms ofGFlops (second) or GBytes (second)
52 Single GPU We compare PCSR with CSR-scalar CSR-vector CSRMV HYBMV and CSR-Adaptive CSR-scalar and
Mathematical Problems in Engineering 5
Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx
lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()
lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898
Algorithm 3 Kernel 2
Block grid
Block 0
Block 1
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middotmiddot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Block i
Block BG
Threads in the ith block
Shared memory Global memoryThread 0
Thread 0
Thread 0
Thread RT
Thread TB minus 1
Thread TB minus 1
v_s[0]
v_s[TB minus 1]
v_s[j lowast TB + 0]
v_s[m lowast TB + 0]
Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]
CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]
We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla
C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported
521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and
Mathematical Problems in Engineering 7
Sing
le-p
reci
sion
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(a) Single precision
Dou
ble-
prec
ision
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(b) Double precision
Figure 4 Performance of all algorithms on a Tesla C2050
Sing
le-p
reci
sion
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(a) Single precision
Dou
ble-
prec
ision
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(b) Double precision
Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050
cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive
Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism
522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050
53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR
8 Mathematical Problems in Engineering
(a) cont-300 (b) af shell9
Figure 6 Visualization of the af shell9 and cont-300 matrix
Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively
On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are
9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod
On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889
Mathematical Problems in Engineering 9
Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs
Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
and maps well to GPUs [9] Their implementation efficientlyaccesses DRAM by streaming data into the local scratchpadmemory and dynamically assigns different numbers of rowsto each parallel GPU compute unit In addition numerousworks have proposed for GPUs using the variants of the CSRstorage format such as the compressed sparse eXtended [10]bit-representation-optimized compression [11] block CSR[12 13] and row-grouped CSR [14]
Besides using the variants of CSR many highly efficientSpMVs onGPUs have been proposed by utilizing the variantsof the ELL and COO storage formats such as the ELLPACK-R [15] ELLR-T [16] sliced ELL [13 17] SELL-C-120590 [18] slicedCOO [19] and blocked compressed COO [20] Specializedstorage formats provide definitive advantages However asmany programs use CSR the conversion from CSR to otherstorage formats will present a large engineering hurdle andcan incur large runtime overheads and require extra storagespace Moreover CSR-based algorithms generally have alower memory usage than those that are based on otherstorage formats such as ELL DIA and HYB
All the above observations motivate us to further inves-tigate how to construct efficient SpMVs on GPUs whilekeeping CSR intact In this study we propose a perfect CSRalgorithm called PCSR on GPUs PCSR is composed oftwo kernels and accesses CSR arrays in a fully coalescedmanner Experimental results on C2050 GPUs show thatPCSR outperforms CSR-scalar and CSR-vector and has abetter behavior compared to CSRMV and HYBMV in thevendor-tuned CUSPARSE library [4] and a most recentlyproposed CSR-based algorithm CSR-Adaptive
The main contributions in this paper are summarized asfollows
(i) A novel SpMV implementation on a GPU whichkeeps CSR intact is proposed The proposed algo-rithm consists of two kernels and alleviates the defi-ciencies of many existing CSR algorithms that accessCSR arrays in a rare or partial coalesced manner
(ii) Our proposed SpMV algorithm on aGPU is extendedtomultiple GPUs Moreover we suggest twomethodsto balance the workload among multiple GPUs
The rest of this paper is organized as follows Followingthis introduction the matrix storage CUDA architectureand SpMV are described in Section 2 In Section 3 a newSpMV implementation on a GPU is proposed Section 4discusses how to extend the proposed SpMV algorithm ona GPU to multiple GPUs Experimental results are presentedin Section 5 Section 6 contains our conclusions and points toour future research directions
2 Related Techniques
21 Matrix Storage To take advantage of the large number ofzeros in sparse matrices special storage formats are requiredIn this study the compressed sparse row (CSR) format is onlyconsidered although there aremany varieties of sparsematrixstorage formats such as the ELLPACK (or ITPACK) [21]COO [22] DIA [1] and HYB [3] Using CSR an 119899 times 119899 sparse
matrix 119860 with119873 nonzero elements is stored via three arrays(1) the array 119889119886119905119886 contains all the nonzero entries of 119860 (2)the array 119894119899119889119894119888119890119904 contains column indices of nonzero entriesthat are stored in 119889119886119905119886 and (3) entries of the array 119901119905119903 pointto the first entry of subsequence rows of 119860 in the arrays 119889119886119905119886and 119894119899119889119894119888119890119904
For example the following matrix
119860 =
[[[[[[[[[[[
[
4 1 0 1 0 0
1 4 1 0 1 0
0 1 4 0 0 1
1 0 0 4 1 0
0 1 0 1 4 1
0 0 1 0 1 4
]]]]]]]]]]]
]
(1)
is stored in the CSR format by
119889119886119905119886
[4 1 1 1 4 1 1 1 4 1 1 4 1 1 1 4 1 1 1 4]
119894119899119889119894119888119890119904
[0 1 3 0 1 2 4 1 2 5 0 3 4 1 3 4 5 2 4 5]
119901119905119903 [0 3 7 10 13 17 20]
(2)
22 CUDA Architecture The compute unified device archi-tecture (CUDA) is a heterogenous computing model thatinvolves both the CPU and theGPU [23] Executing a parallelprogram on the GPU using CUDA involves the following(1) transferring required data to the GPU global memory(2) launching the GPU kernel and (3) transferring resultsback to the host memoryThe threads of a kernel are groupedinto a grid of thread blocks The GPU schedules blocks overthe multiprocessors according to their available executioncapacity When a block is given to a multiprocessor it issplit in warps that are composed of 32 threads In the bestcase all 32 threads have the same execution path and theinstruction is executed concurrently If not the executionpaths are executed sequentially which greatly reduces theefficiency The threads in a block communicate via thefast shared memory but the threads in different blockscommunicate through high-latency global memory Majorchallenges in optimizing an application on GPUs are globalmemory access latency different execution paths in eachwarp communication and synchronization between threadsin different blocks and resource utilization
23 Sparse Matrix-Vector Multiplication Assume that119860 is an119899times119899 sparse matrix and 119909 is a vector of size 119899 and a sequentialversion of CSR-based SpMV is described in Algorithm 1Obviously the order in which elements of 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903and 119909 are accessed has an important impact on the SpMVperformance on GPUs where memory access patterns arecrucial
Mathematical Problems in Engineering 3
Input 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903 119909 119899Output 119910(01) for 119894 larr 0 to 119899 minus 1 do(02) 119903119900119908 119904119905119886119903119905 larr 119901119905119903[119894](03) 119903119900119908 119890119899119889 larr 119901119905119903[119894 + 1](04) 119904119906119898 larr 0(05) for 119895 larr 119903119900119908 119904119905119886119903119905 to 119903119900119908 119890119899119889 minus 1 do(06) 119904119906119898 += 119889119886119905119886[119895] sdot 119909[119894119899119889119894119888119890119904[119895]](07) done(08) 119910[119894] larr 119904119906119898(09) done
Algorithm 1 Sequential SpMV
3 SpMV on a GPU
In this section we present a perfect implementation of CSR-based SpMV on the GPU Different with other related workthe proposed algorithm involves the following two kernels
(i) Kernel 1 calculate the array V = [V1 V2 V
119873] where
V119894= 119889119886119905119886[119894] sdot 119909[119894119899119889119894119888119890119904[119894]] 119894 = 1 2 119873 and then
save it to global memory(ii) Kernel 2 accumulate element values of V according
to the following formula sum119901119905119903[119895]⩽119894lt119901119905119903[119895+1]
V119894 119895 =
0 1 119899 minus 1 and store them to an array 119910 in globalmemory
We call the proposed SpMV algorithm PCSR For sim-plicity the symbols used in this study are listed in Table 1
31 Kernel 1 For Kernel 1 its detailed procedure is shown inAlgorithm 2 We observe that the accesses to two arrays 119889119886119905119886and 119894119899119889119894119888119890119904 in global memory are fully coalesced Howeverthe vector 119909 in global memory is randomly accessed whichresults in decreasing the performance ofKernel 1 On the basisof evaluations in [24] the best memory space to place datais the texture memory when randomly accessing the arrayTherefore here texture memory is utilized to place the vectorinstead of global memory For the single-precision floatingpoint texture the fourth step in Algorithm 2 is rewritten as
Because the texture does not support double values thefollowing function119891119890119905119888ℎ 119889119900119906119887119897119890() is suggested to transfer theint2 value to the double value
(01) dekice double 119891119890119905119888ℎ 119889119900119906119887119897119890(texture⟨int2 1⟩119905 int 119894)(02) int2 V = tex1Dfetch(119905 119894)(03) return hiloint2double(V sdot 119910 V sdot 119909)(04)
Furthermore for the double-precision floating point texturebased on the function 119891119890119905119888ℎ 119889119900119906119887119897119890() we rewrite the fourthstep in Algorithm 2 as
32 Kernel 2 Kernel 2 accumulates element values of V thatis obtained by Kernel 1 and its detailed procedure is shown inAlgorithm 3This kernel is mainly composed of the followingthree stages
(i) In the first stage the array 119901119905119903 in global memoryis piecewise assembled into shared memory 119901119905119903 119904 ofeach thread block in parallel Each thread for a threadblock is only responsible for loading an elementvalue of 119901119905119903 into 119901119905119903 119904 except for thread 0 (see lines(05)-(06) in Algorithm 3) The detailed procedure isillustrated in Figure 1 We can see that the accesses to119901119905119903 are aligned
(ii) The second stage loads element values of V in globalmemory from the position 119901119905119903 119904[0] to the position
119901119905119903 119904[TB] into shared memory V 119904 for each threadblock The assembling procedure is illustrated inFigure 2 In this case the access to V is fully coa-lesced
(iii) The third stage accumulates element values of V 119904as shown in Figure 3 The accumulation is highlyefficient due to the utilization of two shared memoryarrays 119901119905119903 119904 and V 119904
Obviously Kernel 2 benefits from shared memory Usingthe shared memory not only are the data accessed fast butalso the accesses to data are coalesced
From the above procedures for PCSR we observe thatPCSR needs additional global memory spaces to store amiddle array V besides storing CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 Saving data into V in Kernel 1 and loading data from Vin Kernel 2 to a degree decrease the performance of PCSRHowever PCSR benefits from the middle array V becauseintroducing V makes it access CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 in a fully coalesced manner This greatly improves thespeed of accessing CSR arrays and alleviates the principaldeficiencies of CSR-scalar (rare coalescing) and CSR-vector(partial coalescing)
Symbol Description119860 Sparse matrix119909 Input vector119910 Output vector119899 Size of the input and output vectors119873 Number of nonzero elements in 119860threadsPerBlock (TB) Number of threads per blockblocksPerGrid (BG) Number of blocks per grid
elementsPerThread Number of elements calculated by eachthread
sizeSharedMemory Size of shared memory119872 Number of GPUs
Input 119889119886119905119886 119894119899119889119894119888119890119904 119909119873CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output V(01) 119905119894119889 larr threadIdx + blockIdx sdot blockDimx(02) 119894119888119903 larr blockDimx sdot gridDimx(03) while 119905119894119889 lt 119873(04) V[119905119894119889] larr 119889119886119905119886[119905119894119889] sdot 119909[119894119899119889119894119888119890119904[119905119894119889]](05) 119905119894119889 += 119894119888119903(06) end while
Algorithm 2 Kernel 1
4 SpMV on Multiple GPUs
In this section we will present how to extend PCSR on asingle GPU to multiple GPUs Note that the case of multipleGPUs in a single node (single PC) is only discussed becauseof its good expansibility (eg also used in the multi-CPUand multi-GPU heterogeneous platform) To balance theworkload among multiple GPUs the following two methodscan be applied
(1) For the first method the matrix is equally partitionedinto119872 (number of GPUs) submatrices according tothe matrix rows Each submatrix is assigned to oneGPU and each GPU is only responsible for comput-ing the assigned submatrix multiplication with thecomplete input vector
(2) For the second method the matrix is equally parti-tioned into119872 submatrices according to the numberof nonzero elements Each GPU only calculates asubmatrix multiplication with the complete inputvector
In most cases two partitionedmethods mentioned aboveare similar However for some exceptional cases for examplemost nonzero elements are involved in a few rows for amatrix the partitioned submatrices that are obtained by thefirstmethodhave distinct difference of nonzero elements andthose that are obtained by the second method have differentrows Which method is the preferred one for PCSR
If each GPU has the complete input vector PCSR onmultiple GPUs will not need to communicate between GPUsIn fact SpMV is often applied to a large number of iterativemethods where the sparse matrix is iteratively multipliedby the input and output vectors Therefore if each GPUonly includes a part of the input vector before SpMV thecommunication between GPUs will be required in order toexecute PCSR Here PCSR implements the communicationbetween GPUs using NVIDIA GPUDirect
5 Experimental Results
51 Experimental Setup In this section we test the perfor-mance of PCSR All test matrices come from the Universityof Florida Sparse Matrix Collection [25]Their properties aresummarized in Table 2
All algorithms are executed on one machine which isequipped with an Intel Xeon Quad-Core CPU and fourNVIDIA Tesla C2050 GPUs Our source codes are compiledand executed using the CUDA toolkit 65 under GNULinuxUbuntu v10041 The performance is measured in terms ofGFlops (second) or GBytes (second)
52 Single GPU We compare PCSR with CSR-scalar CSR-vector CSRMV HYBMV and CSR-Adaptive CSR-scalar and
Mathematical Problems in Engineering 5
Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx
lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()
lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898
Algorithm 3 Kernel 2
Block grid
Block 0
Block 1
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middotmiddot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Block i
Block BG
Threads in the ith block
Shared memory Global memoryThread 0
Thread 0
Thread 0
Thread RT
Thread TB minus 1
Thread TB minus 1
v_s[0]
v_s[TB minus 1]
v_s[j lowast TB + 0]
v_s[m lowast TB + 0]
Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]
CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]
We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla
C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported
521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and
Mathematical Problems in Engineering 7
Sing
le-p
reci
sion
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(a) Single precision
Dou
ble-
prec
ision
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(b) Double precision
Figure 4 Performance of all algorithms on a Tesla C2050
Sing
le-p
reci
sion
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(a) Single precision
Dou
ble-
prec
ision
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(b) Double precision
Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050
cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive
Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism
522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050
53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR
8 Mathematical Problems in Engineering
(a) cont-300 (b) af shell9
Figure 6 Visualization of the af shell9 and cont-300 matrix
Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively
On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are
9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod
On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889
Mathematical Problems in Engineering 9
Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs
Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
Input 119889119886119905119886 119894119899119889119894119888119890119904 119901119905119903 119909 119899Output 119910(01) for 119894 larr 0 to 119899 minus 1 do(02) 119903119900119908 119904119905119886119903119905 larr 119901119905119903[119894](03) 119903119900119908 119890119899119889 larr 119901119905119903[119894 + 1](04) 119904119906119898 larr 0(05) for 119895 larr 119903119900119908 119904119905119886119903119905 to 119903119900119908 119890119899119889 minus 1 do(06) 119904119906119898 += 119889119886119905119886[119895] sdot 119909[119894119899119889119894119888119890119904[119895]](07) done(08) 119910[119894] larr 119904119906119898(09) done
Algorithm 1 Sequential SpMV
3 SpMV on a GPU
In this section we present a perfect implementation of CSR-based SpMV on the GPU Different with other related workthe proposed algorithm involves the following two kernels
(i) Kernel 1 calculate the array V = [V1 V2 V
119873] where
V119894= 119889119886119905119886[119894] sdot 119909[119894119899119889119894119888119890119904[119894]] 119894 = 1 2 119873 and then
save it to global memory(ii) Kernel 2 accumulate element values of V according
to the following formula sum119901119905119903[119895]⩽119894lt119901119905119903[119895+1]
V119894 119895 =
0 1 119899 minus 1 and store them to an array 119910 in globalmemory
We call the proposed SpMV algorithm PCSR For sim-plicity the symbols used in this study are listed in Table 1
31 Kernel 1 For Kernel 1 its detailed procedure is shown inAlgorithm 2 We observe that the accesses to two arrays 119889119886119905119886and 119894119899119889119894119888119890119904 in global memory are fully coalesced Howeverthe vector 119909 in global memory is randomly accessed whichresults in decreasing the performance ofKernel 1 On the basisof evaluations in [24] the best memory space to place datais the texture memory when randomly accessing the arrayTherefore here texture memory is utilized to place the vectorinstead of global memory For the single-precision floatingpoint texture the fourth step in Algorithm 2 is rewritten as
Because the texture does not support double values thefollowing function119891119890119905119888ℎ 119889119900119906119887119897119890() is suggested to transfer theint2 value to the double value
(01) dekice double 119891119890119905119888ℎ 119889119900119906119887119897119890(texture⟨int2 1⟩119905 int 119894)(02) int2 V = tex1Dfetch(119905 119894)(03) return hiloint2double(V sdot 119910 V sdot 119909)(04)
Furthermore for the double-precision floating point texturebased on the function 119891119890119905119888ℎ 119889119900119906119887119897119890() we rewrite the fourthstep in Algorithm 2 as
32 Kernel 2 Kernel 2 accumulates element values of V thatis obtained by Kernel 1 and its detailed procedure is shown inAlgorithm 3This kernel is mainly composed of the followingthree stages
(i) In the first stage the array 119901119905119903 in global memoryis piecewise assembled into shared memory 119901119905119903 119904 ofeach thread block in parallel Each thread for a threadblock is only responsible for loading an elementvalue of 119901119905119903 into 119901119905119903 119904 except for thread 0 (see lines(05)-(06) in Algorithm 3) The detailed procedure isillustrated in Figure 1 We can see that the accesses to119901119905119903 are aligned
(ii) The second stage loads element values of V in globalmemory from the position 119901119905119903 119904[0] to the position
119901119905119903 119904[TB] into shared memory V 119904 for each threadblock The assembling procedure is illustrated inFigure 2 In this case the access to V is fully coa-lesced
(iii) The third stage accumulates element values of V 119904as shown in Figure 3 The accumulation is highlyefficient due to the utilization of two shared memoryarrays 119901119905119903 119904 and V 119904
Obviously Kernel 2 benefits from shared memory Usingthe shared memory not only are the data accessed fast butalso the accesses to data are coalesced
From the above procedures for PCSR we observe thatPCSR needs additional global memory spaces to store amiddle array V besides storing CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 Saving data into V in Kernel 1 and loading data from Vin Kernel 2 to a degree decrease the performance of PCSRHowever PCSR benefits from the middle array V becauseintroducing V makes it access CSR arrays 119889119886119905119886 119894119899119889119894119888119890119904 and119901119905119903 in a fully coalesced manner This greatly improves thespeed of accessing CSR arrays and alleviates the principaldeficiencies of CSR-scalar (rare coalescing) and CSR-vector(partial coalescing)
Symbol Description119860 Sparse matrix119909 Input vector119910 Output vector119899 Size of the input and output vectors119873 Number of nonzero elements in 119860threadsPerBlock (TB) Number of threads per blockblocksPerGrid (BG) Number of blocks per grid
elementsPerThread Number of elements calculated by eachthread
sizeSharedMemory Size of shared memory119872 Number of GPUs
Input 119889119886119905119886 119894119899119889119894119888119890119904 119909119873CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output V(01) 119905119894119889 larr threadIdx + blockIdx sdot blockDimx(02) 119894119888119903 larr blockDimx sdot gridDimx(03) while 119905119894119889 lt 119873(04) V[119905119894119889] larr 119889119886119905119886[119905119894119889] sdot 119909[119894119899119889119894119888119890119904[119905119894119889]](05) 119905119894119889 += 119894119888119903(06) end while
Algorithm 2 Kernel 1
4 SpMV on Multiple GPUs
In this section we will present how to extend PCSR on asingle GPU to multiple GPUs Note that the case of multipleGPUs in a single node (single PC) is only discussed becauseof its good expansibility (eg also used in the multi-CPUand multi-GPU heterogeneous platform) To balance theworkload among multiple GPUs the following two methodscan be applied
(1) For the first method the matrix is equally partitionedinto119872 (number of GPUs) submatrices according tothe matrix rows Each submatrix is assigned to oneGPU and each GPU is only responsible for comput-ing the assigned submatrix multiplication with thecomplete input vector
(2) For the second method the matrix is equally parti-tioned into119872 submatrices according to the numberof nonzero elements Each GPU only calculates asubmatrix multiplication with the complete inputvector
In most cases two partitionedmethods mentioned aboveare similar However for some exceptional cases for examplemost nonzero elements are involved in a few rows for amatrix the partitioned submatrices that are obtained by thefirstmethodhave distinct difference of nonzero elements andthose that are obtained by the second method have differentrows Which method is the preferred one for PCSR
If each GPU has the complete input vector PCSR onmultiple GPUs will not need to communicate between GPUsIn fact SpMV is often applied to a large number of iterativemethods where the sparse matrix is iteratively multipliedby the input and output vectors Therefore if each GPUonly includes a part of the input vector before SpMV thecommunication between GPUs will be required in order toexecute PCSR Here PCSR implements the communicationbetween GPUs using NVIDIA GPUDirect
5 Experimental Results
51 Experimental Setup In this section we test the perfor-mance of PCSR All test matrices come from the Universityof Florida Sparse Matrix Collection [25]Their properties aresummarized in Table 2
All algorithms are executed on one machine which isequipped with an Intel Xeon Quad-Core CPU and fourNVIDIA Tesla C2050 GPUs Our source codes are compiledand executed using the CUDA toolkit 65 under GNULinuxUbuntu v10041 The performance is measured in terms ofGFlops (second) or GBytes (second)
52 Single GPU We compare PCSR with CSR-scalar CSR-vector CSRMV HYBMV and CSR-Adaptive CSR-scalar and
Mathematical Problems in Engineering 5
Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx
lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()
lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898
Algorithm 3 Kernel 2
Block grid
Block 0
Block 1
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middotmiddot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Block i
Block BG
Threads in the ith block
Shared memory Global memoryThread 0
Thread 0
Thread 0
Thread RT
Thread TB minus 1
Thread TB minus 1
v_s[0]
v_s[TB minus 1]
v_s[j lowast TB + 0]
v_s[m lowast TB + 0]
Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]
CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]
We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla
C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported
521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and
Mathematical Problems in Engineering 7
Sing
le-p
reci
sion
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(a) Single precision
Dou
ble-
prec
ision
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(b) Double precision
Figure 4 Performance of all algorithms on a Tesla C2050
Sing
le-p
reci
sion
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(a) Single precision
Dou
ble-
prec
ision
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(b) Double precision
Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050
cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive
Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism
522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050
53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR
8 Mathematical Problems in Engineering
(a) cont-300 (b) af shell9
Figure 6 Visualization of the af shell9 and cont-300 matrix
Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively
On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are
9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod
On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889
Mathematical Problems in Engineering 9
Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs
Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
Symbol Description119860 Sparse matrix119909 Input vector119910 Output vector119899 Size of the input and output vectors119873 Number of nonzero elements in 119860threadsPerBlock (TB) Number of threads per blockblocksPerGrid (BG) Number of blocks per grid
elementsPerThread Number of elements calculated by eachthread
sizeSharedMemory Size of shared memory119872 Number of GPUs
Input 119889119886119905119886 119894119899119889119894119888119890119904 119909119873CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output V(01) 119905119894119889 larr threadIdx + blockIdx sdot blockDimx(02) 119894119888119903 larr blockDimx sdot gridDimx(03) while 119905119894119889 lt 119873(04) V[119905119894119889] larr 119889119886119905119886[119905119894119889] sdot 119909[119894119899119889119894119888119890119904[119905119894119889]](05) 119905119894119889 += 119894119888119903(06) end while
Algorithm 2 Kernel 1
4 SpMV on Multiple GPUs
In this section we will present how to extend PCSR on asingle GPU to multiple GPUs Note that the case of multipleGPUs in a single node (single PC) is only discussed becauseof its good expansibility (eg also used in the multi-CPUand multi-GPU heterogeneous platform) To balance theworkload among multiple GPUs the following two methodscan be applied
(1) For the first method the matrix is equally partitionedinto119872 (number of GPUs) submatrices according tothe matrix rows Each submatrix is assigned to oneGPU and each GPU is only responsible for comput-ing the assigned submatrix multiplication with thecomplete input vector
(2) For the second method the matrix is equally parti-tioned into119872 submatrices according to the numberof nonzero elements Each GPU only calculates asubmatrix multiplication with the complete inputvector
In most cases two partitionedmethods mentioned aboveare similar However for some exceptional cases for examplemost nonzero elements are involved in a few rows for amatrix the partitioned submatrices that are obtained by thefirstmethodhave distinct difference of nonzero elements andthose that are obtained by the second method have differentrows Which method is the preferred one for PCSR
If each GPU has the complete input vector PCSR onmultiple GPUs will not need to communicate between GPUsIn fact SpMV is often applied to a large number of iterativemethods where the sparse matrix is iteratively multipliedby the input and output vectors Therefore if each GPUonly includes a part of the input vector before SpMV thecommunication between GPUs will be required in order toexecute PCSR Here PCSR implements the communicationbetween GPUs using NVIDIA GPUDirect
5 Experimental Results
51 Experimental Setup In this section we test the perfor-mance of PCSR All test matrices come from the Universityof Florida Sparse Matrix Collection [25]Their properties aresummarized in Table 2
All algorithms are executed on one machine which isequipped with an Intel Xeon Quad-Core CPU and fourNVIDIA Tesla C2050 GPUs Our source codes are compiledand executed using the CUDA toolkit 65 under GNULinuxUbuntu v10041 The performance is measured in terms ofGFlops (second) or GBytes (second)
52 Single GPU We compare PCSR with CSR-scalar CSR-vector CSRMV HYBMV and CSR-Adaptive CSR-scalar and
Mathematical Problems in Engineering 5
Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx
lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()
lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898
Algorithm 3 Kernel 2
Block grid
Block 0
Block 1
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middotmiddot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Block i
Block BG
Threads in the ith block
Shared memory Global memoryThread 0
Thread 0
Thread 0
Thread RT
Thread TB minus 1
Thread TB minus 1
v_s[0]
v_s[TB minus 1]
v_s[j lowast TB + 0]
v_s[m lowast TB + 0]
Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]
CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]
We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla
C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported
521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and
Mathematical Problems in Engineering 7
Sing
le-p
reci
sion
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(a) Single precision
Dou
ble-
prec
ision
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(b) Double precision
Figure 4 Performance of all algorithms on a Tesla C2050
Sing
le-p
reci
sion
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(a) Single precision
Dou
ble-
prec
ision
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(b) Double precision
Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050
cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive
Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism
522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050
53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR
8 Mathematical Problems in Engineering
(a) cont-300 (b) af shell9
Figure 6 Visualization of the af shell9 and cont-300 matrix
Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively
On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are
9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod
On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889
Mathematical Problems in Engineering 9
Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs
Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
Input V 119901119905119903CUDA-specific variables(i) threadIdx a thread(ii) blockIdx a block(iii) blockDimx number of threads per block(iv) gridDimx number of blocks per grid
Output 119910(01) define shared memory V 119904 with size 119904119894119911119890119878ℎ119886119903119890119889119872119890119898119900119903119910(02) define shared memory 119901119905119903 119904 with size (119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 + 1)(03) 119892119894119889 larr threadIdxx + blockIdxx times blockDimx(04) 119905119894119889 larr threadIdxx
lowastLoad V into the shared memory V 119904lowast(14) for 119895 larr 0 to 119899119897119890119899119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896 minus 1 do(15) if 119894119899119889119890119909 lt 119899119897119890119899 then(16) V 119904[119905119894119889 + 119895 sdot 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896]larr V[119894119899119889119890119909](17) 119894119899119889119890119909 += 119905ℎ119903119890119886119889119904119875119890119903119861119897119900119888119896(18) end(19) done(20) syncthreads()
lowastPerform a scalar-style reductionlowast(21) if (119901119905119903 119904[119905119894119889 + 1] ⩽ 119894 or119901119905119903 119904[119905119894119889] gt 119894 + 119899119897119890119899 minus 1) is false then(22) 119903119900119908 119904 larr max(119901119905119903 119904[119905119894119889] minus119894 0)(23) 119903119900119908 119890 larr min(119901119905119903 119904[119905119894119889 + 1] minus 119894 119899119897119890119899)(24) for 119895 larr 119903119900119908 119904 to 119903119900119908 119890 minus 1 do(25) 119904119906119898 += V 119904[119895](26) done(27) end(28) done(29) 119910[gid] larr 119904119906119898
Algorithm 3 Kernel 2
Block grid
Block 0
Block 1
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middotmiddot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Block i
Block BG
Threads in the ith block
Shared memory Global memoryThread 0
Thread 0
Thread 0
Thread RT
Thread TB minus 1
Thread TB minus 1
v_s[0]
v_s[TB minus 1]
v_s[j lowast TB + 0]
v_s[m lowast TB + 0]
Note thatRS = ptr_s[TB] minus ptr_s[0]m = [RSTB]RT = RS minus m lowast TBPT = ptr_s[0]
CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]
We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla
C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported
521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and
Mathematical Problems in Engineering 7
Sing
le-p
reci
sion
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(a) Single precision
Dou
ble-
prec
ision
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(b) Double precision
Figure 4 Performance of all algorithms on a Tesla C2050
Sing
le-p
reci
sion
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(a) Single precision
Dou
ble-
prec
ision
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(b) Double precision
Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050
cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive
Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism
522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050
53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR
8 Mathematical Problems in Engineering
(a) cont-300 (b) af shell9
Figure 6 Visualization of the af shell9 and cont-300 matrix
Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively
On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are
9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod
On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889
Mathematical Problems in Engineering 9
Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs
Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
CSR-vector in the CUSP library [5] are chosen in order toshow the effects of accessing CSR arrays in a fully coalescedmanner in PCSR CSRMV in the CUSPARSE library [4]is a representative of CSR-based SpMV algorithms on theGPU HYBMV in the CUSPARSE library [4] is a finely tunedHYB-based SpMV algorithm on the GPU and usually has abetter behavior than many existing SpMV algorithms CSR-Adaptive is a most recently proposed CSR-based algorithm[9]
We select 15 sparse matrices with distinct sizes rangingfrom 25228 to 2063494 as our test matrices Figure 4shows the single-precision and double-precision perfor-mance results in terms of GFlops of CSR-scalar CSR-vector CSRMVHYBMVCSR-Adaptive andPCSRon aTesla
C2050 GFlops values in Figure 4 are calculated on the basisof the assumption of two Flops per nonzero entry for amatrix[3 13] In Figure 5 the measured memory bandwidth resultsfor single precision and double precision are reported
521 Single Precision From Figure 4(a) we observe thatPCSR achieves high performance for all the matrices in thesingle-precision mode In most cases the performance ofover 9GFlopss can be obtained Moreover PCSR outper-forms CSR-scalar CSR-vector and CSRMV for all test casesand average speedups of 424x 218x and 162x comparedto CSR-scalar CSR-vector and CSRMV can be obtainedrespectively Furthermore PCSRhas a slightly better behaviorthan HYBMV for all the matrices except for af shell9 and
Mathematical Problems in Engineering 7
Sing
le-p
reci
sion
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(a) Single precision
Dou
ble-
prec
ision
perfo
rman
ce (G
Flop
s)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
181614121086420
(b) Double precision
Figure 4 Performance of all algorithms on a Tesla C2050
Sing
le-p
reci
sion
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(a) Single precision
Dou
ble-
prec
ision
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(b) Double precision
Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050
cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive
Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism
522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050
53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR
8 Mathematical Problems in Engineering
(a) cont-300 (b) af shell9
Figure 6 Visualization of the af shell9 and cont-300 matrix
Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively
On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are
9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod
On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889
Mathematical Problems in Engineering 9
Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs
Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
Figure 4 Performance of all algorithms on a Tesla C2050
Sing
le-p
reci
sion
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(a) Single precision
Dou
ble-
prec
ision
band
wid
th (G
Byte
ss)
CSR-scalarCSRMVCSR-Adaptive
CSR-vectorHYBMVPCSR
epb2
ecl32
baye
r01
g7ja
c200
scfin
an512
tors
o2FE
M_3
D_t
herm
al2
cont
-300
raja
t24
lang
uage
af_s
hell9
ASI
C_680
ksH
amrle
3
ther
mal2
kkt_
pow
er
140
120
100
80
60
40
20
0
(b) Double precision
Figure 5 Effective bandwidth results of all algorithms on a Tesla C2050
cont-300The average speedup is 122x compared toHYBMVFigure 6 shows the visualization of af shell9 and cont-300We can find that af shell9 and cont-300 have a similarstructure and each row for two matrices has a very similarnumber of nonzero elements which is suitable to be storedin the ELL section of the HYB format Particularly PCSRand CSR-Adaptive have close performance The averageperformance of PCSR is nearly 105 times faster than CSR-Adaptive
Furthermore PCSR almost has the best memory band-width utilization among all algorithms for all the matricesexcept for af shell9 and cont-300 (Figure 5(a)) The max-imum memory bandwidth of PCSR exceeds 128GBytesswhich is about 90 percent of peak theoretical memorybandwidth for the Tesla C2050 Based on the performancemetrics [26] we can conclude that PCSR achieves goodperformance and has high parallelism
522 Double Precision From Figures 4(b) and 5(b) wesee that for all algorithms both the double-precision per-formance and memory bandwidth utilization are smallerthan the corresponding single-precision values due to theslow software-based operation PCSR is still better thanCSR-scalar CSR-vector and CSRMV and slightly outper-forms HYBMV and CSR-Adaptive for all the matricesThe average speedup of PCSR is 333x compared to CSR-scalar 198x compared to CSR-vector 157x compared toCSRMV 115x compared to HYBMV and 103x compared toCSR-Adaptive The maximum memory bandwidth of PCSRexceeds 108GBytess which is about 75 percent of peaktheoretical memory bandwidth for the Tesla C2050
53 Multiple GPUs531 PCSR Performance without Communication Here wetake the double-precisionmode for example to test the PCSR
8 Mathematical Problems in Engineering
(a) cont-300 (b) af shell9
Figure 6 Visualization of the af shell9 and cont-300 matrix
Table 3 Comparison of PCSRI and PCSRII without communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively
On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are
9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod
On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889
Mathematical Problems in Engineering 9
Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs
Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
performance onmultiple GPUswithout considering commu-nication We call PCSR with the first method and PCSR withthe second method PCSR-I and PCSR-II respectively Somelarge-sized test matrices in Table 2 are used The executiontime comparison of PCSRI and PCSRII on two and four TeslaC2050 GPUs is listed in Tables 3 and 4 respectively In Tables3 and 4 ET SD and PE stand for the execution time standarddeviation and parallel efficiency respectivelyThe time unit ismillisecond (ms) Figures 7 and 8 show the parallel efficiencyof PCSRI and PCSRII on two and four GPUs respectively
On two GPUs we observe that PCSRII has betterparallel efficiency than PCSRI for all the matrices exceptfor G3 circuit from Table 3 and Figure 7 The maximumaverage and minimum parallel efficiency of PCSRII are
9806 9164 and 7741 which wholly outperform thecorresponding maximum average and minimum parallelefficiency of PCSRI 9806 8816 and 7220 MoreoverPCSRII has a smaller standard deviation than PCSRI for allthe matrices except for ecology2 Transport and G3 circuitThis implies that the workload balance on two GPUs for thesecondmethod is advantageous over that for the firstmethod
On four GPUs for the parallel efficiency and standarddeviation PCSRII outperforms PCSRI for all the matricesexcept for G3 circuit (Table 4 and Figure 8) The maximumaverage and minimum parallel efficiency of PCSRII forall the matrices are 9635 8514 and 6417 and areadvantageous over the corresponding maximum averageand minimum parallel efficiency of PCSRI 9621 7889
Mathematical Problems in Engineering 9
Table 4 Comparison of PCSRI and PCSRII without communication on four GPUs
Matrix ET (GPU) PCSRI (4 GPUs) PCSRII (4 GPUs)ET SD PE ET SD PE
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
Figure 7 Parallel efficiency of PCSRI and PCSRII without commu-nication on two GPUs
and 5994 Particularly for Ga41As41H72 F1 cage14kkt power and Freescale1 the parallel efficiency of PCSRIIis almost 12 times that obtained by PCSRI
On the basis of the above observations we conclude thatPCSRII has high performance and is on the whole better thanPCSRI For PCSR on multiple GPUs the second method isour preferred one
532 PCSR Performance with Communication We still takethe double-precision mode for example to test the PCSRperformance on multiple GPUs with considering communi-cation PCSRwith the firstmethod and PCSRwith the secondmethod are still called PCSR-I and PCSR-II respectivelyThesame testmatrices as in the above experiment are utilizedTheexecution time comparison of PCSRI and PCSRII on two andfour Tesla C2050GPUs is listed in Tables 5 and 6 respectivelyThe time unit is ms ET SD and PE in Tables 5 and 6 are as
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 8 Parallel efficiency of PCSRI and PCSRII without commu-nication on four GPUs
the same as those in Tables 3 and 4 Figures 9 and 10 showPCSRI and PCSRII parallel efficiency on two and four GPUsrespectively
On two GPUs PCSRI and PCSRII have almost closeparallel efficiency for most matrices (Figure 9 and Table 5)As a comparison PCSRII slightly outperforms PCSRI Themaximum average and minimum parallel efficiency ofPCSRII for all the matrices are 9634 8851 and 8044and are advantageous over the corresponding maximumaverage and minimum parallel efficiency of PCSRI 96058603 and 7357
On four GPUs for the parallel efficiency and standarddeviation PCSRII is better than PCSRI for all the matri-ces except that PCSRI has slightly good parallel efficiencyfor thermal2 G3 circuit and Hamrle3 and slightly smallstandard deviation for thermal2 G3 circuit ecology2 andCurlCurl 4 (Figure 10 and Table 6) The maximum average
10 Mathematical Problems in Engineering
Table 5 Comparison of PCSRI and PCSRII with communication on two GPUs
Matrix ET (GPU) PCSRI (2 GPUs) PCSRII (2 GPUs)ET SD PE ET SD PE
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
and minimum parallel efficiency of PCSRII for all the matri-ces are 9012 7450 and 5827 which are better thanthe correspondingmaximum average andminimumparallelefficiency of PCSRI 8877 6569 and 5639
Therefore compared to PCSRI and PCSRII withoutcommunication although the performance of PCSRI andPCSRII with communication decreases due to the influenceof communication they still achieve significant performanceBecause PCSRII overall outperforms PCSRI for all testmatrices the second method in this case is still our preferredone for PCSR on multiple GPUs
6 Conclusion
In this study we propose a novel CSR-based SpMV onGPUs (PCSR) Experimental results show that our proposedPCSR on a GPU is better than CSR-scalar and CSR-vectorin the CUSP library and CSRMV and HYBMV in theCUSPARSE library and a most recently proposed CSR-basedalgorithm CSR-Adaptive To achieve high performance onmultiple GPUs for PCSR we present two matrix-partitionedmethods to balance the workload among multiple GPUs Weobserve that PCSR can show good performance with and
Mathematical Problems in Engineering 11
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
Figure 9 Parallel efficiency of PCSRI and PCSRII with communi-cation on two GPUs
2cu
bes_
sphe
resc
ircui
tG
a41
As41
H72 F1
ASI
C_680
ksec
olog
y2H
amrle
3
ther
mal2
cage14
Tran
spor
tG3
_circ
uit
kkt_
pow
erCu
rlCur
l_4
mem
chip
Free
scal
e1
PCSRIPCSRII
Para
llel e
ffici
ency
10
08
06
04
02
00
Figure 10 Parallel efficiency of PCSRI and PCSRII with communi-cation on four GPUs
without considering communication using the two matrix-partitioned methods As a comparison the second method isour preferred one
Next we will further do research in this area and developother novel SpMVs on GPUs In particular the future workwill apply PCSR to some well-known iterative methods andthus solve the scientific and engineering problems
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
The research has been supported by the Chinese NaturalScience Foundation under Grant no 61379017
References
[1] Y Saad Iterative Methods for Sparse Linear Systems SIAMPhiladelphia Pa USA 2nd edition 2003
[2] N Bell and M Garland ldquoEfficient Sparse Matrix-vector Multi-plication on CUDArdquo Tech Rep NVIDIA 2008
[3] N Bell and M Garland ldquoImplementing sparse matrix-vectormultiplication on throughput-oriented processorsrdquo in Pro-ceedings of the Conference on High Performance ComputingNetworking Storage and Analysis (SC rsquo09) pp 14ndash19 PortlandOre USA November 2009
[5] N Bell and M Garland ldquoCusp Generic parallel algorithmsfor sparse matrix and graph computations version 051rdquo 2015httpcusp-librarygooglecodecom
[6] F Lu J Song F Yin and X Zhu ldquoPerformance evaluation ofhybrid programming patterns for large CPUGPU heteroge-neous clustersrdquoComputer Physics Communications vol 183 no6 pp 1172ndash1181 2012
[7] M M Dehnavi D M Fernandez and D GiannacopoulosldquoFinite-element sparse matrix vector multiplication on graphicprocessing unitsrdquo IEEE Transactions on Magnetics vol 46 no8 pp 2982ndash2985 2010
[8] M M Dehnavi D M Fernandez and D GiannacopoulosldquoEnhancing the performance of conjugate gradient solvers ongraphic processing unitsrdquo IEEE Transactions on Magnetics vol47 no 5 pp 1162ndash1165 2011
[9] J L Greathouse and M Daga ldquoEfficient sparse matrix-vectormultiplication on GPUs using the CSR storage formatrdquo inProceedings of the International Conference forHigh PerformanceComputing Networking Storage and Analysis (SC rsquo14) pp 769ndash780 New Orleans La USA November 2014
[10] V Karakasis T Gkountouvas K Kourtis G Goumas and NKoziris ldquoAn extended compression format for the optimizationof sparse matrix-vector multiplicationrdquo IEEE Transactions onParallel and Distributed Systems vol 24 no 10 pp 1930ndash19402013
[11] W T Tang W J Tan R Ray et al ldquoAccelerating sparsematrix-vectormultiplication onGPUsusing bit-representation-optimized schemesrdquo in Proceedings of the International Confer-ence for High Performance Computing Networking Storage andAnalysis (SC rsquo13) pp 1ndash12 Denver Colo USA November 2013
[12] M Verschoor and A C Jalba ldquoAnalysis and performanceestimation of the conjugate gradientmethodonmultipleGPUsrdquoParallel Computing vol 38 no 10-11 pp 552ndash575 2012
[13] J W Choi A Singh and R W Vuduc ldquoModel-driven autotun-ing of sparse matrix-vector multiply on GPUsrdquo in Proceedingsof the 15th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP rsquo10) pp 115ndash126 ACMBangalore India January 2010
[14] G Oyarzun R Borrell A Gorobets and A Oliva ldquoMPI-CUDA sparse matrix-vector multiplication for the conjugategradient method with an approximate inverse preconditionerrdquoComputers amp Fluids vol 92 pp 244ndash252 2014
[15] F Vazquez J J Fernandez and E M Garzon ldquoA new approachfor sparse matrix vector product on NVIDIA GPUsrdquo Concur-rency Computation Practice and Experience vol 23 no 8 pp815ndash826 2011
[16] F Vazquez J J Fernandez and E M Garzon ldquoAutomatictuning of the sparse matrix vector product on GPUs based onthe ELLR-T approachrdquo Parallel Computing vol 38 no 8 pp408ndash420 2012
[17] A Monakov A Lokhmotov and A Avetisyan ldquoAutomaticallytuning sparse matrix-vector multiplication for GPU archi-tecturesrdquo in High Performance Embedded Architectures and
12 Mathematical Problems in Engineering
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011
Compilers 5th International Conference HiPEAC 2010 PisaItaly January 25ndash27 2010 Proceedings pp 111ndash125 SpringerBerlin Germany 2010
[18] M Kreutzer G Hager G Wellein H Fehske and A R BishopldquoA unified sparse matrix data format for efficient general sparsematrix-vector multiplication on modern processors with widesimd unitsrdquo SIAM Journal on Scientific Computing vol 36 no5 pp C401ndashC423 2014
[19] H-V Dang and B Schmidt ldquoCUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operationsrdquo Par-allel Computing Systems amp Applications vol 39 no 11 pp 737ndash750 2013
[20] S Yan C Li Y Zhang and H Zhou ldquoYaSpMV yet anotherSpMV framework on GPUsrdquo in Proceedings of the 19th ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP rsquo14) pp 107ndash118 February 2014
[21] D R Kincaid and D M Young ldquoA brief review of the ITPACKprojectrdquo Journal of Computational and Applied Mathematicsvol 24 no 1-2 pp 121ndash127 1988
[22] G Blelloch M Heroux and M Zagha ldquoSegmented operationsfor sparse matrix computation on vector multiprocessorrdquo TechRep School of Computer Science Carnegie Mellon UniversityPittsburgh Pa USA 1993
[23] J Nickolls I Buck M Garland and K Skadron ldquoScalableparallel programming with CUDArdquo ACM Queue vol 6 no 2pp 40ndash53 2008
[24] B Jang D Schaa P Mistry and D Kaeli ldquoExploiting memoryaccess patterns to improve memory performance in data-parallel architecturesrdquo IEEE Transactions on Parallel and Dis-tributed Systems vol 22 no 1 pp 105ndash118 2011
[25] T A Davis and Y Hu ldquoThe University of Florida sparse matrixcollectionrdquo ACM Transactions on Mathematical Software vol38 no 1 pp 1ndash25 2011