Accelerating the Singular Value De composition of Rectangular Matrices with the C SX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky Yusaku Yamamoto, Takeshi Fukaya, Takashi Uneyama, Masami Takata, Kinji Kimura, Masashi Iwas aki and Yoshimasa Nakamura
27
Embed
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Accelerating the Singular Value Decompositionof Rectangular Matrices with the CSX600 and t
he Integrable SVD
September 7, 2007
PaCT-2007, Pereslavl-Zalessky
Yusaku Yamamoto, Takeshi Fukaya, Takashi Uneyama,
Masami Takata, Kinji Kimura, Masashi Iwasaki
and Yoshimasa Nakamura
2
Outline
• Introduction• The CSX600 floating-point accelerator• Optimization of the rectangular SVD algorithm
for the CSX600• Performance evaluation• Conclusion
• ClearSpeed Advance board– Two CSX600 processors– 1GB DRAM– Connected to a host PC via the P
CI-X bus– Peak performance : 96GFLOPS
10
• Software Development Kit– Compiler: parallel programming with the Cn language– Debugger– Simulator
• CSXL library– Basic Linear Algebra Subprograms (BLAS) for the ClearSpeed A
dvance board– The library transfers the input data from the main memory to the
board, perform the computation and return the data to the main memory.
– Sustained performance: 50GFLOPS with the DGEMM (dense matrix-matrix multiplication)
• CSFFT library
Software environments for the CSX600
We use this in this study.
11
Performance of the CSXL DGEMM
0
10000
20000
30000
40000
50000
1000 2000 3000 4000 5000 6000
A, B: non-transposed
A: non-transposed, B: transposed
m = k = 450 1000 n 6000
0
10000
20000
30000
40000
50000
1000 2000 3000 4000 5000 6000
k = 450 1000 m = n 6000
Per
form
ance
(M
FL
OP
S)
A B×C +=
n k
m
kn
BA ×C +=m
At least two of the three size parameters (m, n and k) must be large to obtain considerable performance.
n n,m
12
Optimization of the rectangular SVD
algorithm for the CSX600
13
Algorithm for rectangular SVD
QR decomposition: A = QR
Bidiagonalization: R = U1 B V1T
SVD of thebidiagonal matrix:
B = U2 V2T
Inverse transformation:
R = U’ V T
V = V1 V2
U’= U1 U2
Multiplication by Q A = U V TU = QU’
Am
n
Qm
n
R
n
n
B
n
n
14
Computational work of each part
2mn2
(8/3)n3
O(n2) O(n3)
2n3 4n3
4mn2
When m >> n (e.g., m =100000, n =5000) Computational work
QR decomposition: A = QR
Bidiagonalization: R = U1 B V1T
SVD of thebidiagonal matrix:
B = U2 V2T
Inverse transformation:
R = U’ V T
V = V1 V2
U’= U1 U2
Multiplication by Q A = U V TU = QU’
Accounts for most of the computational work
15
QR decomposition: A = QR
Multiplication by Q A = U V TU = QU’
Optimization of each part
Parts accelerated with the CSX600
LAPACK DGEBRD
LAPACK DORMBR
Integrable SVD
Reorganize the algorithms to use matrix multiplications
Accelerate the matrix multiplication with the
CSXL BLASParts executed on the host only
Bidiagonalization: R = U1 B V1T
Inverse transformation:
V = V1 V2
U’= U1 U2
SVD of thebidiagonal matrix:
B = U2 V2T
R = U’ V T
16
QR decomposition of A
Upper triangularization by Householder transformations
A(1)A
Hn ・・・ H2 H1 A = A(n)
・・・
A(2) A(n) = R
A = H1 H2 ・・・ Hn A(n) = QR
where,
H1 A = ( I – t1 y1 y1T ) A
= A(1)
level-2 BLAS
CSXL cannot be used
17
Aggregating the Householder transformations
Blocking technique
Hn ・・・ H2 H1 = ( I – tn yn ynT ) ・・・ ( I – t2 y2 y2
T )( I – t1 y1 y1T )
= I – Yn Tn YnT
where, Yn = [ y1 | y2 | ・・・ | yn] (m n matrix)
Tn : n n lower triangular matrix
× ×I – × ×I –・・・ × ×I –=
Multiple Householder transformations can be aggregated andcarried out by matrix multiplications.
Acceleration with the CSXL.
18
• Block QR requires the smallest amount of work, but some of the work is done with the level-2 BLAS. The size of matrix multiplication is rather small.
• Recursive QR requires the largest amount of work, but all in the level-3 BLAS. The size of matrix multiplication is large.
Blocking strategies for QR decomposition
Comparison of three blocking strategies
No.
Strategy QR decomposition Multiplication by Q
Level-2 work Level-3 work Size of matrix multiplication