Robert Strzodka, Stanford University, Max Planck Center The Chances and Challenges The Chances and Challenges of Parallelism of Parallelism Comparison of Comparison of Hardwired (GPU) and Hardwired (GPU) and Reconfigurable (FPGA) Devices Reconfigurable (FPGA) Devices 0 5e-05 0.0001 0.00015 0.0002 0.00025 0.0003 1000 10000 100000 1e+06 1e+07 Seconds per grid node Domain size in grid nodes Normlized CPU (double) and CPU-GPU (mixed precision) execution time 1x1 CG: Opteron 250 1x1 CG: GF7800GTX 2x2 MG__MG: Opteron 250 2x2 MG__MG: GF7800GTX 0 200 400 600 800 1000 1200 1400 1600 20 25 30 35 40 45 50 Number of slices Bits of mantissa Area of s??e11 float kernels on the xc2v500/xc2v8000(CG) Adder Multiplier CG kernel normalized (1/30)
42
Embed
The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Robert Strzodka, Stanford University, Max Planck Center
The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism
Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and
GPU: Banded Matrix Vector Product r = AvGPU: Banded Matrix Vector Product r = Av
• Flowware in C++// load configware to the GPU, define names for arrays, then initialize// enum EnumArr { ARR_r, ARR_v, ARR_Al, ARR_Ac, ARR_Au, ARR_NUM };for( int i= 0; i < ARR_NUM; i++ ) {
FPGA: Banded Matrix Vector Product r = AvFPGA: Banded Matrix Vector Product r = Av
• Flowware in C++
• The FPGA will use the same framework as the GPU.• Object-orientation: One interface, different implementations.• In development.
[Oskar Mencer: ASC, A Stream Compiler for Computing with FPGAs, IEEE Trans. CAD, 2006]
14
Applicatione.g. in
C/C++, Java,Fortran, Perl
Shaderprograms
e.g. inHLSL, GLSL,
Cg
GPU ProgrammingGPU Programming
Graphicshardware
e.g.Radeon (ATI),GeForce (NV)
Operatingsystem
e.g.Windows, Unix,Linux, MacOS
Graphics APIe.g.
OpenGL,DirectX
Window manager
e.g.GLUT, Qt,
Win32, MotifGPU library
Hides thegraphicsspecificdetails
Flowware
Configware
Logic Synthesis
System Level Model
Behavioral Synthesis
RTL / Libraries
- Traditional hardware design process is vertically fragmented across many companies, file formats, etc...this is the major culprit for the productivity gap.
ASC bridges the VLSI CAD Productivity Gap witha Software Approach to Hardware Generation
with find parameters and :function aFor 0 NMNN XQF ℜ∈ℜ∈ℜ→ℜ
0);( 0 =QXF
equations of systemlinear a solve toused typicallyis This BAX =11111 ~:,0~,: +++++ +==−−= kkkkkkk XXXXABAXBB
process iterativean itself requires osolution t eapproximat The 2)directly osolution t eapproximatan findcan We1)
:cases h twodistinguis weNow
FF
iterate we some with starting exactly, solvecannot weAs 0 NXF ℜ∈
,~:,0);~(),,,,(: 111101 +++++ +=== kkkkkkkk XXXQXFQQXHQ K
. parametersdifferent with solve repeatedly wei.e. kPF
22
CPU Results: LU SolverCPU Results: LU Solver
chart courtesy
of Jack Dongarra
Larg
er is
bet
ter
[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems), SC 2006, to appear]
23
GPU Results: Conjugate Gradient (CG) and GPU Results: Conjugate Gradient (CG) and MultigridMultigrid (MG)(MG)S
mal
ler i
s be
tter
5e-7
5e-6
5e-5
5e-4
6 7 8 9 10
Sec
onds
per
grid
nod
e
Data level
Performance of double precision CPU and mixed precision CPU-GPU solvers
CG CPUCG GPU
MG2+2 CPUMG2+2 GPU
24
FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18S
mal
ler i
s be
tter
0
10000
20000
30000
40000
50000
60000
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of Conjugate Gradient s??e11 float kernels on the xc2v8000
Number of SlicesQuadratic fit
Number of 4 input LUTsNumber of Slice Flip Flops
Number of MULT18X18s * 500
25
FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18La
rger
is b
ette
r
40
60
80
100
120
140
20 25 30 35 40 45 50
Freq
uenc
y / I
O B
lock
s
Bits of mantissa
Frequency/IO of Conjugate Gradient s??e11 float kernels on the xc2v8000
Maximal Frequency in MHzNumber of bonded IOBs in 10s
26
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
27
ArithmeticArithmetic Intensity in MatrixIntensity in Matrix--Vector ProductsVector Products
• Analysis of banded MatVec r=Av, pre-assembled– Reads per component of r:
9 times into v, once into each band of A– Operations per component of r:
9 multiply-adds
18 reads
18 ops
18/18=1
• Arithmetic intensity• Operations per memory access• Computation / bandwidth
> 8• Rule of thumb for CPU/GPU
• Arithmetic intensity on floats should be• On doubles twice as high
28
Trading Computation for BandwidthTrading Computation for Bandwidth
• Three possibilities for a matrix vector product A·v if Adepends on some data and must be computed itself– On-the-fly: compute entries of A for each A·v application
• Lowest memory requirement• Good for simple entries or seldom use of A
– Partial assembly: precompute only some intermediate results• Allows to balance computation and bandwidth requirements • Good choice of precomputed results requires little memory
– Full assembly: precompute all entries of A, use these in A·v• Good if other computations hide bandwidth problem in A·v• Otherwise try to use partial assembly
( ).][div:][,][ 1h
kh
kkkkk UGUAUUUA ∇−==⋅ + τ1• For example, pre-compute only G[] when solving
29
Standard Conjugate GradientStandard Conjugate Gradient
• Sufficient flexibility in domain discretization– Global unstructured macro
mesh, domain decomposition– (an)isotropic refinement into
local tensor-product grids
• Efficient computation– High data locality, large problems map well to clusters – Problem specific solvers depending on anisotropy level– Hardware accelerated solvers on regular sub-problems
[Stefan Turek et al. Hardware–oriented numerics and concepts for PDE software, 2006]