This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
AgendaAgenda What is 3D Running Average (RA) ?What is 3D Running Average (RA) ? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
3D RA is computed for each voxel 3D RA is computed for each voxel VV as normalized sum inside as normalized sum inside kkxxkkxxk cube (k is odd) located “around” given voxel:k cube (k is odd) located “around” given voxel:
where ‘where ‘v’v’ is source voxels. is source voxels.
In another words, 3D RA can be considered as 3D convolution In another words, 3D RA can be considered as 3D convolution with kernel having all components equal to 1/(kwith kernel having all components equal to 1/(kxxkkxxk).k).
3D Running Average (RA) – what is it ?3D Running Average (RA) – what is it ?
AgendaAgenda What is 3D Running Average (RA) ?What is 3D Running Average (RA) ? From 1D to 3D RA implementationFrom 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Unlike 1D convolution, 1D RA can be computed Unlike 1D convolution, 1D RA can be computed with complexity (Owith complexity (O11) using following aproach:) using following aproach:
– Prolog: compute sum S of first k voxelsProlog: compute sum S of first k voxels
– Main step: to compute next sum SMain step: to compute next sum S+1 +1 , first member of , first member of previous sum (previous sum (v0) should be subtracted, and next ) should be subtracted, and next component (component (vk) should be added ) should be added
Extending 1D Running Average toward 2DExtending 1D Running Average toward 2D
Giving slice (plane) with all lines (LGiving slice (plane) with all lines (Lii) 1D-averaged, we ) 1D-averaged, we can extend averaging for 2D by the same approach:can extend averaging for 2D by the same approach:
– Prolog: compute sum S of first k linesProlog: compute sum S of first k lines
S = ∑(L)0,k-1
S+1 = ∑(L)1,k = S– L0 + Lk
L 0
L k
–Main step: to compute next sum SMain step: to compute next sum S+1+1 , first line of previous sum , first line of previous sum ((L0) should be subtracted, and next line () should be subtracted, and next line (Lk) should be added) should be added
Extending 2D Running Average toward 3DExtending 2D Running Average toward 3D
Giving stack of planes with all planes (PGiving stack of planes with all planes (Pii) 2D-averaged, ) 2D-averaged, we can extend averaing for 3D by the same approach:we can extend averaing for 3D by the same approach:
– Prolog: compute sum S of first k planesProlog: compute sum S of first k planes
– Main step: to compute next sum SMain step: to compute next sum S+1+1 , first plane of previous sum , first plane of previous sum ((P0) should be subtracted, and next plane () should be subtracted, and next plane (Pk) should be added) should be added
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Array of Structures (AoS ) => Structure of Arrays (SoA)Array of Structures (AoS ) => Structure of Arrays (SoA)Why should we transform it to vectorize 1D Running Average ?Why should we transform it to vectorize 1D Running Average ?
Origin “natural” serial Origin “natural” serial data structure: AoSdata structure: AoS
NOTNOT enabled for SSE enabled for SSE
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk Ms = ∑(m)0,k-1
Ms+1 = ∑(m)1,k = Ms – m0 + mk
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk
00 11 22 33 44 55 66 k-3k-3 k-2k-2 k-1k-1 kk
L0
L1
L2
L3vv
33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
vv33 44
vv22 44
vv11 44
vv00 44
vv33 55
vv22 55
vv11 55
vv00 55
vv33 66
vv22 66
vv11 66
vv00 66
vv33
k-3
k-3
vv22
k-3
k-3
vv11
k-3
k-3
vv00
k-3
k-3
vv33
k-2
k-2
vv22
k-2
k-2
vv11
k-2
k-2
vv00
k-2
k-2
vv33
k-1
k-1
vv22
k-1
k-1
vv11
k-1
k-1
vv00
k-1
k-1
vv33
k k
vv22
k k
vv11
k k
vv00 kk
S = ∑(v)0,k-1S+1 = ∑(v)1,k
= S – v0 + vk
““Transposed” Transposed” data structure: SoAdata structure: SoA ENABLEDENABLED for SSE ! for SSE !
Array of Structures (AoS ) => Structure of Arrays (SoA)Array of Structures (AoS ) => Structure of Arrays (SoA) Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Presented below: transposition of 4 quads from 4 org lines – into 4 SSE regs of x, y, w, z. Takes 12 SSE operations per 16 components.Takes 12 SSE operations per 16 components.
Array of Structures (AoS ) Array of Structures (AoS ) <=<= Structure of Arrays (SoA) Structure of Arrays (SoA) Presented below: Presented below: (inverse)(inverse) transposition of 4 x, y, w, z SSE regs into 4 memory places. transposition of 4 x, y, w, z SSE regs into 4 memory places. Takes 12 SSE operations per 16 components.Takes 12 SSE operations per 16 components.
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
RA with width 11 needs to maintain together 12 regs, they can fit RA with width 11 needs to maintain together 12 regs, they can fit in 3 QUADs of regs, but can crawl to 4 QUADs as wellin 3 QUADs of regs, but can crawl to 4 QUADs as well
vv33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
vv33 44
vv22 44
vv11 44
vv00 44
vv33 55
vv22 55
vv11 55
vv00 55
vv33 66
vv22 66
vv11 66
vv00 66
vv33 77
vv22 77
vv11 77
vv00 77
vv33 88
vv22 88
vv11 88
vv00 88
vv33 99
vv22 99
vv11 99
vv00 99
vv33
1010
vv22
1010
vv11
1010
vv00
1010
vv33
1111
vv22
1111
vv11
1111
vv00
1111
vv33
1212
vv22
1212
vv11
1212
vv00
1212
vv33
1313
vv22
1313
vv11
1313
vv00
1313
vv33
1414
vv22
1414
vv11
1414
vv00
1414
vv33
1515
vv22
1515
vv11
1515
vv00
1515
12: fits in 3 QUADS
12: crawls to 4 QUADS
vv33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
Can be filled by AoS=>SoA as oS=>SoA as ““next” QUAD next” QUAD
So, 16 regs (4 QUADs) must be allocated and used in So, 16 regs (4 QUADs) must be allocated and used in cyclic way – when last QUAD is freed, it is loaded by cyclic way – when last QUAD is freed, it is loaded by AoS=>SoA with next QUAD values. AoS=>SoA with next QUAD values.
1.1. Loading Loading 1212 SSE regs by AoS=>SoA SSE regs by AoS=>SoA
2.2. Summing up (accumulate) Summing up (accumulate) 55 first first
3.3. 44 times: (sum-up next, save result in SSE regs – SoA form) times: (sum-up next, save result in SSE regs – SoA form)– Save QUAD of results in memory by AoS<=SoASave QUAD of results in memory by AoS<=SoA
4.4. 22 times: (sum-up next, save result in SSE regs – SoA form) times: (sum-up next, save result in SSE regs – SoA form)
5.5. 11 time: Sum-up next, subrtact first, save result in SSE reg time: Sum-up next, subrtact first, save result in SSE reg
Here all 12 loaded QUADs are used: 5+4+2+1, and 3 resulted regs are Here all 12 loaded QUADs are used: 5+4+2+1, and 3 resulted regs are NOTNOT saved saved
1D Running Average 4-lines SSE implementation (width – 11)1D Running Average 4-lines SSE implementation (width – 11) PrologProlog
vv33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
vv33 44
vv22 44
vv11 44
vv00 44
vv33 55
vv22 55
vv11 55
vv00 55
vv33 66
vv22 66
vv11 66
vv00 66
vv33 77
vv22 77
vv11 77
vv00 77
vv33 88
vv22 88
vv11 88
vv00 88
vv33 99
vv22 99
vv11 99
vv00 99
vv33
1010
vv22
1010
vv11
1010
vv00
1010
vv33
1111
vv22
1111
vv11
1111
vv00
1111
Accumulaterr33 00
rr22 00
rr11 00
rr00 00
rr33 11
rr22 11
rr11 11
rr00 11
rr33 22
rr22 22
rr11 22
rr00 22
rr33 33
rr22 33
rr11 33
rr00 33
Accumulate & save
+= += += += += += +–=
Save in memory by AoSSave in memory by AoS<=<=SoASoA
Main stepMain step1.1. Loading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic bufferLoading 4 SSE regs by AoS=>SoA, using 4 “last” regs from cyclic buffer
2.2. Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4Sum-up next, subrtact (next-11), save result in SSE reg – it will be the 4 thth – Save QUAD of results in memory by AoS<=SoASave QUAD of results in memory by AoS<=SoA
3.3. 3 times: (sum-up next, subrtact first, save result in SSE reg)3 times: (sum-up next, subrtact first, save result in SSE reg)
During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, During the step: 4 new SSE regs are loaded, 4 (3 old and 1 new) are saved in memory, and 3 resulted regs are and 3 resulted regs are NOTNOT saved saved
1D Running Average 4-lines SSE implementation (width – 11)1D Running Average 4-lines SSE implementation (width – 11) Main step & epilogMain step & epilog
vv33 00
vv22 00
vv11 00
vv00 00
vv33 11
vv22 11
vv11 11
vv00 11
vv33 22
vv22 22
vv11 22
vv00 22
vv33 33
vv22 33
vv11 33
vv00 33
vv33 44
vv22 44
vv11 44
vv00 44
vv33 55
vv22 55
vv11 55
vv00 55
vv33 66
vv22 66
vv11 66
vv00 66
vv33 77
vv22 77
vv11 77
vv00 77
vv33 88
vv22 88
vv11 88
vv00 88
vv33 99
vv22 99
vv11 99
vv00 99
vv33
1010
vv22
1010
vv11
1010
vv00
1010
vv33
1111
vv22
1111
vv11
1111
vv00
1111
rr33 00
rr22 00
rr11 00
rr00 00
rr33 11
rr22 11
rr11 11
rr00 11
rr33 22
rr22 22
rr11 22
rr00 22
Added in current step
+–= +–= +–= +–=
Save in memory by AoSSave in memory by AoS<=<=SoASoA
rr33 00
rr22 00
rr11 00
rr00 00
rr33 11
rr22 11
rr11 11
rr00 11
rr33 22
rr22 22
rr11 22
rr00 22
3 3 NOTNOT saved saved in current stepin current step
vv33
1212
vv22
1212
vv11
1212
vv00
1212
vv33
1313
vv22
1313
vv11
1313
vv00
1313
vv33
1414
vv22
1414
vv11
1414
vv00
1414
vv33
1515
vv22
1515
vv11
1515
vv00
1515
rr33 33
rr22 33
rr11 33
rr00 33
3 from prev | new
Subtracted in current step
Are freed afterAre freed aftercurrent stepcurrent step
EpilogEpilog
For 5 last results, For 5 last results, subtraction ONLY is donesubtraction ONLY is done
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completion dimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines Logical flow of 2D RA (in-place routine) is very similar to 1D RA 4-lines implementation.implementation.
To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs.To save intermediate 1D RA lines we use 16 working lines – analog of 16 SSE regs.
PrologProlog
1.1. Computation Computation 1212 1D RA lines by 3 calls to 1D RA 4-lines routine 1D RA lines by 3 calls to 1D RA 4-lines routine
2.2. Summing up (accumulate) Summing up (accumulate) 55 first in working memory first in working memory
3.3. 66 times: (sum-up next line, save result in final place) times: (sum-up next line, save result in final place)
4.4. 11 time: sum-up next line, subrtact first line, save result in final place time: sum-up next line, subrtact first line, save result in final place
Here all 12 1D RA lines are used: 5+6+1Here all 12 1D RA lines are used: 5+6+1
22ndnd dimension completion dimension completion 2D RA: based on 4-lines 1D SSE implementation - prolog2D RA: based on 4-lines 1D SSE implementation - prolog
1D
RA
L0
1D
RA
L1
1D
RA
L1
1
1D
RA
L2
1D
RA
L3
1D
RA
L4
1D
RA
L5
1D
RA
L6
1D
RA
L7
1D
RA
L8
1D
RA
L9
1D
RA
L1
0
Accumulate Accumulate & save
+ + + + + + +–
2D
RA
L0
<=
2D
RA
L1
<=
2D
RA
L2
<=
2D
RA
L3
<=
2D
RA
L4
<=
2D
RA
L5
<=
2D
RA
L6
<=
Resulting 2D Running Average linesResulting 2D Running Average lines
22ndnd dimension completion dimension completion 2D RA: based on 4-lines 1D SSE implementation – main step & epilog2D RA: based on 4-lines 1D SSE implementation – main step & epilog
1D
RA
L0
1D
RA
L1
1D
RA
L1
1
1D
RA
L2
1D
RA
L3
1D
RA
L4
1D
RA
L5
1D
RA
L6
1D
RA
L7
1D
RA
L8
1D
RA
L9
1D
RA
L1
0
+– +– +– +–
2D
RA
+0
<=
2D
RA
+1
<=
2D
RA
+2
<=
2D
RA
+3
<=
Resulting 2D Running Average linesResulting 2D Running Average lines
Main stepMain step
– Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 Computation 4 1D RA lines by calling 1D RA 4-lines routine, outputting into 4 “last” lines from working lines cyclic buffer“last” lines from working lines cyclic buffer
– 4 times - sum-up next, subrtact (next-11), save result in final place 4 times - sum-up next, subrtact (next-11), save result in final place
1D
RA
L1
5
1D
RA
L1
2
1D
RA
L1
3
1D
RA
L1
4
Added in current stepSubtracted in current step
Are freed afterAre freed aftercurrent stepcurrent step
EpilogEpilog
– For 5 last results, subtraction ONLY is doneFor 5 last results, subtraction ONLY is done
Important cash-related note: typical line length is Important cash-related note: typical line length is ~400 floats => 1.6K, therefore the cyclic buffer of ~400 floats => 1.6K, therefore the cyclic buffer of 16 lines is ~26K => less than 32K, L1 cash.16 lines is ~26K => less than 32K, L1 cash.
Most of data manipulation is done in L1 cash !Most of data manipulation is done in L1 cash !
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completiondimension completion 33rdrd dimension completion dimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
33rdrd dimension (in-place) computation is done after completion of dimension (in-place) computation is done after completion of 2D computations for all the stack of images (planes). 2D computations for all the stack of images (planes).
It is straight-forward, as it is fully independent from previously It is straight-forward, as it is fully independent from previously computed 2D results – in opposite to 2D computation that computed 2D results – in opposite to 2D computation that includes 1D computation as internal part.includes 1D computation as internal part.
In general, its logical flow is very similar to 2D one. The important In general, its logical flow is very similar to 2D one. The important difference is, that (because of “in placing”) the results are difference is, that (because of “in placing”) the results are firstlyfirstly saved in cyclic buffer, and are copied to final place only saved in cyclic buffer, and are copied to final place only afterafter using appropriate line for subtracting.using appropriate line for subtracting.
AgendaAgenda What is 3D Running Average (RA)?What is 3D Running Average (RA)? From 1D to 3D RA implementation From 1D to 3D RA implementation Basic SSE technique: AoS Basic SSE technique: AoS SoA transforms SoA transforms 1D RA 4-lines SSE implementation1D RA 4-lines SSE implementation 22ndnd dimension completiondimension completion 33rdrd dimension completiondimension completion Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Parallelizing by OpenMP and benchmarkingParallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP, 16 working lines To parallelize the above algorithm by using OpenMP, 16 working lines
for each thread are allocated.for each thread are allocated. Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each Using OpenMP is straight forward for 2 loops: (1) calling 2D RA routine for each
plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the plane in stack and (2) calling routine for computing “stack” of 3D RA lines – the loop in “y” direction (explained on appropriate foil).loop in “y” direction (explained on appropriate foil).
Results for several platforms benchmarking:Results for several platforms benchmarking:
Pentium-MPentium-M
T43 laptopT43 laptop
1.86 GHz1.86 GHz
MeromMerom
T61 laptopT61 laptop
2.0 GHz2.0 GHz
Conroe Conroe WSWS
2.4 GHz2.4 GHz
WoodCrestWoodCrest
WSWS
2.66 GHz2.66 GHz
HPTNHPTN
BensleyBensley
2.8 GHz2.8 GHz
SSE run time msecSSE run time msec 3232 1515 1414 12.512.5 9.49.4
Speed-upSpeed-up
Serial/SSESerial/SSE
2.5x2.5x 4x4x 3.2x3.2x 3.6x3.6x 4.2x4.2x
SSE+OpenMPSSE+OpenMP
run time msecrun time msec
NANA 1313 ?? ?? 5.75.7
Speed-upSpeed-up
SSE/SSE+OpenMPSSE/SSE+OpenMP
NANA 1.15x1.15x ?? ?? 1.6x1.6x
Conclusions:Conclusions:– SSE/serial speed-up for Penryn/Merom is ~SSE/serial speed-up for Penryn/Merom is ~4x4x, 30% better than for “old” Pentium-M (2.5x), 30% better than for “old” Pentium-M (2.5x)
– Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and Absolute SSE run time for Merom (12-15 msec) is 2-2.5x better than for Pentium-M (32 msec) and >3x>3x better for better for Penrin (9.4 msec).Penrin (9.4 msec).
– OpenMP scalability is very low, it seems that performance is restricted by FSB speed.OpenMP scalability is very low, it seems that performance is restricted by FSB speed.