Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT Takahiro Katagiri Supercomputing Research Division, Information Technology Center, The University of Tokyo 1 . Collaborators: Satoshi Ohshima, Masaharu Matsumoto (Information Technology Center, The University of Tokyo) SPNS2013, December 5 th -6 th , 2013 Conference Room, 3F, Bldg.1, Earthquake Research Institute (ERI), The University of Tokyo December 6 th , 2013, ppOpen-HPC and Automatic Tuning (Chair: Hideyuki Jitsumoto), 1330-1400
31
Embed
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Impact of Auto-tuning of Kernel Loop Transformation
by using ppOpen-ATTakahiro Katagiri
Supercomputing Research Division,
Information Technology Center,
The University of Tokyo
1
.
Collaborators:Satoshi Ohshima, Masaharu Matsumoto (Information Technology Center, The University of Tokyo)SPNS2013, December 5th -6th, 2013 Conference Room, 3F, Bldg.1, Earthquake Research Institute (ERI), The University of TokyoDecember 6th, 2013, ppOpen-HPC and Automatic Tuning (Chair: Hideyuki Jitsumoto), 1330-1400
OutlineBackground
ppOpen-AT System
Target Application and Its Kernel Loop Transformation
Performance Evaluation
Conclusion
2
OutlineBackground
ppOpen-AT System
Target Application and Its Kernel Loop Transformation
Performance Evaluation
Conclusion
3
Performance Portability (PP)
4
Keeping high performance in multiple computer environments.
◦ Not only multiple CPUs, but also multiple compilers.
◦ Run-time information, such as loop length and number of threads, is important.
Auto-tuning (AT) is one of candidate technologies to establish PP in multiple computer environments.
5
FVM DEMFDMFEM
Many-core CPUs GPULow Power
CPUsVector CPUs
MG
COMM
Auto-Tuning FacilityCode Generation for Optimization CandidatesSearch for the best candidateAutomatic Execution for the optimization
Resource Allocation Facility
ppOpen-APPL
ppOpen-MATH
BEM
ppOpen-AT
User’s Program
GRAPH VIS MP
STATIC DYNAMIC
ppOpen-SYS FT
Specify The Best Execution Allocations
Software Architecture of ppOpen-HPC
OutlineBackground
ppOpen-AT System
Target Application and Its Kernel Loop Transformation
Performance Evaluation
Conclusion
6
Design Policy of ppOpen-AT I. Domain Specific Language (DSL) for
Dedicated Processes for ppOpen-HPCSimple functions of languages
to restrict computation patterns in ppOpen-HPC.
II. Directive-base AT LanguageCodes of ppOpen-HPC are frequently
modified, since it is under development software.
To add AT functions, we provide AT by a directive-base manner.
7
Design Policy of ppOpen-AT (Cont’d)III. Utilizing Developer’s Knowledge
Some loop transformations require increase of memory and/or computational complexities.
To establish the loop transformation, user admits via the directive.
IV. Minimum Software-Stack RequirementTo establish AT in supercomputers in
operation, our AT system does not use dynamic code generator. No daemon and no dynamic job submission
are required. No script language is also required for
the AT system.8
ppOpen‐AT SystemppOpen‐APPL /*
ppOpen‐ATDirectives
User KnowledgeLibrary
Developer
① Before Release‐time
Candidate1
Candidate2
Candidate3
CandidatenppOpen‐AT
Auto‐Tuner
ppOpen‐APPL / *
AutomaticCodeGeneration②
:Target Computers
Execution Time④
Library User
③
Library Call
Selection
⑤
⑥
Auto‐tunedKernelExecution
Run‐time
A Scenario to Software Developers for ppOpen-AT
10
Executable Code with Optimization Candidates
and AT Function
Invocate dedicated Preprocessor
Software Developer
Description of AT by UsingppOpen-AT
Program with AT Functions
Optimizationthat cannot be established by
compilers
#pragma oat install unroll (i,j,k) region start#pragma oat varied (i,j,k) from 1 to 8
for(i = 0 ; i < n ; i++){for(j = 0 ; j < n ; j++){for(k = 0 ; k < n ; k++){A[i][j]=A[i][j]+B[i][k]*C[k][j]; }}}
END DO; END DO; END DO!$omp end parallel do!oat$ install LoopFusionSplit region end
Re-calculation is defined in here.
Using the re-calculation is defined in here.
Loop Split Point
Automatic Generated Codes for the kernel 1ppohFDM_update_stress #1 [Baseline]: Original 3-nested Loop
#2 [Split]: Loop Splitting with K-loop (Separated, two 3-nested loops)
#3 [Split]: Loop Splitting with J-loop
#4 [Split]: Loop Splitting with I-loop
#5 [Split&Fusion]: Loop Fusion to #1 for K and J-loops (2-nested loop)
#6 [Split&Fusion]: Loop Fusion to #2 for K and J-Loops(2-nested loop)
#7 [Fusion]: Loop Fusion to #1(loop collapse)
#8 [Split&Fusion]: Loop Fusion to #2(loop collapse, two one-nest loop)
Outline• Background• ppOpen‐AT System• Target Application and Its Kernel Loop Transformation
• Performance Evaluation• Conclusion
21
An Example of Seism_3D Simulation West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14) The region of 820km x 410km x 128 km is discretized with 0.4km.
NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. (a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
Problem Sizes (Tottori Prefecture Earthquake) 8 Nodes(8MPI Processes, Minimum running condition of
ppOpen‐APPL/FDM with respect to 32GB/node)Value of NZ Problem Sizes
(NX x NY x NZ)Process Grid(Pure MPI, the FX10)
Problem Size per Core
Weak Scaling, Problem Sizes when we use whole nodes of the FX10(65,536 Cores、Pure MPI Process Grid: 64 x 64 x 16)
10 64 x 32 x 10 8 x 8 x 2 8 x 4 x 5 512 x 256 x 80
20 128 x 64 x 20 8 x 8 x 2 16 x 8 x 10 1024 x 512 x 160
40 256 x 128 x 40 8 x 8 x 2 32 x 16 x 20 2048 x 1024 x 320
80 512 x 256 x 80 8 x 8 x 2 64 x 32 x 40 4096 x 2048 x 640
160 1024 x 512 x 160 8 x 8 x 2 128 x 64 x 80 8192 x 4096 x 1280
320 (Maximum Size for 32GB /node)
2048 x 1024 x 320 8 x 8 x 2 256 x 128 x 160 16384 x 8192 x 2560
Same as size as Tottori’s Earthquake Simulation
With AT(Speedups to the case without AT)
Pure MPITypes of hybrid MPI‐OpenMP Execution
2.5
AT Effect for Hybrid OpenMP‐MPI
Original without AT
Pure MPI
Speedup to pure MPI Execution
Types of hybrid MPI‐OpenMP Execution
The FX10, Kernel: update_stress
1
No merit for Hybrid MPI‐OpenMPI Executions. 1
Effect on pure MPI Execution
Gain by using MPI‐OpenMPI Executions.
By adapting loop transformation from the AT, we obtained: Maximum 1.5x speedup to pure MPI (without Thread execution) Maximum 2.5x speedup to pure MPI in hybrid MPI‐OpenMP execution.
PXTY :X Processes, Y Threads / Process
OTHER KERNEL AND CODE OPTIMIZATION
Kernel update_vel (ppOpen‐APPL/FDM)• m_velocity.f90(ppohFDM_update_vel)!OAT$ install LoopFusion region start!OAT$ name ppohFDMupdate_vel!OAT$ debug (pp)!$omp parallel do private(k,j,i,ROX,ROY,ROZ)do k = NZ00, NZ01do j = NY00, NY01do i = NX00, NX01
!OAT$ RotationOrder sub region endend do; end do; end do
!$omp end parallel do!OAT$ install LoopFusion region end
Reorder of sentences!OAT$ RotationOrder sub region start
Sentence iSentence ii
!OAT$ RotationOrder sub region endSentences 1
!OAT$ RotationOrder sub region startSentence ISentence II
!OAT$ RotationOrder sub region end
Sentence 1Sentence iSentence ISentence iiSentence II
Automatic Code Generation
Related Work (AT Languages)
#1: Method for supporting multi-computer environments. #2: Obtaining loop length in run-time.#3: Loop split with increase of computations, and loop fusion to the split loop.#4: Re-ordering of inner-loop sentences. #5: Algorithm selection.#6: Code generation with execution feedback. #7: Software requirement.
AT Language / Items
#1
#2
#3
#4
#5
#6
#7
ppOpen‐AT OATDirectives
✔ ✔ ✔ ✔ None
Vendor Compilers Out of Target Limited ‐Transformation
Recipes Recipe
Descriptions✔ ✔ ChiLL
POET XformDescription
✔ ✔ POET translator, ROSE
X language XlangPragmas
✔ ✔ X Translation,‘C and tcc
SPL SPL Expressions ✔ ✔ ✔ A Script Language
ADAPT
ADAPT Language
✔ ✔ PolarisCompiler
Infrastructure, Remote Procedure
Call (RPC)
Atune‐IL atunePragmas
✔ A Monitoring Daemon
Outline• Background• ppOpen‐AT System• Target Application and Its Kernel Loop Transformation
• Performance Evaluation• Conclusion
31
ConclusionKernel loop transformation is
a key technology to establish high performance for current multi-core and many-core processors.
Utilizing run-time information for problem sizes (loop length) and the number of threads is important.
Minimum software stack for auto-tuning facility is required for supercomputers in operation.