Simulation, Compilation, and Debugging of OpenCL on The Southern Islands Dana Schaa (AMD), Rafael Ubal, and David Kaeli (Northeastern University, Boston, MA)
May 13, 2015
Simulation, Compilation, and Debugging ofOpenCL on The S outhern Islands
Dana S chaa (AMD), Rafael Ubal, and David Kaeli(Northeastern University, Boston, MA)
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 20132
Simulation Methodology
• Full - OS simulation
An OS runs on the simulator. The simulator implements the complete ISA, and virtualizes native hardware devices, similar to a virtual machine. Accurate simulations, but extremely slow.
• Guest program simulation
An application runs directly on the simulator. The simulator implements the non-privileged subset of the ISA, and virtualizes the system call interface (ABI). Multi2Sim falls in this category.
Application- OS vs. Guest Program
Ful l - s y st ems im ulat or core
Guestprogram 1
Guestprogram 2
Full O.S.
...
Virtualizat ion of Complete processor ISA I/O hardware
Virtualizat ion of User-space subset of ISA System call int erface
Guest programs imulat or c ore
Guestprogram 1
Guestprogram 2
...
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 20133
Instructionbytes
Instructionfields
Run oneinstruction
Instructioninformation
Pipelinetrace
ExectuableELF file
Instructionsdump
Exectuable file,program arguments
Programoutput
Executable file,program arguments,
processor configuration
Performancestatistics
Userinteraction
Cycle navigation,timing diagrams
Disassembler Emulator Timingsimulator
Visualtool
Simulation MethodologyFour-Step Simulation Process
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 20134
Simulation MethodologyCurrent Architecture Support
• In our latest Multi2Sim SVN repository─ 4 GPU + 3 CPU architectures supported or in progress
─ This presentation focuses on Southern Islands (and x86)
In progressNVIDIA Fermi X In progress –
Disasm. EmulationTiming
simulationVisualtool
ARM X In progress – –MIPS X ––x86 X X X XAMD Evergreen X X X XAMD Southern Islands X X X X
NVIDIA Kepler In progress –––
X
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 20135
The x86 EmulatorProgram Loading
• Emulation of x86 instructions─ Update x86 registers
─ Update memory map if needed
─ Example: add [bp+16], 0x5
• Emulation of Linux system calls─ Analyze system call code and arguments
─ Update memory map
─ Update register eax with return value
─ Example: read(fd, buf, count)
Stack
Program args.Env. variables
mmap region
(not initialized)
Heap
Initialized data
Text
Initialized data0x08000000
0x08xxxxxx
0x40000000
0xc0000000
eax
ebx
eax
ecx
esp
eip
Initial virtual memory image
Initial values for x86 registers
Stac
k po
inte
rIn
stru
ctio
n po
inte
r
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 20136
1) Parse ELF executable─ Read ELF sections and symbols
─ Initialize code and data
2) Initialize stack─ Program headers
─ Arguments
─ Environment variables
3) Initialize registers─ Program entry → eip
─ Stack pointer → esp
The x86 EmulatorEmulation Loop
Read instr.at eip
Instr.bytes
Decodeinstruction
Instr.fields
Instr. isint 0x80
No Yes
Emulatesystem call
Emulatex86 instr.
Move eipto next instr.
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 20137
OpenCL on the HostExecution Framework
─ An OpenCL host program performs a set of OpenCL library function calls (API calls)
─ Multi2Sim's OpenCL runtime l ibrar y , running with guest code, transparently intercepts the call. It communicates with the Multi2Sim driver using system calls with codes not reserved in Linux.
─ An OpenCL driver module (Multi2Sim code) intercepts the ABI call and communicates with the GPU emulator
─ The GPU emulator updates its internal state based on the message received from the driver
Userapplication
API call
Devicedriver
ABI call
Hardware
Internalinterface
Runtimelibrary
User
-leve
l cod
eO
S-le
vel c
ode
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 20138
OpenCL on the DeviceExecution Model
Work-group
Work-group
···
···
Work-group
···
Global Memory
Work-group
Work-item
Work-item
···
···
Work-item
···
Local Memory
Work-item
···
__kernel func(){
}
Private Memory
ND-Range Work-Group Work-Item
─ Work-items execute multiple instances of the same kernel code
─ Work-groups are sets of work-items that can synchronize and communicate efficiently
─ The ND -Range contains all work-groups, not communicating with each other and executing in any order
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 20139
The S outhern Islands Disassembler
Disassembler Emulato r Timingsimulato r
Visualto ol
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201310
• Source code
__kernel void vector_add( __read_only __global int *src1, __read_only __global int *src2, __write_only __global int *dst){ int id = get_global_id(0); dst[id] = src1[id] + src2[id];}
Scal
ar in
stru
ctio
nsThe loads
The additionThe store
Vect
or in
stru
ctio
ns
Vector registers
Scalar registers
The S outhern Islands DisassemblerVector Addition Kernel
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201311
__kernel void if_kernel(__global int *v)
{ uint id = get_global_id(0); if (id < 5) v[id] = 10;}
• Source code
The comparison.Save active mask.
Store value 10.
Restore active mask.
• Assembly code
The S outhern Islands DisassemblerConditional Statements
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201312
The S outhern Islands Emulator
Disassembler EmulatorTiming
simulato rVisual
tool
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201313
• Responsible entity─ The device driver is the module responsible for setting up an initial
state for the hardware, leaving it ready to run the first ISA instruction.
─ Natively, it writes on hardware registers and global memory locations. On Mult i2Sim, it calls initialization functions of the emulator.
• Setup─ Instruction memories in compute units, each with one copy of the ISA
section of the kernel binary
─ Initial global memor y image, copying global buffers from CPU to GPU memory
─ Kernel arguments
─ ND -Range topology , including number of dimensions and sizes
Userapplication
API call
Devicedriver
ABI call
Hardware
Internalinterface
Runtimelibrary
The S outhern Islands EmulatorProgram Loading
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201314
• Work-group execution─ Work-groups can execute in any order .
This order is irrelevant for emulation purposes.
─ The chosen policy is executing one work-group at a t ime, in increasing order of ID for each dimension.
• Wavefront execution─ Wavefronts within a work-group can also
execute in any order, as long as synchronizations are considered.
─ The chosen policy is executing one wavefront at a time unti l it hits a barrier , if any.
Split ND-Rangeinto work-groups
Work-grouppool
Anywork-groups
left?
Grab work-group andsplit in wavefronts
Wavefrontpool
Anywavefront
left?
For each running wavefront not stalled in a barrier:
● Read instruction @PC● Emulate (update mem. + regs.)● Advance PC
Yes
No
Yes
NoEnd
The S outhern Islands EmulatorEmulation Loop
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201315
The S outhern Islands Timing Simulator
Disassembler EmulatorTiming
simulatorVisual
tool
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201316
─ A command processor receives and processes commands from the host.
─ When the ND-Range is created, an ultra-threaded dispatcher (scheduler) assigns work-groups into compute units while new available slots occur.
Command Processor
Ultra-Threaded Dispatcher
ComputeUnit 0
ComputeUnit 1
ComputeUnit 31···
L1Cache
L1Cache
L1Cache···
Crossbar
Main Memory Hierarchy(L2 caches, memory controllers,
video memory)
The S outhern Islands Timing SimulatorThe GPU Architecture
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201317
The S outhern Islands Timing SimulatorThe Compute Unit
─ The instruction memor y of each compute unit contains the OpenCL kernel.
─ A front- end fetches instructions and sends them to the appropriate execution unit.
─ There is one scalar unit, vector-memory unit, branch unit, LDS (local data store) unit.
─ There are multiple instances of SIMD units.
Fron
t-En
d
Scalar unit
Vector memory unit
Branch unit
LDS unit
SIMD unit 0
SIMD unit 1
SIMD unit 2Instructionmemory ···
Glo
bal
mem
ory
Loca
lm
emor
y
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201318
The S outhern Islands Timing SimulatorThe Front-End
Wavefront Pool ···
···
···
···
Wavefront Pool
Wavefront Pool
Wavefront Pool
···
···
···
···
···
Fetch buffers, oneper wavefront pool
SIMD issue buffer,matchingwavefront pool
Scalar unitissue buffer
Branch unitissue buffer
Vector memoryunit issue buffer
LDS unitissue buffer
Fetch
Issue
─ Work-groups are split into wavefronts and allocated to wavefront pools .
─ The fetch and issue stages operate in a round-robin fashion.
─ There is one SIMD unit associated to each wavefront pool.
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201319
The S outhern Islands Timing SimulatorThe SIMD Unit
─ The SIMD unit runs arithmetic- logic vector instructions.
─ There are 4 SIMD units , each one associated with one of the 4 wavefront pools.
─ The SIMD unit pipeline is modeled with 5 stages : decode, read, execute, write, and complete.
─ In the execute stage , a wavefront (64 work-items max.) is split into 4 subwavefronts (16 work-items each). Subwavefronts are pipelined over the 16 stream cores in 4 consecutive cycles.
─ The vector register file is accessed in the read and write stages to consume input and produce output operands, respectively.
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201320
The S outhern Islands Timing SimulatorThe SIMD Unit
Execute
Work-item 0Work-item 16Work-item 32Work-item 48
PipelinedFunctionalunits
SIMD Lane 0
SIMD Lane 1Work-items 1, 17, 33, 49
...SIMD Lane 15
Work-items 15, 31, 47, 63
···Read
···
Issuebuffer
Readbuffer
Write···
Executebuffer
···
Complete
Write
buffer
···
Decode
buffer
Decode
From
com
pute
unit
fro
nt-e
nd
Vector/scalarregister file
Vectorregister file
`
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201321
The S outhern Islands Timing SimulatorThe S calar Unit
─ Runs both arithmetic-logic and memory scalar instructions
─ Modeled with 5 stages – decode, read, execute/memory, write, complete ···Read
···
Issuebuffer
Readbuffer
···
Decode
buffer
Decode
From
com
pute
unit
fro
nt-e
nd
Scalarregister file
Execute
Memory
···
Executebuffer
Write
Complete
Write
buffer
Vectorregister file
···
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201322
The S outhern Islands Timing SimulatorThe Vector Memor y Unit
Read
···
Issuebuffer
Decode
buffer
Decode
From
com
pute
unit
fro
nt-e
nd
Vectorregister file
···
··· Memory
Readbuffer
···
Memorybuffer
Write
Complete
Write
buffer···
Vectorregister file
Globalmemory
─ Runs vector memor y instructions
─ Modeled with 5 stages – decode, read, memory, write, complete
─ Accesses to the global memor y hierarchy happen mainly in this unit
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201323
The S outhern Islands Timing SimulatorThe Branch Unit
Read
···
Issuebuffer
Decode
buffer
Decode
From
com
pute
unit
fro
nt-e
nd
Scalarregister file
(condition codes)
···
··· Execute
Readbuffer
···
Executebuffer
Write
Complete
Write
buffer···
Scalar reg. file(programcounter)
─ Runs branch instructions , which decide whether to make an entire wavefront jump to a target address depending on the scalar condition code
─ Modeled with 5 stages – decode, read, execute/memory, write, complete
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201324
The S outhern Islands Timing SimulatorThe Local Data Share (LDS) Unit
─ Runs local memor y accesses instructions
─ Modeled with 5 stages – decode, read, execute/memory, write, complete
─ The memory stage accesses the compute unit local memor y for read/write
Read
···
Issuebuffer
Decode
buffer
Decode
From
com
pute
unit
fro
nt-e
nd
Vectorregister file
···
··· Memory
Readbuffer
···
Memorybuffer
Write
Complete
Write
buffer···
Vectorregister file
Localmemory
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201325
The S outhern Islands Timing SimulatorGlobal Memor y Hierarchy
─ Fully configurable memory hierarchy, with default values based on theAMD Radeon HD 7970 Southern Islands GPU
─ One 16KB data L1 per compute unit
─ One scalar L1 cache shared by every 4 compute units
─ Six L2 banks with a total size of 128KB, each connected to a DRAM module
CU 0
L1
CU 1
L1
CU 2
L1
CU 3
L1Scalarcache
CU 28
L1
CU 29
L1
CU 30
L1
CU 31
L1Scalarcache. . .
. . .L2Bank 0
L2Bank 1
L2Bank 1
...Interconnect
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201326
The S outhern Islands Visual Tool
Disa ssembler EmulatorTiming
simulatorVisual
tool
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201327
The S outhern Islands Visual ToolMain Window and Timing Diagram
─ The main window provides c ycle-by- c ycle navigation throughout simulation.
─ A dedicated S outhern Is lands panel contains one widget per compute unit, showing allocated work-groups.
─ The memor y hierarchy panel shows caches connected to Southern Islands compute units, and special-purpose scalar caches.
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201328
Ongoing ProjectsCPU- GPU Cache Coherence Protocol
. . .
L2Bank 0
...
ARM x86 Evg. S.I. . . .Fermi
L1 L1 L1 L1
Interconnect
. . .L2Bank 1
NMOESI Interface─ MOESI protocol extended with an additional
non-coherent write state: NMOESI
─ CPU and GPU cores with any ISA can be connected to different entry points of the memory hierarchy
─ Processing nodes interact with the memory hierarchy with three types of accesses: load, store, and n-store
─ GPUs and CPUs running OpenCL kernels issue n-store write accesses. The rest issue regular store accesses.
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201329
Ongoing ProjectsCooperative E xecution of Work- Groups
S.I. S.I. S.I. S.I.x86x86
Tim
e
Hostprogram+ kernel
KernelKernelHardware
ND-Range
S.I. S.I. S.I. S.I.x86x86
WG-0 WG-1 WG-2 WG-3 WG-N. . .
• Work-group mapping─ Portions of ND-Range executed by CPU/GPU cores
with different ISAs
─ Work-groups mapped to CPU cores or GPU compute units as they become available
• Attained concurrency─ x86 cores run both the host program
and a portion of the ND-Range
─ Idle regions are removed during the execution of the ND-Range
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201330
─ LLVM-based compiler for OpenCL and CUDA kernels
─ Future release Multi2Sim 4.2 will include a working version
─ Diagrams show progress as per SVN Rev. 1838
• Yellow = under development• Blue = arriving soon• Green = supported
Ongoing ProjectsMulti2C – An OpenCL /CUDA Kernel Compiler
vec-add.clOpenCL Cto LLVM
front-end
CUDAto LLVM
front-endvec-add.cu
LLVM toSouthernIslands
back-end
vec-add.llvm
vec-add.s
LLVM toFermi
back-end
LLVM toKepler
back-end
vec-add.s
vec-add.s
SouthernIslands
assembler
Fermiassembler
Keplerassembler
vec-add.bin
vec-add.cubin
vec-add.cubin
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201331
Ongoing ProjectsSimulation of OpenGL Pipelines
• Goal─ Leverage our Southern Islands pipeline models to
execute OpenGL vertex and fragment shaders
• Steps─ Develop a runtime l ibrar y to link with guest programs,
implementing the OpenGL, GLUT and GLEW library APIs
─ Reverse engineer AMD's OpenGL binar y format to decode embedded metadata
─ Timing model of the OpenGL pipeline
• New capabilities targeted─ Timing simulation of other critical GPU components, such as rasterizer
─ Concurrency evaluation of compute + graphics pipelines
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201332
The Multi2Sim CommunityAcademic Efforts at Northeastern
• The “GPU Programming and Architecture” course─ We started an unofficial seminar that students can voluntarily attend. The syllabus covers
OpenCL programming, GPU architecture, and state-of-the-art research topics on GPUs.─ Average attendance of ~25 students per semester.
• Undergraduate directed studies─ Official alternative equivalent to a 4- credit course that an undergraduate student can
optionally enroll, collaborating with Multi2Sim development
• Graduate-level development─ Multiple ongoing PhD theses using Multi2Sim as support tool─ All related development becomes openly available through the SVN repo
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201333
The Multi2Sim CommunityAcademic Publicat ions
• Conference papers
─ Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors, SBAC-PAD, 2007
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing, PACT, 2012
• Tutorials
─ The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing, PACT, 2011
─ Programming and Simulating Fused Devices — OpenCL and Multi2Sim, ICPE, 2012
─ Multi-Architecture ISA-Level Simulation of OpenCL, IWOCL, 2013
─ Simulation of OpenCL and APUs on Multi2Sim, ISCA, 2013
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201334
The Multi2Sim Communityw w w.multi2sim.org
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201335
The Multi2Sim Communityw w w.multi2sim.org
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201336
The Multi2Sim Communityw w w.multi2sim.org
| Simulation, Compilation, and Debugging of OpenCL on the AMD S outhern Islands | November 13th, 201337
`
The Multi2Sim CommunitySponsors
38
Thanks!Questions?
39
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AT TRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.
Disclaimer & Attribution