SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
OVERVIEW OF OCELOT: ARCHITECTURE
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
OverviewGPU Ocelot overview
Building, configuring, and executing Ocelot programs
Ocelot Device Interface and CUDA Runtime API
Ocelot PTX Internal Representation
PTX Pass Manager
2
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot: Multiplatform Dynamic Compilation
Just-in-time code generation and
optimization for data intensive
applications
esd.lbl.gov
R. Domingo & D. Kaeli (NEU)
Data Parallel IR
Language Front-
End
• Environment for i) compiler research, ii) architecture research, and iii) productivity tools
3
3
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
NVIDIA’s Compute Unified Device Architecture (CUDA)
Integrate the concept of a compute kernel called from standard languages
Multithreaded host programsThe compute kernel specifies data parallel computation as thousands of threads
An accelerator model of computing Explicit functions for off-loading computation to GPUs Data movement explicitly managed by the programmer
4
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
http://developer.nvidia.com/cuda-education-training
Host GPU
For access to CUDA tutorials
NVIDIA’s Compute Unified Device Architecture (CUDA)
5
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Structure of a Compute Kernel
Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs)
Barrier synchronizationMapped to single instruction stream multiple data stream (SIMD) processor
6
Parallel Thread Execution (PTX) instruction set architecture
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
NVIDIA Fermi GF 100• 4 Global Processing Clusters
(GPCs) containing 4 SMs each
• Each SM has 32 ALUs, 4 SFUs, and 16 LS units
• Each ALU has access to 1024 32bit registers (total of 128kB per SM)
• Each SM has its own Shared Memory/L1 cache (64kB total)
• Unified L2 cache (768kB)
• Six 64bit Memory Controllers (total 384bit wide)
ALU Streaming multiprocessor (SM)
7
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Structure1 PTX Kernel
CUDA Application
nvcc
Ocelot is low-level compiler after CUDA apps have been compiled with nvcc
Structured around a PTX IRCompile stock CUDA applications without modification
8
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
CUDA to PTX
PTX modules stored as string literals in fat binary We ignore accompanying binary image (GPU native
binary)9
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
OverviewGPU Ocelot overview
Building, configuring, and executing Ocelot programs
Ocelot Device Interface and CUDA Runtime API
Ocelot PTX Internal Representation
PTX Pass Manager
10
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dependencies Software
C++ Compiler (GCC 4.5.x) Lex Lexer Generator (Flex 2.5.35) YACC Parser Generator (Bison 2.4.1) Scons (Python 2.7) LLVM (3.1)
Libraries boost_system (1.46) boost_filesystem (1.46) boost_serialization (1.46) GLEW (optional for GL interop) (1.5) GL (for NVIDIA GPU Devices)
Library headers Boost (1.46)
http://code.google.com/p/gpuocelot/wiki/Installation
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code
• Freely available via Google Code project site (New BSD License)
• ocelot/• analysis/ -- Analysis passes• api/ -- Ocelot-specific API extensions• cuda/ -- Implements CUDA runtime• executive/ -- Device interface and backend implementations• ir/ -- Internal representations (PTX, LLVM, AMD IL)• parser/ -- Parser (to PTX)• tools/ -- Standalone applications using Ocelot• trace/ -- Trace generation and analysis tools• translator/ -- Translators from PTX to LLVM and AMD IL• transforms/ -- Program transformations
http://code.google.com/p/gpuocelot/
svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only
12
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13
Building GPU Ocelot Obtain source code
svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only
Compile with Scons sudo ./build.py –install
Build and execute unit tests sudo ./build.py –test=full
Output appears in .release_build libocelot.so OcelotConfig Tests
Installation directory: /usr/local/include/ocelot /usr/local/lib
http://code.google.com/p/gpuocelot/wiki/Installation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Configuring Ocelot configure.ocelot
Controls Ocelot’s initial state Located in application’s startup directory trace specifies which trace generators are initially
attached executive controls device properties
trace: memoryChecker – ensures accesses to a memory
region associated with the currently selected device
raceDetector - enforces synchronized access to .shared
debugger - interactive debugger executive:
devices: List of Ocelot backend devices that are enabled nvidia - NVIDIA GPU backend emulated – Ocelot PTX emulator (trace generators) llvm – efficient execution of PTX on multicore CPU amd – translation to AMD IL for PTX on AMD RADEON
GPU
trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: false, ignoreIrrelevantWrites: true }, debugger: { enabled: false, kernelFilter:
"_Z13scalarProdGPUPfS_S_ii", alwaysAttach: true }, }, executive: { devices: [ "emulated" ], }}
14
14
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15
Building and Executing CUDA Programsnvcc -c example.cu -arch sm_23
g++ -o example example.o `OcelotConfig -l` `OcelotConfig -l` expands to ‘-locelot’
libocelot.so replaces libcudart.so
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
GPU Ocelot overview
Building, configuring, and executing Ocelot programs
Ocelot Device Interface and CUDA Runtime API
Ocelot PTX Internal Representation
PTX Pass Manager
16
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
CUDA Runtime API
Ocelot implements CUDA Runtime API Transparent hooks into existing CUDA
applications override methods of
cuda::CudaDeviceInterface Maps CUDA RT onto Ocelot device interface
abstraction cuda::CudaRuntime
Extended through custom Ocelot API e.g. ocelot::registerPTXModule( );
17
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot CUDA Runtime Overview18
Kernels execute anywhere Key to portability!
A reimplementation of the CUDA Runtime API
Compatible with existing applications
Link against libocelot.so instead of libcudart.so
R. Domingo & D. Kaeli (NEU)
18
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot CUDA Runtime
Clean device abstraction
All back-ends implement same interface
Ocelot API Extensions Add/remove trace
generators Device memory sharing
among host threads Device switching
19
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: CUDA Runtime API• ocelot/
• cuda/ -- Implements CUDA runtime
• interface/CudaRuntimeInterface.h• interface/CudaRuntime.h• interface/CudaRuntimeContext.h• interface/FatBinaryContext.h• interface/CudaDriverFrontend.h
20
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21
Ocelot CUDA Runtime API Implementation Implement interface defined by cuda::CudaRuntimeInterface
ocelot/cuda/interface/CudaRuntime.h ocelot/cuda/implementation/CudaRuntime.cpp class cuda::CudaRuntime
cuda::CudaRuntime members Host thread contexts Ocelot devices Registered modules, textures, kernels Fat binaries Global mutex
CUDA Runtime API functions e.g. cudaMemcpy, cudaLaunch, __cudaRegisterModule()
Additional functions E.g. _lock(), _unlock(), _registerModule()
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: Device Interface• ocelot/
• executive/ -- Device interface and backend implementations
• interface/Device.h• interface/EmulatorDevice.h• interface/NVIDIAGPUDevice.h• interface/MulticoreCPUDevice.h• interface/ATIGPUDevice.h
22
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23
Ocelot Device Interface class executive::Device Succinct interface for device objects
Module registration Memory management Kernel configuration and launching Global variable and texture management OpenGL interoperability Streams and Events Trace generators
Minimal set of APIs for device-oriented programming model 57 functions (versus CUDA Runtime’s 120+)
Capture device state: Memory allocations, global variables, textures, graphics interoperability
Facilitate creation of backend execution targets Implement Device interface
Enable multiple API front ends Implement front ends targeting Device interface
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
GPU Ocelot overview
Building, configuring, and executing Ocelot programs
Ocelot Device Interface and CUDA Runtime API
Ocelot PTX Internal Representation
PTX Pass Manager
24
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot PTX Intermediate Representation (IR)
Backend compiler framework for PTX Full-featured PTX IR
Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization
IR to IR translation From PTX to other IRs
LLVM (x86/PowerPC/ARM) CAL (AMD GPUs)
PTX Kernel
25
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: Intermediate Representation• ocelot/
• ir/ -- Internal representations (PTX, LLVM, AMD IL)
• interface/Module.h• interface/PTXInstruction.h• interface/PTXOperand.h• interface/PTXKernel.h• interface/ControlFlowGraph.h• interface/ILInstruction.h• interface/LLVMInstruction.h
• parser/ -- Parser (to PTX)
• interface/PTXParser.h
26
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 27
Ocelot PTX Internal Representation C++ classes representing PTX module
ir::PTXModule ir::PTXKernel ir::PTXInstruction ir::PTXOperand ir::GlobalVariable ir::LocalVariable ir::Parameter
Translator source PTX to LLVM PTX to AMD IL
Suitable for analysis and transformation
Executable representation PTX Emulator
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot PTX IR: Kernels.global .f32 globalVariable;
.entry sequence (.param .u64 __cudaparm_sequence_A,.param .s32 __cudaparm_sequence_N){.reg .u32 %r<11>;.reg .u64 %rd<6>;.local u32 %rp0;
. . . . . .
$LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; . . . . . .$Lt_0_1026:
exit;$LDWend_sequence:
} // sequence
ir::Module
ir::Kernel
ir::BasicBlock
ir::Local
ir::Parameter
ir::Global
28
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
add.s32 %r7, %r5, 1;
ld .param .u64 %rd1, [__cudaparm_sequence_A];
cvt.s64.s32 %rd2, %r5;
mul.wide.s32 %rd3, %r5, 4;
add.u64 %rd4, %rd1, %rd3;
st .global .s32 [ %rd4 + 0 ], %r7;
@%p1 bra $Lt_0_6146;
ir::BasicBlockir::PTXInstruction
opcode addressSpace dataType d a
addressMode: address
addressMode: register
addressMode: immediate
addressMode: indirect
ir::PTXOperand
addressMode: label
Guard predicate
Ocelot PTX IR: Instructions
29
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Control and Data-Flow Graphs
• Data structure for representing kernels• Basic blocks
• fall-through and branch edges• instruction vector• label
• Block traversals:• pre-order, topological, post-order
• Data-flow graph overlaying CFG• definition-use chains, ..
• CFG Transformations:• split blocks, edges
• DFG Transformations:• insert and remove values• iterate over def-use
30
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example: Control-Flow Graphs// example: splits basic blocks containing barriers//for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks
unsigned int n = 0; ir::BasicBlock::InstructionList::iterator inst_it;
for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it
const ir::PTXInstruction *inst = static_cast< const ir::PTXInstruction *>(*inst_it);
if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) {
std::string label = (bb_it)->label + "_bar";
kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync // so that it’s always the last } // instruction in a block break; } } // end for (inst_it)
} // end for (bb_it)
31
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example: Spilling Live Values// ocelot/analysis/implementation/RemoveBarrierPass.cpp
//
void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, const DataflowGraph::Block::RegisterSet& alive ){
unsigned int bytes = 0;
ir::PTXInstruction move ( ir::PTXInstruction::Mov );
move.type = ir::PTXOperand::u64;
move.a.identifier = "__ocelot_remove_barrier_pass_stack";
move.a.addressMode = ir::PTXOperand::Address;
move.a.type = ir::PTXOperand::u64;
move.d.reg = _kernel->dfg()->newRegister();
move.d.addressMode = ir::PTXOperand::Register;
move.d.type = ir::PTXOperand::u64;
_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );
...
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
...
for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) {
ir::PTXInstruction save( ir::PTXInstruction::St );
save.type = reg->type;
save.addressSpace = ir::PTXInstruction::Local;
save.d.addressMode = ir::PTXOperand::Indirect;
save.d.reg = move.d.reg;
save.d.type = ir::PTXOperand::u64;
save.d.offset = bytes;
bytes += ir::PTXOperand::bytes( save.type );
save.a.addressMode = ir::PTXOperand::Register;
save.a.type = reg->type;
save.a.reg = reg->id;
_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );
}
_spillBytes = std::max( bytes, _spillBytes );
}
Example: Spilling Live Values
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
IR for AMD and LLVM
LLVM IR• Implements all of the LLVM instruction set
AMD IL• Supports translation from PTX to AMD
interface
AMD Backend: R. Domingo & D. Kaeli (NEU)
34
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
GPU Ocelot overview
Building, configuring, and executing Ocelot programs
Ocelot Device Interface and CUDA Runtime API
Ocelot PTX Internal Representation
PTX Pass Manager
35
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 36
PTX Pass Manager Orchestrates analysis and transformation passes
Derived from LLVM model Analysis Passes generate meta-data Meta-data consumed by transformations Transformation Passes modify the IR
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 37
Using the Pass Manager Passes added to a manager
Schedules execution Manages meta-data
Ensures meta-data available Up to date; not redundantly computed
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 38
Analysis Passes Analysis runs over the PTX IR
Generates meta-data Possibly updates or invalidates existing meta-data
Examples Data-flow graph Dominator and Post-dominator trees Thread frontiers
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Analysis Passes
39
Control Flow Graph ir/interface/ControlFlowGraph.h
Data Flow Graph analysis/interface/DataflowGraph.h
Dominator and Post-Dominator Trees analysis/interface/DominatorTree.h analysis/interface/PostDominatorTree.h
Superblock Analysis analysis/interface/SuperblockAnalysis.h
Divergence Graph analysis/interface/DivergenceGraph.h
Thread Frontiers analysis/interface/ThreadFrontiers.h
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 40
Transformation Passes Modify the PTX IR
Consume meta-data
Examples: Dead-code elimination
transforms/interface/DeadCodeEliminationPass.h
Control-flow structuring transforms/interface/StructuralTransform.h
Sync elimination transforms/interface/SyncElimination.h
Dynamic instrumentation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 41
Example: Dead Code Elimination Transformation Pass
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 42
Dead Code Elimination Approach
Run once on each kernel Consume data-flow analysis meta-data Delete instructions producing values with no users Implementation
transforms/interface/DeadCodeEliminationPass.h transforms/implementation/DeadCodeEliminationPass.cpp
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 43
Dead Code Elimination (1 of 5) Setup pass dependencies
DeadCodeEliminationPass::DeadCodeEliminationPass(): KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass"){
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 44
Dead Code Elimination (2 of 5) Run pass
Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis);assert(dfgAnalysis != 0);
// cast upanalysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis);assert(dfg.ssa());
void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k){
Get analysis metadata
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 45
Dead Code Elimination (3 of 5) Loop until change
BlockSet blocks;for (iterator block = dfg.begin(); block != dfg.end(); ++block){ report(" Queueing up BB_" << block->id()); blocks.insert(block);}
while(!blocks.empty()){ iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block);}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 46
Dead Code Elimination (4 of 5) Remove unused live-out valuesAliveKillList aliveOutKillList;for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut != block->aliveOut().end(); ++aliveOut){ if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); }}for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed){ block->aliveOut().erase(*killed);}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 47
Dead Code Elimination (5 of 5) Check if an instruction can be removedif (ptx.hasSideEffects()) return false;
for (RegisterPointerVector::iterator reg = instruction->d.begin(); reg != instruction->d.end(); ++reg) {
// the reg is alive outside the blockif (block->aliveOut().count(*reg) != 0) return false;InstructionVector::iterator next = instruction;for (++next; next != block->instructions().end(); ++next) {
for (RegisterPointerVector::iterator source = next->s.begin();source != next->s.end(); ++source) {// found a user in the blockif (*source->pointer == *reg->pointer) return false;}
}}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 48
Running Passes on PTX Static optimizer
PTXOptimizer Runs passes on PTX assembly files ocelot/tools/PTXOptimizer.cpp
JIT optimization Runs passes before kernels are launched ocelot/api/implementation/OcelotRuntime.cpp
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 49
Questions GPU Ocelot
Google Code site: http://code.google.com/p/gpuocelot
Research Project site: http://gpuocelot.gatech.edu
Mailing list: [email protected]
Contributors Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Ashwin
Lele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, Sudhakar Yalamanchili
Sponsors AMD, IBM, Intel, LogicBlox, NSF, NVIDIA