Overview of Ocelot: architecture

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

OVERVIEW OF OCELOT: ARCHITECTURE


OverviewGPU Ocelot overview

Building, configuring, and executing Ocelot programs

Ocelot Device Interface and CUDA Runtime API

Ocelot PTX Internal Representation

PTX Pass Manager

2


Ocelot: Multiplatform Dynamic Compilation

Just-in-time code generation and

optimization for data intensive

applications

esd.lbl.gov

R. Domingo & D. Kaeli (NEU)

Data Parallel IR

Language Front-

End

• Environment for i) compiler research, ii) architecture research, and iii) productivity tools

3

3


NVIDIA’s Compute Unified Device Architecture (CUDA)

Integrate the concept of a compute kernel called from standard languages

Multithreaded host programsThe compute kernel specifies data parallel computation as thousands of threads

An accelerator model of computing Explicit functions for off-loading computation to GPUs Data movement explicitly managed by the programmer

4


http://developer.nvidia.com/cuda-education-training

Host GPU

For access to CUDA tutorials

NVIDIA’s Compute Unified Device Architecture (CUDA)

5


Structure of a Compute Kernel

Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs)

Barrier synchronizationMapped to single instruction stream multiple data stream (SIMD) processor

6

Parallel Thread Execution (PTX) instruction set architecture


NVIDIA Fermi GF 100• 4 Global Processing Clusters

(GPCs) containing 4 SMs each

• Each SM has 32 ALUs, 4 SFUs, and 16 LS units

• Each ALU has access to 1024 32bit registers (total of 128kB per SM)

• Each SM has its own Shared Memory/L1 cache (64kB total)

• Unified L2 cache (768kB)

• Six 64bit Memory Controllers (total 384bit wide)

ALU Streaming multiprocessor (SM)

7


Ocelot Structure1 PTX Kernel

CUDA Application

nvcc

Ocelot is low-level compiler after CUDA apps have been compiled with nvcc

Structured around a PTX IRCompile stock CUDA applications without modification

8


CUDA to PTX

PTX modules stored as string literals in fat binary We ignore accompanying binary image (GPU native

binary)9


OverviewGPU Ocelot overview




PTX Pass Manager

10


Dependencies Software

C++ Compiler (GCC 4.5.x) Lex Lexer Generator (Flex 2.5.35) YACC Parser Generator (Bison 2.4.1) Scons (Python 2.7) LLVM (3.1)

Libraries boost_system (1.46) boost_filesystem (1.46) boost_serialization (1.46) GLEW (optional for GL interop) (1.5) GL (for NVIDIA GPU Devices)

Library headers Boost (1.46)

http://code.google.com/p/gpuocelot/wiki/Installation

11


Ocelot Source Code

• Freely available via Google Code project site (New BSD License)

• ocelot/• analysis/ -- Analysis passes• api/ -- Ocelot-specific API extensions• cuda/ -- Implements CUDA runtime• executive/ -- Device interface and backend implementations• ir/ -- Internal representations (PTX, LLVM, AMD IL)• parser/ -- Parser (to PTX)• tools/ -- Standalone applications using Ocelot• trace/ -- Trace generation and analysis tools• translator/ -- Translators from PTX to LLVM and AMD IL• transforms/ -- Program transformations

http://code.google.com/p/gpuocelot/

svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

12

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13

Building GPU Ocelot Obtain source code

svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only

Compile with Scons sudo ./build.py –install

Build and execute unit tests sudo ./build.py –test=full

Output appears in .release_build libocelot.so OcelotConfig Tests

Installation directory: /usr/local/include/ocelot /usr/local/lib

http://code.google.com/p/gpuocelot/wiki/Installation


Configuring Ocelot configure.ocelot

Controls Ocelot’s initial state Located in application’s startup directory trace specifies which trace generators are initially

attached executive controls device properties

trace: memoryChecker – ensures accesses to a memory

region associated with the currently selected device

raceDetector - enforces synchronized access to .shared

debugger - interactive debugger executive:

devices: List of Ocelot backend devices that are enabled nvidia - NVIDIA GPU backend emulated – Ocelot PTX emulator (trace generators) llvm – efficient execution of PTX on multicore CPU amd – translation to AMD IL for PTX on AMD RADEON

GPU

trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: false, ignoreIrrelevantWrites: true }, debugger: { enabled: false, kernelFilter:

"_Z13scalarProdGPUPfS_S_ii", alwaysAttach: true }, }, executive: { devices: [ "emulated" ], }}

14

14


Building and Executing CUDA Programsnvcc -c example.cu -arch sm_23

g++ -o example example.o `OcelotConfig -l` `OcelotConfig -l` expands to ‘-locelot’

libocelot.so replaces libcudart.so


Overview

GPU Ocelot overview




PTX Pass Manager

16


CUDA Runtime API

Ocelot implements CUDA Runtime API Transparent hooks into existing CUDA

applications override methods of

cuda::CudaDeviceInterface Maps CUDA RT onto Ocelot device interface

abstraction cuda::CudaRuntime

Extended through custom Ocelot API e.g. ocelot::registerPTXModule( );

17


Ocelot CUDA Runtime Overview18

Kernels execute anywhere Key to portability!

A reimplementation of the CUDA Runtime API

Compatible with existing applications

Link against libocelot.so instead of libcudart.so

R. Domingo & D. Kaeli (NEU)

18


Ocelot CUDA Runtime

Clean device abstraction

All back-ends implement same interface

Ocelot API Extensions Add/remove trace

generators Device memory sharing

among host threads Device switching

19


Ocelot Source Code: CUDA Runtime API• ocelot/

• cuda/ -- Implements CUDA runtime

• interface/CudaRuntimeInterface.h• interface/CudaRuntime.h• interface/CudaRuntimeContext.h• interface/FatBinaryContext.h• interface/CudaDriverFrontend.h

20


Ocelot CUDA Runtime API Implementation Implement interface defined by cuda::CudaRuntimeInterface

ocelot/cuda/interface/CudaRuntime.h ocelot/cuda/implementation/CudaRuntime.cpp class cuda::CudaRuntime

cuda::CudaRuntime members Host thread contexts Ocelot devices Registered modules, textures, kernels Fat binaries Global mutex

CUDA Runtime API functions e.g. cudaMemcpy, cudaLaunch, __cudaRegisterModule()

Additional functions E.g. _lock(), _unlock(), _registerModule()


Ocelot Source Code: Device Interface• ocelot/

• executive/ -- Device interface and backend implementations

• interface/Device.h• interface/EmulatorDevice.h• interface/NVIDIAGPUDevice.h• interface/MulticoreCPUDevice.h• interface/ATIGPUDevice.h

22


Ocelot Device Interface class executive::Device Succinct interface for device objects

Module registration Memory management Kernel configuration and launching Global variable and texture management OpenGL interoperability Streams and Events Trace generators

Minimal set of APIs for device-oriented programming model 57 functions (versus CUDA Runtime’s 120+)

Capture device state: Memory allocations, global variables, textures, graphics interoperability

Facilitate creation of backend execution targets Implement Device interface

Enable multiple API front ends Implement front ends targeting Device interface


Overview

GPU Ocelot overview




PTX Pass Manager

24


Ocelot PTX Intermediate Representation (IR)

Backend compiler framework for PTX Full-featured PTX IR

Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization

IR to IR translation From PTX to other IRs

LLVM (x86/PowerPC/ARM) CAL (AMD GPUs)

PTX Kernel

25


Ocelot Source Code: Intermediate Representation• ocelot/

• ir/ -- Internal representations (PTX, LLVM, AMD IL)

• interface/Module.h• interface/PTXInstruction.h• interface/PTXOperand.h• interface/PTXKernel.h• interface/ControlFlowGraph.h• interface/ILInstruction.h• interface/LLVMInstruction.h

• parser/ -- Parser (to PTX)

• interface/PTXParser.h

26


Ocelot PTX Internal Representation C++ classes representing PTX module

ir::PTXModule ir::PTXKernel ir::PTXInstruction ir::PTXOperand ir::GlobalVariable ir::LocalVariable ir::Parameter

Translator source PTX to LLVM PTX to AMD IL

Suitable for analysis and transformation

Executable representation PTX Emulator


Ocelot PTX IR: Kernels.global .f32 globalVariable;

.entry sequence (.param .u64 __cudaparm_sequence_A,.param .s32 __cudaparm_sequence_N){.reg .u32 %r<11>;.reg .u64 %rd<6>;.local u32 %rp0;

. . . . . .

$LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; . . . . . .$Lt_0_1026:

exit;$LDWend_sequence:

} // sequence

ir::Module

ir::Kernel

ir::BasicBlock

ir::Local

ir::Parameter

ir::Global

28


add.s32 %r7, %r5, 1;

ld .param .u64 %rd1, [__cudaparm_sequence_A];

cvt.s64.s32 %rd2, %r5;

mul.wide.s32 %rd3, %r5, 4;

add.u64 %rd4, %rd1, %rd3;

st .global .s32 [ %rd4 + 0 ], %r7;

@%p1 bra $Lt_0_6146;

ir::BasicBlockir::PTXInstruction

opcode addressSpace dataType d a

addressMode: address

addressMode: register

addressMode: immediate

addressMode: indirect

ir::PTXOperand

addressMode: label

Guard predicate

Ocelot PTX IR: Instructions

29


Control and Data-Flow Graphs

• Data structure for representing kernels• Basic blocks

• fall-through and branch edges• instruction vector• label

• Block traversals:• pre-order, topological, post-order

• Data-flow graph overlaying CFG• definition-use chains, ..

• CFG Transformations:• split blocks, edges

• DFG Transformations:• insert and remove values• iterate over def-use

30


Example: Control-Flow Graphs// example: splits basic blocks containing barriers//for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks

unsigned int n = 0; ir::BasicBlock::InstructionList::iterator inst_it;

for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it

const ir::PTXInstruction *inst = static_cast< const ir::PTXInstruction *>(*inst_it);

if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) {

std::string label = (bb_it)->label + "_bar";

kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); // split block containing bar.sync // so that it’s always the last } // instruction in a block break; } } // end for (inst_it)

} // end for (bb_it)

31


Example: Spilling Live Values// ocelot/analysis/implementation/RemoveBarrierPass.cpp

//

void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, const DataflowGraph::Block::RegisterSet& alive ){

unsigned int bytes = 0;

ir::PTXInstruction move ( ir::PTXInstruction::Mov );

move.type = ir::PTXOperand::u64;

move.a.identifier = "__ocelot_remove_barrier_pass_stack";

move.a.addressMode = ir::PTXOperand::Address;

move.a.type = ir::PTXOperand::u64;

move.d.reg = _kernel->dfg()->newRegister();

move.d.addressMode = ir::PTXOperand::Register;

move.d.type = ir::PTXOperand::u64;

_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );

...


...

for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) {

ir::PTXInstruction save( ir::PTXInstruction::St );

save.type = reg->type;

save.addressSpace = ir::PTXInstruction::Local;

save.d.addressMode = ir::PTXOperand::Indirect;

save.d.reg = move.d.reg;

save.d.type = ir::PTXOperand::u64;

save.d.offset = bytes;

bytes += ir::PTXOperand::bytes( save.type );

save.a.addressMode = ir::PTXOperand::Register;

save.a.type = reg->type;

save.a.reg = reg->id;

_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );

}

_spillBytes = std::max( bytes, _spillBytes );

}

Example: Spilling Live Values


IR for AMD and LLVM

LLVM IR• Implements all of the LLVM instruction set

AMD IL• Supports translation from PTX to AMD

interface

AMD Backend: R. Domingo & D. Kaeli (NEU)

34


Overview

GPU Ocelot overview




PTX Pass Manager

35


PTX Pass Manager Orchestrates analysis and transformation passes

Derived from LLVM model Analysis Passes generate meta-data Meta-data consumed by transformations Transformation Passes modify the IR


Using the Pass Manager Passes added to a manager

Schedules execution Manages meta-data

Ensures meta-data available Up to date; not redundantly computed


Analysis Passes Analysis runs over the PTX IR

Generates meta-data Possibly updates or invalidates existing meta-data

Examples Data-flow graph Dominator and Post-dominator trees Thread frontiers


Analysis Passes

39

Control Flow Graph ir/interface/ControlFlowGraph.h

Data Flow Graph analysis/interface/DataflowGraph.h

Dominator and Post-Dominator Trees analysis/interface/DominatorTree.h analysis/interface/PostDominatorTree.h

Superblock Analysis analysis/interface/SuperblockAnalysis.h

Divergence Graph analysis/interface/DivergenceGraph.h

Thread Frontiers analysis/interface/ThreadFrontiers.h


Transformation Passes Modify the PTX IR

Consume meta-data

Examples: Dead-code elimination

transforms/interface/DeadCodeEliminationPass.h

Control-flow structuring transforms/interface/StructuralTransform.h

Sync elimination transforms/interface/SyncElimination.h

Dynamic instrumentation


Example: Dead Code Elimination Transformation Pass


Dead Code Elimination Approach

Run once on each kernel Consume data-flow analysis meta-data Delete instructions producing values with no users Implementation

transforms/interface/DeadCodeEliminationPass.h transforms/implementation/DeadCodeEliminationPass.cpp


Dead Code Elimination (1 of 5) Setup pass dependencies

DeadCodeEliminationPass::DeadCodeEliminationPass(): KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass"){

}


Dead Code Elimination (2 of 5) Run pass

Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis);assert(dfgAnalysis != 0);

// cast upanalysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis);assert(dfg.ssa());

void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k){

Get analysis metadata


Dead Code Elimination (3 of 5) Loop until change

BlockSet blocks;for (iterator block = dfg.begin(); block != dfg.end(); ++block){ report(" Queueing up BB_" << block->id()); blocks.insert(block);}

while(!blocks.empty()){ iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block);}


Dead Code Elimination (4 of 5) Remove unused live-out valuesAliveKillList aliveOutKillList;for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut != block->aliveOut().end(); ++aliveOut){ if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); }}for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed){ block->aliveOut().erase(*killed);}


Dead Code Elimination (5 of 5) Check if an instruction can be removedif (ptx.hasSideEffects()) return false;

for (RegisterPointerVector::iterator reg = instruction->d.begin(); reg != instruction->d.end(); ++reg) {

// the reg is alive outside the blockif (block->aliveOut().count(*reg) != 0) return false;InstructionVector::iterator next = instruction;for (++next; next != block->instructions().end(); ++next) {

for (RegisterPointerVector::iterator source = next->s.begin();source != next->s.end(); ++source) {// found a user in the blockif (*source->pointer == *reg->pointer) return false;}

}}


Running Passes on PTX Static optimizer

PTXOptimizer Runs passes on PTX assembly files ocelot/tools/PTXOptimizer.cpp

JIT optimization Runs passes before kernels are launched ocelot/api/implementation/OcelotRuntime.cpp


Questions GPU Ocelot

Google Code site: http://code.google.com/p/gpuocelot

Research Project site: http://gpuocelot.gatech.edu

Mailing list: [email protected]

Contributors Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Ashwin

Lele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, Sudhakar Yalamanchili

Sponsors AMD, IBM, Intel, LogicBlox, NSF, NVIDIA

Overview of Ocelot: architecture

Documents

Overview of Ocelot: architecture