Kokkos : The Tutorial alpha+1 version

Official Use Only

11/19/13

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of

Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Kokkos: The Tutorial alpha+1 version

The Kokkos Team: Carter Edwards Christian Trott Dan Sunderland

11/19/13 2

IntroductionWhat this tutorial is:• Introduction to Kokkos’ main API features• List of example codes (valid Kokkos programs)• Incrementally increasing complexity

What this tutorial is NOT:• Introduction to parallel programming• Presentation of Kokkos features • Performance comparison of Kokkos with other approaches

What you should know:• C++ (a bit of experience with templates helps)• General parallel programming concepts

Where the code can be found:• Trilinos/packages/kokkos/example/tutorial

Compilation:• make all CUDA=yes/no –j 8

11/19/13 3

A Note on Devices

• Use of Kokkos in applications has informed interface changes• Most Kokkos changes are already reflected in tutorial material• Not yet: Split Device into ExecutionSpace and MemorySpace• For this tutorial a Device fulfills a dual role: it is either a

MemorySpace or an ExecutionSpace

Kokkos::Cuda is used as a MemorySpace (GPU memory):

Kokkos::View<double*, Kokkos::Cuda>

Device is used as an ExecutionSpace:

template<class Device> struct functor { typedef Device device_type; };

11/19/13 4

A Note on C++11

• Lambda interface requires C++11• It is not currently supported on GPUs

• is expected for NVIDIA in March 2015 • early access for NVIDIA probably fall 2014• not sure about AMD

• Lambda interface does not support all features• use for the simple cases• currently dispatches always to the default Device type• reductions only on POD with += and default initialize• parallel_scan operation not supported• shared memory for teams (scratch-pad) not supported• not obvious which limitations will stay in the future – but some will

11/19/13 5

01_HelloWorld

#include <Kokkos_Core.hpp>#include <cstdio>

// A minimal functor with just an operator().// That operator will be called in parallel.struct hello_world { KOKKOS_INLINE_FUNCTION void operator()(const int& i) const { printf("Hello World %i\n",i); }};

int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize();

// Run functor with 15 iterations in parallel // on DefaultDeviceType. Kokkos::parallel_for(15, hello_world()); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize();}


int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize();

// Run lambda with 15 iterations in parallel on // DefaultDeviceType. Take in values in the // enclosing scope by copy [=]. Kokkos::parallel_for(15, [=] (const int& i) { printf("HelloWorld %i\n",i); }); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize();}

• Kokkos Devices need to be initialized (start up reference counting, reserve GPU etc.)• Kokkos::initialize() does that for the DefaultDeviceType which depends on your configuration

(e.g., whether Cuda or OpenMP is enabled)• parallel_for is used to dispatch work to threads or a GPU• By default parallel_for dispatches work to DefaultDeviceType

Lambda interface (C++11) Functor interface (C++98)

11/19/13 6

02_SimpleReduce


struct squaresum { // For reductions operator() has a different // interface then for parallel_for // The lsum parameter must be passed by reference // By default lsum is intialized with int() and // combined with += KOKKOS_INLINE_FUNCTION void operator() (int i, int &lsum) const { lsum+= i*i; }}; int main() { Kokkos::initialize(); int sum = 0; // sum can be everything which defines += and // a default constructors // sum has to have the same type as the // second argument of operator() of the functor Kokkos::parallel_reduce(10,squaresum(),sum); printf("Sum of first %i square numbers %i\n",9,sum);

Kokkos::finalize();}

#include <Kokkos_Core.hpp>#include <cstdio> int main() { Kokkos::initialize(); int sum = 0; // sum can be everything which defines += and // a default constructor // sum has to have the same type as the second // argument of operator() of the functor // By default lsum is initialized with default // constructor and combined with += Kokkos::parallel_reduce(10, [=] (int i, int& lsum) { lsum+=i*i; }, sum);

printf("Sum of first %i square numbers %i\n",9,sum); Kokkos::finalize();}

• Kokkos parallel_reduce offers deterministic reductions (same order of operations each time)• By default the reduction sets initial value to zero (default constructor) & uses += to combine values,

but the functor interface can be used to define specialized init and join functions

Lambda interface (C++11) Functor interface (C++98)

11/19/13 7

03_SimpleViews


// A simple 2D array (rank==2) with one compile dimension// It is using DefaultDeviceType as its memoryspace and the default layout associated with it (typically LayoutLeft // or LayoutRight). The view does not use any special access traits.// By default a view using this type will be reference counted.typedef Kokkos::View<double*[3]> view_type;

int main() { Kokkos::initialize(); // Allocate a view with the runtime dimension set to 10 and a label "A" // The label is used in debug output and error messages view_type a("A",10);

// The view a is passed on via copy to the parallel dispatch which is important if the execution space can not // access the default HostSpace directly (or if it is slow) as e.g. on GPUs // Note: the underlying allocation is not moved, only meta_data such as pointers and shape information is copied Kokkos::parallel_for(10,[=](int i){ // Read and write access to data comes via operator() a(i,0) = 1.0*i; a(i,1) = 1.0*i*i; a(i,2) = 1.0*i*i*i; });

double sum = 0; Kokkos::parallel_reduce(10,[=](int i, double& lsum) { lsum+= a(i,0)*a(i,1)/(a(i,2)+0.1); },sum); printf("Result %lf\n",sum); Kokkos::finalize();}

• Kokkos::View: Multi-dimensional array (up to 8 dimensions)• Default layout (row- or column-major) depends on Device• Hooks for current & next-gen memory architecture features

11/19/13 8

04_SimpleMemorySpaces


typedef Kokkos::View<double*[3]> view_type;// HostMirror is a view with the same layout / padding as its parent type but in the host memory space.// This memory space can be the same as the device memory space for example when running on CPUs.typedef view_type::HostMirror host_view_type;

struct squaresum { view_type a; squaresum(view_type a_):a(a_) {} KOKKOS_INLINE_FUNCTION void operator() (int i, int &lsum) const { lsum += a(i,0)-a(i,1)+a(i,2); } }; int main() { Kokkos::initialize(); view_type a("A",10); // Create an allocation with the same dimensions as a in the host memory space. // If the memory space of view_type and its HostMirror are the same, the mirror view won’t allocate, // and both views will have the same pointer. In that case, deep copies do nothing. host_view_type h_a = Kokkos::create_mirror_view(a);

for(int i = 0; i < 10; i++) { for(int j = 0; j < 3; j++) { h_a(i,j) = i*10 + j; } }

// Transfer data from h_a to a. This does nothing if both views reference the same data. Kokkos::deep_copy(a,h_a);

int sum = 0; Kokkos::parallel_reduce(10,squaresum(a),sum); printf("Result is %i\n",sum); Kokkos::finalize();}

• Views live in a MemorySpace (abstraction for possibly manually managed memory hierarchies)• Deep copies between MemorySpaces are always explicit (“expensive things are always explicit”)

11/19/13 9

05_SimpleAtomics

#include <Kokkos_Core.hpp>#include <cstdio>#include <cstdlib>#include <cmath>

// Define View types used in the codetypedef Kokkos::View<int*> view_type;typedef Kokkos::View<int> count_type;

// A functor to find prime numbers. Append all// primes in ‘data_’ to the end of the ‘result_’ // array. ‘count_’ is the index of the first open// spot in ‘result_’.struct findprimes { view_type data_; view_type result_; count_type count_;

// The functor’s constructor. findprimes (view_type data, view_type result, count_type count) : data_ (data), result_ (result), count_ (count) {}

// operator() to be called in parallel_for. KOKKOS_INLINE_FUNCTION void operator() (int i) const { // Is data_(i) a prime number? const int number = data_(i); const int upper_bound = sqrt(1.0*number)+1; bool is_prime = !(number%2 == 0); int k = 3; while(k<upper_bound && is_prime) { is_prime = !(number%k == 0); k+=2; }

if(is_prime) { // ‘number’ is a prime, so append it to the // data_ array. Find & increment the position // of the last entry by using a fetch-and-add // atomic operation. int idx = Kokkos::atomic_fetch_add(&count(),1); result_(idx) = number; } }};

• Atomics make updating a single memory location (<= 64 bits) thread-safe• Kokkos provides: fetch-and-add, fetch-bitwise-or, fetch-bitwise-and, fetch-exchange, fetch-

compare-exchange (more can be implemented if needed)• Performance of atomics depends on hardware & how many atomic operations hit the same address

at the same time• If the atomic density is too large, explore different algorithms

11/19/13 10

main() for simple atomics example

typedef view_type::HostMirror host_view_type;typedef count_type::HostMirror host_count_type;

int main() { Kokkos::initialize();

srand(61391);

int nnumbers = 100000; view_type data("RND",nnumbers); view_type result("Prime",nnumbers); count_type count("Count");

host_view_type h_data = Kokkos::create_mirror_view(data); host_view_type h_result = Kokkos::create_mirror_view(result); host_count_type h_count = Kokkos::create_mirror_view(count); for(int i = 0; i < data.dimension_0(); i++) h_data(i) = rand()%100000;

Kokkos::deep_copy(data,h_data);

int sum = 0; Kokkos::parallel_for(data.dimension_0(),findprimes(data,result,count)); Kokkos::deep_copy(h_count,count);

printf("Found %i prime numbers in %i random numbers\n",h_count(),nnumbers); Kokkos::finalize();}

11/19/13 11

Advanced Views: 01_data_layouts

#include <Kokkos_Core.hpp>#include <impl/Kokkos_Timer.hpp>#include <cstdio>

typedef Kokkos::View<double**, Kokkos::LayoutLeft> left_type;typedef Kokkos::View<double**, Kokkos::LayoutRight> right_type;typedef Kokkos::View<double*> view_type;

template<class ViewType>struct init_view { ViewType a; init_view(ViewType a_):a(a_) {};

KOKKOS_INLINE_FUNCTION void operator() (int i) const { for(int j = 0; j < a.dimension_1(); j++) a(i,j) = 1.0*a.dimension_0()*i + 1.0*j; }};

template<class ViewType1, class ViewType2>struct contraction { view_type a; typename ViewType1::const_type v1; typename ViewType2::const_type v2; contraction(view_type a_, ViewType1 v1_, ViewType2 v2_):a(a_),v1(v1_),v2(v2_) {}

KOKKOS_INLINE_FUNCTION void operator() (int i) const { for(int j = 0; j < v1.dimension_1(); j++) a(i) = v1(i,j)*v2(j,i); }};

• Data Layouts determine the mapping between indices and memory addresses• Each ExecutionSpace has a default Layout optimized for parallel execution on the first index• Data Layouts can be set via a template parameters in Views• Kokkos provides currently: LayoutLeft (column-major), LayoutRight (row-major), LayoutStride

([almost] arbitrary strides for each dimension), LayoutTile (like in the MAGMA library)• Custom Layouts can be added with <= 200 lines of code• Choosing wrong layout can reduce performance by 2-10x

11/19/13 12

struct dot { view_type a; dot(view_type a_):a(a_) {};

KOKKOS_INLINE_FUNCTION void operator() (int i, double &lsum) const { lsum+= a(i)*a(i); }};

int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 10000; view_type a("A",size); left_type l("L",size,10000); right_type r("R",size,10000);

Kokkos::parallel_for(size,init_view<left_type>(l)); Kokkos::parallel_for(size,init_view<right_type>(r)); Kokkos::fence();

Kokkos::Impl::Timer time1; Kokkos::parallel_for (size,contraction<left_type,right_type>(a,l,r)); Kokkos::fence(); double sec1 = time1.seconds();

double sum1 = 0; Kokkos::parallel_reduce(size,dot(a),sum1); Kokkos::fence(); Kokkos::Impl::Timer time2; Kokkos::parallel_for (size,contraction<right_type,left_type>(a,r,l)); Kokkos::fence(); double sec2 = time2.seconds();

double sum2 = 0; Kokkos::parallel_reduce(size,dot(a),sum2);

printf("Result Left/Right %lf Right/Left %lf (equal result: %i)\n",sec1,sec2,sum2==sum1);


[crtrott@perseus 01_data_layouts]$ ./data_layouts.host --threads 16 --numa 2Result Left/Right 0.058223 Right/Left 0.024368 (equal result: 1)

[crtrott@perseus 01_data_layouts]$ ./data_layouts.cuda Result Left/Right 0.015542 Right/Left 0.104692 (equal result: 1)

11/19/13 13

Advanced Views: 02_memory_traits

#include <Kokkos_Core.hpp>#include <impl/Kokkos_Timer.hpp>#include <cstdio>

typedef Kokkos::View<double*> view_type;// We expect to access these data “randomly” (noncontiguously).typedef Kokkos::View<const double*, Kokkos::MemoryRandomAccess> view_type_rnd;typedef Kokkos::View<int**> idx_type;typedef idx_type::HostMirror idx_type_host;

// Template the Functor on the View type to show performance difference with MemoryRandomAccess.template<class DestType, class SrcType>struct localsum { idx_type::const_type idx; DestType dest; SrcType src; localsum (idx_type idx_, DestType dest_, SrcType src_) : idx (idx_), dest (dest_), src (src_) {}

KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { // Indirect (hence probably noncontiguous) access const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) = tmp; }};

• Memory Traits are used to specify usage patterns of Views • Views with different traits (which are equal otherwise) can usually be assigned to each other• Example of MemoryTraits: MemoryManaged, MemoryUnmanaged, MemoryRandomAccess• Choosing the correct traits can have significant performance impact if special hardware exists to

support a usage pattern (e.g., texture cache for random access on GPUs)

11/19/13 14

int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 1000000; idx_type idx("Idx",size,64); idx_type_host h_idx = Kokkos::create_mirror_view(idx);

view_type dest("Dest",size); view_type src("Src",size);

srand(134231); for(int i=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) { h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } } Kokkos::deep_copy(idx,h_idx); Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence();

// Invoke Kernel with views using the // RandomAccess Trait Kokkos::Impl::Timer time1;

Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); double sec1 = time1.seconds();

// Invoke Kernel with views without // the RandomAccess Trait Kokkos::Impl::Timer time2; Kokkos::parallel_for(size, localsum<view_type,view_type>(idx,dest,src)); Kokkos::fence(); double sec2 = time2.seconds();

printf("Time with Trait RandomAccess: %lf with Plain: %lf \n",sec1,sec2);


[crtrott@perseus 02_memory_traits]$ ./memory_traits.host --threads 16 --numa 2Time with Trait RandomAccess: 0.004979 with Plain: 0.004999 [crtrott@perseus 02_memory_traits]$ ./memory_traits.cudaTime with Trait RandomAccess: 0.004043 with Plain: 0.009060

11/19/13 15

Advanced Views: 04_DualViews

#include <Kokkos_Core.hpp>#include <Kokkos_DualView.hpp>#include <impl/Kokkos_Timer.hpp>#include <cstdio>#include <cstdlib>

typedef Kokkos::DualView<double*> view_type;typedef Kokkos::DualView<int**> idx_type;

template<class Device>struct localsum { // Define the functor’s execution space // (overrides the DefaultDeviceType) typedef Device device_type;

// Get view types on the particular Device // for which the functor is instantiated Kokkos::View<idx_type::const_data_type, idx_type::array_layout, Device> idx; Kokkos::View<view_type::array_type, view_type::array_layout, Device> dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, Device, Kokkos::MemoryRandomAccess > src;

Localsum (idx_type dv_idx, view_type dv_dest, view_type dv_src) // Constructor { // Extract view on correct Device from DualView idx = dv_idx.view<Device>(); dest = dv_dest.template view<Device>(); src = dv_src.template view<Device>();

// Synchronize DualView on correct Device dv_idx.sync<Device>(); dv_dest.template sync<Device>(); dv_src.template sync<Device>();

// Mark dest as modified on Device dv_dest.template modify<Device>(); }

KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) += tmp; }};

• DualViews manage data transfer between host and device• You mark a View as modified on host or device; you ask for synchronization (conditional, if marked)• DualView has same template arguments as View• To access View on a specific MemorySpace, must extract it

11/19/13 16

int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg);

srand(134231); int size = 1000000;

// Create DualViews. This will allocate on both // the device and its host_mirror_device idx_type idx("Idx",size,64); view_type dest("Dest",size); view_type src("Src",size);

// Get a reference to the host view of idx // directly (equivalent to // idx.view<idx_type::host_mirror_device_type>() ) idx_type::t_host h_idx = idx.h_view; for(int i=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) h_idx(i,j) = (size + i + (rand()%500 - 250))%size; }

// Mark idx as modified on the host_mirror_device_type // so that a sync to the device will actually move // data. // The sync happens in the constructor of the functor idx.modify<idx_type::host_mirror_device_type>();

// Run on the device // This will cause a sync of idx to the device since // its marked as modified on the host Kokkos::Impl::Timer timer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds();

timer.reset();

Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds();

// Run on the host (could be the same as device) // This will cause a sync back to the host of dest // Note that if the Device is CUDA: the data layout // will not be optimal on host, so performance is // lower than what it would be for a pure host // compilation timer.reset(); Kokkos::parallel_for(size, localsum<view_type:: host_mirror_device_type> (idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds();

timer.reset(); Kokkos::parallel_for(size,localsum<view_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds();

printf("Device Time with Sync: %lf without Sync: %lf \n”,sec1_dev,sec2_dev); printf("Host Time with Sync: %lf without Sync: %lf \n",sec1_host,sec2_host);


11/19/13 17

Advanced Views: 05 NVIDIA UVM


typedef Kokkos::View<double*> view_type;typedef Kokkos::View<int**> idx_type;

template<class Device>struct localsum { // Define the execution space for the functor // (overrides the DefaultDeviceType) typedef Device device_type;

// Use the same ViewType no matter where the // functor is executed idx_type::const_type idx; view_type dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, view_type::device_type, Kokkos::MemoryRandomAccess > src;

localsum(idx_type idx_, view_type dest_, view_type src_):idx(idx_),dest(dest_),src(src_) { }

KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val – idx.dimension_1()*val); } dest(i) += tmp; }};

• NVIDIA provides Unified Virtual Memory on high end Kepler: runtime manages data transfer• Makes coding easier: pretend there is only one MemorySpace• But: can come with significant performance penalties if frequently complete allocations are moved

11/19/13 18

int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg);

int size = 1000000;

// Create Views idx_type idx("Idx",size,64); view_type dest("Dest",size); view_type src("Src",size);

srand(134231);

// When using UVM Cuda views can be accessed on the // Host directly for(int i=0; i<size; i++) { for(int j=0; j<idx.dimension_1(); j++) idx(i,j) = (size + i + (rand()%500 - 250))%size; }

Kokkos::fence(); // Run on the device // This will cause a sync of idx to the device since // it was modified on the host Kokkos::Impl::Timer timer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds();

// No data transfer will happen now, since nothing is // accessed on the host timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds();

// Run on the host // This will cause a sync back to the host of // dest which was changed on the device // Compare runtime here with the dual_view example: // dest will be copied back in 4k blocks // when they are accessed the first time during the // parallel_for. Due to the latency of a memcpy // this gives lower effective bandwidth when doing // a manual copy via dual views timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds();

// No data transfers will happen now timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds();

printf("Device Time with Sync: %lf without Sync: %lf \n",sec1_dev,sec2_dev); printf("Host Time with Sync: %lf without Sync: %lf \n",sec1_host,sec2_host);


11/19/13 19

[crtrott@perseus 04_dualviews]$ make CUDA=yes CUDA_UVM=no -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no[crtrott@perseus 05_NVIDIA_UVM]$ make CUDA=yes CUDA_UVM=yes -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no

[crtrott@perseus Advanced_Views]$ 04_dualviews/dual_view.cuda --threads 16 --numa 2Device Time with Sync: 0.074286 without Sync: 0.004056 Host Time with Sync: 0.038507 without Sync: 0.035801

[crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2Device Time with Sync: 0.368231 without Sync: 0.358703 Host Time with Sync: 0.015760 without Sync: 0.015575

[crtrott@perseus Advanced_Views]$ export CUDA_VISIBLE_DEVICES=0[crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2Device Time with Sync: 0.068831 without Sync: 0.004065 Host Time with Sync: 0.990998 without Sync: 0.016688

Running with UVM on multi GPU machines can cause fallback to zero-copy mechanism: All allocations live on host and are accessed via PCIe bus Use: CUDA_VISIBLE_DEVICES=k to prevent thisWhen looping through a UVM allocation on the host, data is copied back in 4k Blocks to host. PCIe latency restricts effective bandwidth to 0.5 GB/s as opposed to 8 GB/s

11/19/13 20

Hierarchical Parallelism: 01 ThreadTeams


typedef Kokkos::Impl::DefaultDeviceType device_type;

int main(int narg, char* args[]) { Kokkos::initialize(narg,args); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), [=](device_type dev, int& lsum) { lsum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); },sum); printf("Result %i\n",sum);



typedef Kokkos::Impl::DefaultDeviceType device_type;

struct hello_world {

KOKKOS_INLINE_FUNCTION void operator() (device_type dev, int& sum) const { sum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); }};

int main(int narg, char* args[]) { Kokkos::initialize(narg,args); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), hello_world(),sum); printf("Result %i\n",sum);


• Kokkos supports the notion of a “League of Thread Teams”• Useful when fine-grained parallelism is exposed: need to sync or share data with thread-subset• On CPUs: often the best team size is 1; On Intel Xeon Phi and GPUs: team size of 4 and 256 • The number of teams is not hardware resource bound: as in CUDA/OpenCL use algorithmic number

11/19/13 21

Hierarchical Parallelism: 02 Shared Memory


typedef Kokkos::Impl::DefaultDeviceType Device;typedef Device::host_mirror_device_type Host;#define TS 16

struct find_2_tuples { int chunk_size; Kokkos::View<const int*> data; Kokkos::View<int**> histogram;

find_2_tuples(int chunk_size_, Kokkos::DualView<int*> data_, Kokkos::DualView<int**> histogram_): chunk_size(chunk_size_), data(data_.d_view), histogram(histogram_.d_view) { data_.sync<Device>(); histogram_.sync<Device>(); histogram_.modify<Device>(); }

KOKKOS_INLINE_FUNCTION void operator() (Device dev) const { // If Device is 1st arg, use scratchpad mem Kokkos::View<int**,Kokkos::MemoryUnmanaged> l_histogram(dev,TS,TS); Kokkos::View<int*,Kokkos::MemoryUnmanaged> l_data(dev,chunk_size+1);

const int i = dev.league_rank() * chunk_size;

for(int j = dev.team_rank(); j<chunk_size+1; j+=dev.team_size()) l_data(j) = data(i+j);

for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) l_histogram(k,l) = 0; dev.team_barrier();

for(int j = 0; j<chunk_size; j++) { for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) { if((l_data(j) == k) && (l_data(j+1)==l)) l_histogram(k,l)++; } }

for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++){ Kokkos::atomic_fetch_add(&histogram(k,l), l_histogram(k,l)); } dev.team_barrier(); } size_t shmem_size() const { return sizeof(int)*(chunk_size+2 + TS*TS); }};

• Kokkos supports ScratchPads for Teams• On CPUs, ScratchPad is just a small team-private allocation which hopefully lives in L1 cache

11/19/13 22

main() for hierarchical parallelism exampleint main(int narg, char* args[]) { Kokkos::initialize(narg,args); int chunk_size = 1024; int nchunks = 100000; //1024*1024; Kokkos::DualView<int*> data("data“,nchunks*chunk_size+1);

srand(1231093);

for(int i = 0; i < data.dimension_0(); i++) { data.h_view(i) = rand()%TS; } data.modify<Host>(); data.sync<Device>();

Kokkos::DualView<int**> histogram("histogram",TS,TS);

Kokkos::Impl::Timer timer; Kokkos::parallel_for( Kokkos::ParallelWorkRequest(nchunks, (TS < Device::team_max()) ? TS : Device::team_max()), find_2_tuples(chunk_size,data,histogram)); Kokkos::fence(); double time = timer.seconds();

histogram.sync<Host>();

printf("Time: %lf \n\n",time); Kokkos::finalize();}

11/19/13 23

Wrap Up

Features not presented here:• Getting a subview of a View• ParallelScan & TeamScan• Linear Algebra subpackage• Kokkos::UnorderedMap (thread-scalable hash table)

To learn more, see:• More complex Kokkos examples • Mantevo MiniApps (e.g., MiniFE)• LAMMPS (molecular dynamics code)

Questions and further discussion: [email protected]

Kokkos : The Tutorial alpha+1 version

Documents

type kokkos

executionspace kokkos

kokkos devices

kokkos team

powerpoint presentation

typedef device device

split device

view device