Official Use Only 11/19/13 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Kokkos: The Tutorial alpha+1 version The Kokkos Team: Carter Edwards Christian Trott Dan Sunderland
Kokkos : The Tutorial alpha+1 version. The Kokkos Team: Carter Edwards Christian Trott Dan Sunderland. Introduction. What this tutorial is: Introduction to Kokkos ’ main API features List of example codes (valid Kokkos programs) Incrementally increasing complexity - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Official Use Only
11/19/13
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of
Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Kokkos: The Tutorial alpha+1 version
The Kokkos Team: Carter Edwards Christian Trott Dan Sunderland
11/19/13 2
IntroductionWhat this tutorial is:• Introduction to Kokkos’ main API features• List of example codes (valid Kokkos programs)• Incrementally increasing complexity
What this tutorial is NOT:• Introduction to parallel programming• Presentation of Kokkos features • Performance comparison of Kokkos with other approaches
What you should know:• C++ (a bit of experience with templates helps)• General parallel programming concepts
Where the code can be found:• Trilinos/packages/kokkos/example/tutorial
Compilation:• make all CUDA=yes/no –j 8
11/19/13 3
A Note on Devices
• Use of Kokkos in applications has informed interface changes• Most Kokkos changes are already reflected in tutorial material• Not yet: Split Device into ExecutionSpace and MemorySpace• For this tutorial a Device fulfills a dual role: it is either a
MemorySpace or an ExecutionSpace
Kokkos::Cuda is used as a MemorySpace (GPU memory):
• Lambda interface requires C++11• It is not currently supported on GPUs
• is expected for NVIDIA in March 2015 • early access for NVIDIA probably fall 2014• not sure about AMD
• Lambda interface does not support all features• use for the simple cases• currently dispatches always to the default Device type• reductions only on POD with += and default initialize• parallel_scan operation not supported• shared memory for teams (scratch-pad) not supported• not obvious which limitations will stay in the future – but some will
11/19/13 5
01_HelloWorld
#include <Kokkos_Core.hpp>#include <cstdio>
// A minimal functor with just an operator().// That operator will be called in parallel.struct hello_world { KOKKOS_INLINE_FUNCTION void operator()(const int& i) const { printf("Hello World %i\n",i); }};
int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize();
// Run functor with 15 iterations in parallel // on DefaultDeviceType. Kokkos::parallel_for(15, hello_world()); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize();}
#include <Kokkos_Core.hpp>#include <cstdio>
int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize();
// Run lambda with 15 iterations in parallel on // DefaultDeviceType. Take in values in the // enclosing scope by copy [=]. Kokkos::parallel_for(15, [=] (const int& i) { printf("HelloWorld %i\n",i); }); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize();}
• Kokkos Devices need to be initialized (start up reference counting, reserve GPU etc.)• Kokkos::initialize() does that for the DefaultDeviceType which depends on your configuration
(e.g., whether Cuda or OpenMP is enabled)• parallel_for is used to dispatch work to threads or a GPU• By default parallel_for dispatches work to DefaultDeviceType
struct squaresum { // For reductions operator() has a different // interface then for parallel_for // The lsum parameter must be passed by reference // By default lsum is intialized with int() and // combined with += KOKKOS_INLINE_FUNCTION void operator() (int i, int &lsum) const { lsum+= i*i; }}; int main() { Kokkos::initialize(); int sum = 0; // sum can be everything which defines += and // a default constructors // sum has to have the same type as the // second argument of operator() of the functor Kokkos::parallel_reduce(10,squaresum(),sum); printf("Sum of first %i square numbers %i\n",9,sum);
Kokkos::finalize();}
#include <Kokkos_Core.hpp>#include <cstdio> int main() { Kokkos::initialize(); int sum = 0; // sum can be everything which defines += and // a default constructor // sum has to have the same type as the second // argument of operator() of the functor // By default lsum is initialized with default // constructor and combined with += Kokkos::parallel_reduce(10, [=] (int i, int& lsum) { lsum+=i*i; }, sum);
printf("Sum of first %i square numbers %i\n",9,sum); Kokkos::finalize();}
• Kokkos parallel_reduce offers deterministic reductions (same order of operations each time)• By default the reduction sets initial value to zero (default constructor) & uses += to combine values,
but the functor interface can be used to define specialized init and join functions
// A simple 2D array (rank==2) with one compile dimension// It is using DefaultDeviceType as its memoryspace and the default layout associated with it (typically LayoutLeft // or LayoutRight). The view does not use any special access traits.// By default a view using this type will be reference counted.typedef Kokkos::View<double*[3]> view_type;
int main() { Kokkos::initialize(); // Allocate a view with the runtime dimension set to 10 and a label "A" // The label is used in debug output and error messages view_type a("A",10);
// The view a is passed on via copy to the parallel dispatch which is important if the execution space can not // access the default HostSpace directly (or if it is slow) as e.g. on GPUs // Note: the underlying allocation is not moved, only meta_data such as pointers and shape information is copied Kokkos::parallel_for(10,[=](int i){ // Read and write access to data comes via operator() a(i,0) = 1.0*i; a(i,1) = 1.0*i*i; a(i,2) = 1.0*i*i*i; });
double sum = 0; Kokkos::parallel_reduce(10,[=](int i, double& lsum) { lsum+= a(i,0)*a(i,1)/(a(i,2)+0.1); },sum); printf("Result %lf\n",sum); Kokkos::finalize();}
• Kokkos::View: Multi-dimensional array (up to 8 dimensions)• Default layout (row- or column-major) depends on Device• Hooks for current & next-gen memory architecture features
11/19/13 8
04_SimpleMemorySpaces
#include <Kokkos_Core.hpp>#include <cstdio>
typedef Kokkos::View<double*[3]> view_type;// HostMirror is a view with the same layout / padding as its parent type but in the host memory space.// This memory space can be the same as the device memory space for example when running on CPUs.typedef view_type::HostMirror host_view_type;
struct squaresum { view_type a; squaresum(view_type a_):a(a_) {} KOKKOS_INLINE_FUNCTION void operator() (int i, int &lsum) const { lsum += a(i,0)-a(i,1)+a(i,2); } }; int main() { Kokkos::initialize(); view_type a("A",10); // Create an allocation with the same dimensions as a in the host memory space. // If the memory space of view_type and its HostMirror are the same, the mirror view won’t allocate, // and both views will have the same pointer. In that case, deep copies do nothing. host_view_type h_a = Kokkos::create_mirror_view(a);
// Transfer data from h_a to a. This does nothing if both views reference the same data. Kokkos::deep_copy(a,h_a);
int sum = 0; Kokkos::parallel_reduce(10,squaresum(a),sum); printf("Result is %i\n",sum); Kokkos::finalize();}
• Views live in a MemorySpace (abstraction for possibly manually managed memory hierarchies)• Deep copies between MemorySpaces are always explicit (“expensive things are always explicit”)
// Define View types used in the codetypedef Kokkos::View<int*> view_type;typedef Kokkos::View<int> count_type;
// A functor to find prime numbers. Append all// primes in ‘data_’ to the end of the ‘result_’ // array. ‘count_’ is the index of the first open// spot in ‘result_’.struct findprimes { view_type data_; view_type result_; count_type count_;
// operator() to be called in parallel_for. KOKKOS_INLINE_FUNCTION void operator() (int i) const { // Is data_(i) a prime number? const int number = data_(i); const int upper_bound = sqrt(1.0*number)+1; bool is_prime = !(number%2 == 0); int k = 3; while(k<upper_bound && is_prime) { is_prime = !(number%k == 0); k+=2; }
if(is_prime) { // ‘number’ is a prime, so append it to the // data_ array. Find & increment the position // of the last entry by using a fetch-and-add // atomic operation. int idx = Kokkos::atomic_fetch_add(&count(),1); result_(idx) = number; } }};
• Atomics make updating a single memory location (<= 64 bits) thread-safe• Kokkos provides: fetch-and-add, fetch-bitwise-or, fetch-bitwise-and, fetch-exchange, fetch-
compare-exchange (more can be implemented if needed)• Performance of atomics depends on hardware & how many atomic operations hit the same address
at the same time• If the atomic density is too large, explore different algorithms
• Data Layouts determine the mapping between indices and memory addresses• Each ExecutionSpace has a default Layout optimized for parallel execution on the first index• Data Layouts can be set via a template parameters in Views• Kokkos provides currently: LayoutLeft (column-major), LayoutRight (row-major), LayoutStride
([almost] arbitrary strides for each dimension), LayoutTile (like in the MAGMA library)• Custom Layouts can be added with <= 200 lines of code• Choosing wrong layout can reduce performance by 2-10x
typedef Kokkos::View<double*> view_type;// We expect to access these data “randomly” (noncontiguously).typedef Kokkos::View<const double*, Kokkos::MemoryRandomAccess> view_type_rnd;typedef Kokkos::View<int**> idx_type;typedef idx_type::HostMirror idx_type_host;
// Template the Functor on the View type to show performance difference with MemoryRandomAccess.template<class DestType, class SrcType>struct localsum { idx_type::const_type idx; DestType dest; SrcType src; localsum (idx_type idx_, DestType dest_, SrcType src_) : idx (idx_), dest (dest_), src (src_) {}
• Memory Traits are used to specify usage patterns of Views • Views with different traits (which are equal otherwise) can usually be assigned to each other• Example of MemoryTraits: MemoryManaged, MemoryUnmanaged, MemoryRandomAccess• Choosing the correct traits can have significant performance impact if special hardware exists to
support a usage pattern (e.g., texture cache for random access on GPUs)
11/19/13 14
int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 1000000; idx_type idx("Idx",size,64); idx_type_host h_idx = Kokkos::create_mirror_view(idx);
template<class Device>struct localsum { // Define the functor’s execution space // (overrides the DefaultDeviceType) typedef Device device_type;
// Get view types on the particular Device // for which the functor is instantiated Kokkos::View<idx_type::const_data_type, idx_type::array_layout, Device> idx; Kokkos::View<view_type::array_type, view_type::array_layout, Device> dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, Device, Kokkos::MemoryRandomAccess > src;
Localsum (idx_type dv_idx, view_type dv_dest, view_type dv_src) // Constructor { // Extract view on correct Device from DualView idx = dv_idx.view<Device>(); dest = dv_dest.template view<Device>(); src = dv_src.template view<Device>();
• DualViews manage data transfer between host and device• You mark a View as modified on host or device; you ask for synchronization (conditional, if marked)• DualView has same template arguments as View• To access View on a specific MemorySpace, must extract it
11/19/13 16
int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg);
srand(134231); int size = 1000000;
// Create DualViews. This will allocate on both // the device and its host_mirror_device idx_type idx("Idx",size,64); view_type dest("Dest",size); view_type src("Src",size);
// Get a reference to the host view of idx // directly (equivalent to // idx.view<idx_type::host_mirror_device_type>() ) idx_type::t_host h_idx = idx.h_view; for(int i=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) h_idx(i,j) = (size + i + (rand()%500 - 250))%size; }
// Mark idx as modified on the host_mirror_device_type // so that a sync to the device will actually move // data. // The sync happens in the constructor of the functor idx.modify<idx_type::host_mirror_device_type>();
// Run on the device // This will cause a sync of idx to the device since // its marked as modified on the host Kokkos::Impl::Timer timer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds();
// Run on the host (could be the same as device) // This will cause a sync back to the host of dest // Note that if the Device is CUDA: the data layout // will not be optimal on host, so performance is // lower than what it would be for a pure host // compilation timer.reset(); Kokkos::parallel_for(size, localsum<view_type:: host_mirror_device_type> (idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds();
printf("Device Time with Sync: %lf without Sync: %lf \n”,sec1_dev,sec2_dev); printf("Host Time with Sync: %lf without Sync: %lf \n",sec1_host,sec2_host);
template<class Device>struct localsum { // Define the execution space for the functor // (overrides the DefaultDeviceType) typedef Device device_type;
// Use the same ViewType no matter where the // functor is executed idx_type::const_type idx; view_type dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, view_type::device_type, Kokkos::MemoryRandomAccess > src;
• NVIDIA provides Unified Virtual Memory on high end Kepler: runtime manages data transfer• Makes coding easier: pretend there is only one MemorySpace• But: can come with significant performance penalties if frequently complete allocations are moved
11/19/13 18
int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg);
// When using UVM Cuda views can be accessed on the // Host directly for(int i=0; i<size; i++) { for(int j=0; j<idx.dimension_1(); j++) idx(i,j) = (size + i + (rand()%500 - 250))%size; }
Kokkos::fence(); // Run on the device // This will cause a sync of idx to the device since // it was modified on the host Kokkos::Impl::Timer timer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds();
// No data transfer will happen now, since nothing is // accessed on the host timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds();
// Run on the host // This will cause a sync back to the host of // dest which was changed on the device // Compare runtime here with the dual_view example: // dest will be copied back in 4k blocks // when they are accessed the first time during the // parallel_for. Due to the latency of a memcpy // this gives lower effective bandwidth when doing // a manual copy via dual views timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds();
// No data transfers will happen now timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds();
printf("Device Time with Sync: %lf without Sync: %lf \n",sec1_dev,sec2_dev); printf("Host Time with Sync: %lf without Sync: %lf \n",sec1_host,sec2_host);
Kokkos::finalize();}
11/19/13 19
[crtrott@perseus 04_dualviews]$ make CUDA=yes CUDA_UVM=no -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no[crtrott@perseus 05_NVIDIA_UVM]$ make CUDA=yes CUDA_UVM=yes -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no
[crtrott@perseus Advanced_Views]$ 04_dualviews/dual_view.cuda --threads 16 --numa 2Device Time with Sync: 0.074286 without Sync: 0.004056 Host Time with Sync: 0.038507 without Sync: 0.035801
[crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2Device Time with Sync: 0.368231 without Sync: 0.358703 Host Time with Sync: 0.015760 without Sync: 0.015575
[crtrott@perseus Advanced_Views]$ export CUDA_VISIBLE_DEVICES=0[crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2Device Time with Sync: 0.068831 without Sync: 0.004065 Host Time with Sync: 0.990998 without Sync: 0.016688
Running with UVM on multi GPU machines can cause fallback to zero-copy mechanism: All allocations live on host and are accessed via PCIe bus Use: CUDA_VISIBLE_DEVICES=k to prevent thisWhen looping through a UVM allocation on the host, data is copied back in 4k Blocks to host. PCIe latency restricts effective bandwidth to 0.5 GB/s as opposed to 8 GB/s
int main(int narg, char* args[]) { Kokkos::initialize(narg,args); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), hello_world(),sum); printf("Result %i\n",sum);
Kokkos::finalize();}
• Kokkos supports the notion of a “League of Thread Teams”• Useful when fine-grained parallelism is exposed: need to sync or share data with thread-subset• On CPUs: often the best team size is 1; On Intel Xeon Phi and GPUs: team size of 4 and 256 • The number of teams is not hardware resource bound: as in CUDA/OpenCL use algorithmic number
KOKKOS_INLINE_FUNCTION void operator() (Device dev) const { // If Device is 1st arg, use scratchpad mem Kokkos::View<int**,Kokkos::MemoryUnmanaged> l_histogram(dev,TS,TS); Kokkos::View<int*,Kokkos::MemoryUnmanaged> l_data(dev,chunk_size+1);
Features not presented here:• Getting a subview of a View• ParallelScan & TeamScan• Linear Algebra subpackage• Kokkos::UnorderedMap (thread-scalable hash table)
To learn more, see:• More complex Kokkos examples • Mantevo MiniApps (e.g., MiniFE)• LAMMPS (molecular dynamics code)