How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019
How to Deploy AI Software to Self Driving Cars
Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller
IWOCL`19 - May 2019
© 2019 Codeplay Software Ltd.2
Partners
Codeplay - Connecting AI to Silicon
Customers
C++ platform via the SYCL™ open standard, enabling vision & machine learning e.g. TensorFlow™
The heart of Codeplay's compute technologyenabling OpenCL™, SPIR™, HSA™ and Vulkan™
ProductsAutomotive (ISO 26262)
IoT, Smartphones & TabletsHigh Performance Compute (HPC)
Medical & Industrial
Technologies: Vision ProcessingMachine Learning
Artificial IntelligenceBig Data Compute
Addressable Markets
High-performance software solutions for custom heterogeneous systems
Enabling the toughest processor systems with tools and middleware based on open standards
Established 2002 in Scotland
~70 employees
Company
© 2019 Codeplay Software Ltd.3
Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
© 2019 Codeplay Software Ltd.4
Autonomous driving is one of the biggest challenges in technology
The automotive industry needs to deliver the latest AI technologies with safety, high performance and low power consumption
© 2019 Codeplay Software Ltd.5
Delivering an autonomous vehicle is a huge software and hardware challenge
It requires scaling up software development to very high levels of complexity, performance and risk
Whilst maintaining low power consumption
© 2019 Codeplay Software Ltd.6
Renesas R-Car architecture
● Embedded automotive architecture
● Optimized for computer vision processing and machine learning
● Designed for low latency, low power consumption and low cost
© 2019 Codeplay Software Ltd.7
SYCL-BLAS, SYCL-DNN
SYCL
OpenCL
© 2019 Codeplay Software Ltd.8
Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
© 2019 Codeplay Software Ltd.9
Processing Element
1. A processing element executes a
single work-item
1
work-item
© 2019 Codeplay Software Ltd.10
Processing Element
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element1
work-item
2
© 2019 Codeplay Software Ltd.11
Processing Element
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element
3. A compute unit executes a
work-group, composed of multiple
work-items, one for each processing
element in the compute unit
1
Compute unit
work-item work-group
2
3
© 2019 Codeplay Software Ltd.12
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element
3. A compute unit executes a
work-group, composed of multiple
work-items, one for each processing
element in the compute unit
4. Each work-item can access local
memory, a dedicated memory region
for each compute unit
Local memory
Compute unit
work-group
2
3
4Processing
Element
1
work-item
© 2019 Codeplay Software Ltd.13
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element
3. A compute unit executes a
work-group, composed of multiple
work-items, one for each processing
element in the compute unit
4. Each work-item can access local
memory, a dedicated memory region
for each compute unit
5. A device can execute multiple
work-groups
Local memory
Compute unit
work-group
2
3
4
5
Processing Element
1
work-item
© 2019 Codeplay Software Ltd.14
Processing Element
Private memory
1. A processing element executes a
single work-item
2. Each work-item can access private
memory, a dedicated memory region
for each processing element
3. A compute unit executes a
work-group, composed of multiple
work-items, one for each processing
element in the compute unit
4. Each work-item can access local
memory, a dedicated memory region
for each compute unit
5. A device can execute multiple
work-groups
6. Each work-item can access global
memory, a single memory region
available to all processing elements
1
Local memory
Global memory
Compute unit
work-item work-group
2
3
4
6
5
© 2019 Codeplay Software Ltd.15
Private memory Local memory Global memory< <
© 2019 Codeplay Software Ltd.16
Work-item
© 2019 Codeplay Software Ltd.17
Work-item Private memory
© 2019 Codeplay Software Ltd.18
Work-item
Work-group
Private memory
© 2019 Codeplay Software Ltd.19
Work-item
Work-group
Private memory
Local memory
© 2019 Codeplay Software Ltd.20
Work-item
Work-group
Private memory
Local memoryWork-group barrier
© 2019 Codeplay Software Ltd.21
Work-item
Work-group
Private memory
Local memory
Kernel
Work-group barrier
© 2019 Codeplay Software Ltd.22
Work-item Private memory
Local memory
Global memoryKernel
Work-group Work-group barrier
© 2019 Codeplay Software Ltd.23
Work-item Private memory
Local memory
Global memoryKernel Kernel barrier
Work-group Work-group barrier
© 2019 Codeplay Software Ltd.24
Cross-platform, single-source, high-level, C++ programming layerBuilt on top of OpenCL and based on standard C++11
Delivering a heterogeneous programming solution for C++
© 2019 Codeplay Software Ltd.25
__global__ vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i];}
float *a, *b, *c;vec_add<<<range>>>(a, b, c);
vector<float> a, b, c;
#pragma parallel_forfor(int i = 0; i < a.size(); i++) { c[i] = a[i] + b[i];}
cgh.parallel_for<vec_add>(range, [=](cl::sycl::id<2> idx) { c[idx] = a[idx] + c[idx];}));
array_view<float> a, b, c;extent<2> e(64, 64);
parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx];});
© 2019 Codeplay Software Ltd.26
SYCL separates the storage and access of data through the use of buffers and accessors
SYCL provides data dependency tracking based on accessors to optimise the scheduling of tasks
© 2019 Codeplay Software Ltd.27
Buffer
Accessor
Accessor
Buffers and accessors are type safe access
across host and device
Accessors are used to describe access
requirements
Buffers manage data across the host and one or more devices
CG A
CG B
© 2019 Codeplay Software Ltd.28
Buffer B
Buffer C
Buffer D
Buffer A
CG B
CG C
CG ARead accessor
Write accessor
Read accessor
Write accessor
Read accessor
Write accessor
Read accessor
CG C
CG A CG B
© 2019 Codeplay Software Ltd.29
CG
Buffer global_buffer accessor
constant_buffer accessor
local accessor
Request access to a buffer in the global memory region
Request access to a buffer in the constant memory region
Allocate memory in the local memory region
host_buffer accessor
Request access to a buffer immediately on the host
© 2019 Codeplay Software Ltd.30
Benefits of data dependency task graphs
● Allows you to describe your tasks in terms of relationships○ Removes the need to en-queue explicit copies
○ Removes the need for complex event handling
● Allows the runtime to make data movement optimizations○ Preemptively copy data to a device before kernels are executed
○ Avoid unnecessarily copying data back to the host after execution on a
device
© 2019 Codeplay Software Ltd.31
Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
© 2019 Codeplay Software Ltd.32
CPU
© 2019 Codeplay Software Ltd.33
CPU
DDR
1. A CPU has a region of
dedicated memory
1
© 2019 Codeplay Software Ltd.34
CPU
DDR
1. A CPU has a region of
dedicated memory
2. CPU memory is
connected to the CPU
via a bus
1
2
© 2019 Codeplay Software Ltd.35
CPU
DDR
1. A CPU has a region of
dedicated memory
2. The CPU memory is
connected to the CPU
via a bus
3. A CPU has a number of
cores
Core Core Core Core
1
3
2
© 2019 Codeplay Software Ltd.36
CPU
DDR
1. A CPU has a region of
dedicated memory
2. The CPU memory is
connected to the CPU
via a bus
3. A CPU has a number of
cores
4. A CPU has a number of
caches of different
levels
Core Core Core Core
Cache (multiple levels)
1
2
3
4
© 2019 Codeplay Software Ltd.37
CPU
DDR
1. A CPU has a region of
dedicated memory
2. The CPU memory is
connected to the CPU
via a bus
3. A CPU has a number of
cores
4. A CPU has a number of
caches of different
levels
5. Each CPU core has
dedicated registers
Core Core Core Core
Cache (multiple levels)
Registers Registers Registers Registers
1
2
3
4
5
© 2019 Codeplay Software Ltd.38
CPU
DDR
Core Core Core Core
Cache (multiple levels)
Registers Registers Registers Registers
© 2019 Codeplay Software Ltd.39
CPU
DDR
1. Lanes of the CPU core
SIMD instructions are
mapped to work-itemsSIMD work-items SIMD work-items SIMD work-items SIMD work-items
Cache (multiple levels)
Registers Registers Registers Registers
1
© 2019 Codeplay Software Ltd.40
CPU
DDR
1. Lanes of the CPU core
SIMD instructions are
mapped to work-items
2. CPU registers and their
associated caches are
mapped to private
memory
SIMD work-items SIMD work-items SIMD work-items SIMD work-items
Cache (multiple levels)
Private memory Private memory Private memory Private memory
1
2
© 2019 Codeplay Software Ltd.41
CPU
DDR
1. Lanes of the CPU core
SIMD instructions are
mapped to work-items
2. CPU registers and their
associated caches are
mapped to private
memory
3. A section of DDR is
mapped to local memory
SIMD work-items SIMD work-items SIMD work-items SIMD work-items
Cache (multiple levels)
Private memory Private memory Private memory Private memory
1
2
Local memory
3
© 2019 Codeplay Software Ltd.42
CPU
DDR
1. Lanes of the CPU core
SIMD instructions are
mapped to work-items
2. CPU registers and their
associated caches are
mapped to private
memory
3. A section of DDR is
mapped to local memory
4. The rest of DDR is
mapped to global
memory
SIMD work-items SIMD work-items SIMD work-items SIMD work-items
Cache (multiple levels)
Private memory Private memory Private memory Private memory
1
2
Local memory
3
Global memory
4
© 2019 Codeplay Software Ltd.43
GPU
© 2019 Codeplay Software Ltd.44
GPU
DDR
1. A GPU has a region of
dedicated DDR memory
which is connected to the
CPU
1
© 2019 Codeplay Software Ltd.45
GPU
DDR
1. A GPU has a region of
dedicated DDR memory
which is connected to the
CPU
2. A GPU is divided into a
number of compute units
Compute unit Compute unit
...
1
2
© 2019 Codeplay Software Ltd.46
GPU
DDR
1. A GPU has a region of
dedicated DDR memory
which is connected to the
CPU
2. A GPU is divided into a
number of compute units
3. Each compute unit has
dedicated shared memory
Compute unit Compute unit
...
Shared memory Shared memory
2
3
1
© 2019 Codeplay Software Ltd.47
GPU
DDR
1. A GPU has a region of
dedicated DDR memory
which is connected to the
CPU
2. A GPU is divided into a
number of compute units
3. Each compute unit has
dedicated shared memory
4. Each compute unit has a
number of processing
elements
Compute unit Compute unit
...
Shared memory Shared memory
PE PE PE PE PE PE
... ...
2
3
1
4
© 2019 Codeplay Software Ltd.48
GPU
DDR
1. A GPU has a region of
dedicated DDR memory
which is connected to the
CPU
2. A GPU is divided into a
number of compute units
3. Each compute unit has
dedicated shared memory
4. Each compute unit has a
number of processing
elements
5. Each processing element has
dedicated processing
element local memory
Compute unit Compute unit
...
Shared memory Shared memory
PE PE PE
PM PM PM
PE PE PE
PM PM PM... ...
2
3
1
4
5
© 2019 Codeplay Software Ltd.49
GPU
DDR
Compute unit Compute unit
...
Shared memory Shared memory
PE PE PE
PM PM PM
PE PE PE
PM PM PM... ...
© 2019 Codeplay Software Ltd.50
GPU
DDR
Compute units on are mapped to
the optimal work-group sizeWork-group Work-group
...
Shared memory Shared memory
PE PE PE
PM PM PM
PE PE PE
PM PM PM... ...
© 2019 Codeplay Software Ltd.51
GPU
DDR
Processing elements on are
mapped to work-itemsWork-group Work-group
...
Shared memory Shared memory
Work-item
Work-item
Work-item
PM PM PM
Work-item
Work-item
Work-item
PM PM PM... ...
© 2019 Codeplay Software Ltd.52
GPU
Global memory
DDR memory is mapped to global
memoryWork-group Work-group
...
Shared memory Shared memory
Work-item
Work-item
Work-item
PM PM PM
Work-item
Work-item
Work-item
PM PM PM... ...
© 2019 Codeplay Software Ltd.53
GPU
Global memory
Compute unit shared memory is
mapped to local memoryWork-group Work-group
...
Local memory Local memory
Work-item
Work-item
Work-item
PM PM PM
Work-item
Work-item
Work-item
PM PM PM... ...
© 2019 Codeplay Software Ltd.54
GPU
Global memory
Processing element local memory
is mapped to private memoryWork-group Work-group
...
Local memory Local memory
Work-item
Work-item
Work-item
PM PM PM
Work-item
Work-item
Work-item
PM PM PM... ...
© 2019 Codeplay Software Ltd.55
Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
© 2019 Codeplay Software Ltd.56
CVEngine
© 2019 Codeplay Software Ltd.57
CVEngine
DDR
1. The CVEngine has is connected to
off-chip DDR which is connected to
the CPU
1
© 2019 Codeplay Software Ltd.58
CVEngine
Cluster Cluster
DDR
2
1. The CVEngine has is connected to
off-chip DDR which is connected to
the CPU
2. The CVEngine has a number of
clusters
1
© 2019 Codeplay Software Ltd.59
CVEngine
Cluster Cluster
SRAM
DDR1
2
3
1. The CVEngine has is connected to
off-chip DDR which is connected to
the CPU
2. The CVEngine has a number of
clusters
3. The CVEngine has a region of
on-chip SRAM, also connected to
the CPU
© 2019 Codeplay Software Ltd.60
CVEngine
Cluster Cluster
Core Core Core Core Core Core Core Core
SRAM
DDR1
2
3
4
1. The CVEngine has is connected to
off-chip DDR which is connected to
the CPU
2. The CVEngine has a number of
clusters
3. The CVEngine has a region of
on-chip SRAM, also connected to
the CPU
4. Each cluster has 4 cores each with
a number of processing elements
© 2019 Codeplay Software Ltd.61
CVEngine
Cluster
Cache (multiple levels)
Cluster
Cache (multiple levels)
Core Core Core Core
Registers Registers Registers Registers
Core Core Core Core
Registers Registers Registers Registers
SRAM
DDR1
2
3
4
1. The CVEngine has is connected to
off-chip DDR which is connected to
the CPU
2. The CVEngine has a number of
clusters
3. The CVEngine has a region of
on-chip SRAM, also connected to
the CPU
4. Each cluster has 4 cores each with
a number of processing elements
5. Each core has dedicated registers
and can access DDR memory via
caches5
© 2019 Codeplay Software Ltd.62
CVEngine
Cluster
Cache (multiple levels)
Cluster
Cache (multiple levels)
Core Core Core Core
Registers Registers Registers Registers
Core Core Core Core
Registers Registers Registers Registers
DDR
SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
1
2
3
4
5
6
1. The CVEngine has is connected to
off-chip DDR which is connected to
the CPU
2. The CVEngine has a number of
clusters
3. The CVEngine has a region of
on-chip SRAM, also connected to
the CPU
4. Each cluster has 4 cores each with
a number of processing elements
5. Each core has dedicated registers
and can access DDR memory via
caches
6. Each core also has dedicated local
SRAM
© 2019 Codeplay Software Ltd.63
CVEngine
Cluster
Cache (multiple levels)
Cluster
Cache (multiple levels)
Core Core Core Core
Registers Registers Registers Registers
Core Core Core Core
Registers Registers Registers Registers
DDR
SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
1
2
3
4
5
6
7
1. The CVEngine has is connected to
off-chip DDR which is connected to
the CPU
2. The CVEngine has a number of
clusters
3. The CVEngine has a region of
on-chip SRAM, also connected to
the CPU
4. Each cluster has 4 cores each with
a number of processing elements
5. Each core has dedicated registers
and can access DDR memory via
caches
6. Each core also has dedicated local
SRAM
7. The local SRAM is connected to the
on-chip SRAM and DDR via DMA
© 2019 Codeplay Software Ltd.64
CVEngine
Cluster
Cache (multiple levels)
Cluster
Cache (multiple levels)
Core Core Core Core
Registers Registers Registers Registers
Core Core Core Core
Registers Registers Registers Registers
DDR
SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
© 2019 Codeplay Software Ltd.65
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Core Core Core Core
Registers Registers Registers Registers
Core Core Core Core
Registers Registers Registers Registers
DDR
SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Each cluster maps to the optimal
work-group size
© 2019 Codeplay Software Ltd.66
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
Sub- group
Sub- group
Sub- group
Registers Registers Registers Registers
Sub- group
Sub- group
Sub- group
Sub- group
Registers Registers Registers Registers
DDR
SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Cores provide an extra level of
subdivision within work-groups
So each core maps to a sub-group
Sub-groups are available in
OpenCL 2.x but not yet available
in SYCL so this will require an
extension
© 2019 Codeplay Software Ltd.67
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Registers Registers Registers Registers
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Registers Registers Registers Registers
DDR
SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Each processing element within a
core maps to a single work-item
© 2019 Codeplay Software Ltd.68
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Registers Registers Registers Registers
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Registers Registers Registers Registers
Global memory
SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Off-chip DDR memory is mapped
to global memory
© 2019 Codeplay Software Ltd.69
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Registers Registers Registers Registers
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Registers Registers Registers Registers
Global memory
Local memory
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
On-chip SRAM memory is mapped
to local memory
© 2019 Codeplay Software Ltd.70
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Global memory
Local memory
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Registers are mapped to private
memory
© 2019 Codeplay Software Ltd.71
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Global memory
On-chip memory
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Local SRAM
Since SRAM can be written to and
read from the CPU and can be
accessed by all work-groups
similar to global memory
SRAM can also be used to allocate
low-latency on-chip buffers
Local memory
© 2019 Codeplay Software Ltd.72
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Global memory
On-chip memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Since each core has its own
dedicated local SRAM
Local SRAM can be mapped to a
sub-group local memory
Sub-group local memory is not yet
available in OpenCL or SYCL so
this will require an extension
Local memory
© 2019 Codeplay Software Ltd.73
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Global memory
On-chip memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Since there is DMA connections
from SRAM and DDR to local
SRAM
The CVEngine can support
asynchronous memory copies
from on-chip memory buffers and
global memory buffers into
sub-group local memory and vise
versa
These asynchronous copies
cannot be represented in OpenCL
or SYCL so will require an
extension
Local memory
© 2019 Codeplay Software Ltd.74
Work-item Private memory
Local memory
Global memoryKernel Kernel barrier
Work-group Work-group barrier
© 2019 Codeplay Software Ltd.75
Work-item Private memory
Local memory
Global memoryKernel Kernel barrier
Work-group Work-group barrier
On-chip memory
© 2019 Codeplay Software Ltd.76
Work-item
Work-group
Private memory
Local memory
Kernel
Work-group barrier
Kernel barrier
Sub-group Sub-group local memorySub-group barrier
Global memory
On-chip memory
© 2019 Codeplay Software Ltd.77
Work-item
Work-group
Private memory
Local memory
Kernel
Work-group barrier
Kernel barrier
Sub-groupSub-group local memorySub-group barrier
Global memory
On-chip memory
Asynchronous sub-group copies
© 2019 Codeplay Software Ltd.78
Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
© 2019 Codeplay Software Ltd.79
Disclaimer
The features that I present here are Codeplay extensions and are not standard SYCL features
© 2019 Codeplay Software Ltd.80
CVEngine
Work-group
Cache (multiple levels)
Work-group
Cache (multiple levels)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Global memory
On-chip memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Local memory
© 2019 Codeplay Software Ltd.81
● On-chip memory○ On-chip memory is allocated in OpenCL/SYCL similarly to regular buffers
■ ComputeAorta (OpenCL) provides an extension API
■ ComputeCpp (SYCL) provides an extension buffer property use_onchip_memory
○ On-chip memory buffers are accessed in OpenCL/SYCL kernels in the
same way as regular buffers
On-chip memory Local memory
© 2019 Codeplay Software Ltd.82
class kernel;
using namespace cl::sycl;
{
queue deviceQueue;
buffer<float, 1> onchipBuffer(hostData, size,
{codeplay::property::buffer::use_onchip_memory(
codeplay::property::require)});
deviceQueue.submit([&](handler &cgh){
auto onchipAcc =
onchipBuffer.get_access<access::mode::read_write>(cgh);
cgh.parallel_for<kernel>(range<1>(size), [=](id<1> idx){
onchipAcc[idx] = onchipAcc[idx] * onchipAcc[idx];
}):
});
}
We construct a SYCL buffer as normal, but provide the use_onchip_memory buffer property
This property takes an enumeration; either require, which means that SYCL runtime has to use it or prefer, which means the SYCL runtime should try to use it
© 2019 Codeplay Software Ltd.83
● Sub-groups○ Sub-groups are exposed following the OpenCL 2.x feature and as a
natural extension to the SYCL execution model■ ComputeAorta (OpenCL) provides kernel builtins for querying sub-group info and
invoking a sub-group barrier
■ ComputeCpp (SYCL) provides an extension to nd_item to expose a sub_group
object, similar to group, which exposes member functions for querying sub-group
info and invoking a sub-group barrier
○ The size of sub-groups cannot be specified
explicitly by the users, they are determined by
the implementation
Work-group
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
© 2019 Codeplay Software Ltd.84
class kernel;
using namespace cl::sycl;
{
queue deviceQueue;
buffer<float, 1> deviceBuffer(hostData, size);
deviceQueue.submit([&](handler &cgh){
auto deviceAcc=
deviceBuffer.get_access<access::mode::read_write>(cgh);
cgh.parallel_for<kernel>(nd_range<1>(range<1>(size), range<1>(32)),
[=](nd_item<1> ndItem){
… auto subGroup = ndItem.get_sub_group();
auto subGroupRange = subGroup.get_group_range();
auto subGroupId = subGroup.get_group_id();
subGroup.barrier();
… }):
});
}
We query in-kernel sub-group information and invoke sub-group barriers via the sub_group class and the nd_item has a member function called get_sub_group that will return a sub_group object
If an implementation does not support sub-groups using sub_group is undefined
© 2019 Codeplay Software Ltd.85
● Sub-group local memory○ Sub-group local memory is exposed with extensions which follow the
OpenCL/SYCL memory model■ ComputeAorta (OpenCL) provides a new address space which can be used to
allocate sub-group local memory
■ ComputeCpp (SYCL) provides a new accessor access target;
access::target::subgroup_local, that behaves similarly to access::target::local
Work-group
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
© 2019 Codeplay Software Ltd.86
class kernel;
using namespace cl::sycl;
{
queue deviceQueue;
buffer<float, 1> deviceBuffer(hostData, size);
deviceQueue.submit([&](handler &cgh){
auto deviceAcc=
deviceBuffer.get_access<access::mode::read_write>(cgh);
auto subGroupLocalMem = accessor<float, 1, access::mode::read_write,
access::target::subgroup_local>(cgh, range<1>(32));
cgh.parallel_for<kernel>(nd_range<1>(range<1>(size), range<1>(32)),
[=](nd_item<1> ndItem){
… subGroupLocalMem[idx] = ...;
… }):
});
}
We allocate sub-group local memory by constructing an accessor with the subgroup_local access target
© 2019 Codeplay Software Ltd.87
● Asynchronous sub-group copies○ Asynchronous sub-group copies are exposed following the OpenCL/SYCL
feature for asynchronous work-group copies■ ComputeAorta (OpenCL) provides a plane_t type to represent a non-accessible
buffer and kernel builtins for invoking an asynchronous in-kernel copies between
a plane_t and a sub-group local memory allocation
■ ComputeCpp (SYCL) provides a new accessor access target; access::target::plane,
and a member function to the sub_group extension; async_sub_group_copy, to
perform an asynchronous in-kernel copy from a plane
accessor to a sub-group local accessor Work-group
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
Sub- group
(N work- items)
PM PM PM PM
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
Sub-group local
memory
© 2019 Codeplay Software Ltd.88
class kernel;
using namespace cl::sycl;
{
queue deviceQueue;
buffer<float, 1> deviceBuffer(hostData, size);
deviceQueue.submit([&](handler &cgh){
auto devicePlane =
deviceBuffer.get_access<access::mode::read_write,
access::target::plane>(cgh);
auto subGroupLocalMem = accessor<float, 1, access::mode::read_write,
access::target::subgroup_local>(cgh, range<1>(32));
cgh.parallel_for<kernel>(nd_range<1>(range<1>(size), range<1>(32)),
[=](nd_item<1> ndItem){
… auto subGroup = ndItem.get_sub_group();
auto event = subGroup.async_sub_group_copy(subGroupLocalMem,
devicePlane, range<1>(32));
… event.wait();
}):
});
}
We construct a plane accessor using the plane access target
We perform asynchronous in-kernel sub-group copies by calling the sub_group member function async_sub_group_copy
This returns a device_event that can be used to wait on the copy to complete.
© 2019 Codeplay Software Ltd.89
Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
© 2019 Codeplay Software Ltd.90
© 2019 Codeplay Software Ltd.91
Input Image
© 2019 Codeplay Software Ltd.92
Input Image
© 2019 Codeplay Software Ltd.93
Input Image
Global memory
✔ The entire image will fit into global memory
✘ Global memory has a high access latency
© 2019 Codeplay Software Ltd.94
✔ On-chip memory has a much lower access latency
✘ Only part of the image will fit into on-chip memory at once, so we have to tile it
✘ Executing a kernel per tile incurs host-side overhead
Note that because convolutions are gather operations the input data much include a halo
On-chip memory
1,0
0,1 1,1
0,0
© 2019 Codeplay Software Ltd.95
Copy{0, 0}
Convo{0, 0}
Copy{1, 0}
Convo{1, 0}
Copy{0, 1}
Convo{0, 1}
Copy{1, 1}
Convo{1, 1}
Copy
Compute
© 2019 Codeplay Software Ltd.96
Copy{0, 0}
Convo{0, 0}
Copy{1, 0}
Convo{1, 0}
Copy{0, 1}
Convo{0, 1}
Copy{1, 1}
Convo{1, 1}
Copy
Compute
© 2019 Codeplay Software Ltd.97
Copy{0, 0}
Convo{0, 0}
Copy{1, 0}
Convo{1, 0}
Copy{0, 1}
Convo{0, 1}
Copy{1, 1}
Convo{1, 1}
Copy
Compute
© 2019 Codeplay Software Ltd.98
Copy{0, 0}
Convo{0, 0}
Copy{1, 0}
Convo{1, 0}
Copy{0, 1}
Convo{0, 1}
Copy{1, 1}
Convo{1, 1}
Copy
Compute
© 2019 Codeplay Software Ltd.99
✔ By double buffering copy and computation you can hide the latency of copying into on-chip memory
However, the R-Car CVEngine provides further sub-group local memory which has an even lower access latency than on-chip memory
On-chip memory
1,0
0,1 1,1
0,0
© 2019 Codeplay Software Ltd.100
✔ Asynchronously copying each part of the input data that is associated with a sub-group to sub-group local memory will further lower access latency
✘ Again, only part of the image data that is associated with a sub-group will fit into sub-group local memory at once, so again we have to tile it
In this case the tiling is done in-kernel
On-chip memory1,0
0,1 1,1
Sub-group local memory
1,0
0,1 1,1
0,0
© 2019 Codeplay Software Ltd.101
cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();
auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);
auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});
© 2019 Codeplay Software Ltd.102
cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();
auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);
auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});
This kernel is operating on a tile that is stored in on-chip memory
© 2019 Codeplay Software Ltd.103
cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();
auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);
auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});
We want to perform the computation of the part of the input that each sub-group corresponds to in sub-group local memory
But all the memory required may not fit into sub-group local memory at once
So we calculate how many tiles are required for a sub-group
© 2019 Codeplay Software Ltd.104
cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();
auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);
auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});
First we initiate and wait on the copy of the first tile so we can perform the computation on it
Then we initiate, but don’t wait for the copy of the second tile, so that copy will happen in parallel to the computation of the first tile
© 2019 Codeplay Software Ltd.105
cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();
auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);
auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});
Then we iterate over the tiles, performing the computation of the current tile and then waiting on the copy for the next tile
© 2019 Codeplay Software Ltd.106
cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();
auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);
auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});
Finally, if there are further tiles to be processed, then we initiate the copy for the next tile and then swap the accessors for the next iteration of the loop
© 2019 Codeplay Software Ltd.107
✔ By double buffering asynchronous copies and the computation in each sub-group you can hide the latency of copying into sub-group local memory
On-chip memory1,0
0,1 1,1
Sub-group local memory
1,0
0,1 1,1
0,0
© 2019 Codeplay Software Ltd.108
Conclusion
● The Renesas R-Car CVEngine is designed to efficiently accelerate complex machine learning algorithms in a low power environment
● The OpenCL/SYCL programming memory can be efficiently applied and extended when necessary to support very unique hardware architectures
● This allows automotive systems to take advantage of AI software stacks based on open standards
/codeplaysoft@codeplaysoft codeplay.com
We’re
Hiring!
codeplay.c
om/c
areers/
Thank you for listening