Top Banner
How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019
109

Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

Oct 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

How to Deploy AI Software to Self Driving Cars

Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller

IWOCL`19 - May 2019

Page 2: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.2

Partners

Codeplay - Connecting AI to Silicon

Customers

C++ platform via the SYCL™ open standard, enabling vision & machine learning e.g. TensorFlow™

The heart of Codeplay's compute technologyenabling OpenCL™, SPIR™, HSA™ and Vulkan™

ProductsAutomotive (ISO 26262)

IoT, Smartphones & TabletsHigh Performance Compute (HPC)

Medical & Industrial

Technologies: Vision ProcessingMachine Learning

Artificial IntelligenceBig Data Compute

Addressable Markets

High-performance software solutions for custom heterogeneous systems

Enabling the toughest processor systems with tools and middleware based on open standards

Established 2002 in Scotland

~70 employees

Company

Page 3: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.3

Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

Page 4: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.4

Autonomous driving is one of the biggest challenges in technology

The automotive industry needs to deliver the latest AI technologies with safety, high performance and low power consumption

Page 5: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.5

Delivering an autonomous vehicle is a huge software and hardware challenge

It requires scaling up software development to very high levels of complexity, performance and risk

Whilst maintaining low power consumption

Page 6: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.6

Renesas R-Car architecture

● Embedded automotive architecture

● Optimized for computer vision processing and machine learning

● Designed for low latency, low power consumption and low cost

Page 7: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.7

SYCL-BLAS, SYCL-DNN

SYCL

OpenCL

Page 8: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.8

Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

Page 9: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.9

Processing Element

1. A processing element executes a

single work-item

1

work-item

Page 10: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.10

Processing Element

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element1

work-item

2

Page 11: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.11

Processing Element

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element

3. A compute unit executes a

work-group, composed of multiple

work-items, one for each processing

element in the compute unit

1

Compute unit

work-item work-group

2

3

Page 12: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.12

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element

3. A compute unit executes a

work-group, composed of multiple

work-items, one for each processing

element in the compute unit

4. Each work-item can access local

memory, a dedicated memory region

for each compute unit

Local memory

Compute unit

work-group

2

3

4Processing

Element

1

work-item

Page 13: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.13

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element

3. A compute unit executes a

work-group, composed of multiple

work-items, one for each processing

element in the compute unit

4. Each work-item can access local

memory, a dedicated memory region

for each compute unit

5. A device can execute multiple

work-groups

Local memory

Compute unit

work-group

2

3

4

5

Processing Element

1

work-item

Page 14: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.14

Processing Element

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element

3. A compute unit executes a

work-group, composed of multiple

work-items, one for each processing

element in the compute unit

4. Each work-item can access local

memory, a dedicated memory region

for each compute unit

5. A device can execute multiple

work-groups

6. Each work-item can access global

memory, a single memory region

available to all processing elements

1

Local memory

Global memory

Compute unit

work-item work-group

2

3

4

6

5

Page 15: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.15

Private memory Local memory Global memory< <

Page 16: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.16

Work-item

Page 17: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.17

Work-item Private memory

Page 18: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.18

Work-item

Work-group

Private memory

Page 19: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.19

Work-item

Work-group

Private memory

Local memory

Page 20: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.20

Work-item

Work-group

Private memory

Local memoryWork-group barrier

Page 21: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.21

Work-item

Work-group

Private memory

Local memory

Kernel

Work-group barrier

Page 22: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.22

Work-item Private memory

Local memory

Global memoryKernel

Work-group Work-group barrier

Page 23: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.23

Work-item Private memory

Local memory

Global memoryKernel Kernel barrier

Work-group Work-group barrier

Page 24: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.24

Cross-platform, single-source, high-level, C++ programming layerBuilt on top of OpenCL and based on standard C++11

Delivering a heterogeneous programming solution for C++

Page 25: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.25

__global__ vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i];}

float *a, *b, *c;vec_add<<<range>>>(a, b, c);

vector<float> a, b, c;

#pragma parallel_forfor(int i = 0; i < a.size(); i++) { c[i] = a[i] + b[i];}

cgh.parallel_for<vec_add>(range, [=](cl::sycl::id<2> idx) { c[idx] = a[idx] + c[idx];}));

array_view<float> a, b, c;extent<2> e(64, 64);

parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx];});

Page 26: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.26

SYCL separates the storage and access of data through the use of buffers and accessors

SYCL provides data dependency tracking based on accessors to optimise the scheduling of tasks

Page 27: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.27

Buffer

Accessor

Accessor

Buffers and accessors are type safe access

across host and device

Accessors are used to describe access

requirements

Buffers manage data across the host and one or more devices

CG A

CG B

Page 28: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.28

Buffer B

Buffer C

Buffer D

Buffer A

CG B

CG C

CG ARead accessor

Write accessor

Read accessor

Write accessor

Read accessor

Write accessor

Read accessor

CG C

CG A CG B

Page 29: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.29

CG

Buffer global_buffer accessor

constant_buffer accessor

local accessor

Request access to a buffer in the global memory region

Request access to a buffer in the constant memory region

Allocate memory in the local memory region

host_buffer accessor

Request access to a buffer immediately on the host

Page 30: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.30

Benefits of data dependency task graphs

● Allows you to describe your tasks in terms of relationships○ Removes the need to en-queue explicit copies

○ Removes the need for complex event handling

● Allows the runtime to make data movement optimizations○ Preemptively copy data to a device before kernels are executed

○ Avoid unnecessarily copying data back to the host after execution on a

device

Page 31: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.31

Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

Page 32: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.32

CPU

Page 33: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.33

CPU

DDR

1. A CPU has a region of

dedicated memory

1

Page 34: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.34

CPU

DDR

1. A CPU has a region of

dedicated memory

2. CPU memory is

connected to the CPU

via a bus

1

2

Page 35: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.35

CPU

DDR

1. A CPU has a region of

dedicated memory

2. The CPU memory is

connected to the CPU

via a bus

3. A CPU has a number of

cores

Core Core Core Core

1

3

2

Page 36: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.36

CPU

DDR

1. A CPU has a region of

dedicated memory

2. The CPU memory is

connected to the CPU

via a bus

3. A CPU has a number of

cores

4. A CPU has a number of

caches of different

levels

Core Core Core Core

Cache (multiple levels)

1

2

3

4

Page 37: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.37

CPU

DDR

1. A CPU has a region of

dedicated memory

2. The CPU memory is

connected to the CPU

via a bus

3. A CPU has a number of

cores

4. A CPU has a number of

caches of different

levels

5. Each CPU core has

dedicated registers

Core Core Core Core

Cache (multiple levels)

Registers Registers Registers Registers

1

2

3

4

5

Page 38: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.38

CPU

DDR

Core Core Core Core

Cache (multiple levels)

Registers Registers Registers Registers

Page 39: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.39

CPU

DDR

1. Lanes of the CPU core

SIMD instructions are

mapped to work-itemsSIMD work-items SIMD work-items SIMD work-items SIMD work-items

Cache (multiple levels)

Registers Registers Registers Registers

1

Page 40: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.40

CPU

DDR

1. Lanes of the CPU core

SIMD instructions are

mapped to work-items

2. CPU registers and their

associated caches are

mapped to private

memory

SIMD work-items SIMD work-items SIMD work-items SIMD work-items

Cache (multiple levels)

Private memory Private memory Private memory Private memory

1

2

Page 41: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.41

CPU

DDR

1. Lanes of the CPU core

SIMD instructions are

mapped to work-items

2. CPU registers and their

associated caches are

mapped to private

memory

3. A section of DDR is

mapped to local memory

SIMD work-items SIMD work-items SIMD work-items SIMD work-items

Cache (multiple levels)

Private memory Private memory Private memory Private memory

1

2

Local memory

3

Page 42: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.42

CPU

DDR

1. Lanes of the CPU core

SIMD instructions are

mapped to work-items

2. CPU registers and their

associated caches are

mapped to private

memory

3. A section of DDR is

mapped to local memory

4. The rest of DDR is

mapped to global

memory

SIMD work-items SIMD work-items SIMD work-items SIMD work-items

Cache (multiple levels)

Private memory Private memory Private memory Private memory

1

2

Local memory

3

Global memory

4

Page 43: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.43

GPU

Page 44: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.44

GPU

DDR

1. A GPU has a region of

dedicated DDR memory

which is connected to the

CPU

1

Page 45: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.45

GPU

DDR

1. A GPU has a region of

dedicated DDR memory

which is connected to the

CPU

2. A GPU is divided into a

number of compute units

Compute unit Compute unit

...

1

2

Page 46: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.46

GPU

DDR

1. A GPU has a region of

dedicated DDR memory

which is connected to the

CPU

2. A GPU is divided into a

number of compute units

3. Each compute unit has

dedicated shared memory

Compute unit Compute unit

...

Shared memory Shared memory

2

3

1

Page 47: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.47

GPU

DDR

1. A GPU has a region of

dedicated DDR memory

which is connected to the

CPU

2. A GPU is divided into a

number of compute units

3. Each compute unit has

dedicated shared memory

4. Each compute unit has a

number of processing

elements

Compute unit Compute unit

...

Shared memory Shared memory

PE PE PE PE PE PE

... ...

2

3

1

4

Page 48: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.48

GPU

DDR

1. A GPU has a region of

dedicated DDR memory

which is connected to the

CPU

2. A GPU is divided into a

number of compute units

3. Each compute unit has

dedicated shared memory

4. Each compute unit has a

number of processing

elements

5. Each processing element has

dedicated processing

element local memory

Compute unit Compute unit

...

Shared memory Shared memory

PE PE PE

PM PM PM

PE PE PE

PM PM PM... ...

2

3

1

4

5

Page 49: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.49

GPU

DDR

Compute unit Compute unit

...

Shared memory Shared memory

PE PE PE

PM PM PM

PE PE PE

PM PM PM... ...

Page 50: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.50

GPU

DDR

Compute units on are mapped to

the optimal work-group sizeWork-group Work-group

...

Shared memory Shared memory

PE PE PE

PM PM PM

PE PE PE

PM PM PM... ...

Page 51: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.51

GPU

DDR

Processing elements on are

mapped to work-itemsWork-group Work-group

...

Shared memory Shared memory

Work-item

Work-item

Work-item

PM PM PM

Work-item

Work-item

Work-item

PM PM PM... ...

Page 52: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.52

GPU

Global memory

DDR memory is mapped to global

memoryWork-group Work-group

...

Shared memory Shared memory

Work-item

Work-item

Work-item

PM PM PM

Work-item

Work-item

Work-item

PM PM PM... ...

Page 53: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.53

GPU

Global memory

Compute unit shared memory is

mapped to local memoryWork-group Work-group

...

Local memory Local memory

Work-item

Work-item

Work-item

PM PM PM

Work-item

Work-item

Work-item

PM PM PM... ...

Page 54: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.54

GPU

Global memory

Processing element local memory

is mapped to private memoryWork-group Work-group

...

Local memory Local memory

Work-item

Work-item

Work-item

PM PM PM

Work-item

Work-item

Work-item

PM PM PM... ...

Page 55: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.55

Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

Page 56: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.56

CVEngine

Page 57: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.57

CVEngine

DDR

1. The CVEngine has is connected to

off-chip DDR which is connected to

the CPU

1

Page 58: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.58

CVEngine

Cluster Cluster

DDR

2

1. The CVEngine has is connected to

off-chip DDR which is connected to

the CPU

2. The CVEngine has a number of

clusters

1

Page 59: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.59

CVEngine

Cluster Cluster

SRAM

DDR1

2

3

1. The CVEngine has is connected to

off-chip DDR which is connected to

the CPU

2. The CVEngine has a number of

clusters

3. The CVEngine has a region of

on-chip SRAM, also connected to

the CPU

Page 60: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.60

CVEngine

Cluster Cluster

Core Core Core Core Core Core Core Core

SRAM

DDR1

2

3

4

1. The CVEngine has is connected to

off-chip DDR which is connected to

the CPU

2. The CVEngine has a number of

clusters

3. The CVEngine has a region of

on-chip SRAM, also connected to

the CPU

4. Each cluster has 4 cores each with

a number of processing elements

Page 61: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.61

CVEngine

Cluster

Cache (multiple levels)

Cluster

Cache (multiple levels)

Core Core Core Core

Registers Registers Registers Registers

Core Core Core Core

Registers Registers Registers Registers

SRAM

DDR1

2

3

4

1. The CVEngine has is connected to

off-chip DDR which is connected to

the CPU

2. The CVEngine has a number of

clusters

3. The CVEngine has a region of

on-chip SRAM, also connected to

the CPU

4. Each cluster has 4 cores each with

a number of processing elements

5. Each core has dedicated registers

and can access DDR memory via

caches5

Page 62: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.62

CVEngine

Cluster

Cache (multiple levels)

Cluster

Cache (multiple levels)

Core Core Core Core

Registers Registers Registers Registers

Core Core Core Core

Registers Registers Registers Registers

DDR

SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

1

2

3

4

5

6

1. The CVEngine has is connected to

off-chip DDR which is connected to

the CPU

2. The CVEngine has a number of

clusters

3. The CVEngine has a region of

on-chip SRAM, also connected to

the CPU

4. Each cluster has 4 cores each with

a number of processing elements

5. Each core has dedicated registers

and can access DDR memory via

caches

6. Each core also has dedicated local

SRAM

Page 63: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.63

CVEngine

Cluster

Cache (multiple levels)

Cluster

Cache (multiple levels)

Core Core Core Core

Registers Registers Registers Registers

Core Core Core Core

Registers Registers Registers Registers

DDR

SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

1

2

3

4

5

6

7

1. The CVEngine has is connected to

off-chip DDR which is connected to

the CPU

2. The CVEngine has a number of

clusters

3. The CVEngine has a region of

on-chip SRAM, also connected to

the CPU

4. Each cluster has 4 cores each with

a number of processing elements

5. Each core has dedicated registers

and can access DDR memory via

caches

6. Each core also has dedicated local

SRAM

7. The local SRAM is connected to the

on-chip SRAM and DDR via DMA

Page 64: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.64

CVEngine

Cluster

Cache (multiple levels)

Cluster

Cache (multiple levels)

Core Core Core Core

Registers Registers Registers Registers

Core Core Core Core

Registers Registers Registers Registers

DDR

SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Page 65: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.65

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Core Core Core Core

Registers Registers Registers Registers

Core Core Core Core

Registers Registers Registers Registers

DDR

SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Each cluster maps to the optimal

work-group size

Page 66: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.66

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

Sub- group

Sub- group

Sub- group

Registers Registers Registers Registers

Sub- group

Sub- group

Sub- group

Sub- group

Registers Registers Registers Registers

DDR

SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Cores provide an extra level of

subdivision within work-groups

So each core maps to a sub-group

Sub-groups are available in

OpenCL 2.x but not yet available

in SYCL so this will require an

extension

Page 67: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.67

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Registers Registers Registers Registers

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Registers Registers Registers Registers

DDR

SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Each processing element within a

core maps to a single work-item

Page 68: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.68

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Registers Registers Registers Registers

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Registers Registers Registers Registers

Global memory

SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Off-chip DDR memory is mapped

to global memory

Page 69: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.69

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Registers Registers Registers Registers

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Registers Registers Registers Registers

Global memory

Local memory

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

On-chip SRAM memory is mapped

to local memory

Page 70: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.70

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Global memory

Local memory

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Registers are mapped to private

memory

Page 71: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.71

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Global memory

On-chip memory

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Local SRAM

Since SRAM can be written to and

read from the CPU and can be

accessed by all work-groups

similar to global memory

SRAM can also be used to allocate

low-latency on-chip buffers

Local memory

Page 72: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.72

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Global memory

On-chip memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Since each core has its own

dedicated local SRAM

Local SRAM can be mapped to a

sub-group local memory

Sub-group local memory is not yet

available in OpenCL or SYCL so

this will require an extension

Local memory

Page 73: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.73

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Global memory

On-chip memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Since there is DMA connections

from SRAM and DDR to local

SRAM

The CVEngine can support

asynchronous memory copies

from on-chip memory buffers and

global memory buffers into

sub-group local memory and vise

versa

These asynchronous copies

cannot be represented in OpenCL

or SYCL so will require an

extension

Local memory

Page 74: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.74

Work-item Private memory

Local memory

Global memoryKernel Kernel barrier

Work-group Work-group barrier

Page 75: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.75

Work-item Private memory

Local memory

Global memoryKernel Kernel barrier

Work-group Work-group barrier

On-chip memory

Page 76: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.76

Work-item

Work-group

Private memory

Local memory

Kernel

Work-group barrier

Kernel barrier

Sub-group Sub-group local memorySub-group barrier

Global memory

On-chip memory

Page 77: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.77

Work-item

Work-group

Private memory

Local memory

Kernel

Work-group barrier

Kernel barrier

Sub-groupSub-group local memorySub-group barrier

Global memory

On-chip memory

Asynchronous sub-group copies

Page 78: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.78

Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

Page 79: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.79

Disclaimer

The features that I present here are Codeplay extensions and are not standard SYCL features

Page 80: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.80

CVEngine

Work-group

Cache (multiple levels)

Work-group

Cache (multiple levels)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Global memory

On-chip memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Local memory

Page 81: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.81

● On-chip memory○ On-chip memory is allocated in OpenCL/SYCL similarly to regular buffers

■ ComputeAorta (OpenCL) provides an extension API

■ ComputeCpp (SYCL) provides an extension buffer property use_onchip_memory

○ On-chip memory buffers are accessed in OpenCL/SYCL kernels in the

same way as regular buffers

On-chip memory Local memory

Page 82: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.82

class kernel;

using namespace cl::sycl;

{

queue deviceQueue;

buffer<float, 1> onchipBuffer(hostData, size,

{codeplay::property::buffer::use_onchip_memory(

codeplay::property::require)});

deviceQueue.submit([&](handler &cgh){

auto onchipAcc =

onchipBuffer.get_access<access::mode::read_write>(cgh);

cgh.parallel_for<kernel>(range<1>(size), [=](id<1> idx){

onchipAcc[idx] = onchipAcc[idx] * onchipAcc[idx];

}):

});

}

We construct a SYCL buffer as normal, but provide the use_onchip_memory buffer property

This property takes an enumeration; either require, which means that SYCL runtime has to use it or prefer, which means the SYCL runtime should try to use it

Page 83: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.83

● Sub-groups○ Sub-groups are exposed following the OpenCL 2.x feature and as a

natural extension to the SYCL execution model■ ComputeAorta (OpenCL) provides kernel builtins for querying sub-group info and

invoking a sub-group barrier

■ ComputeCpp (SYCL) provides an extension to nd_item to expose a sub_group

object, similar to group, which exposes member functions for querying sub-group

info and invoking a sub-group barrier

○ The size of sub-groups cannot be specified

explicitly by the users, they are determined by

the implementation

Work-group

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Page 84: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.84

class kernel;

using namespace cl::sycl;

{

queue deviceQueue;

buffer<float, 1> deviceBuffer(hostData, size);

deviceQueue.submit([&](handler &cgh){

auto deviceAcc=

deviceBuffer.get_access<access::mode::read_write>(cgh);

cgh.parallel_for<kernel>(nd_range<1>(range<1>(size), range<1>(32)),

[=](nd_item<1> ndItem){

… auto subGroup = ndItem.get_sub_group();

auto subGroupRange = subGroup.get_group_range();

auto subGroupId = subGroup.get_group_id();

subGroup.barrier();

… }):

});

}

We query in-kernel sub-group information and invoke sub-group barriers via the sub_group class and the nd_item has a member function called get_sub_group that will return a sub_group object

If an implementation does not support sub-groups using sub_group is undefined

Page 85: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.85

● Sub-group local memory○ Sub-group local memory is exposed with extensions which follow the

OpenCL/SYCL memory model■ ComputeAorta (OpenCL) provides a new address space which can be used to

allocate sub-group local memory

■ ComputeCpp (SYCL) provides a new accessor access target;

access::target::subgroup_local, that behaves similarly to access::target::local

Work-group

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Page 86: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.86

class kernel;

using namespace cl::sycl;

{

queue deviceQueue;

buffer<float, 1> deviceBuffer(hostData, size);

deviceQueue.submit([&](handler &cgh){

auto deviceAcc=

deviceBuffer.get_access<access::mode::read_write>(cgh);

auto subGroupLocalMem = accessor<float, 1, access::mode::read_write,

access::target::subgroup_local>(cgh, range<1>(32));

cgh.parallel_for<kernel>(nd_range<1>(range<1>(size), range<1>(32)),

[=](nd_item<1> ndItem){

… subGroupLocalMem[idx] = ...;

… }):

});

}

We allocate sub-group local memory by constructing an accessor with the subgroup_local access target

Page 87: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.87

● Asynchronous sub-group copies○ Asynchronous sub-group copies are exposed following the OpenCL/SYCL

feature for asynchronous work-group copies■ ComputeAorta (OpenCL) provides a plane_t type to represent a non-accessible

buffer and kernel builtins for invoking an asynchronous in-kernel copies between

a plane_t and a sub-group local memory allocation

■ ComputeCpp (SYCL) provides a new accessor access target; access::target::plane,

and a member function to the sub_group extension; async_sub_group_copy, to

perform an asynchronous in-kernel copy from a plane

accessor to a sub-group local accessor Work-group

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

Sub- group

(N work- items)

PM PM PM PM

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Sub-group local

memory

Page 88: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.88

class kernel;

using namespace cl::sycl;

{

queue deviceQueue;

buffer<float, 1> deviceBuffer(hostData, size);

deviceQueue.submit([&](handler &cgh){

auto devicePlane =

deviceBuffer.get_access<access::mode::read_write,

access::target::plane>(cgh);

auto subGroupLocalMem = accessor<float, 1, access::mode::read_write,

access::target::subgroup_local>(cgh, range<1>(32));

cgh.parallel_for<kernel>(nd_range<1>(range<1>(size), range<1>(32)),

[=](nd_item<1> ndItem){

… auto subGroup = ndItem.get_sub_group();

auto event = subGroup.async_sub_group_copy(subGroupLocalMem,

devicePlane, range<1>(32));

… event.wait();

}):

});

}

We construct a plane accessor using the plane access target

We perform asynchronous in-kernel sub-group copies by calling the sub_group member function async_sub_group_copy

This returns a device_event that can be used to wait on the copy to complete.

Page 89: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.89

Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

Page 90: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.90

Page 91: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.91

Input Image

Page 92: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.92

Input Image

Page 93: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.93

Input Image

Global memory

✔ The entire image will fit into global memory

✘ Global memory has a high access latency

Page 94: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.94

✔ On-chip memory has a much lower access latency

✘ Only part of the image will fit into on-chip memory at once, so we have to tile it

✘ Executing a kernel per tile incurs host-side overhead

Note that because convolutions are gather operations the input data much include a halo

On-chip memory

1,0

0,1 1,1

0,0

Page 95: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.95

Copy{0, 0}

Convo{0, 0}

Copy{1, 0}

Convo{1, 0}

Copy{0, 1}

Convo{0, 1}

Copy{1, 1}

Convo{1, 1}

Copy

Compute

Page 96: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.96

Copy{0, 0}

Convo{0, 0}

Copy{1, 0}

Convo{1, 0}

Copy{0, 1}

Convo{0, 1}

Copy{1, 1}

Convo{1, 1}

Copy

Compute

Page 97: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.97

Copy{0, 0}

Convo{0, 0}

Copy{1, 0}

Convo{1, 0}

Copy{0, 1}

Convo{0, 1}

Copy{1, 1}

Convo{1, 1}

Copy

Compute

Page 98: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.98

Copy{0, 0}

Convo{0, 0}

Copy{1, 0}

Convo{1, 0}

Copy{0, 1}

Convo{0, 1}

Copy{1, 1}

Convo{1, 1}

Copy

Compute

Page 99: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.99

✔ By double buffering copy and computation you can hide the latency of copying into on-chip memory

However, the R-Car CVEngine provides further sub-group local memory which has an even lower access latency than on-chip memory

On-chip memory

1,0

0,1 1,1

0,0

Page 100: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.100

✔ Asynchronously copying each part of the input data that is associated with a sub-group to sub-group local memory will further lower access latency

✘ Again, only part of the image data that is associated with a sub-group will fit into sub-group local memory at once, so again we have to tile it

In this case the tiling is done in-kernel

On-chip memory1,0

0,1 1,1

Sub-group local memory

1,0

0,1 1,1

0,0

Page 101: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.101

cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();

auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);

auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});

Page 102: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.102

cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();

auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);

auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});

This kernel is operating on a tile that is stored in on-chip memory

Page 103: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.103

cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();

auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);

auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});

We want to perform the computation of the part of the input that each sub-group corresponds to in sub-group local memory

But all the memory required may not fit into sub-group local memory at once

So we calculate how many tiles are required for a sub-group

Page 104: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.104

cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();

auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);

auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});

First we initiate and wait on the copy of the first tile so we can perform the computation on it

Then we initiate, but don’t wait for the copy of the second tile, so that copy will happen in parallel to the computation of the first tile

Page 105: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.105

cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();

auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);

auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});

Then we iterate over the tiles, performing the computation of the current tile and then waiting on the copy for the next tile

Page 106: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.106

cgh.parallel_for<convo2d>(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();

auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);

auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } }});

Finally, if there are further tiles to be processed, then we initiate the copy for the next tile and then swap the accessors for the next iteration of the loop

Page 107: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.107

✔ By double buffering asynchronous copies and the computation in each sub-group you can hide the latency of copying into sub-group local memory

On-chip memory1,0

0,1 1,1

Sub-group local memory

1,0

0,1 1,1

0,0

Page 108: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

© 2019 Codeplay Software Ltd.108

Conclusion

● The Renesas R-Car CVEngine is designed to efficiently accelerate complex machine learning algorithms in a low power environment

● The OpenCL/SYCL programming memory can be efficiently applied and extended when necessary to support very unique hardware architectures

● This allows automotive systems to take advantage of AI software stacks based on open standards

Page 109: Driving Cars How to Deploy AI Software to Self · How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller IWOCL`19 - May 2019

/codeplaysoft@codeplaysoft codeplay.com

We’re

Hiring!

codeplay.c

om/c

areers/

Thank you for listening