Top Banner
Embedded OpenCV Acceleration Dario Pennisi
28

Embedded OpenCV Acceleration

Dec 31, 2015

Download

Documents

zahir-irwin

Embedded OpenCV Acceleration. Dario Pennisi. Introduction. Open -Source Computer Vision Library Over 2500 algorithms and functions Cross platform, portable API Windows, Linux, OS X, Android, iOS Real Time performance BSD license Professionally developed and maintained. History. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Embedded  OpenCV  Acceleration

Embedded OpenCV AccelerationDario Pennisi

Page 2: Embedded  OpenCV  Acceleration

Introduction

Open-Source Computer Vision Library

Over 2500 algorithms and functionsCross platform, portable API

Windows, Linux, OS X, Android, iOS

Real Time performanceBSD licenseProfessionally developed and maintained

19/04/2023

Page 3: Embedded  OpenCV  Acceleration

History

Launched in 1999 by IntelShowcasing Intel Performance Library

First Alpha released in 20001.0 version released in 2006Corporate support by Willow Garage in 20082.0 version released in 2009

Improved c++ interfacesReleases each 6 months

In 2014 taken over by ItSeez3.0 in beta now

Drop C API support

19/04/2023

Page 4: Embedded  OpenCV  Acceleration

Building blocks to ease vision applications

OpenCV

Application structure

19/04/2023

Image Retrieval

Pre Processin

g

Feature Extraction

Object Detection

highgui imgproc features2d objdetect

RecognitionReconstruction

AnalisysDecision Making

calib3d video stitching ml

Page 5: Embedded  OpenCV  Acceleration

Acceleration

SSE/AVX/NEON

OpenCLCUDA

Environment

19/04/2023

Application

C++ Java Python

OpenCV

cv::parallel_for_

Threading APIs

OS

Concurrency

GCD

TBBOpenM

PCStrip

es

Page 6: Embedded  OpenCV  Acceleration

Desktop vs Embedded

19/04/2023

Desktop Industrial Embedded

Cores/Threads

8/16 4/4

Core Frequency

>4GHz >1.4GHz

L1 Cache 32K+32K 32K+32K

L2 Cache 256K per core 2M shared

L3 Cache 20M -

DDR Controllers

4x64 Bit DDR4 @ 1066 MHz

2x32 Bit DDR3 @ 800MHz

TDP 140W (CPU) 10W (SoC)

GPU cores 2880 1+4+16

Page 7: Embedded  OpenCV  Acceleration

Dimensioning system is fundamentalUnderstand your algorithmCarefully choose your toolboxEmbedded means no chance for “one size fits all”

System Engineering

19/04/2023

Page 8: Embedded  OpenCV  Acceleration

Acceleration Strategies

Optimize AlgorithmsProfileOptimizePartition (CPU/GPU/DSP)

FPGA accelerationHigh level synthesisCustom DSPRTL coding

Brute ForceIncrease number of CPUsIncrease CPU Frequency

Accelerated librariesNEON

OpenCL/CUDA

19/04/2023

Page 9: Embedded  OpenCV  Acceleration

Bottlenecks

19/04/2023

Know your enemy

Page 10: Embedded  OpenCV  Acceleration

Memory

Access to external memory is expensiveCPU load instructions are slowMemory has LatencyMemory bandwidth is shared among CPUs

CachePrevents CPU to access external memoryData and instruction

19/04/2023

Page 11: Embedded  OpenCV  Acceleration

Disordered accesses

What happens when we have cache miss?

Fetch data from same memory row 13 clocksFetch data from a different row 23 clocks

Cache line usually 32 bytes8 clocks to fill a line (32 bit data bus)

Memory bandwidth Efficiency38% on same row26% on different row19/04/2023

Page 12: Embedded  OpenCV  Acceleration

Bottlenecks - Cache

1920x1080 YCbCr 4:2:2 (Full HD) 4MBDouble the size of the biggest ARM L2 cache

1280x720 YCbCr 4:2:2 (HD) 1.8 MBJust fits L2 Cache… ok if reading and writing to the same frame

720x576 YCbCr 4:2:2 (SD) 800KB2 images in L2 cache…

19/04/2023

Page 13: Embedded  OpenCV  Acceleration

OpenCV Algorithms

Mostly designed for PCsWell structuredGeneral purposeOptimized functions for SSE/AVXRelatively optimizedSmall number of accelerated functions• NEON• Cuda (nVidia GPU/Tegra)• OpenCL (GPU, Multicore processors)

19/04/2023

Page 14: Embedded  OpenCV  Acceleration

Multicore ARM/NEON

NEON SIMD instructions work on vectors of registers

Load-process-store philosophyLoad/store costs 1 cycle only if in L1 cache• 4-12 cycles if in L2• 25 to 35 cycles on L2 cache miss

SIMD instructions can take from 1 to 5 clocks

Fast clock useless on big datasets/small computation

19/04/2023

Page 15: Embedded  OpenCV  Acceleration

Generic DSP

Very similar to ARM/NEONHigh speed pipeline impaired by inefficient memory access subsystemWhen smart DMA is available it is very complex to program

When DSP is integrated in SoC it shares ARM’s bandwidth

19/04/2023

Page 16: Embedded  OpenCV  Acceleration

OpenCL on GPU

OpenCL on Vivante GC2000Claimed capability up to 16 GFLOPS

Real Applicationsonly on internal registers: 13.8 GFLOPScomputing 1000x1000 matrix: 600 MFLOPS

Bandwidth and inefficiencies:Only 1K local memory and 64 byte memory cache

19/04/2023

Page 17: Embedded  OpenCV  Acceleration

OpenCL on FPGA

Same code can run on FPGA and GPUTransform selected functions in hardwareAutomated memory access coalescingEach function requires dedicated logic

Large FPGAs requiredPartial reconfiguration may solve this

Significant compilation time

19/04/2023

Page 18: Embedded  OpenCV  Acceleration

HLS on FPGA

High Level SynthesisConvert C to hardware

HLS requires Code to be heavily modified

Pragmas to instruct compilerCode restructuringNot portable anymore

Each function requires dedicated logicLarge FPGAs requiredPartial reconfiguration may solve this

Significant compilation time19/04/2023

Page 19: Embedded  OpenCV  Acceleration

A different approach

Demanding algorithms on low cost/power HW

19/04/2023

Algorithm Analysis

Memory Access Pattern

Data intensive processin

g

Decision Making

DMADSP

NEONARM

program

Custom Instructio

n(RTL)

Page 20: Embedded  OpenCV  Acceleration

External co-processing

19/04/2023

ARM

GPU

Memory

FPGA Memory

PCIe

ARM MemoryFPG

A

Page 21: Embedded  OpenCV  Acceleration

Co-processor details

FPGA Co-ProcessorSeparate memory• Adds bandwidth• Reduces access conflict

Algorithm aware DMA• Access memory in ordered way• Add caching through embedded

RAM

Algorithm specific processors• HLS/OpenCL synthesized IP blocks• DSP with custom instructions• Hardcoded IP blocks

19/04/2023

Block capture

DPRAM(s)

DPRAM(s)

DSP core (s)

Memory

DMA Process

or

Block capture

DPRAM(s)

DMA Process

or

DPRAM(s)

DSP core (s)

DPRAM DPRAM

DSP core/IP Block

Block capture

ARMARM

Page 22: Embedded  OpenCV  Acceleration

Co-processor details

Flex DMADedicated processor with DMA custom instructionSoftware defined memory access pattern

Block CaptureExtracts data for each tile

DPRAMLocal, high speed cache

DSP CoreDedicated processor with Algorithm specific custom instructions19/04/2023

Block capture

DPRAM(s)

DPRAM(s)

DSP core (s)

Memory

Flex DMA

Block capture

DPRAM(s)

Flex DMA

DPRAM(s)

DSP core (s)

DPRAM DPRAM

DSP core/IP Block

Block capture

ARMARM

Flex DMA

Flex DMA

Block capture

Block capture

Block capture

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM(s)DPRAM DPRAM

DSP core (s)DSP core (s)DSP core/IP Block

Page 23: Embedded  OpenCV  Acceleration

OpenVX

Environment

19/04/2023

Application

C++ Java Python

OpenCV

cv::parallel_for_

Threading APIs

OS

Concurrency

GCD

TBBOpenM

PCStrip

es

Acceleration

SSE/AVX/NEON

OpenCL

CUDA

FPGA

Page 24: Embedded  OpenCV  Acceleration

OpenVX

19/04/2023

Page 25: Embedded  OpenCV  Acceleration

MemoryMemory Node2Memory Node1 Node2 MemoryMemory Node1

OpenVX Graph Manager

Graph ConstructionAllocates resourcesLogical representation of algorithm

Graph ExecutionConcatenate nodes avoiding memory storage

Tiling extensionsSingle node execution can be split in multiple tilesMultiple accelerators executing single task in parallel

19/04/2023

Page 26: Embedded  OpenCV  Acceleration

Summary

• OpenCV today is mainly PC oriented.

• ARM, Cuda, OpenCL support growing

• Existing acceleration only on selected functions

• Embedded CV requires good partitioning among resources

• When ASSPs are not enough FPGAs are key

• OpenVX provides a consistent HW acceleration platform, not only for OpenCV

19/04/2023

What we learnt

Page 27: Embedded  OpenCV  Acceleration

Questions

19/04/2023

Page 28: Embedded  OpenCV  Acceleration

Thank you

19/04/2023