Top Banner
Copyright © 2014 Synopsys Inc. 1 Pierre Paulin, Director R&D Santa Clara, 29 May 2014 Combining Flexibility and Low-power in Embedded Vision Subsystems: An Application to Pedestrian Detection Bruno Lavigueur, Senior R&D Engineer
32

"Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Aug 16, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 1

Pierre Paulin, Director R&D

Santa Clara, 29 May 2014

Combining Flexibility and Low-power in

Embedded Vision Subsystems:

An Application to Pedestrian Detection

Bruno Lavigueur, Senior R&D Engineer

Page 2: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 2

• Pedestrian Detection algorithm overview

• Computation and bandwidth requirements

• Embedded Vision Reference Platform

• Programming Tools and Architecture

• Application Mapping to a Heterogeneous Multi-Core

Platform

• From Functional implementation in OpenCV

to a fully optimized mapping to GPP and ASIP cores

• Final optimized mapping

• Power — Performance — Area analysis

• FPGA-based prototype

• Lessons learned, outlook

Outline

Page 3: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 3

• EDA tool and IP provider

• $1.96B in revenue (FY 2013)

• ~8700 employees ( > 5600 R&D engineers)

• ~81 offices worldwide

• Products for Designing Embedded Vision Systems

• Embedded Cores (ARC HS, EM, 600, 700)

• Application Specific Processor (ASIP) design tools

• Semiconductor IP (DDR, DMA, AXI, HDMI, USB, A/D, …)

• Synthesis and verification for SoCs and FPGAs

• FPGA-based rapid prototyping system

Synopsys — EDA Industry Leadership

Page 4: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 4

• Pedestrian detection

• One of the most popular EV applications

• Standard feature in luxury vehicles

• Moving to mid-size and compact vehicles

in the next 5-10 years, also due to

legislation efforts

• Implementation requirements

• Low cost

• Low power (small form factor, and/or battery powered)

• Programmable (to allow for in-field SW upgrades)

• Most popular algorithm for pedestrian detection is

Histogram of Oriented Gradients (HOG)

Pedestrian Detection and HOG

Page 5: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 5

Histogram Of Oriented Gradients

Gradient Computation

Apply Sobel operators: +1 +2 +10 0 0−1 −2 −1

and +1 0 −1+2 0 −2+1 0 −1

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per block

Normalization of the histograms

SVM per window position

Non-max suppression

Scale to Multiple Resolutions

Use a fixed 64x128-pixel detection window.

Apply this detection window to scaled frames.

Page 6: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 6

Histogram Of Oriented Gradients

The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply

Gaussian weights and compute 4 histograms of orientation of gradients.

Histogram Computation

Normalization of the Histograms

(1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization

Support Vector Machine

Linear classification of histograms

for every 64x128 windows position.

Non-Max Suppression

Cluster multi-scale dense scan of

detection windows and select unique

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per block

Normalization of the histograms

SVM per window position

Non-max suppression

Page 7: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 7

Embedded Vision

Reference Platform Overview

Page 8: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 8

Embedded Vision Reference Platform

Embedded Vision

Reference Platform

Ported OpenCV library Pedestrian Detection, etc.

C API to ASIP-based vision accelerators

Configurable ARC HS RISC processor

ASIP-based accelerators

HAPS® FPGA-based prototyping system

Pre

-veri

fied

flo

w

an

d e

xam

ple

s

Page 9: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 9

Time-to-market and Flexibility vs.

Power-Performance-Area Trade-offs

Subsystem

Controller

HS

Emb. Vision

Accelerators

ASIP

ASIP

ASIP

1X 100X

P

A

R

A

L

L

E

L

I

S

M

Pre-processing:

- Filtering

- Color conversion

- Image scaling

- Feature extraction

and matching

- Segmentation

Power-Performance-Area Efficiency

Time-to-market, Flexibility

10X

MQX Lightweight O/S

High-level processing:

- Control

- Multi-object tracking

- Post-processing

- High-level command

interpretation

Data

Level

Parallelism

Task

Level

Parallelism

Sequential

Tasks

Page 10: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 10

• ARC HS family of high-performance cores

• ARC HS 36 Performance, power at 28 nm HPM process (worst case):

• Scalable to 1.6 GHz

• 1.9 DMIPS/MHz

• 37 uW/MHZ

• Application-Specific Instruction-set Processors (ASIP)

• User-driven design of processors tailored to a specific application

• Ability to guide performance-power-area and flexibility trade-offs

• Automatic generation of implementation, C compiler and

programming tools from instruction-set specification

• Connectivity components

• DMA, AXI, DDR, etc.

Main architectural components

53 Dhrystone GIPS/W

Page 11: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 11

Embedded Vision Flow and Architecture

HOG Embedded App.

Base drivers MQX runtime

AXI-4 local interconnect DMA, Sync & I/O HS DCCM

Dedicated Streaming Interconnect (FIFOs)

D D D ASIP1 ASIPn

C/C++ C API to Accelerators

HAPS-70 S12

12M ASIC

Gate equiv.

L2 SRAM

ASIP2

Page 12: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 13

HOG Mapping and Refinement Flow

Camera HOG

Detection

DVI

Output

Page 13: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 14

• Refinement from an OpenCV high-level functional

description, to a fully optimized multi-processor SoC

combining a GP RISC with multiple ASIPs

• Main steps

• OpenCV functional reference

• Optimization and Porting onto MQX RTOS

• Profiling of all major functions

• Identification of high compute kernels

• Development of ASIPs using Synopsys ASIP design and

exploration tools

• Stepwise refinement

• From GPP only to GPP + multiple ASIPs

HOG Mapping and Refinement Flow

Page 14: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 15

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

Rescale Grad Hist Norm SVM Other

% of processing

% of processing

ARC and ASIP Exploration Tool Flow

Optimizing

Compiler

Assembler,

Linker

Instrn.-Set

Simulator

Debugger,

Profiler

C code

ARC HS S/W

Optimization

Processor

Description

Language

Optimizing

Compiler

Assembler,

Linker

Instrn.-Set

Simulator

Debugger,

Profiler

RTL

Gen.

Sim, FPGA,

RTL Synthesis

C code

ASIP HW/SW

Optimization

ARC-ASIP Trade-off Exploration

MQX RTOS

Page 15: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 16

Grey scale

conversion

HOG Functional Validation on ARC HS

(640 × 480 pixels)

AXI local interconnect DMA, Sync & I/O

Dedicated Streaming Interconnect (FIFOs)

D D D ASIP1 ASIP2

Rescaling Gradient Histogram SVM Normali-

zation

Non-max

suppression

ASIP4

L3 Ext. DRAM

DCCM HS

Subs. ctrl

1

• C fixed point profiling results: 2.25 G cycles per frame

Page 16: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 18

ARC HS

G cycles

0.1

1.4

17.3

31.9

1.2

15.7

0.004

Histogram Of Oriented Gradients Profiling

(640 × 480 pixels, at 25 FPS)

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per block

Normalization of the histograms

SVM per window position

Non-max suppression

Page 17: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 19

Histogram Of Oriented Gradients Profiling

(640 × 480 pixels, at 25 FPS)

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per block

Normalization of the histograms

SVM per window position

Non-max suppression

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

Rescale Grad Hist Norm SVM Other

% of processing

% of processing

Page 18: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 20

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

Rescale Grad Hist Norm SVM Other

# ARC HS

# ARC HS

Histogram Of Oriented Gradients Profiling

(640 × 480 pixels, at 25 FPS)

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per block

Normalization of the histograms

SVM per window position

Non-max suppression

Single Core

Multicore?

Accelerate!

Page 19: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 21

Grey scale

conversion

Task Assignment #2

AXI local interconnect DMA, Sync & I/O

Dedicated Streaming Interconnect (FIFOs)

D D D ASIP1 ASIP2

Rescaling Gradient Histogram SVM Normali-

zation

Non-max

suppression

ASIP4

2

L3 Ext. DRAM

DCCM HS

Subs. ctrl 1.6 GHz 1.6 GHz

400 MHz

Page 20: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 22

Task Assignment #3

AXI local interconnect DMA, Sync & I/O HS DCCM

Dedicated Streaming Interconnect (FIFOs)

Subs. ctrl

D D D D ASIP1 ASIP2 ASIP3 ASIP4

3

L3 Ext. DRAM

Grey scale

conversion Rescaling Gradient Histogram SVM

Normali-

zation

Non-max

suppression

1.6 GHz

400 MHz

Page 21: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 23

Task Assignment #4

AXI local interconnect DMA, Sync & I/O HS DCCM

Dedicated Streaming Interconnect (FIFOs)

Subs. ctrl

D D D D ASIP1’ ASIP2 ASIP3 ASIP4

4

L3 Ext. DRAM

Grey scale

conversion Rescaling Gradient Histogram SVM

Normali-

zation

Non-max

suppression

400 MHz

400 MHz

Page 22: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 24

Task Assignment #4 With On-Chip L2

AXI local interconnect DMA, Sync & I/O

Dedicated Streaming Interconnect (FIFOs)

D D D D ASIP1’ ASIP2 ASIP3 ASIP4

4

HS DCCM L2

SRAM

L3 Ext. DRAM

Grey scale

conversion Rescaling Gradient Histogram SVM

Normali-

zation

Non-max

suppression

Storage of

scaled images

200 MB/s 80 MB/s

400 MHz

400 MHz

Page 23: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 25

Power, Gate Count Comparisons (28 nm)

640 × 480 pixels, at 25 FPS

0

200

400

600

800

1000

1200

1400

Config #2 Config #3 Config #4

ASIP gates (K)

ARC gates (K)

Gates (K)

2 3 4

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

Config #2 Config #3 Config #4

ASIP power (mW)

ARC power (mW)

Power (mW)

2 3 4

0

1

2

3

4

5

6

Config #2 Config #3 Config #4

ASIP design and S/W

ARC S/W

2 3 4

Effort (person-months)

HAPS FPGA-based

demo platform

Note: Gates and power for processors

and local memory

Page 24: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 26

• 1 ARC HS, 4 ASIPs, AXI interconnect, private SRAM, L2 SRAM

• Fixed point version of HOG derived from OpenCV

• 25 frames/second at 400 MHz (ARC and ASIPs)

• TSMC HPM process, 28nm

• Gate count (at 400 MHz): 471K gates

• 303K gates for ASIPs, 168K gates for ARC HS 36

• Power consumption: 60 mW

• Prototype running on HAPS board (ASIPs)

• 4 frames/second at 70 MHz

26

Final Results for Demonstrator Platform

Demonstration available at our booth

4

Page 25: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 27

Lessons Learned

Subsystem

Controller

HS

Emb. Vision

Accelerators

ASIP2

ASIP1’

ASIP3

1X 100X

P

A

R

A

L

L

E

L

I

S

M

1’) Rescaling + Gradient

2) Histogram

3) Normalization

4) SVM

Power-Performance-Area Efficiency

Time-to-market, Flexibility

10X

Data

Level

Parallelism

Talk

Level

Parallelism

1) Greyscale

2) Non-max suppr.

3) Display

4) Control, O/S ASIP4

Sequential

Tasks

4

Page 26: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 28

Lessons Learned

Subsystem

Controller

HS

Emb. Vision

Accelerators

ASIP2

ASIP1’

ASIP3

1X 60X~80X

P

A

R

A

L

L

E

L

I

S

M

Area Efficiency

Time-to-market, Flexibility

Data

Level

Parallelism

Talk

Level

Parallelism

Combined:

471K gates,

60 mW @ 28nm

ASIP4

Sequential

Tasks

400 MHz

303K gates, 58 mW (Logic = 12 mW

SRAM = 46 mW)

4

20% utilization

1.6 GHz: 473K gates, 37 uW/MHZ

400 MHz: 168K gates, 18 uW/MHz

Page 27: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 29

Accelerator

C API

Data

Level

Parallelism

Talk

Level

Parallelism

Sequential

Tasks

Embedded Vision Platform Directions

& Wish List

Subsystem

Controller

HS

1X 100X

Pre-processing:

- Filtering

- Color conversion

- Image scaling

- Feature extraction

and matching

- Segmentation

High-level processing:

- Body part detection

- Multi-object tracking

- Post-processing

- Command

interpretation

Power-Performance-Area Efficiency

Time-to-market, Flexibility

Close

coupling

Vision

Extn.

SIMD

(64 bit)

OpenCV

MQX O/S

10X

Emb. Vision

Accelerators

ASIP

ASIP

ASIP

P

A

R

A

L

L

E

L

I

S

M

Page 28: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 30

• Embedded vision applications combine complex algorithms

and high data rates with a need for low power

• Need to trade-off Flexibility vs. Power-Perf-Area

• Flexibility via High-performance ARC HS core

• Ability to trade-off power vs. performance

• Scaling to multi-core, specialization and SIMD usage

• Highest PPA via ASIPs

• Performance gains and power efficiency due to tailored

instruction sets and dedicated memory architecture

• While fully programmable, gains are application specific

Conclusions

Page 29: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 31

Backup slides

Page 30: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 32

Design flow for the Vision Sub System

ARC

HS

DW

AXI interco

DesignWare

DMA

DesignWare

DDR

ARChitect

ASIP Processor

Designer

Core

Assembler

ASIP

description ASIP ISA

description

Ref Sub

System

ASIP

Synthesis +

P&R tools

Core

Consultant

SubSys

settings

ARC

settings

coreKit Tool

Core

Builder

Core

Builder User

config

VCS

DVE MDB PDBG

Legend :

HAPS

Page 31: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 33

Synopsys’ ASIP Design Tool Flow

Processor

Description

Language

Optimizing

Compiler

Assembler Linker

Instruction-Set

Simulator

Debugger Profiler

RTL Generator

RTL Sim &

FPGA

RTL

Synthesis

Full-featured SDK with graphical debugger

Compiler supports processor specific data-types and operators

Advanced optimizations allow C programmers to easily tap into architectural efficiencies

Fast retargeting to evaluate incremental processor architecture changes quickly.

High level language to quickly capture ISA Tight control of architecture (RTL-level)

Fast simulation technology

Easy integration into System C virtual platforms

Multicore and on-chip debugging

Smooth integration with RTL implementation and verification flows

Page 32: "Combining Flexibility and Low-Power in Embedded Vision Subsystems: An Application to Pedestrian Detection," a Presentation from Synopsys

Copyright © 2014 Synopsys Inc. 34

Architectural Optimization Space

ASIP architectural optimization space

Parallelism Specialization

Instruction- level

parallelism

Data- level

parallelism

Task- level

parallelism

Orthogonal instruction set (VLIW)

Encoded instruction

set

Vector processing

(SIMD)

Multi-core

App.-specific data types

App.-specific instructions

Connectivity & storage matching application’s

data-flow

App.-spec. data

processing

App.-spec. memory

addressing

App.-spec. control

processing

Distributed regs, sub-ranges

Multiple mem’s, sub-ranges

Jumps, subroutines, interrupts, HW do-loops,

residual control, predication…

Direct, indirect, post-modification, indexed,

stack indirect…

Any exotic operator

Integer, fractional, floating-point, bits, complex, vector…

Single or multi-cycle

Relative or absolute, address range, delay slots…

Pipeline

Synopsys ASIP tools …

• Support a wide range of ASIP architectures

• Support RTL accelerator tricks for highest PPA efficiency

• Enable ASIP optimization through architectural exploration

Multi-threading