Manycores – From hardware prospective to software

MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE

Presenter: D96943001 電子所陳泓輝

Why Moore’s Law is die

He is not CEO anymore!! Walls => ILP, Frequency, Power, Memory walls

ILP – more cost less return

ILP: instruction level parallelism OOO: out of order execution of microcodes

Frequency wall

FO4 delay metric: delay of a inverter with 4 fan-in with ¼ size and it drives another inverter 4x size

Saturated!

Freq ↑ => Some OP cycle counts ↑

Memory wall

External access penalty is increasing(the gap) Solution => enlarge cache

Cache decide the performance and the price

It’s cache that matters!

The power wall

High power might imply Thermal run away of device behavior Larger current => electronic migration => issue of the

reliability of the metal connection Hit packaging heat limitation

Change to high cost packaging Cooling noise!! Form factor

The great wall……

Moore’s Law

MulticoreManycore

Historical - Intel 2007 Xeon

Dual on chip memory controller => fcpu > 2*fmem Point-to-point interconnection => fabrics

Multiple communication activities (c.f. “bus” => one activity)

Fabric working notation

AMD – Opteron(Shanghai)

Much the same as Intel Xeon Shared L3 cache among 2 cores

Game consoles

XBox360 => Triple core PS3 => Cell, 8+1 cores

Power PC wins!

Homogeneous

Heterogeneous

State-of-art multicore DSP chips

TI TNETV3020 Freescale 8156

Homogeneous Heterogeneous

State-of-art multicore DSP chips

picoChip PC205 Tilera TILE64

Heterogeneous Homogeneous, Mesh

State-of-art multicore x86 chips

24 “tiles” with two IA cores per tile A 24-router mesh network with 256 GB/s bisection

bandwidth 4 integrated DDR3 memory controllers Hardware support for message-passing !!

Intel Single-chip Cloud Computer 1GHz Pentium

GPGPU - OpenCL

OfficialLOGO

Special case: multicore video processor

Characteristics of video applications in consumer electronics High computational capability Low hardware cost Low power consumption

A General Solution Fixed-function logic designed

Challenges Multiple video decoding standards Updating video decoding standards Ill-posed video processing algorithms Product requirements are diverse and mutually exclusive

mediaDSP technology Broadcom: mediaDSP technology

Heterogeneous (programmable and fixed functions units) A task-based programming model A uniform approach for managing tasks executing on different

types of programmable and fixed-function processing elements

A platform, easily extendable to support a range of applications Easily to be customized for special purpose

Successful stories SD MPEG Video encoder including scaling and noise reduction Frame-Rate-Conversation Video Processing for

FHD@60Hz /120Hz videos

Nickname: accelerator

Classes of video processing

Highly parallelizable operations for fixed-point data and no floating point A processor with SIMD data path engine

Ad-hoc computation and decision making, which are operating on smaller sets of data produced by the parallelizable processes A general processor such as RISC

Data movement and formatting on multidimensional pixels

Bit serial processing for entropy decoding and encoding => dedicate hardware do this job very efficiently

Task-based programming model

Programmers’ duties as follows: Partition a sequential algorithm into a set of parallelizable tasks

and then efficiently map it to the massively parallel architecture A task has a definite initiation time A task runs until completion with no interruption and no further

synchronization with other task Understand hardware architecture and limitation

Shared memory (instead of FIFO mode) Buffer size must be enough for a data unit Interconnect bandwidth must be enough Computational power must be enough for real time

(IP) Platform-based architecture

Task-oriented engine (TOE) A programmable DSP or a fixed function unit

Task control unit (TCU) A RISC maintains a queue of tasks and synchronous with other TCU/TOEs To maximize the utilization of TOEs

Control engine Shared memory Communication fabric

Memory architecture Memory hierarchy

L1 - Processor Instruction and Data Memory

L2 - On-chip Shared Memory

L3 - Off-chip

All TOEs use software-managed DMA rather than caches for their local storage 6D addressing (x,y,t,Y,U,V) and the chunking

of blocks into smaller subblocks. No {pre-fetching, early load scheduling,

cache, speculative execution, multithreading …}

Broadcom BCM35421 chip [1/2]

Do motion-compensated frame-rate conversion Double frame rate from FHD@60fps to FHD@120fps

(to conquer motion blur) 24fps 60fps (de-judder)

Broadcom BCM35421 chip [2/2]

65nm CMOS process mediaDSP runs at 400 MHz 106 Million transistors Two Teraops of peak integer performance

Performance of DSPs for applications

DSP becomes useful when it can perform a minimum of 100 instructions per sample period

68% DSP were shipped for mobile handsets and base stations in 2008

Several K cycles for processing a input sample

Multiple elements

Increase in performance: multiple elements > higher performance single

elements

Go deeper –TI’s multicore

Multicore Programming Guide

Mapping application to mutilcore

Know the processing model option Identify all the tasks

Task partition into many small ones Familiar with Inter-task communication/data flow Combination/aggregation Mapping

Memory hierarchy => L1/L2/L3, private/shared memory, external memory channel numbers/capability

DMA Special purpose hardware!!

FFT, Viterb, reed solomon, AES codec, Entropy codec

Parallel processing model

Very successful in communication system Router Base station

Master/Slave model Data flow

Data movement

Shared memory Dedicated memory

Transitional memory => ownership change, content not copy

Notification [1/4]

Direct signaling Create event to other core’s local interrupt controller

Other core polling local status Or the local interrupt controller convert this event to real

interrupt

Notification [2/4]

Indirect signaling Not directly controlled by software

Notification [3/4]

Atomic arbitration Hardware semaphore/mutex

Semaphore => allow limited multiple access => example: multi-port SRAM/external DDR memory

Mutex => allow one access only Use software semaphore instead if resource only shared between

processes only executed in one core Overhead of hardware semaphore is not small

Its only a facility for software usage, hardware only guarantee atomic operation, locked content is not protected Cost, performance consideration

Notification [4/4]

Left diagram is mutex

Just like the software counterpart

Data transfer engines

DMA => System DMA, local DMA belongs to a core Ethernet

Up to 32 MAC address RapidIO

Implemented with ultra fast serial IO physical layer Maybe multiple serial IO links uni/bi-directional Example

USB 2.0 => 480Mbit/sec USB 3.0 => 5Gbit/sec Serial ATA

1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec 3.0, Gen 3 => 6 Gbit/sec

High speed serial link

USB SATA

Memory management

Devices do not support automated cache coherency among cores because of the power consumption involved and the latency overhead introduced

Switched central resource fabric

Highlights [1/3]

Portion of the cache could be configured to as memory mapped SRAM Transparent cache => visible

Address aliasing => masking MSByte For core 0: 0x10800000 == 0x00800000 For core 1: 0x11800000 == 0x00800000 For core 2: 0x12800000 == 0x00800000

Special register DNUM for dynamic pointer address update =>

Write common rom codestill assess core’s private area

Each core has it DNUM

Implicit

Explicit

Highlight [2/3]

The only guaranteed coherency by hardware L1D L2 (core-locally) L1D L2 SL2 (if as

memory mapped SRAM) (core-locally)

Equal access to the external DDR2 SDRAM through SCR

L1DL1P L1DL1P

This may be the bottleneck for

certain application

Highlight [3/3]

If any portion of the L1s is configured as memory-mapped SRAM, there is a small paging engine built into the core (IDMA) that can be used to transfer linear blocks of memory between L1 and L2 in the background of CPU operation Paging engine => MMU

IDMA may also be used to perform bulk peripheral configuration register access

DSP code and data image

Image types Single image Multiple image Multiple image with shared code and data

Complex linking scheme should be used Device boot

Tool Tool Tool

Debugging

Cuda => XXOO↑↑↓↓←→←→BA

TI’s offer

Hardware emulation => ICE, JTAG Basically, not intrusive

Software instrumentation Patching original codes to enable same ability => this

time, “Trace Logs” Basically, intrusive

Type of Trace Logs API call log, Statistics log, DMA transaction log, Event

log, Customer data log

Manycores – From hardware prospective to software

Documents

Manycores: Hardware und Low-Level Programmierung...

Prospective students

Portable and Predictable Performance on Heterogeneous...

CERCLA Bona Fide Prospective Purchaser Defense: Securing...

Prospective Industries

M3: A Hardware/Operating-System Co-Design to Tame...

Clinician Prospective LCM manual 2-14-2002 Final Copy...

IntelliNoC: A Holistic Design Framework for Energy ... ·.....

PROSPECTIVE - entreprises.gouv.fr

INSTITUTE FOR PROSPECTIVE TECHNOLOGICAL STUDIES DIGITAL...

Prioritization of prospective third-generation biofuel...

Performance Engineering on Multi- and Manycores - FAU ·...

Purchasing Department...Purchasing Department Date: May 8,.....

ETUDE PROSPECTIVE

PROSPECTIVE JeunessE

MN-Mate: Resource Management of Manycores with DRAM and...