Top Banner
MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電電電 電電電
60

Manycores – From hardware prospective to software

Feb 23, 2016

Download

Documents

Hashim

Manycores – From hardware prospective to software. Presenter: D96943001 電子所 陳泓輝. Why Moore’s Law is die. He is not CEO anymore!! Walls => ILP, Frequency, Power, Memory walls. ILP – more cost less return. ILP: instruction level parallelism OOO: out of order execution of microcodes. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Manycores  –  From hardware prospective to software

MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE

Presenter: D96943001 電子所 陳泓輝

Page 2: Manycores  –  From hardware prospective to software

Why Moore’s Law is die

He is not CEO anymore!! Walls => ILP, Frequency, Power, Memory walls

Page 3: Manycores  –  From hardware prospective to software

ILP – more cost less return

ILP: instruction level parallelism OOO: out of order execution of microcodes

Page 4: Manycores  –  From hardware prospective to software

Frequency wall

FO4 delay metric: delay of a inverter with 4 fan-in with ¼ size and it drives another inverter 4x size

Saturated!

Freq ↑ => Some OP cycle counts ↑

Page 5: Manycores  –  From hardware prospective to software

Memory wall

External access penalty is increasing(the gap) Solution => enlarge cache

Cache decide the performance and the price

Page 6: Manycores  –  From hardware prospective to software

It’s cache that matters!

Page 7: Manycores  –  From hardware prospective to software

The power wall

High power might imply Thermal run away of device behavior Larger current => electronic migration => issue of the

reliability of the metal connection Hit packaging heat limitation

Change to high cost packaging Cooling noise!! Form factor

Page 8: Manycores  –  From hardware prospective to software

The great wall……

CMOS

Moore’s Law

MulticoreManycore

Page 9: Manycores  –  From hardware prospective to software

Historical - Intel 2007 Xeon

Dual on chip memory controller => fcpu > 2*fmem Point-to-point interconnection => fabrics

Multiple communication activities (c.f. “bus” => one activity)

Page 10: Manycores  –  From hardware prospective to software

Fabric working notation

Page 11: Manycores  –  From hardware prospective to software

AMD – Opteron(Shanghai)

Much the same as Intel Xeon Shared L3 cache among 2 cores

Page 12: Manycores  –  From hardware prospective to software

Game consoles

XBox360 => Triple core PS3 => Cell, 8+1 cores

Power PC wins!

Homogeneous

Heterogeneous

Page 13: Manycores  –  From hardware prospective to software

State-of-art multicore DSP chips

TI TNETV3020 Freescale 8156

Homogeneous Heterogeneous

Page 14: Manycores  –  From hardware prospective to software

State-of-art multicore DSP chips

picoChip PC205 Tilera TILE64

Heterogeneous Homogeneous, Mesh

Page 15: Manycores  –  From hardware prospective to software

State-of-art multicore x86 chips

24 “tiles” with two IA cores per tile A 24-router mesh network with 256 GB/s bisection

bandwidth 4 integrated DDR3 memory controllers Hardware support for message-passing !!

Intel Single-chip Cloud Computer 1GHz Pentium

Page 16: Manycores  –  From hardware prospective to software

GPGPU - OpenCL

OfficialLOGO

Page 17: Manycores  –  From hardware prospective to software

Special case: multicore video processor

Characteristics of video applications in consumer electronics High computational capability Low hardware cost Low power consumption

A General Solution Fixed-function logic designed

Challenges Multiple video decoding standards Updating video decoding standards Ill-posed video processing algorithms Product requirements are diverse and mutually exclusive

Page 18: Manycores  –  From hardware prospective to software

mediaDSP technology Broadcom: mediaDSP technology

Heterogeneous (programmable and fixed functions units) A task-based programming model A uniform approach for managing tasks executing on different

types of programmable and fixed-function processing elements

A platform, easily extendable to support a range of applications Easily to be customized for special purpose

Successful stories SD MPEG Video encoder including scaling and noise reduction Frame-Rate-Conversation Video Processing for

FHD@60Hz /120Hz videos

Nickname: accelerator

Page 19: Manycores  –  From hardware prospective to software

Classes of video processing

Highly parallelizable operations for fixed-point data and no floating point A processor with SIMD data path engine

Ad-hoc computation and decision making, which are operating on smaller sets of data produced by the parallelizable processes A general processor such as RISC

Data movement and formatting on multidimensional pixels

Bit serial processing for entropy decoding and encoding => dedicate hardware do this job very efficiently

Page 20: Manycores  –  From hardware prospective to software

Task-based programming model

Programmers’ duties as follows: Partition a sequential algorithm into a set of parallelizable tasks

and then efficiently map it to the massively parallel architecture A task has a definite initiation time A task runs until completion with no interruption and no further

synchronization with other task Understand hardware architecture and limitation

Shared memory (instead of FIFO mode) Buffer size must be enough for a data unit Interconnect bandwidth must be enough Computational power must be enough for real time

Page 21: Manycores  –  From hardware prospective to software

(IP) Platform-based architecture

Task-oriented engine (TOE) A programmable DSP or a fixed function unit

Task control unit (TCU) A RISC maintains a queue of tasks and synchronous with other TCU/TOEs To maximize the utilization of TOEs

Control engine Shared memory Communication fabric

Page 22: Manycores  –  From hardware prospective to software

Memory architecture Memory hierarchy

L1 - Processor Instruction and Data Memory

L2 - On-chip Shared Memory

L3 - Off-chip

All TOEs use software-managed DMA rather than caches for their local storage 6D addressing (x,y,t,Y,U,V) and the chunking

of blocks into smaller subblocks. No {pre-fetching, early load scheduling,

cache, speculative execution, multithreading …}

Page 23: Manycores  –  From hardware prospective to software

Broadcom BCM35421 chip [1/2]

Do motion-compensated frame-rate conversion Double frame rate from FHD@60fps to FHD@120fps

(to conquer motion blur) 24fps 60fps (de-judder)

Page 24: Manycores  –  From hardware prospective to software

Broadcom BCM35421 chip [2/2]

65nm CMOS process mediaDSP runs at 400 MHz 106 Million transistors Two Teraops of peak integer performance

Page 25: Manycores  –  From hardware prospective to software

Performance of DSPs for applications

DSP becomes useful when it can perform a minimum of 100 instructions per sample period

68% DSP were shipped for mobile handsets and base stations in 2008

Several K cycles for processing a input sample

Page 26: Manycores  –  From hardware prospective to software

Multiple elements

Increase in performance: multiple elements > higher performance single

elements

Page 27: Manycores  –  From hardware prospective to software

Go deeper –TI’s multicore

Multicore Programming Guide

Page 28: Manycores  –  From hardware prospective to software

Mapping application to mutilcore

Know the processing model option Identify all the tasks

Task partition into many small ones Familiar with Inter-task communication/data flow Combination/aggregation Mapping

Memory hierarchy => L1/L2/L3, private/shared memory, external memory channel numbers/capability

DMA Special purpose hardware!!

FFT, Viterb, reed solomon, AES codec, Entropy codec

Page 29: Manycores  –  From hardware prospective to software

Parallel processing model

Very successful in communication system Router Base station

Master/Slave model Data flow

Page 30: Manycores  –  From hardware prospective to software

Data movement

Shared memory Dedicated memory

Transitional memory => ownership change, content not copy

Page 31: Manycores  –  From hardware prospective to software

Notification [1/4]

Direct signaling Create event to other core’s local interrupt controller

Other core polling local status Or the local interrupt controller convert this event to real

interrupt

Page 32: Manycores  –  From hardware prospective to software

Notification [2/4]

Indirect signaling Not directly controlled by software

Page 33: Manycores  –  From hardware prospective to software

Notification [3/4]

Atomic arbitration Hardware semaphore/mutex

Semaphore => allow limited multiple access => example: multi-port SRAM/external DDR memory

Mutex => allow one access only Use software semaphore instead if resource only shared between

processes only executed in one core Overhead of hardware semaphore is not small

Its only a facility for software usage, hardware only guarantee atomic operation, locked content is not protected Cost, performance consideration

Page 34: Manycores  –  From hardware prospective to software

Notification [4/4]

Left diagram is mutex

Just like the software counterpart

Page 35: Manycores  –  From hardware prospective to software

Data transfer engines

DMA => System DMA, local DMA belongs to a core Ethernet

Up to 32 MAC address RapidIO

Implemented with ultra fast serial IO physical layer Maybe multiple serial IO links uni/bi-directional Example

USB 2.0 => 480Mbit/sec USB 3.0 => 5Gbit/sec Serial ATA

1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec 3.0, Gen 3 => 6 Gbit/sec

Page 36: Manycores  –  From hardware prospective to software

High speed serial link

USB SATA

Page 37: Manycores  –  From hardware prospective to software

Memory management

Devices do not support automated cache coherency among cores because of the power consumption involved and the latency overhead introduced

Switched central resource fabric

Page 38: Manycores  –  From hardware prospective to software

Highlights [1/3]

Portion of the cache could be configured to as memory mapped SRAM Transparent cache => visible

Address aliasing => masking MSByte For core 0: 0x10800000 == 0x00800000 For core 1: 0x11800000 == 0x00800000 For core 2: 0x12800000 == 0x00800000

Special register DNUM for dynamic pointer address update =>

Write common rom codestill assess core’s private area

Each core has it DNUM

Implicit

Explicit

Page 39: Manycores  –  From hardware prospective to software

Highlight [2/3]

The only guaranteed coherency by hardware L1D L2 (core-locally) L1D L2 SL2 (if as

memory mapped SRAM) (core-locally)

Equal access to the external DDR2 SDRAM through SCR

L1DL1P L1DL1P

This may be the bottleneck for

certain application

Page 40: Manycores  –  From hardware prospective to software

Highlight [3/3]

If any portion of the L1s is configured as memory-mapped SRAM, there is a small paging engine built into the core (IDMA) that can be used to transfer linear blocks of memory between L1 and L2 in the background of CPU operation Paging engine => MMU

IDMA may also be used to perform bulk peripheral configuration register access

Page 41: Manycores  –  From hardware prospective to software

DSP code and data image

Image types Single image Multiple image Multiple image with shared code and data

Complex linking scheme should be used Device boot

Tool Tool Tool

Page 42: Manycores  –  From hardware prospective to software

Debugging

Cuda => XXOO↑↑↓↓←→←→BA

Page 43: Manycores  –  From hardware prospective to software

TI’s offer

Hardware emulation => ICE, JTAG Basically, not intrusive

Software instrumentation Patching original codes to enable same ability => this

time, “Trace Logs” Basically, intrusive

Type of Trace Logs API call log, Statistics log, DMA transaction log, Event

log, Customer data log

Page 44: Manycores  –  From hardware prospective to software

More on logs

Information stores in memory pull back to host by path through hardware emulation

Provide tool to correlate all the logs Display them with an organized manner Log example:

Page 45: Manycores  –  From hardware prospective to software

Go deeper –Freescale’s manycoreEmbedded Multicore: An Introduction

Page 46: Manycores  –  From hardware prospective to software

Why manycore?

Freescale MPC8641 Single core => freq x 1.5 => power x 2 Dual core => freq x 1.5 => power x 1.3

Bug in this Fig.

Page 47: Manycores  –  From hardware prospective to software

Memory system types

Page 48: Manycores  –  From hardware prospective to software

SMP + AMP + Sharing

Manycore enables multiple OS concurrently running Memory sharing => MMU Interface/peripheral sharing => hypervisor Virtualization is good for legacy support

Page 49: Manycores  –  From hardware prospective to software

Review of single core

Page 50: Manycores  –  From hardware prospective to software

Manycore example [1/2]

1

2

22 2

2

4

33

4

4

Page 51: Manycores  –  From hardware prospective to software

Manycore example [2/2]

1

2

3

1

4

4

2

Page 52: Manycores  –  From hardware prospective to software

Highlights [1/2]

CoreNet fabric supports cache coherency across all cache layers CoreNet fabric also supports software semaphores by extending

the bit-test to guarantee atomic access between cores CLASS is better suited for DSPs as they tend to use less complex

operating systems and the application software is more in control Silicon area of the fabric reduced

If core is configured as software accelerator Some space of L1, L2 could be converted to memory mapped SRAM

in a per way basis

Page 53: Manycores  –  From hardware prospective to software

Highlights [2/2]

While MMU protects memory access, DMA could ruin all the things => solution “PAMU” PAMU is located at the connection of non-core masters

and the CoreNet fabric configured to map memory and to limit access windows thereby increasing system stability

Cache stashing DMA between cache and external memory

Page 54: Manycores  –  From hardware prospective to software

Deal with 10Gb Ethernet

Parsing network traffic up to layer 4

Assign packet to designated cores TCP port 80/22 to core 0 ARP to core 1 UDP to core 4~7

Queue Mgr and Buffer Mgr simplify driver codes

Page 55: Manycores  –  From hardware prospective to software

Debugging

JTAG

High-speed logging linkPlatinum version “printk”

Page 56: Manycores  –  From hardware prospective to software

Why JTAG is slow?

Involving serial bit shifting

Page 57: Manycores  –  From hardware prospective to software

Conclusion

Hardware guys are crazy and unfriendly They are all “Postscript” geeks

However, they provides the world a chance to deal with real-time sample data with thousands of cycles => more things could be done by software

Besides data structure and algorithm, now good engineer need to know more

The more you know hardware the more it dance with you, or at least……

Page 58: Manycores  –  From hardware prospective to software

Talk as common sense

Page 59: Manycores  –  From hardware prospective to software

Reference[1/2]

Sudhakar Yalamanchili, Georgia Institute of Technology, “Multicore Computing Multicore Computing - - Evolution”

(Broadcom) “Broadcom mediaDSP: A platform for building programmable multicore video processors,” IEEE Micro., Mar/April 2009.

(Texas Instruments, TI) Lina J. Karam, Ismail Alkamal, Alan Gatherer, Gene A. Frantz, David V. Anderson, and Brian L. Evans, “Trends in Multicore DSP Platforms,” IEEE Signal Processing Magazine, Nov. 2009. University Texas and Texas Instruments

Page 60: Manycores  –  From hardware prospective to software

Reference[2/2]

http://en.wikipedia.org (Tilera) Tile64 processor products. http://www.tilera.com

Originate from MIT (Intel) Single-chip Cloud Computer TI, “Multicore Programming Guide” Freescale, “Embedded Multicore: An Introduction” Presentation slides from lab member