Manycores – From hardware prospective to software
Post on 23-Feb-2016
19 Views
Preview:
DESCRIPTION
Transcript
MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE
Presenter: D96943001 電子所 陳泓輝
Why Moore’s Law is die
He is not CEO anymore!! Walls => ILP, Frequency, Power, Memory walls
ILP – more cost less return
ILP: instruction level parallelism OOO: out of order execution of microcodes
Frequency wall
FO4 delay metric: delay of a inverter with 4 fan-in with ¼ size and it drives another inverter 4x size
Saturated!
Freq ↑ => Some OP cycle counts ↑
Memory wall
External access penalty is increasing(the gap) Solution => enlarge cache
Cache decide the performance and the price
It’s cache that matters!
The power wall
High power might imply Thermal run away of device behavior Larger current => electronic migration => issue of the
reliability of the metal connection Hit packaging heat limitation
Change to high cost packaging Cooling noise!! Form factor
The great wall……
CMOS
Moore’s Law
MulticoreManycore
Historical - Intel 2007 Xeon
Dual on chip memory controller => fcpu > 2*fmem Point-to-point interconnection => fabrics
Multiple communication activities (c.f. “bus” => one activity)
Fabric working notation
AMD – Opteron(Shanghai)
Much the same as Intel Xeon Shared L3 cache among 2 cores
Game consoles
XBox360 => Triple core PS3 => Cell, 8+1 cores
Power PC wins!
Homogeneous
Heterogeneous
State-of-art multicore DSP chips
TI TNETV3020 Freescale 8156
Homogeneous Heterogeneous
State-of-art multicore DSP chips
picoChip PC205 Tilera TILE64
Heterogeneous Homogeneous, Mesh
State-of-art multicore x86 chips
24 “tiles” with two IA cores per tile A 24-router mesh network with 256 GB/s bisection
bandwidth 4 integrated DDR3 memory controllers Hardware support for message-passing !!
Intel Single-chip Cloud Computer 1GHz Pentium
GPGPU - OpenCL
OfficialLOGO
Special case: multicore video processor
Characteristics of video applications in consumer electronics High computational capability Low hardware cost Low power consumption
A General Solution Fixed-function logic designed
Challenges Multiple video decoding standards Updating video decoding standards Ill-posed video processing algorithms Product requirements are diverse and mutually exclusive
mediaDSP technology Broadcom: mediaDSP technology
Heterogeneous (programmable and fixed functions units) A task-based programming model A uniform approach for managing tasks executing on different
types of programmable and fixed-function processing elements
A platform, easily extendable to support a range of applications Easily to be customized for special purpose
Successful stories SD MPEG Video encoder including scaling and noise reduction Frame-Rate-Conversation Video Processing for
FHD@60Hz /120Hz videos
Nickname: accelerator
Classes of video processing
Highly parallelizable operations for fixed-point data and no floating point A processor with SIMD data path engine
Ad-hoc computation and decision making, which are operating on smaller sets of data produced by the parallelizable processes A general processor such as RISC
Data movement and formatting on multidimensional pixels
Bit serial processing for entropy decoding and encoding => dedicate hardware do this job very efficiently
Task-based programming model
Programmers’ duties as follows: Partition a sequential algorithm into a set of parallelizable tasks
and then efficiently map it to the massively parallel architecture A task has a definite initiation time A task runs until completion with no interruption and no further
synchronization with other task Understand hardware architecture and limitation
Shared memory (instead of FIFO mode) Buffer size must be enough for a data unit Interconnect bandwidth must be enough Computational power must be enough for real time
(IP) Platform-based architecture
Task-oriented engine (TOE) A programmable DSP or a fixed function unit
Task control unit (TCU) A RISC maintains a queue of tasks and synchronous with other TCU/TOEs To maximize the utilization of TOEs
Control engine Shared memory Communication fabric
Memory architecture Memory hierarchy
L1 - Processor Instruction and Data Memory
L2 - On-chip Shared Memory
L3 - Off-chip
All TOEs use software-managed DMA rather than caches for their local storage 6D addressing (x,y,t,Y,U,V) and the chunking
of blocks into smaller subblocks. No {pre-fetching, early load scheduling,
cache, speculative execution, multithreading …}
Broadcom BCM35421 chip [1/2]
Do motion-compensated frame-rate conversion Double frame rate from FHD@60fps to FHD@120fps
(to conquer motion blur) 24fps 60fps (de-judder)
Broadcom BCM35421 chip [2/2]
65nm CMOS process mediaDSP runs at 400 MHz 106 Million transistors Two Teraops of peak integer performance
Performance of DSPs for applications
DSP becomes useful when it can perform a minimum of 100 instructions per sample period
68% DSP were shipped for mobile handsets and base stations in 2008
Several K cycles for processing a input sample
Multiple elements
Increase in performance: multiple elements > higher performance single
elements
Go deeper –TI’s multicore
Multicore Programming Guide
Mapping application to mutilcore
Know the processing model option Identify all the tasks
Task partition into many small ones Familiar with Inter-task communication/data flow Combination/aggregation Mapping
Memory hierarchy => L1/L2/L3, private/shared memory, external memory channel numbers/capability
DMA Special purpose hardware!!
FFT, Viterb, reed solomon, AES codec, Entropy codec
Parallel processing model
Very successful in communication system Router Base station
Master/Slave model Data flow
Data movement
Shared memory Dedicated memory
Transitional memory => ownership change, content not copy
Notification [1/4]
Direct signaling Create event to other core’s local interrupt controller
Other core polling local status Or the local interrupt controller convert this event to real
interrupt
Notification [2/4]
Indirect signaling Not directly controlled by software
Notification [3/4]
Atomic arbitration Hardware semaphore/mutex
Semaphore => allow limited multiple access => example: multi-port SRAM/external DDR memory
Mutex => allow one access only Use software semaphore instead if resource only shared between
processes only executed in one core Overhead of hardware semaphore is not small
Its only a facility for software usage, hardware only guarantee atomic operation, locked content is not protected Cost, performance consideration
Notification [4/4]
Left diagram is mutex
Just like the software counterpart
Data transfer engines
DMA => System DMA, local DMA belongs to a core Ethernet
Up to 32 MAC address RapidIO
Implemented with ultra fast serial IO physical layer Maybe multiple serial IO links uni/bi-directional Example
USB 2.0 => 480Mbit/sec USB 3.0 => 5Gbit/sec Serial ATA
1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec 3.0, Gen 3 => 6 Gbit/sec
High speed serial link
USB SATA
Memory management
Devices do not support automated cache coherency among cores because of the power consumption involved and the latency overhead introduced
Switched central resource fabric
Highlights [1/3]
Portion of the cache could be configured to as memory mapped SRAM Transparent cache => visible
Address aliasing => masking MSByte For core 0: 0x10800000 == 0x00800000 For core 1: 0x11800000 == 0x00800000 For core 2: 0x12800000 == 0x00800000
Special register DNUM for dynamic pointer address update =>
Write common rom codestill assess core’s private area
Each core has it DNUM
Implicit
Explicit
Highlight [2/3]
The only guaranteed coherency by hardware L1D L2 (core-locally) L1D L2 SL2 (if as
memory mapped SRAM) (core-locally)
Equal access to the external DDR2 SDRAM through SCR
L1DL1P L1DL1P
This may be the bottleneck for
certain application
Highlight [3/3]
If any portion of the L1s is configured as memory-mapped SRAM, there is a small paging engine built into the core (IDMA) that can be used to transfer linear blocks of memory between L1 and L2 in the background of CPU operation Paging engine => MMU
IDMA may also be used to perform bulk peripheral configuration register access
DSP code and data image
Image types Single image Multiple image Multiple image with shared code and data
Complex linking scheme should be used Device boot
Tool Tool Tool
Debugging
Cuda => XXOO↑↑↓↓←→←→BA
TI’s offer
Hardware emulation => ICE, JTAG Basically, not intrusive
Software instrumentation Patching original codes to enable same ability => this
time, “Trace Logs” Basically, intrusive
Type of Trace Logs API call log, Statistics log, DMA transaction log, Event
log, Customer data log
More on logs
Information stores in memory pull back to host by path through hardware emulation
Provide tool to correlate all the logs Display them with an organized manner Log example:
Go deeper –Freescale’s manycoreEmbedded Multicore: An Introduction
Why manycore?
Freescale MPC8641 Single core => freq x 1.5 => power x 2 Dual core => freq x 1.5 => power x 1.3
Bug in this Fig.
Memory system types
SMP + AMP + Sharing
Manycore enables multiple OS concurrently running Memory sharing => MMU Interface/peripheral sharing => hypervisor Virtualization is good for legacy support
Review of single core
Manycore example [1/2]
1
2
22 2
2
4
33
4
4
Manycore example [2/2]
1
2
3
1
4
4
2
Highlights [1/2]
CoreNet fabric supports cache coherency across all cache layers CoreNet fabric also supports software semaphores by extending
the bit-test to guarantee atomic access between cores CLASS is better suited for DSPs as they tend to use less complex
operating systems and the application software is more in control Silicon area of the fabric reduced
If core is configured as software accelerator Some space of L1, L2 could be converted to memory mapped SRAM
in a per way basis
Highlights [2/2]
While MMU protects memory access, DMA could ruin all the things => solution “PAMU” PAMU is located at the connection of non-core masters
and the CoreNet fabric configured to map memory and to limit access windows thereby increasing system stability
Cache stashing DMA between cache and external memory
Deal with 10Gb Ethernet
Parsing network traffic up to layer 4
Assign packet to designated cores TCP port 80/22 to core 0 ARP to core 1 UDP to core 4~7
Queue Mgr and Buffer Mgr simplify driver codes
Debugging
JTAG
High-speed logging linkPlatinum version “printk”
Why JTAG is slow?
Involving serial bit shifting
Conclusion
Hardware guys are crazy and unfriendly They are all “Postscript” geeks
However, they provides the world a chance to deal with real-time sample data with thousands of cycles => more things could be done by software
Besides data structure and algorithm, now good engineer need to know more
The more you know hardware the more it dance with you, or at least……
Talk as common sense
Reference[1/2]
Sudhakar Yalamanchili, Georgia Institute of Technology, “Multicore Computing Multicore Computing - - Evolution”
(Broadcom) “Broadcom mediaDSP: A platform for building programmable multicore video processors,” IEEE Micro., Mar/April 2009.
(Texas Instruments, TI) Lina J. Karam, Ismail Alkamal, Alan Gatherer, Gene A. Frantz, David V. Anderson, and Brian L. Evans, “Trends in Multicore DSP Platforms,” IEEE Signal Processing Magazine, Nov. 2009. University Texas and Texas Instruments
Reference[2/2]
http://en.wikipedia.org (Tilera) Tile64 processor products. http://www.tilera.com
Originate from MIT (Intel) Single-chip Cloud Computer TI, “Multicore Programming Guide” Freescale, “Embedded Multicore: An Introduction” Presentation slides from lab member
top related