Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation [email protected]
Mar 29, 2015
Tile Processors: Many-Core for Embedded
and Cloud Computing
Richard SchoolerVP Software Engineering
Tilera [email protected]
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.2
Exploiting Natural Parallelism
High-performance applications have lots of parallelism!– Embedded apps:
• Networking: packets, flows• Media: streams, images, functional & data parallelism
– Cloud apps:• Many clients: network sessions• Data mining: distributed data & computation
Lots of different levels: – SIMD (fine-grain data parallelism)– Thread/process (medium-grain task parallelism)– Distributed system (coarse-grain job parallelism)
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.3
Every one is going Manycore, but can the architecture scale?
The computing world is ready for radical change
Time
#C
ore
s
2010
1
2006
2
4
32
20202005
Intel
Sun IBM Cell
nCores
Larrabee
Performance&
Performance/W
Gap
2014
Perfo
rman
ce
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.4
Key “Many-Core” Challenges: The 3 P’s
Performance challenge– How to scale from 1 to 1000 cores – the number of
cores is the new Megahertz
Power efficiency challenge– Performance per watt is the new metric – systems are
often constrained by power & cooling
Programming challenge– How to provide a converged many core solution in a
standard programming environment
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.5
“Problems cannot be solved by the samelevel of thinking that created them.”
Current technologies fail to deliver– Incremental performance increase– High power– Low level of Integration– Increasingly bigger cores
We need to have a new thinking to get– 10 x performance– 10 x performance per watt – Converged computing– Standard programming models
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.6
Stepping Back: How Did We Get Here?
Moore’s Conundrum:
More devices =>? More performance
Old answers: More complex cores; bigger caches– But power-hungry
New answers: More cores– But do conventional approaches scale?
Diminishing returns!
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.7
The Old Challenge: CPU-on-a-chip
20 MIPS CPUin 1987
Few thousand gates
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.8
What to do with all those transistors?
The Opportunity: Billions of Transistors
Old CPU:
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.9
ASICs have high performance and low power• Custom-routed, short wires• Lots of ALUs, registers, memories – huge on-chip parallelism
memmem
mem
mem
mem
Take Inspiration from ASICs
But how to build a programmable chip?
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.10
Replace Long Wires with Routed Interconnect
Ctrl
[IEEE Computer ’97]
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.11
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
From Centralized Clump of CPUs …
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.12
AL
UA
LU A
LU A
LU
AL
U
AL
UR
Scalar Operand Network (SON) [TPDS 2005]
… To Distributed ALUs, Routed Bypass Network
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.13
From a Large Centralized Cache…
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.14
…to a Distributed Shared Cache
ALU
ALU
ALU
ALU
ALU
ALU
R
$
[ISCA 1999]
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.15
Distributed Everything + Routed Interconnect Tiled Multicore
AL
UA
LU A
LU A
LU
AL
U
AL
U
R
$
Each tile is a processor, so programmable
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.16
Tiled Multicore Captures ASIC Benefits and is Programmable
Scales to large numbers of cores Modular – design and verify 1 tile Power efficient
– Short wires plus locality opts –
CV2f– Chandrakasan effect, more cores at
lower freq and voltage – CV2f
ProcessorCore
= TileCore + Switch
SCurrent Bus Architecture
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.17
Tilera processor portfolio Demonstrating the scale of many-core
200920082007 2010
TILEPro36TILEPro36
TILE64TILE64
TILEPro64TILEPro64
Gx64 & Gx100Up to 8x performance
Gx64 & Gx100Up to 8x performance
Gx16 & Gx362x the performance
Gx16 & Gx362x the performance
. . .
TILE-Gx100100 cores
TILE-Gx6464 cores
TILE-Gx3636 cores
TILE-Gx1616 cores
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.18
Memory Controller (DDR3)Memory Controller (DDR3) Memory Controller (DDR3)Memory Controller (DDR3)
Memory Controller (DDR3)Memory Controller (DDR3) Memory Controller (DDR3)Memory Controller (DDR3)
mP
IPE
mP
IPE
1.2GHz – 1.5GHz 32 MBytes total cache 546 Gbps peak mem BW 200 Tbps iMesh BW
80-120 Gbps packet I/O– 8 ports XAUI / 2 XAUI– 2 40Gb Interlaken– 32 ports 1GbE (SGMII)
80 Gbps PCIe I/O– 3 StreamIO ports (20Gb)
Wire-speed packet eng.– 120Mpps
MiCA engines:– 40 Gbps crypto– compress & decompress
FlexibleI/O
FlexibleI/O
UART x2, USB x2,JTAG, I2C, SPI
UART x2, USB x2,JTAG, I2C, SPI
MiCAMiCA
MiCAMiCA
Ser
De
sS
erD
es
PCIe 2.08-lane
PCIe 2.08-lane
Ser
De
sS
erD
es
PCIe 2.04-lane
PCIe 2.04-lane
Ser
De
sS
erD
es
PCIe 2.08-lane
PCIe 2.08-lane
Inte
rla
ken
Inte
rla
ken
Inte
rla
ken
Inte
rla
ken
10 GbEXAUI
10 GbEXAUI S
erD
es
Ser
De
s4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
es
Ser
De
s4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
es
Ser
De
s4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
es
Ser
De
s4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
es
Ser
De
s4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
es
Ser
De
s4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
es
Ser
De
s4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
es
Ser
De
s4x GbESGMII
TILE-Gx100™: Complete System-on-a-Chip with 100 64-bit cores
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.19
36 Processor Cores 866M, 1.2GHz, 1.5GHz clk 12 MBytes total cache
40 Gbps total packet I/O – 4 ports 10GbE (XAUI)– 16 ports 1GbE (SGMII)
48 Gbps PCIe I/O– 2 16Gbps Stream IO ports
Wire-speed packet engine– 60Mpps
MiCA engine:– 20 Gbps crypto– Compress & decompress
FlexibleI/O
FlexibleI/O
UARTx2, USBx2,JTAG, I2C, SPI
UARTx2, USBx2,JTAG, I2C, SPI
MiCAMiCA
mP
IPE
mP
IPE
10 GbEXAUI
10 GbEXAUI
Ser
Des
Ser
Des
4x GbESGMII
10 GbEXAUI
10 GbEXAUI
Ser
Des
Ser
Des
4x GbESGMII
10 GbEXAUI
10 GbEXAUI
Ser
Des
Ser
Des
4x GbESGMII
10 GbEXAUI
10 GbEXAUI
Ser
Des
Ser
Des
4x GbESGMII
Ser
Des
Ser
Des PCIe 2.0
8-lanePCIe 2.08-lane
Ser
Des
Ser
Des
PCIe 2.04-lane
PCIe 2.04-lane
Ser
Des
Ser
Des PCIe 2.0
4-lanePCIe 2.04-lane
Memory Controller (DDR3)Memory Controller (DDR3)
Memory Controller (DDR3)Memory Controller (DDR3)
TILE-Gx36™: Scaling to a broad range of applications
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.20
Full-Featured General Converged Cores
Processor– Each core is a complete computer– 3-way VLIW CPU– SIMD instructions: 32, 16, and 8-bit ops– Instructions for video (e.g., SAD) and
networking– Protection and interrupts
Memory– L1 cache and L2 Cache– Virtual and physical address space– Instruction and data TLBs– Cache integrated 2D DMA engine
Runs SMP Linux Runs off-the-shelf C/C++ programs Signal processing and general apps
Core
Cache
16K L1-I
64K L2
I-TLB
D-TLB
2DDMA8K L1-D
Register File
Three Execution Pipelines
TerabitSwitch
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.21
Software must complement the hardware Enable re-use of existing code-bases
Standards-based development environment– e.g. gcc, C, C++, Java, Linux– Comprehensive command-line & GUI-based tools
Support multiple OS models– One OS running SMP– Multiple virtualized OS’s with protection– Bare metal or “zero-overhead” with background OS environment
Support range of parallel programming styles– Threaded programming (pThreads, TBB)– Run-to-Completion with load-balancing– Decomposition & Pipelining– Higher-level frameworks (Erlang, OpenMP, Hadoop etc.)
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.22
Software Roadmap
Standards & open source integration– Compiler: gcc, g++ 4.4+– Linux:
• Kernel: Tile architecture integrated to 2.6.36• User-space: glibc, broader set of standard
packages Extended programming and runtime
environments– Java: porting OpenJDK– Virtualization: porting KVM
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.23
Tile architecture: The future of many-core computing
Multicore is the way forward– But we need the right architecture to utilize it
The Tile architecture addresses the challenges– Scales to 100’s of cores– Delivers very low power– Runs your existing code
Standards-based software– Familiar tools– Full range of standard programming environments
Thank you!
Questions?
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.25
Research Vision to Commercial Product
2002
2007
100B transistors
100B transistors
2018
1B Transistors
in 2007
1996
The opportunity
CPUMem
1997
A blankslate
MIT Raw 16 cores Tile Processor
64 cores
Memory Controller (DDR3)Memory Controller (DDR3) Memory Controller (DDR3)
Memory Controller (DDR3)
Memory Controller (DDR3)Memory Controller (DDR3) Memory Controller (DDR3)
Memory Controller (DDR3)
mP
IPE
mP
IPE
FlexibleI/O
FlexibleI/O
UART x2, USB x2,JTAG, I2C, SPI
UART x2, USB x2,JTAG, I2C, SPI
MiCAMiCA
MiCAMiCA
Ser
Des
Ser
Des
PCIe 2.08-lane
PCIe 2.08-lane
Ser
Des
Ser
Des
PCIe 2.04-lane
PCIe 2.04-lane
Ser
Des
Ser
Des
PCIe 2.08-lane
PCIe 2.08-lane
Inte
rlak
enIn
terl
aken
Inte
rlak
enIn
terl
aken
10 GbEXAUI
10 GbEXAUI S
erD
esS
erD
es4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
esS
erD
es4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
esS
erD
es4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
esS
erD
es4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
esS
erD
es4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
esS
erD
es4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
esS
erD
es4x GbESGMII
10 GbEXAUI
10 GbEXAUI S
erD
esS
erD
es4x GbESGMII
TILE-Gx100100 cores
The future?
2010
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.26
Standard tools and programming model
Multicore Development Environment
Standard application stack
Standard programming SMP Linux 2.6 ANSI C/C++ Java, PHP
Integrated tools GCC compiler Standard gdb gprof Eclipse IDE
Innovative tools Multicore debug Multicore profile
Standards-based tools
Application layer Open source apps Standard C/C++ libs
Operating System layer 64-way SMP Linux Zero Overhead Linux Bare metal
environment
Hypervisor layer Virtualizes hardware I/O device drivers
Tile Processor
Tile Tile Tile Tile …
Hypervisor
Operating System
Applications
Virtualization and high speed I/O drivers
kernel drivers
Applications libraries
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.27
Standard Software Stack
Compiler, OSHypervisor
LanguageSupport
InfrastructureApps
Perl
gcc & g++
Commercial Linux Distribution
ManagementProtocols IPMI 2.0
Transcoding
NetworkMonitoring
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.28
High single core performanceComparable to Atom & ARM Cortex-A9 cores
- Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website http://coremark.org/home.php - TILE-Gx and single thread Atom results were measured in Tilera labs- Single core, single thread result for ARM is calculated based on chip scores
Single-Core Single thread CoreMark™ Comparison
-
500
1,000
1,500
2,000
2,500
3,000
3,500
TileraTILEPro64866 MHz
TileraTILE-Gx361.25 GHz
ARMCortex-A9
1 GHz
IntelAtom N2701600 MHz
Co
reM
ark
Sc
ore
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.29 29
Significant value across multiple markets
Networking Multimedia Wireless Cloud• Classification• L4-7 Services• Load Balancing• Monitoring/QoS• Security
• Video Conferencing
• Media Streaming• Transcoding
• Base Station• Media Gateway• Service Nodes• Test Equipment
• Apache• Memcached• Web Applications• LAMP stack
High Performance
Low Power
Standard Programming
Over 100 customersOver 40 customers going into production
Tier 1 customers in all target markets
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.30
Targeting markets with highly parallel applications
Web ServingIn Memory Cache
Data Mining
Web
TranscodingVideo deliveryWireless media
Media delivery
Lawful interceptionSurveillance
Other
Government
Common ThemesHundreds and Thousands of servers running each application
thousands of parallel transactionsAll need better performance and power efficiency
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.31
200 Tbps on-chip bandwidth
2 Dimensional mesh network
Scales to large numbers of cores Modular: Design-and-verify 1 tile Power efficient:
– Short wires & locality optimize CV2f– Chandrakasan effect, more cores at
lower freq and voltage – CV2f
Core + Switch = Tile
Traditional Bus/Ring Architecture
S
ProcessorCore
The Tile Processor Architecture Mesh interconnect, power-optimized cores
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.32
Global 64-bit address space
Distributed caches Big centralized caches don’t scale– Contention– Long latency– High power
Distributed caches have numerous benefits– Lower power (less logic lit-up per access)
– Exploit locality (local L1 & L2)
– Can exploit various cache placement mechanisms to enhance performance
Distributed “everything” Cache, memory management, connectivity
CompletelyHW Coherent
CacheSystem
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.33
Highest compute density
2U form factor
4 hot pluggable modules
8 Tilera TILEPro processors
512 general purpose cores
1.3 trillion operations /sec
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.34
Power efficient and eco-friendly server
10,000 cores in a 8 Kilowatt rack 35-50 watts max per node Server power of 400 watts 90%+ efficient power supplies Shared fans and power supplies
HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.35
Coherent distributed cache system
Globally Shared Physical Address Space– Full Hardware Cache Coherence– Standard shared memory programming model
Distributed cache– Each tile has local L1 and L2 caches– Aggregate of L2 serves as a globally shared L3– Any cache block can be replicated locally– Hardware tracks sharers, invalidates stale copies
Dynamic Distributed Cache (DDC™)– Memory pages distributed across all cores or
homed by allocating core
Coherent I/O– Hardware maintains coherence– I/O reads/writes coherent with tile caches– Reads/writes delivered by HW to home cache– Header/packet delivered directly to tile caches
Ser
De
sS
erD
es
GbE 0GbE 0
GbE 1GbE 1
FlexibleI/O
UARTJTAG
SPI, I2C
FlexibleI/O
UARTJTAG
SPI, I2C
DDR2 Controller 3DDR2 Controller 3
DDR2 Controller 0DDR2 Controller 0
DDR2 Controller 2DDR2 Controller 2
DDR2 Controller 1DDR2 Controller 1
PCIe 0PCIe 0
Ser
De
sS
erD
es
PCIe 1PCIe 1
XAUI 0XAUI 0
Ser
De
sS
erD
es
XAUI 1XAUI 1
Ser
De
sS
erD
es
CompletelyCoherentSystem