8/3/2019 7 Joshi Presentation
1/40
Building manycore processor-to-DRAMnetworks using monolithic silicon photonics
Ajay Joshi , Christopher Batten , Vladimir Stojanovi , Krste Asanovi
MIT, 77 Massachusetts Ave, Cambridge MA 02139 UC Berkeley, 430 Soda Hall, MC #1776, Berkeley, CA 94720{joshi, cbatten, vlada}@mit.edu, [email protected]
High Performance Embedded Computing(HPEC) Workshop
23-25 September 2008
8/3/2019 7 Joshi Presentation
2/40
MIT/UCB
Manycore systems design space
8/3/2019 7 Joshi Presentation
3/40
MIT/UCB
Manycore system bandwidth requirements
8/3/2019 7 Joshi Presentation
4/40
MIT/UCB
Manycore systems bandwidth, pin count and power scaling
4
1 Byte/Flop,8 Flops/core
@ 5GHz S e r
v e r & H P
C
M o b ile C lie n t
8/3/2019 7 Joshi Presentation
5/40
MIT/UCB
Interconnect bottlenecks
CPU
Cache
D R A M
D I M M
Manycore system
cores
Cache
D R A M
D I M M
Cache
D R A M
D I M M
CPU CPU
InterconnectNetwork
Interconnect
Network
Bottlenecks dueto energy and
bandwidthdensity limitations
8/3/2019 7 Joshi Presentation
6/40
8/3/2019 7 Joshi Presentation
7/40 MIT/UCB
OutlineMotivation
Monolithic silicon photonic technologyProcessor-memory network architecture explorationManycore system using silicon photonics
Conclusion
8/3/2019 7 Joshi Presentation
8/40 MIT/UCB
Unified on-chip/off-chip photonic link
Supports dense wavelength-division multiplexingthat improves bandwidth densityUses monolithic integration that reduces energyconsumptionUtilizes the standard bulk CMOS flow
8/3/2019 7 Joshi Presentation
9/40 MIT/UCB
Optical link components
65 nm bulk CMOS chip designed to test various optical devices
8/3/2019 7 Joshi Presentation
10/40 MIT/UCB
Silicon photonics area and energy advantage
Metric Energy(pJ/b)
Bandwidth density(Gb/s/)
Global on-chip photonic link 0.25 160-320
Global on-chip optimally repeated electrical link 1 5
Off-chip photonic link (50 coupler pitch) 0.25 13-26Off-chip electrical SERDES (100 pitch) 5 0.1
On-chip/off-chip seamless photonic link 0.25
8/3/2019 7 Joshi Presentation
11/40 MIT/UCB
OutlineMotivation
Monolithic silicon photonic technologyProcessor-memory network architecture exploration
Baseline electrical mesh topology
Electrical mesh with optical global crossbar topologyManycore system using silicon photonicsConclusion
8/3/2019 7 Joshi Presentation
12/40 MIT/UCB
Baseline electrical system architecture
Access point per DM distributed across the chip
Two on-chip electrical mesh networksRequest path core access point DRAM moduleResponse path DRAM module access point core
Mesh physical view Mesh logical view
C = core, DM = DRAM module
8/3/2019 7 Joshi Presentation
13/40 MIT/UCB
Interconnect network design methodology
Ideal throughput and zero load latency used as
design metricsEnergy constrained approach is adoptedEnergy components in a network
Mesh energy ( E m ) (router-to-router links (RRL), routers)IO energy ( E io ) (logic-to-memory links (LML))
F l i t w
i d t h
Calculate on-chipRRL energy
Calculate on-chiprouter energy
Calculate mesh
throughput
Calculate totalmesh energy Calculate energybudget for LML
Total energy budget
Calculate LMLwidth
Calculate I/O
throughput
Calculate zero
load latency
8/3/2019 7 Joshi Presentation
14/40
MIT/UCB
Network throughput and zero load latency
System throughput limited by on-chip mesh or I/O linksOn-chip mesh could be over-provisioned to overcome meshbottleneckZero load latency limited by data serialization
(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)
8/3/2019 7 Joshi Presentation
15/40
MIT/UCB
Network throughput and zero load latency
System throughput limited by on-chip mesh or I/O linksOn-chip mesh could be over-provisioned to overcome meshbottleneckZero load latency limited by data serialization
(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)
8/3/2019 7 Joshi Presentation
16/40
MIT/UCB
Network throughput and zero load latency
System throughput limited by on-chip mesh or I/O linksOn-chip mesh could be over-provisioned to overcome meshbottleneckZero load latency limited by data serialization
(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)
O P F : 1
O P F : 2
O P F : 4
8/3/2019 7 Joshi Presentation
17/40
MIT/UCB
Network throughput and zero load latency
System throughput limited by on-chip mesh or I/O linksOn-chip mesh could be over-provisioned to overcome meshbottleneckZero load latency limited by data serialization
On-chip
serialization
Off-chipserialization
(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)
O P F : 1
O P F : 2
O P F : 4
8/3/2019 7 Joshi Presentation
18/40
MIT/UCB
OutlineMotivation
Monolithic silicon photonic technologyProcessor-memory network architecture exploration
Baseline electrical mesh topology
Electrical mesh with optical global crossbar topologyManycore system using silicon photonicsConclusion
8/3/2019 7 Joshi Presentation
19/40
MIT/UCB
Optical system architecture
Off-chip electrical links replaced with optical linksElectrical to optical conversion at access pointWavelengths in each optical link distributedacross various core-DRAM module pairs
Mesh physical view Mesh logical view
C = core, DM = DRAM module
8/3/2019 7 Joshi Presentation
20/40
MIT/UCB
Network throughput and zero load latency
Reduced I/O cost improvessystem bandwidthReduction in latency due to
lower serialization latencyOn-chip network is the newbottleneck
8/3/2019 7 Joshi Presentation
21/40
MIT/UCB
Network throughput and zero load latency
Reduced I/O cost improvessystem bandwidthReduction in latency due to
lower serialization latencyOn-chip network is the newbottleneck
8/3/2019 7 Joshi Presentation
22/40
MIT/UCB
Optical multi-group system architecture
Break the single on-chip electrical mesh into several groupsEach group has its own smaller meshEach group still has one AP for each DMMore APs each AP is narrower (uses less s)
Use optical network as a very efficient global crossbar Need a crossbar switch at the memory for arbitration
Ci = core in group i , DM = DRAM module, S = global crossbar switch
8/3/2019 7 Joshi Presentation
23/40
MIT/UCB
Network throughput vs zero load latency
Grouping moves traffic
from energy-inefficientmesh channels to energy-efficient photonicchannelsGrouping and siliconphotonics provides 10x-15x throughputimprovementGrouping reduces ZLL inphotonic range, butincreases ZLL in electricalrange
A
B
1 0 x - 1 5 x
8/3/2019 7 Joshi Presentation
24/40
MIT/UCB
Simulation results
Grouping2x improvement in bandwidth at comparable latency
Overprovisioning2x-3x improvement in bandwidth for small group count atcomparable latency
Minimal improvement for large group count
256 cores,16 DM
Uniform random traffic
256 cores,16 DM
Uniform randomtraffic
8/3/2019 7 Joshi Presentation
25/40
MIT/UCB
Simulation results
Replacing off-chip electrical with photonics (Eg1x4 Og1x4)2x improvement in bandwidth at comparable latencyUsing opto-electrical global crossbar (Eg4x2 Og16x1)
8x-10x improvement in bandwidth at comparable latency
256 cores,16 DM
Uniform randomtraffic
256 cores
16 DMUniformrandomtraffic
8/3/2019 7 Joshi Presentation
26/40
MIT/UCB
OutlineMotivation
Monolithic silicon photonic technologyProcessor-memory network architecture explorationManycore system using silicon photonics
Conclusion
8/3/2019 7 Joshi Presentation
27/40
MIT/UCB
Simplified 16-core system design
8/3/2019 7 Joshi Presentation
28/40
MIT/UCB
Simplified 16-core system design
8/3/2019 7 Joshi Presentation
29/40
MIT/UCB
Simplified 16-core system design
8/3/2019 7 Joshi Presentation
30/40
MIT/UCB
Simplified 16-core system design
8/3/2019 7 Joshi Presentation
31/40
MIT/UCB
Simplified 16-core system design
8/3/2019 7 Joshi Presentation
32/40
MIT/UCB
Full 256-core system design
8/3/2019 7 Joshi Presentation
33/40
MIT/UCB
OutlineMotivation
Monolithic silicon photonic technologyProcessor-memory network architecture explorationManycore system using silicon photonics
Conclusion
8/3/2019 7 Joshi Presentation
34/40
MIT/UCB
ConclusionOn-chip network design and memory bandwidth will
limit manycore system performanceUnified on-chip/off-chip photonic link is proposed tosolve this problemGrouping with optical global crossbar improvessystem throughputFor an energy-constrained approach, photonicsprovide 8-10x improvement in throughput at
comparable latency
8/3/2019 7 Joshi Presentation
35/40
MIT/UCB
Backup
8/3/2019 7 Joshi Presentation
36/40
MIT/UCB
MIT Eos1 65 nm test chip
Texas Instruments
standard 65 nmbulk CMOSprocessFirst ever photonicchip in sub-100nmCMOS
Automatedphotonic devicelayout
Monolithicintegration withelectricalmodulator drivers
8/3/2019 7 Joshi Presentation
37/40
MIT/UCB
Ring modulator
Paperclips
Waveguide crossings
M-Z test structures
Digital driver
4 ring filter banks
Photo detector
Two-ring filter
One-ring filter
Vertical coupler grating
8/3/2019 7 Joshi Presentation
38/40
MIT/UCB
Optical waveguide
Waveguide made of polysiliconSilicon substrate under waveguide etched away toprovide optical cladding64 wavelengths per waveguide in opposite directions
SEM image of a poly silicon waveguideCross-sectional view of a photonic chip
8/3/2019 7 Joshi Presentation
39/40
8/3/2019 7 Joshi Presentation
40/40
Photodetectors
Embedded SiGe used to create photodetectorsMonolithic integration enable good optical couplingSub-100 fJ/bit energy cost required for the receiver