7 Joshi Presentation

8/3/2019 7 Joshi Presentation

1/40

Building manycore processor-to-DRAMnetworks using monolithic silicon photonics

Ajay Joshi , Christopher Batten , Vladimir Stojanovi , Krste Asanovi

MIT, 77 Massachusetts Ave, Cambridge MA 02139 UC Berkeley, 430 Soda Hall, MC #1776, Berkeley, CA 94720{joshi, cbatten, vlada}@mit.edu, [email protected]

High Performance Embedded Computing(HPEC) Workshop

23-25 September 2008


2/40

MIT/UCB

Manycore systems design space


3/40

MIT/UCB

Manycore system bandwidth requirements


4/40

MIT/UCB

Manycore systems bandwidth, pin count and power scaling

4

1 Byte/Flop,8 Flops/core

@ 5GHz S e r

v e r & H P

C

M o b ile C lie n t


5/40

MIT/UCB

Interconnect bottlenecks

CPU

Cache

D R A M

D I M M

Manycore system

cores

Cache

D R A M

D I M M

Cache

D R A M

D I M M

CPU CPU

InterconnectNetwork

Interconnect

Network

Bottlenecks dueto energy and

bandwidthdensity limitations


6/40


7/40 MIT/UCB

OutlineMotivation

Monolithic silicon photonic technologyProcessor-memory network architecture explorationManycore system using silicon photonics

Conclusion


8/40 MIT/UCB

Unified on-chip/off-chip photonic link

Supports dense wavelength-division multiplexingthat improves bandwidth densityUses monolithic integration that reduces energyconsumptionUtilizes the standard bulk CMOS flow


9/40 MIT/UCB

Optical link components

65 nm bulk CMOS chip designed to test various optical devices


10/40 MIT/UCB

Silicon photonics area and energy advantage

Metric Energy(pJ/b)

Bandwidth density(Gb/s/)

Global on-chip photonic link 0.25 160-320

Global on-chip optimally repeated electrical link 1 5

Off-chip photonic link (50 coupler pitch) 0.25 13-26Off-chip electrical SERDES (100 pitch) 5 0.1

On-chip/off-chip seamless photonic link 0.25


11/40 MIT/UCB

OutlineMotivation

Monolithic silicon photonic technologyProcessor-memory network architecture exploration

Baseline electrical mesh topology

Electrical mesh with optical global crossbar topologyManycore system using silicon photonicsConclusion


12/40 MIT/UCB

Baseline electrical system architecture

Access point per DM distributed across the chip

Two on-chip electrical mesh networksRequest path core access point DRAM moduleResponse path DRAM module access point core

Mesh physical view Mesh logical view

C = core, DM = DRAM module


13/40 MIT/UCB

Interconnect network design methodology

Ideal throughput and zero load latency used as

design metricsEnergy constrained approach is adoptedEnergy components in a network

Mesh energy ( E m ) (router-to-router links (RRL), routers)IO energy ( E io ) (logic-to-memory links (LML))

F l i t w

i d t h

Calculate on-chipRRL energy

Calculate on-chiprouter energy

Calculate mesh

throughput

Calculate totalmesh energy Calculate energybudget for LML

Total energy budget

Calculate LMLwidth

Calculate I/O

throughput

Calculate zero

load latency


14/40

MIT/UCB

Network throughput and zero load latency

System throughput limited by on-chip mesh or I/O linksOn-chip mesh could be over-provisioned to overcome meshbottleneckZero load latency limited by data serialization

(22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)


15/40

MIT/UCB





16/40

MIT/UCB




O P F : 1

O P F : 2

O P F : 4


17/40

MIT/UCB



On-chip

serialization

Off-chipserialization


O P F : 1

O P F : 2

O P F : 4


18/40

MIT/UCB

OutlineMotivation

Monolithic silicon photonic technologyProcessor-memory network architecture exploration

Baseline electrical mesh topology

Electrical mesh with optical global crossbar topologyManycore system using silicon photonicsConclusion


19/40

MIT/UCB

Optical system architecture

Off-chip electrical links replaced with optical linksElectrical to optical conversion at access pointWavelengths in each optical link distributedacross various core-DRAM module pairs

Mesh physical view Mesh logical view

C = core, DM = DRAM module


20/40

MIT/UCB


Reduced I/O cost improvessystem bandwidthReduction in latency due to

lower serialization latencyOn-chip network is the newbottleneck


21/40

MIT/UCB


Reduced I/O cost improvessystem bandwidthReduction in latency due to

lower serialization latencyOn-chip network is the newbottleneck


22/40

MIT/UCB

Optical multi-group system architecture

Break the single on-chip electrical mesh into several groupsEach group has its own smaller meshEach group still has one AP for each DMMore APs each AP is narrower (uses less s)

Use optical network as a very efficient global crossbar Need a crossbar switch at the memory for arbitration

Ci = core in group i , DM = DRAM module, S = global crossbar switch


23/40

MIT/UCB

Network throughput vs zero load latency

Grouping moves traffic

from energy-inefficientmesh channels to energy-efficient photonicchannelsGrouping and siliconphotonics provides 10x-15x throughputimprovementGrouping reduces ZLL inphotonic range, butincreases ZLL in electricalrange

A

B

1 0 x - 1 5 x


24/40

MIT/UCB

Simulation results

Grouping2x improvement in bandwidth at comparable latency

Overprovisioning2x-3x improvement in bandwidth for small group count atcomparable latency

Minimal improvement for large group count

256 cores,16 DM

Uniform random traffic

256 cores,16 DM

Uniform randomtraffic


25/40

MIT/UCB

Simulation results

Replacing off-chip electrical with photonics (Eg1x4 Og1x4)2x improvement in bandwidth at comparable latencyUsing opto-electrical global crossbar (Eg4x2 Og16x1)

8x-10x improvement in bandwidth at comparable latency

256 cores,16 DM

Uniform randomtraffic

256 cores

16 DMUniformrandomtraffic


26/40

MIT/UCB

OutlineMotivation


Conclusion


27/40

MIT/UCB

Simplified 16-core system design


28/40

MIT/UCB



29/40

MIT/UCB



30/40

MIT/UCB



31/40

MIT/UCB



32/40

MIT/UCB

Full 256-core system design


33/40

MIT/UCB

OutlineMotivation


Conclusion


34/40

MIT/UCB

ConclusionOn-chip network design and memory bandwidth will

limit manycore system performanceUnified on-chip/off-chip photonic link is proposed tosolve this problemGrouping with optical global crossbar improvessystem throughputFor an energy-constrained approach, photonicsprovide 8-10x improvement in throughput at

comparable latency


35/40

MIT/UCB

Backup


36/40

MIT/UCB

MIT Eos1 65 nm test chip

Texas Instruments

standard 65 nmbulk CMOSprocessFirst ever photonicchip in sub-100nmCMOS

Automatedphotonic devicelayout

Monolithicintegration withelectricalmodulator drivers


37/40

MIT/UCB

Ring modulator

Paperclips

Waveguide crossings

M-Z test structures

Digital driver

4 ring filter banks

Photo detector

Two-ring filter

One-ring filter

Vertical coupler grating


38/40

MIT/UCB

Optical waveguide

Waveguide made of polysiliconSilicon substrate under waveguide etched away toprovide optical cladding64 wavelengths per waveguide in opposite directions

SEM image of a poly silicon waveguideCross-sectional view of a photonic chip


39/40


40/40

Photodetectors

Embedded SiGe used to create photodetectorsMonolithic integration enable good optical couplingSub-100 fJ/bit energy cost required for the receiver

7 Joshi Presentation

Documents

7 Joshi Presentation