Towards Scalable, Energy- Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs
Feb 25, 2016
Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks
Aniruddha N. Udipi
with Naveen Muralimanohar*,Rajeev Balasubramonian
University of Utah and *HP Labs
University of Utah 2
Motivation - I
• Future CMPs are likely to be power-limited– On-chip networks consume 20-36% of total chip power– Network power dominated by routers
• Chip design and verification costs are tremendous– Directory-based protocols are complicated and have the inherent
problem of indirection– Snooping-based protocols are well understood and simple to design
• Metal and wiring are cheap and plentiful
• We are no longer pin limited for the interconnection network
University of Utah 3
Motivation - II
• Future of multi-core computing likely to diverge into two separate tracks
– Mid-range multicore machines for home/office
• 16-64 cores– Many-core machines for
scientific/server applications• 1000s of cores
• Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores
• Design energy-efficient networks for moderate core-counts
VM
University of Utah 4
Executive Summary
• Elimination of routers leads us back to bus-based networks
• Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity
• Enhancing the life of buses for moderately sized CMPs– Filtered segmented bus, low-swing wiring, address
interleaved buses, page coloring
University of Utah 5
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing Wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
Baseline Chip and Interconnect Organization
University of Utah 6
Core L1
L2
• Simple mesh used for illustration here, other options discussed in the paper
• Static-NUCA shared L2, each line has a “home” slice based on its address
Router
University of Utah 7
Where does energy go in the network?
1.39e-10 J/access
1.56e-11 J/access8X
Router Link Energy estimates based on CACTI 6.0 and Orion 2.0
University of Utah 8
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
University of Utah 9
What is the solution?
• We are left with.. a bus!• Could we really just use a bus?
• Not really–Too many links activated on
every transaction–Energy gained by
eliminating routers lost by activating more links
– Poor performance due to increased arbitration times and network contention
University of Utah 10
We can do better..
Useless snoop: Particular cache line not present in any other core
• Segment and filter snoop transactions at intermediate points
• Two types of filters– Out-filter– In-filter
• Reduces number of links activated
• Allows for safe parallelism (serialization happens at the central bus if required)
Filtered Bus
University of Utah 11
Bus link Filter
Filters
• Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter”
• Each of these is a Counting Bloom Filter
– 2 arrays of 10-bit entries– Subsets of the address bits hashed into
each of these arrays, incremented to add entries, decremented to remove entries
– To test for membership, simply check if entries in both arrays are non-zero
– Compact representation, false positives possible
University of Utah 12
Bus link In + Out Filter
Out-filter - Case 1
University of Utah 13
RHome Segment • Bloom filter in every
segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment
• If a line has never left a segment, none of its transactions need to be seen outside
Energy Saved
• Completely localized transaction
• Only home segment activated
Bus link In - FilterActivated bus Activated filter
Out - FilterR – Requested Address
Out-filter – Case 2
University of Utah 14
Home Segment
R
Update
• If the line is being requested from outside its home segment, transaction has to go out on the central bus
• The out-filter of the home segment is updated appropriately
• The in-filter then takes over
RR R
Bus link
Activated bus Activated filterIn - Filter Out - Filter
R – Requested Address
In-filter
University of Utah 15
RRR
• Bloom filters keep track of a superset of lines currently present in the segment
• Only broadcast within the local segment if requiredEnergy Saved
Bus link
Activated bus Activated filter
In - Filter Out - Filter
R – Requested Address
Arbitration
• Global arbitration delay is non-trivial for a single bus connecting even 16 cores
• Multi-step arbitration, as required• On every request
– arbitrate for local bus and broadcast– if filter indicates that the transaction is complete, “validate”
broadcast via wired-OR– if not, arbitrate for central bus and hold broadcast in a
single-entry buffer until the central bus is available– at the remote sub-buses, priority is given to requests
originating from the central bus
University of Utah 16
University of Utah 17
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
Low-swing Wiring
• Differential low-swing wiring up to 10X more energy efficient than regular wiring
• These have less impact on packet-switched networks since routers are the bottleneck anyway
–Amdahl’s law!• Slightly increased latency, more metal requirement
University of Utah 18
University of Utah 19
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
Address Interleaved Buses
• As core counts increase, increased pressure on the bus due to contention
• At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip
• To shore up performance, increase the number of buses
– different buses handle mutually exclusive addresses– increased metal requirement
University of Utah 20
University of Utah 21
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
Page Coloring
• OS-assisted page-coloring for L2 cache• We use a simple first-touch approach• Improved locality helps any network, but is especially well-suited for our network because
– More flexibility in page placement– Less negative impact by sub-optimal page
placement– Improves filter behavior
University of Utah 22
University of Utah 23
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
University of Utah 24
Methodology
• Virtutech SIMICS full-system simulator– “g-cache” significantly modified to add network models
• CACTI 6.0 and Orion 2.0 for router/link energy computation• 16 cores for most experiments, sensitivity analysis for 32- and
64-core systems• 32nm process, 3GHz clock • 32K D-L1, 16K I-L1, 2MB/slice shared L2• 200 cycle main memory latency• 4KB page size • PARSEC, NAS, SPLASH-2 benchmark suites – run for entire
Region-Of-Interest/parallel section• Baseline routers - 4 VCs, 8 buffers/VC
Energy Consumption – Address Network
University of Utah 25
Ring – 20xGrid – 27xFbfly – 31x
Energy Consumption – Data Network
University of Utah 26
Ring – 2xGrid – 2.5xFbfly – 3x
How does energy consumption reduce?
• Router : Link energy ratio is high enough to significantly impact energy characteristics
• Efficient bloom filters, at 16KB/filter
– Out-filters are 85% accurate (note that there are only false positives, no false negatives)
– In-filters are 90% accurate
University of Utah 27
Effect of Page Coloring
• More locality• Better filtering
– Out filter accuracy increases from 85% to 97%
University of Utah 28
System Performance
University of Utah 29
Ring – 7%Grid – 3%Fbfly – 1%
How does performance improve?
• Two basic reasons– Inherent indirection in directory-based protocols– Deep pipelines in routers increasing the no-load latency
• Avg. latency in bus-based network is 16.4 cycles– Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2
cyc) + Link latency (10.5 cyc)
• Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention
– Link (6 cyc) + Router (9 cyc)
University of Utah 30
Scaling – 32 Cores – Energy
Average energy reduction of 19X in address network, 3X in data network
University of Utah 31
32 Cores – Performance
Average 5% drop in performance
University of Utah 32
Scaling - 64 Cores – Energy
Average reduction of 13X in address network, 2.5X in data network
University of Utah 33
64 Core - Performance
University of Utah 34
Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses
Router Optimizations
University of Utah 35
• For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than
– 3.5 X at 16 cores– 4.5X at 32 cores– 7X at 64 cores
• Current energy ratio is approx. 70X
University of Utah 36
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
University of Utah 37
Related Work
• Packet Switched Networks– Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et
al. (HPCA ’09), TRIPS, TILERA• Hierarchical Networks
– Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09)• Snoop Filtering
– Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08)
• Bus applications in CMPs– Manevich et al. (NOCS ’09)
Key Contributions
• For moderate core counts, buses just work!– Dramatic energy reduction– little or no loss in performance– simple snooping protocols, reduction in design
complexity• Low-swing wiring• Multiple Address Interleaved buses• OS-assisted page coloring• Potential for router optimization
University of Utah 38
University of Utah 39
Thank you..
• Questions?