Top Banner
Implementing a Hybrid SRAM / eDRAM NUCA Architecture Javier Lira (UPC, Spain) Carlos Molina (URV, Spain) [email protected] [email protected] David Brooks (Harvard, USA) Antonio González (Intel-UPC, Spain) [email protected] [email protected] HiPC 2011, Bangalore (India) – December 21, 2011
23

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) [email protected]@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Dec 11, 2015

Download

Documents

Elvin Lucas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Implementing a HybridSRAM / eDRAM NUCA

ArchitectureJavier Lira (UPC, Spain) Carlos Molina (URV, Spain)[email protected] [email protected]

David Brooks (Harvard, USA) Antonio González (Intel-UPC, Spain)

[email protected] [email protected]

HiPC 2011, Bangalore (India) – December 21, 2011

Page 2: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

2

CMPs incorporate large LLC.

POWER7 implements L3 cache with eDRAM.◦ 3x density.◦ 3.5x lower energy consumption.◦ Increases latency few cycles.

We propose a placement policy to accomodate both technologies in a NUCA cache.

Motivation

40-45% chip area

Page 3: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

3

NUCA divides a large cache in smaller and faster banks.

Cache access latency consists of the routing and bank access latencies.

Banks close to cache controller have smaller latencies than further banks.

NUCA caches [1]

Processor

[1] Kim et al. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Architectures. ASPLOS’02

Page 4: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

4

SRAM provides high-performance.

eDRAM provides low power and high density.

SRAM eDRAM

Latency X 1.5x

Density X 3x

Leakage 2x X

Dynamic energy

1.5x X

Need refresh?

No Yes

SRAM vs. eDRAM

Page 5: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

5

Introduction

Methodology

Implementing a hybrid NUCA cache

Analysis of our design

Exploiting architectural benefits

Conclusions

Outline

Page 6: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

6

Baseline architecture [2]

Migration

Placement Access

Replacement

Placement Access

Migration Replacement

Core 0 Core 1 Core 2 Core 3

Core 4 Core 5 Core 6 Core 7

16 positions per data

Partitioned multicast

Gradual promotionLRU + Zero-copy

Core 0

[2] Beckmann and Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO’04

Page 7: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

7

Number of cores 8 – UltraSPARC IIIi

Frequency 1.5 GHz

Main Memory Size 4 Gbytes

Memory Bandwidth 512 Bytes/cycle

Private L1 caches 8 x 32 Kbytes, 2-way

Shared L2 NUCA cache

8 MBytes, 128 Banks

NUCA Bank 64 KBytes, 8-way

L1 cache latency 3 cycles

NUCA bank latency 4 cycles

Router delay 1 cycle

On-chip wire delay 1 cycle

Main memory latency

250 cycles (from core)

Experimental framework

GEMS

Simics

Solaris 10

PARSECSPEC

CPU2006

8 x UltraSPARC IIIi

Ruby Garnet Orion

Page 8: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

8

Introduction

Methodology

Implementing a hybrid NUCA cache

Analysis of our design

Exploiting architectural benefits

Conclusions

Outline

Page 9: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Fast SRAM banks are located close to the cores.

Slower eDRAM banks in the center of the NUCA cache.

PROBLEM: Migration tends to concentrate shared data in central banks.

9

Homogeneous approachCore 0 Core 1 Core 2 Core 3

Core 4 Core 5 Core 6 Core 7

eDRAM

SRAM

Page 10: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

10

Significant amount of data in the LLC are not accessed during their lifetime.

SRAM banks store most frequently accessed data.

eDRAM banks allocate data blocks that either:◦ Just arrived to the NUCA, or◦ Were evicted from SRAM banks.

Data usage analysis

Page 11: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

11

First goes to an eDRAM.

If accessed, it moves to SRAM.

Features:◦ Migration between SRAM banks.◦ Lack of communication in eDRAM.◦ No eviction from SRAM banks.◦ eDRAM is extra storage for SRAM.

PROBLEM: Access scheme must search to the double number of banks.

Heterogeneous approach

eDRAM

SRAM

Core 0 Core 1 Core 2 Core 3

Core 4 Core 5 Core 6 Core 7

Page 12: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

12

Tag Directory Array (TDA) stores tags of eDRAM banks.

Using TDA, the access scheme looks up to 17 banks.

TDA requires 512 Kbytes for an 8 Mbyte (4S-4D) hybrid NUCA cache.

TDA

Page 13: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Heterogeneous + TDA outperforms the other hybrid alternatives.

13

Performance results

We use Heterogeneous + TDA as hybrid NUCA cache in further analysis.

Page 14: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

14

Introduction

Methodology

Implementing a hybrid NUCA cache

Analysis of our design

Exploiting architectural benefits

Conclusions

Outline

Page 15: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

15

Well-balanced configurations achieve similar performance as all-SRAM NUCA cache.

The majority of hits are in SRAM banks.

Performance

Page 16: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

16

Hybrid NUCA pays for TDA.

The less SRAM the hybrid NUCA uses, the better.

Power and Area

All SRAM

7S-1D 6S-2D 5S-3D 4S-4D 3S-5D 2S-6D 1S-7D All DRAM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dyn. Network Dyn. Cache Sta. NetworkSta. SRAM Sta. eDRAM Sta. TDA

Page 17: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

17

Similar performance results as all-SRAM. Reduces power consumption by 10%. Occupies 15% less area than all-SRAM.

The best configuration

4S-4D

Page 18: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

18

Introduction

Methodology

Implementing a hybrid NUCA cache

Analysis of our design

Exploiting architectural benefits

Conclusions

Outline

Page 19: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

19

New configurations

all SRAM banksSRAM: 4MBytes

eDRAM: 4MBytes

15% reduction on area

+1MByte in SRAM banks

+2MBytes in eDRAM banks

5S-4D

4S-6D

SRAM

eDRAM

Page 20: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

20

And do not increase power consumption.

Both configurations increases performance by 4%.

Exploiting benefits

All SRAM 4S-4D 5S-4D 4S-6D0%

20%

40%

60%

80%

100%

Sta. TDASta. eDRAMSta. SRAMSta. NetworkDyn. CacheDyn. Network

Page 21: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

21

Introduction

Methodology

Implementing a hybrid NUCA cache

Analysis of our design

Exploiting architectural benefits

Conclusions

Outline

Page 22: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

22

IBM® integrates eDRAM in its latest general-purpose processor.

We implement a hybrid NUCA cache, that effectively combines SRAM and eDRAM technologies.

Our placement policy succeeds in concentrating most accesses to the SRAM banks.

Well-balanced hybrid cache achieves similar performance as all-SRAM configuration, but occupies 15% less area and dissipates 10% less power.

Exploiting architectural benefits we achieve up to 10% performance improvement, and by 4%, on average.

Conclusions

Page 23: Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) javier.lira@ac.upc.educarlos.molina@urv.net David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Implementing a HybridSRAM / eDRAM NUCA

Architecture

Questions?

HiPC 2011, Bangalore (India) – December 21, 2011