CoNDA: Efficient Cache Coherence Support for Near-Data ...€¦ · Application Analysis Analysis of Existing Coherence Mechanisms Architecture Support Evaluation CoNDA consistently

CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators

Coherence For NDAs

Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Rachata Ausavarungnirun, Kevin Hsieh, Nastaran Hajinazar, Krishna Malladi, Hongzhong Zheng, Onur Mutlu

Application Analysis

Analysis of Existing Coherence Mechanisms

Architecture Support Evaluation

CoNDA consistently retains most of Ideal-NDA’s benefits, coming within 10.4% of the Ideal-NDA performance

CoNDA significantly reduces energy consumption and comes within 4.4% of Ideal-NDA

Challenge:CoherencebetweenNDAsandCPUs

DRAM L2 L1

CPU CPU CPU CPU

NDA

Compute Unit

(1)Largecostofoff-chipcommunicaBon

ItisimpracBcaltousetradiBonalcoherenceprotocols

(2)NDAapplicaBonsgeneratealargeamountofoff-chipdatamovement

1stkeyobservaBon:CPUthreadsoHenconcurrentlyaccessthesameregionofdatathatNDAkernelsareaccessingwhichleadstosignificantdatasharing

Graph Processing Hybrid Databases (HTAP)

WefindnotallporBonsofapplicaBonsbenefitfromNDA

1 Memory-intensiveporBonsbenefitfromNDA

2 Compute-intensiveorcachefriendlyporBonsshouldremainontheCPU

Hybrid Database (HTAP)

Transactions Analytics

Transactions

CPU CPU NDA

Analytics

Data Sharing

2ndkeyobservaBon:CPUthreadsandNDAkernelstypically

donotconcurrentlyaccessthesamecachelines

CPUthreadsrarelyupdatethesamedatathatanNDAisacBvelyworkingon

ForConnectedComponentsapplicaBon,only5.1%oftheCPUaccessescollidewith

NDAaccesses

PoorhandlingofcoherenceeliminatesmuchofanNDA’sperformanceandenergybenefits

0.0

0.5

1.0

1.5

2.0

CC Radii PageRank CC Radii PageRank

arXiV Gnutella

Speedu

p

CPU-only NC CG FG Ideal-NDA

GMEAN

0.0

0.5

1.0

1.5

2.0

CC Radii PageRank CC Radii PageRank

arXiV Gnutella

Normalized

Ene

rgy

CPU-only NC CG FG Ideal-NDA

GMEAN

CoNDA

WeproposeCoNDA,amechanismthatusesopBmisBcNDAexecuBontoavoidunnecessarycoherencetraffic

Time

OpBmisBc-execuBon

CPU NDA

ConcurrentCPU+NDAExecuBon

OffloadNDAkernel

Sendsignatures

CoherenceResoluBon

CommitorRe-execute

NoCoherenceRequest

Signature Signature

CPUThreadExecuBon

Identifying Coherence Violations

Time CPU NDA

C1.WrZC2.RdAC3.WrB

N1.RdXN2.WrYN3.RdZ

AnyCoherenceViolaBon?

N4.RdXN5.WrYN6.RdZ

AnyCoherenceViolaBon?

C6.WrX

C4.WrYC5.RdY

Yes.FlushZtoDRAM

No.commitNDAoperaBons

EffecBveOrdering

C1.WrXC2.RdXC3.RdYC4.WrY

C5.WrYC6.RdYN4.RdZN5.WrYN6.RdXC7.WrX

Non-Cacheable Approach

Hybrid Database (HTAP)

Transactions Analytics

CPU CPU

Transactions

NDA

Analytics

Data Sharing

(1)Generatesalargenumber

ofoff-chipaccesses

(2)SignificantlyhurtsCPUthreadsperformance

NCfailstoprovideanyenergysavingandperform6.0%worsethanCPU-only

MarktheNDAdataasnon-cacheable

CPU DRAM

CPU

CPUWriteSet

SharedLLCCoherence Resolution

L1 NDA Core

L1 NDAReadSet

NDAWriteSet

High Level Architecture of CoNDA

CPU

CPUWriteSet


L1 NDA Core NDAReadSet

NDAWriteSet

L1

Per-worddirtybitmasktomarkalluncommifeddataupdates

TheNDAReadSetandNDAWriteSetareusedtotrackmemoryaccessesfromNDA

Optimistic Execution

0.0

0.5

1.0

1.5

2.0

2.5

CC Radii PR CC Radii PR CC Radii PR 128 256

arXiV Gnutella Enron HTAP

Spee

dup

CPU-only NDA-only FG CoNDA Ideal-NDA

GMEAN

0.00

0.25

0.50

0.75

1.00

1.25

CC Radii PR CC Radii PR CC Radii PR 128 256

arXiV Gnutella Enron HTAP

Normalized

Ene

rgy

CPU-only FG CoNDA Ideal-NDA

GMEAN

CPU

CPUWriteSet


L1 NDA Core NDAReadSet

NDAWriteSet

L1

Address

…1 1 00 0 1 11 0 0 01

hk-1 h1 h0 …NDAReadSet CPUWriteSet

Conflict

Ifconflictshappens:•  TheCPUflushesthedirtycachelinesthatmatch

addressesintheNDAReadSet•  NDAinvalidatesalluncommiQedcachelines•  SignaturesareerasedandNDArestartsexecuSon

Ifnoconflicts:

•  AnycleancachelinesintheCPUthatmatchanaddressintheNDAWriteSetareinvalidated

•  NDAcommitsdataupdates

Coherence Resolution

Bloomfilterbasedsignaturehastwobenefits:

•  AllowsustoeasilyperformcoherenceresoluSon•  Allowsforalargenumberofaddressestobestoredwithinafixed-lengthregister

Fine-Grained Coherence

CPU CPU NDA

High amount of off-chip coherence Traffic

FGeliminates71.8%oftheenergybenefitsofanidealNDAmechanism

Usingfine-grainedcoherencehastwobenefits:

1 SimplifiesNDAprogrammingmodel

2 Allowsustogetpermissionsforonlythepiecesofdatathatareactuallyaccessed

Coarse-Grained Coherence

CPU CPU NDA

GetcoherencepermissionfortheNDAregion

Unnecessarilyflushesalargeamountofdirty

data

Usecoarse-grainedlockstoprovideexclusiveaccess

AccesstoNDAdata

CPU NDATime

STALLBlocksCPUthreadswhen

theyaccessNDAdataregions

CGfailstoprovideanyperformancebenefitofNDAandperform0.4%worsethanCPU-only

CoNDA: Efficient Cache Coherence Support for Near-Data ...€¦ · Application Analysis Analysis of Existing Coherence Mechanisms Architecture Support Evaluation CoNDA consistently

Documents