SoC-2012-pres-2

Enhancing Cache Coherent Architectureswith Access Patterns for Embedded

Manycore Systems

Jussara Marandola, Stephane Louise, Loıc Cudennec, DavidA. Bader

[email protected]

Oct 11-12, 2012, SoC 2012

1 / 23

Introduction CoCCA approach evaluation Conclusion and perspective

Background

Multicore and manycore systems: Architecture and its future:

Single processor time is over

Multiprocessor are there and will remain

Down to embedded systems (e.g. my cellphone)

Manycore systems are on the verge of appearing (e.g. Tilera,but others are on the way)

The future is manycore, even in the embedded world

We have to prepare for this.Programmability?

2 / 23


Programmability

What are the programming paradigms for manycores? How do weprogram them?

New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)

MPI (OK for HPC applications, also heavy work forparallelization)

openMP and the like: “only” adding some pragma toparallelize an application.

OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)

3 / 23


Programmability






3 / 23


Programmability






3 / 23


Programmability






3 / 23


Shared memory consistency for multicore/manycoresFor manycore systems, memory consistency = cache coherencemechanisms:

Based on the four state MESI protocolModified: a single valid copy of the data exist in the systemand was modified since its fetch from memoryExclusive: the value is only in one core’s cache and wasn’tmodified since it was accessed from memoryShared: multiple copy of the value exist in the system, andonly read operation where doneInvalid: the current copy that the core has must not be usedand will be discarded

Use of Home Nodes to keep the state consistency:For a given address in memory only one core of the system willkeep the coherence stateThe distribution of home nodes is done as a modulo on anaddress mask (round-robin, line size) to avoid hot spotsA processor mask tracks the cores that share the cache line

Baseline protocol is the base for all memory consistencysystems within SotA

4 / 23


Shared memory consistency for multicore/manycoresFor manycore systems, memory consistency = cache coherencemechanisms:

Based on the four state MESI protocolModified: a single valid copy of the data exist in the systemand was modified since its fetch from memoryExclusive: the value is only in one core’s cache and wasn’tmodified since it was accessed from memoryShared: multiple copy of the value exist in the system, andonly read operation where doneInvalid: the current copy that the core has must not be usedand will be discarded

Use of Home Nodes to keep the state consistency:For a given address in memory only one core of the system willkeep the coherence stateThe distribution of home nodes is done as a modulo on anaddress mask (round-robin, line size) to avoid hot spotsA processor mask tracks the cores that share the cache line

Baseline protocol is the base for all memory consistencysystems within SotA

4 / 23


Baseline implementation in a manycore system

CacheCoherenceDirectory

Mem

ory

Inte

rfac

e

Net

wor

k In

terf

ace

Address Coherence info. (state and vector bit fields)



L2

Cac

he

5 / 23


Modification of a shared value by a given core

Processor with shared copyof the data

➀

6 / 23




➁

6 / 23




➂

6 / 23



➃

6 / 23


Issues of baseline protocol and memory access patterns

Sometime a single write on a shared value triggers lot ofcoherence traffic on the NoC

For regular but non conterminous access, lot accesses

Typical example: reading an image by columnBut the accesses are simple and deterministic

In some areas the baseline protocol does not work as well as itcould and lacks a bit of scalability

In the embedded world lots of low level data processing display aregular behavior WRT their memory accesses

Convolutions on images

usual transformations (e.g. FFT, DCT)

vector operation

etc.

The idea: take advantage of these regular memory access patternto reduce the coherence traffic and enable memory prefetch

7 / 23


Issues of baseline protocol and memory access patterns

Sometime a single write on a shared value triggers lot ofcoherence traffic on the NoC

For regular but non conterminous access, lot accesses

Typical example: reading an image by columnBut the accesses are simple and deterministic

In some areas the baseline protocol does not work as well as itcould and lacks a bit of scalability

In the embedded world lots of low level data processing display aregular behavior WRT their memory accesses

Convolutions on images

usual transformations (e.g. FFT, DCT)

vector operation

etc.

The idea: take advantage of these regular memory access patternto reduce the coherence traffic and enable memory prefetch

7 / 23


State of the Art, Memory patterns and shared memorycoherence

Use of memory patterns:

Intel: use of special instructions to perform regular accesses tomemory limited to a single core; Patent US 7,143,264 (2006)

IBM: special instructions used to detect and apply patterns,also limited to a single cache; Patent US 7,395,407 (2008)

Other platforms:

STAR project aim to provide a scalable manycore with acoherent shared memory

8 / 23


Cache Coherence Architecture with patterns

Our enhancement to Cache Coherence Architecture (CCA)

Relies on the baseline protocol (adds to it, not replace it)

Update it with special cases for pattern management

Add storage with each core for pattern storage and detection

Patterns are a result of the compilation process

It can not work worst than baseline, because baseline is still thedefault.Modifies:

Core IP with the pattern storage and matching

Add the speculative protocol to the baseline protocol

The patterns (and the speculative protocol) has its owndetermination of Home Node (which can be the same or differfrom the baseline Home Node)

We call this modified system CoCCA (Codesigned CCA)

9 / 23


Cache Coherence Architecture with patterns

Our enhancement to Cache Coherence Architecture (CCA)

Relies on the baseline protocol (adds to it, not replace it)

Update it with special cases for pattern management

Add storage with each core for pattern storage and detection

Patterns are a result of the compilation process

It can not work worst than baseline, because baseline is still thedefault.Modifies:

Core IP with the pattern storage and matching

Add the speculative protocol to the baseline protocol

The patterns (and the speculative protocol) has its owndetermination of Home Node (which can be the same or differfrom the baseline Home Node)

We call this modified system CoCCA (Codesigned CCA)

9 / 23


CoCCA architecture scheme

CacheCoherenceDirectory

CoCCAPatternTable

Mem

ory

Inte

rfac

e

Net

wor

k In

terf

ace




Address Pattern Coherence info. (state and bit fields)

Address Pattern Coherence info. (state and bit fields)

Chip area overhead: ~+3%?

10 / 23


Pattern definition and storage

Patterns are not stored the same way on use nodes and homenodes

The minimum implementation uses a 2D strided access shape:

a start addressa stride lenghta pattern lenghtOn the home node: a pattern size

A speculative access fetches cache lines (as baseline protocol do)but the access pattern may need to be more fine grained in itsspecification (overlaps)Definition of triggers: way of detecting the signature of a patternto fetch

the simplest trigger is the first address of the pattern access

11 / 23


Triggers and pattern definition

Pattern matching principle (hw):Pattern calculation (simple case):

Desc = fn(Baddr , s, δ)

Baddr Base address

s size of the pattern

δ interval (stride) between 2 consecutive access

E.G.: Pat(1, 4, 2)(@1) = {

@2, @5, @8, @11

}

12 / 23





Baddr Base address



E.G.: Pat(1, 4, 2)(@1) = { @2,

@5, @8, @11

}

12 / 23





Baddr Base address



E.G.: Pat(1, 4, 2)(@1) = { @2, @5,

@8, @11

}

12 / 23





Baddr Base address



E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8,

@11

}

12 / 23





Baddr Base address



E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, @11}

12 / 23


Base of the modified protocol

Requester

DIRLookup

PTLookup

hit miss

hit miss

Send RD_RQSend SPEC_RQ

L2 cacheRead

PatternLookup

MemoryAccess

Send RD_RQ_AK

hit

Send RD_RQ_AK

DIRLookup

MemoryAccess

Send RD_RQ_AK

misshit

BaselineHome Node

Hybrid(CoCCA)Home Node

Send RD_RQ_AK

miss Patternlength

Patternlength

Without pattern information or in case of pattern miss: thesystem acts as an ordinary baseline architecture

In case of pattern hit: the speculative protocol is fired

13 / 23



Requester

DIRLookup

PTLookup

hit miss

hit miss


L2 cacheRead

PatternLookup

MemoryAccess

Send RD_RQ_AK

hit

Send RD_RQ_AK

DIRLookup

MemoryAccess

Send RD_RQ_AK

misshit

BaselineHome Node


Send RD_RQ_AK

miss Patternlength

Patternlength

X



13 / 23



Requester

DIRLookup

PTLookup

hit miss

hit miss


L2 cacheRead

PatternLookup

MemoryAccess

Send RD_RQ_AK

hit

Send RD_RQ_AK

DIRLookup

MemoryAccess

Send RD_RQ_AK

misshit

BaselineHome Node


Send RD_RQ_AK

miss Patternlength

Patternlength

X



13 / 23


Hardware tables and special instructions

A C-language description of pattern storing tables:unsigned long capacity; /* sizeof(address) */

unsigned long size; /* address number */

unsigned long * offset; /* pattern offset */

unsigned long * length; /* pattern length */

unsigned long * stride; /* pattern stride */

So it is possible to have a rough estimate of the size of an entry inthe pattern table

A few specialized instructions to deal managepattern tables:

PatternNew: to create a pattern,

PatternAddOffset: to add an offset entry,

PatternAddLength: to add a length entry,

PatternAddStride: to add a stride entry,

PatternFree: to release the pattern after use.

14 / 23


Hardware tables and special instructions

A C-language description of pattern storing tables:unsigned long capacity; /* sizeof(address) */

unsigned long size; /* address number */

unsigned long * offset; /* pattern offset */

unsigned long * length; /* pattern length */

unsigned long * stride; /* pattern stride */

So it is possible to have a rough estimate of the size of an entry inthe pattern table A few specialized instructions to deal managepattern tables:

PatternNew: to create a pattern,

PatternAddOffset: to add an offset entry,

PatternAddLength: to add a length entry,

PatternAddStride: to add a stride entry,

PatternFree: to release the pattern after use.

14 / 23


A first benchmark program for early evaluation

The choice of a benchmark program for our speculative protocol:

be representative of typical embedded application

stress the protocol proposal on several aspects

We choose a 2 step image cascading filtering

the first filter result is the source of the second filter

5x5 filter

applied on chunks of the image, for each core with sharedcache lines both in read mode as in write mode

the result of the second filter is written back on the source(write invalidation)

15 / 23


Memory mapping of our benchmark program

16 / 23


Instrumentation choice: Pin/Pintools

Pin/Pintools:

Pin is an instrumentation framework of binaries based on JITtechnique to accelerate the instrumentation. It is a Intelproject

Pin acts in association with the instrumentation tool calledPintool which is programmable

Several Pintools are provided in the basic distribution of Pin

We used:

inscount: pintool which gives the number of executedinstructions

pinatrace: pintool which trace and log all the memoryaccesses (load/store operations)

See paper for details.

17 / 23


Instrumentation choice: Pin/Pintools

Pin/Pintools:

Pin is an instrumentation framework of binaries based on JITtechnique to accelerate the instrumentation. It is a Intelproject

Pin acts in association with the instrumentation tool calledPintool which is programmable

Several Pintools are provided in the basic distribution of Pin

We used:

inscount: pintool which gives the number of executedinstructions

pinatrace: pintool which trace and log all the memoryaccesses (load/store operations)

See paper for details.

17 / 23


Data sharing and prefetch

Rect. i Rect. i+1

Rect. i+7 Rect. i+8

Exclusive data (1 core only)

Data shared by 2 cores


Figure: Read data sharing in conterminous rectangles

We can define three kinds of patterns on this benchmark:

Source image prefetch and setting of old Shared values (S) toExclusive values (E) when the source image becomes thedestination (2 patterns per core)

18 / 23



Rect. i Rect. i+1

Rect. i+7 Rect. i+8






False concurrency of write accesses between two rectangles ofthe destination image. This happens because the frontiers isnot alined with L2 cache lines. The associated patterns is 6vertical lines with 0 bytes in common

18 / 23



Rect. i Rect. i+1

Rect. i+7 Rect. i+8






Shared read data (because convolution kernels read pixels inconterminous rectangles, see figure 1). There are 6 verticallines and 3 sets of two horizontal lines for these patterns

18 / 23



Rect. i Rect. i+1

Rect. i+7 Rect. i+8






After simplification, only 6 patterns are required

18 / 23


Evaluation results

Condition MESI CoCCA

Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772

Total throughput 48672 30723

reduction of 37% of coherence message throughput

prefetch stands for 10% of cache accesses

Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)

19 / 23


Evaluation results







19 / 23


Evaluation results







19 / 23


Evaluation results







19 / 23


Contributions

Shared memory and coherence is important forprogrammability of CMP

SotA cache coherence mechanisms falls into worst casebehaviors for scenarios that seems simple: regular access tomemory with patterns

We defined an extension of cores to store pattern

We extended the baseline protocol with a speculative protocol

For embedded system: tables are part of the compilationprocess

Only few patterns entries are necessary for each typical lowlevel filter

Patterns can reduce significantly coherence messagethroughput

Patterns allow for early and efficient cache preloading whichaccelerate significantly applications

May provide a path to cache coherency in massive many-cores

20 / 23


Contributions










20 / 23


Contributions










20 / 23


Future work and perspective

extend the number of benchmark applications to draw moregeneral conclusions

apply our ideas in a NoC simulator to do cycle accuratesimulations

include it in a full scale simulator (e.g. SoCLib)

extend our work toward a HPC friendly architecture thatwould determine patterns dynamically at runtime

21 / 23







21 / 23







21 / 23


Thank you for your attention

Questions?

22 / 23


ALCHEMY wokshop @ ICCS 2013 (Barcelona)

The International Conference on Computational Science (ICCS)can be a good place to talk with people using HPC architecturesfor their needs.Loıc Cudennec and I are organizing a workshop on the issues thatare raising with future manycore systems (number of cores ¿ 1000and beyond)

Architecture, Language, Compilation and Hardware supportfor Emerging ManYcore systems

ALCHEMY wokshop

Topics:

Advanced architecture support for massive parallelismmanagement

Advanced architecture support for enhanced communicationfor manycores

Full paper submission: December 15th Notification: Feb. 10

23 / 23


ALCHEMY wokshop @ ICCS 2013 (Barcelona)

The International Conference on Computational Science (ICCS)can be a good place to talk with people using HPC architecturesfor their needs.Loıc Cudennec and I are organizing a workshop on the issues thatare raising with future manycore systems (number of cores ¿ 1000and beyond)

Architecture, Language, Compilation and Hardware supportfor Emerging ManYcore systems

ALCHEMY wokshop

Topics:

Advanced architecture support for massive parallelismmanagement

Advanced architecture support for enhanced communicationfor manycores

Full paper submission: December 15th Notification: Feb. 10

23 / 23

SoC-2012-pres-2

Documents

programming paradigms

stream programming

perspective programmability

new paradigms

shared memory behavior

shared memory model

embedded systems

memory exclusive