Enhancing Cache Coherent Architectures with Access Patterns for Embedded Manycore Systems Jussara Marandola, St´ ephane Louise, Lo¨ ıc Cudennec, David A. Bader [email protected] Oct 11-12, 2012, SoC 2012 1 / 23
Enhancing Cache Coherent Architectureswith Access Patterns for Embedded
Manycore Systems
Jussara Marandola, Stephane Louise, Loıc Cudennec, DavidA. Bader
Oct 11-12, 2012, SoC 2012
1 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Background
Multicore and manycore systems: Architecture and its future:
Single processor time is over
Multiprocessor are there and will remain
Down to embedded systems (e.g. my cellphone)
Manycore systems are on the verge of appearing (e.g. Tilera,but others are on the way)
The future is manycore, even in the embedded world
We have to prepare for this.Programmability?
2 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do weprogram them?
New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)
MPI (OK for HPC applications, also heavy work forparallelization)
openMP and the like: “only” adding some pragma toparallelize an application.
OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)
3 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do weprogram them?
New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)
MPI (OK for HPC applications, also heavy work forparallelization)
openMP and the like: “only” adding some pragma toparallelize an application.
OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)
3 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do weprogram them?
New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)
MPI (OK for HPC applications, also heavy work forparallelization)
openMP and the like: “only” adding some pragma toparallelize an application.
OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)
3 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do weprogram them?
New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)
MPI (OK for HPC applications, also heavy work forparallelization)
openMP and the like: “only” adding some pragma toparallelize an application.
OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)
3 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Shared memory consistency for multicore/manycoresFor manycore systems, memory consistency = cache coherencemechanisms:
Based on the four state MESI protocolModified: a single valid copy of the data exist in the systemand was modified since its fetch from memoryExclusive: the value is only in one core’s cache and wasn’tmodified since it was accessed from memoryShared: multiple copy of the value exist in the system, andonly read operation where doneInvalid: the current copy that the core has must not be usedand will be discarded
Use of Home Nodes to keep the state consistency:For a given address in memory only one core of the system willkeep the coherence stateThe distribution of home nodes is done as a modulo on anaddress mask (round-robin, line size) to avoid hot spotsA processor mask tracks the cores that share the cache line
Baseline protocol is the base for all memory consistencysystems within SotA
4 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Shared memory consistency for multicore/manycoresFor manycore systems, memory consistency = cache coherencemechanisms:
Based on the four state MESI protocolModified: a single valid copy of the data exist in the systemand was modified since its fetch from memoryExclusive: the value is only in one core’s cache and wasn’tmodified since it was accessed from memoryShared: multiple copy of the value exist in the system, andonly read operation where doneInvalid: the current copy that the core has must not be usedand will be discarded
Use of Home Nodes to keep the state consistency:For a given address in memory only one core of the system willkeep the coherence stateThe distribution of home nodes is done as a modulo on anaddress mask (round-robin, line size) to avoid hot spotsA processor mask tracks the cores that share the cache line
Baseline protocol is the base for all memory consistencysystems within SotA
4 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Baseline implementation in a manycore system
CacheCoherenceDirectory
Mem
ory
Inte
rfac
e
Net
wor
k In
terf
ace
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
L2
Cac
he
5 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copyof the data
➀
6 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copyof the data
➁
6 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copyof the data
➂
6 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
➃
6 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Issues of baseline protocol and memory access patterns
Sometime a single write on a shared value triggers lot ofcoherence traffic on the NoC
For regular but non conterminous access, lot accesses
Typical example: reading an image by columnBut the accesses are simple and deterministic
In some areas the baseline protocol does not work as well as itcould and lacks a bit of scalability
In the embedded world lots of low level data processing display aregular behavior WRT their memory accesses
Convolutions on images
usual transformations (e.g. FFT, DCT)
vector operation
etc.
The idea: take advantage of these regular memory access patternto reduce the coherence traffic and enable memory prefetch
7 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Issues of baseline protocol and memory access patterns
Sometime a single write on a shared value triggers lot ofcoherence traffic on the NoC
For regular but non conterminous access, lot accesses
Typical example: reading an image by columnBut the accesses are simple and deterministic
In some areas the baseline protocol does not work as well as itcould and lacks a bit of scalability
In the embedded world lots of low level data processing display aregular behavior WRT their memory accesses
Convolutions on images
usual transformations (e.g. FFT, DCT)
vector operation
etc.
The idea: take advantage of these regular memory access patternto reduce the coherence traffic and enable memory prefetch
7 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
State of the Art, Memory patterns and shared memorycoherence
Use of memory patterns:
Intel: use of special instructions to perform regular accesses tomemory limited to a single core; Patent US 7,143,264 (2006)
IBM: special instructions used to detect and apply patterns,also limited to a single cache; Patent US 7,395,407 (2008)
Other platforms:
STAR project aim to provide a scalable manycore with acoherent shared memory
8 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Cache Coherence Architecture with patterns
Our enhancement to Cache Coherence Architecture (CCA)
Relies on the baseline protocol (adds to it, not replace it)
Update it with special cases for pattern management
Add storage with each core for pattern storage and detection
Patterns are a result of the compilation process
It can not work worst than baseline, because baseline is still thedefault.Modifies:
Core IP with the pattern storage and matching
Add the speculative protocol to the baseline protocol
The patterns (and the speculative protocol) has its owndetermination of Home Node (which can be the same or differfrom the baseline Home Node)
We call this modified system CoCCA (Codesigned CCA)
9 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Cache Coherence Architecture with patterns
Our enhancement to Cache Coherence Architecture (CCA)
Relies on the baseline protocol (adds to it, not replace it)
Update it with special cases for pattern management
Add storage with each core for pattern storage and detection
Patterns are a result of the compilation process
It can not work worst than baseline, because baseline is still thedefault.Modifies:
Core IP with the pattern storage and matching
Add the speculative protocol to the baseline protocol
The patterns (and the speculative protocol) has its owndetermination of Home Node (which can be the same or differfrom the baseline Home Node)
We call this modified system CoCCA (Codesigned CCA)
9 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
CoCCA architecture scheme
CacheCoherenceDirectory
CoCCAPatternTable
Mem
ory
Inte
rfac
e
Net
wor
k In
terf
ace
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Pattern Coherence info. (state and bit fields)
Address Pattern Coherence info. (state and bit fields)
Chip area overhead: ~+3%?
10 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Pattern definition and storage
Patterns are not stored the same way on use nodes and homenodes
The minimum implementation uses a 2D strided access shape:
a start addressa stride lenghta pattern lenghtOn the home node: a pattern size
A speculative access fetches cache lines (as baseline protocol do)but the access pattern may need to be more fine grained in itsspecification (overlaps)Definition of triggers: way of detecting the signature of a patternto fetch
the simplest trigger is the first address of the pattern access
11 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = {
@2, @5, @8, @11
}
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2,
@5, @8, @11
}
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5,
@8, @11
}
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8,
@11
}
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, @11}
12 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIRLookup
PTLookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cacheRead
PatternLookup
MemoryAccess
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIRLookup
MemoryAccess
Send RD_RQ_AK
misshit
BaselineHome Node
Hybrid(CoCCA)Home Node
Send RD_RQ_AK
miss Patternlength
Patternlength
Without pattern information or in case of pattern miss: thesystem acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIRLookup
PTLookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cacheRead
PatternLookup
MemoryAccess
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIRLookup
MemoryAccess
Send RD_RQ_AK
misshit
BaselineHome Node
Hybrid(CoCCA)Home Node
Send RD_RQ_AK
miss Patternlength
Patternlength
X
Without pattern information or in case of pattern miss: thesystem acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIRLookup
PTLookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cacheRead
PatternLookup
MemoryAccess
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIRLookup
MemoryAccess
Send RD_RQ_AK
misshit
BaselineHome Node
Hybrid(CoCCA)Home Node
Send RD_RQ_AK
miss Patternlength
Patternlength
X
Without pattern information or in case of pattern miss: thesystem acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Hardware tables and special instructions
A C-language description of pattern storing tables:unsigned long capacity; /* sizeof(address) */
unsigned long size; /* address number */
unsigned long * offset; /* pattern offset */
unsigned long * length; /* pattern length */
unsigned long * stride; /* pattern stride */
So it is possible to have a rough estimate of the size of an entry inthe pattern table
A few specialized instructions to deal managepattern tables:
PatternNew: to create a pattern,
PatternAddOffset: to add an offset entry,
PatternAddLength: to add a length entry,
PatternAddStride: to add a stride entry,
PatternFree: to release the pattern after use.
14 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Hardware tables and special instructions
A C-language description of pattern storing tables:unsigned long capacity; /* sizeof(address) */
unsigned long size; /* address number */
unsigned long * offset; /* pattern offset */
unsigned long * length; /* pattern length */
unsigned long * stride; /* pattern stride */
So it is possible to have a rough estimate of the size of an entry inthe pattern table A few specialized instructions to deal managepattern tables:
PatternNew: to create a pattern,
PatternAddOffset: to add an offset entry,
PatternAddLength: to add a length entry,
PatternAddStride: to add a stride entry,
PatternFree: to release the pattern after use.
14 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
A first benchmark program for early evaluation
The choice of a benchmark program for our speculative protocol:
be representative of typical embedded application
stress the protocol proposal on several aspects
We choose a 2 step image cascading filtering
the first filter result is the source of the second filter
5x5 filter
applied on chunks of the image, for each core with sharedcache lines both in read mode as in write mode
the result of the second filter is written back on the source(write invalidation)
15 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Memory mapping of our benchmark program
16 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Instrumentation choice: Pin/Pintools
Pin/Pintools:
Pin is an instrumentation framework of binaries based on JITtechnique to accelerate the instrumentation. It is a Intelproject
Pin acts in association with the instrumentation tool calledPintool which is programmable
Several Pintools are provided in the basic distribution of Pin
We used:
inscount: pintool which gives the number of executedinstructions
pinatrace: pintool which trace and log all the memoryaccesses (load/store operations)
See paper for details.
17 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Instrumentation choice: Pin/Pintools
Pin/Pintools:
Pin is an instrumentation framework of binaries based on JITtechnique to accelerate the instrumentation. It is a Intelproject
Pin acts in association with the instrumentation tool calledPintool which is programmable
Several Pintools are provided in the basic distribution of Pin
We used:
inscount: pintool which gives the number of executedinstructions
pinatrace: pintool which trace and log all the memoryaccesses (load/store operations)
See paper for details.
17 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
Source image prefetch and setting of old Shared values (S) toExclusive values (E) when the source image becomes thedestination (2 patterns per core)
18 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
False concurrency of write accesses between two rectangles ofthe destination image. This happens because the frontiers isnot alined with L2 cache lines. The associated patterns is 6vertical lines with 0 bytes in common
18 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
Shared read data (because convolution kernels read pixels inconterminous rectangles, see figure 1). There are 6 verticallines and 3 sets of two horizontal lines for these patterns
18 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
After simplification, only 6 patterns are required
18 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
prefetch stands for 10% of cache accesses
Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)
19 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
prefetch stands for 10% of cache accesses
Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)
19 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
prefetch stands for 10% of cache accesses
Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)
19 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
prefetch stands for 10% of cache accesses
Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)
19 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important forprogrammability of CMP
SotA cache coherence mechanisms falls into worst casebehaviors for scenarios that seems simple: regular access tomemory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilationprocess
Only few patterns entries are necessary for each typical lowlevel filter
Patterns can reduce significantly coherence messagethroughput
Patterns allow for early and efficient cache preloading whichaccelerate significantly applications
May provide a path to cache coherency in massive many-cores
20 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important forprogrammability of CMP
SotA cache coherence mechanisms falls into worst casebehaviors for scenarios that seems simple: regular access tomemory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilationprocess
Only few patterns entries are necessary for each typical lowlevel filter
Patterns can reduce significantly coherence messagethroughput
Patterns allow for early and efficient cache preloading whichaccelerate significantly applications
May provide a path to cache coherency in massive many-cores
20 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important forprogrammability of CMP
SotA cache coherence mechanisms falls into worst casebehaviors for scenarios that seems simple: regular access tomemory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilationprocess
Only few patterns entries are necessary for each typical lowlevel filter
Patterns can reduce significantly coherence messagethroughput
Patterns allow for early and efficient cache preloading whichaccelerate significantly applications
May provide a path to cache coherency in massive many-cores
20 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw moregeneral conclusions
apply our ideas in a NoC simulator to do cycle accuratesimulations
include it in a full scale simulator (e.g. SoCLib)
extend our work toward a HPC friendly architecture thatwould determine patterns dynamically at runtime
21 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw moregeneral conclusions
apply our ideas in a NoC simulator to do cycle accuratesimulations
include it in a full scale simulator (e.g. SoCLib)
extend our work toward a HPC friendly architecture thatwould determine patterns dynamically at runtime
21 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw moregeneral conclusions
apply our ideas in a NoC simulator to do cycle accuratesimulations
include it in a full scale simulator (e.g. SoCLib)
extend our work toward a HPC friendly architecture thatwould determine patterns dynamically at runtime
21 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
Thank you for your attention
Questions?
22 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
ALCHEMY wokshop @ ICCS 2013 (Barcelona)
The International Conference on Computational Science (ICCS)can be a good place to talk with people using HPC architecturesfor their needs.Loıc Cudennec and I are organizing a workshop on the issues thatare raising with future manycore systems (number of cores ¿ 1000and beyond)
Architecture, Language, Compilation and Hardware supportfor Emerging ManYcore systems
ALCHEMY wokshop
Topics:
Advanced architecture support for massive parallelismmanagement
Advanced architecture support for enhanced communicationfor manycores
Full paper submission: December 15th Notification: Feb. 10
23 / 23
Introduction CoCCA approach evaluation Conclusion and perspective
ALCHEMY wokshop @ ICCS 2013 (Barcelona)
The International Conference on Computational Science (ICCS)can be a good place to talk with people using HPC architecturesfor their needs.Loıc Cudennec and I are organizing a workshop on the issues thatare raising with future manycore systems (number of cores ¿ 1000and beyond)
Architecture, Language, Compilation and Hardware supportfor Emerging ManYcore systems
ALCHEMY wokshop
Topics:
Advanced architecture support for massive parallelismmanagement
Advanced architecture support for enhanced communicationfor manycores
Full paper submission: December 15th Notification: Feb. 10
23 / 23