Top Banner
Enhancing Cache Coherent Architectures with Access Patterns for Embedded Manycore Systems Jussara Marandola, St´ ephane Louise, Lo¨ ıc Cudennec, David A. Bader [email protected] Oct 11-12, 2012, SoC 2012 1 / 23
51
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SoC-2012-pres-2

Enhancing Cache Coherent Architectureswith Access Patterns for Embedded

Manycore Systems

Jussara Marandola, Stephane Louise, Loıc Cudennec, DavidA. Bader

[email protected]

Oct 11-12, 2012, SoC 2012

1 / 23

Page 2: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Background

Multicore and manycore systems: Architecture and its future:

Single processor time is over

Multiprocessor are there and will remain

Down to embedded systems (e.g. my cellphone)

Manycore systems are on the verge of appearing (e.g. Tilera,but others are on the way)

The future is manycore, even in the embedded world

We have to prepare for this.Programmability?

2 / 23

Page 3: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Programmability

What are the programming paradigms for manycores? How do weprogram them?

New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)

MPI (OK for HPC applications, also heavy work forparallelization)

openMP and the like: “only” adding some pragma toparallelize an application.

OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)

3 / 23

Page 4: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Programmability

What are the programming paradigms for manycores? How do weprogram them?

New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)

MPI (OK for HPC applications, also heavy work forparallelization)

openMP and the like: “only” adding some pragma toparallelize an application.

OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)

3 / 23

Page 5: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Programmability

What are the programming paradigms for manycores? How do weprogram them?

New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)

MPI (OK for HPC applications, also heavy work forparallelization)

openMP and the like: “only” adding some pragma toparallelize an application.

OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)

3 / 23

Page 6: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Programmability

What are the programming paradigms for manycores? How do weprogram them?

New paradigms (e.g. stream programming) still young, needto learn we way of programming. Bad for legacy software(porting costs!)

MPI (OK for HPC applications, also heavy work forparallelization)

openMP and the like: “only” adding some pragma toparallelize an application.

OpenMP relies on a shared memory model. So a shared memorybehavior must be provided and if possible done in hardware(because it is faster)

3 / 23

Page 7: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Shared memory consistency for multicore/manycoresFor manycore systems, memory consistency = cache coherencemechanisms:

Based on the four state MESI protocolModified: a single valid copy of the data exist in the systemand was modified since its fetch from memoryExclusive: the value is only in one core’s cache and wasn’tmodified since it was accessed from memoryShared: multiple copy of the value exist in the system, andonly read operation where doneInvalid: the current copy that the core has must not be usedand will be discarded

Use of Home Nodes to keep the state consistency:For a given address in memory only one core of the system willkeep the coherence stateThe distribution of home nodes is done as a modulo on anaddress mask (round-robin, line size) to avoid hot spotsA processor mask tracks the cores that share the cache line

Baseline protocol is the base for all memory consistencysystems within SotA

4 / 23

Page 8: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Shared memory consistency for multicore/manycoresFor manycore systems, memory consistency = cache coherencemechanisms:

Based on the four state MESI protocolModified: a single valid copy of the data exist in the systemand was modified since its fetch from memoryExclusive: the value is only in one core’s cache and wasn’tmodified since it was accessed from memoryShared: multiple copy of the value exist in the system, andonly read operation where doneInvalid: the current copy that the core has must not be usedand will be discarded

Use of Home Nodes to keep the state consistency:For a given address in memory only one core of the system willkeep the coherence stateThe distribution of home nodes is done as a modulo on anaddress mask (round-robin, line size) to avoid hot spotsA processor mask tracks the cores that share the cache line

Baseline protocol is the base for all memory consistencysystems within SotA

4 / 23

Page 9: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Baseline implementation in a manycore system

CacheCoherenceDirectory

Mem

ory

Inte

rfac

e

Net

wor

k In

terf

ace

Address Coherence info. (state and vector bit fields)

Address Coherence info. (state and vector bit fields)

Address Coherence info. (state and vector bit fields)

L2

Cac

he

5 / 23

Page 10: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Modification of a shared value by a given core

Processor with shared copyof the data

6 / 23

Page 11: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Modification of a shared value by a given core

Processor with shared copyof the data

6 / 23

Page 12: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Modification of a shared value by a given core

Processor with shared copyof the data

6 / 23

Page 13: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Modification of a shared value by a given core

6 / 23

Page 14: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Issues of baseline protocol and memory access patterns

Sometime a single write on a shared value triggers lot ofcoherence traffic on the NoC

For regular but non conterminous access, lot accesses

Typical example: reading an image by columnBut the accesses are simple and deterministic

In some areas the baseline protocol does not work as well as itcould and lacks a bit of scalability

In the embedded world lots of low level data processing display aregular behavior WRT their memory accesses

Convolutions on images

usual transformations (e.g. FFT, DCT)

vector operation

etc.

The idea: take advantage of these regular memory access patternto reduce the coherence traffic and enable memory prefetch

7 / 23

Page 15: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Issues of baseline protocol and memory access patterns

Sometime a single write on a shared value triggers lot ofcoherence traffic on the NoC

For regular but non conterminous access, lot accesses

Typical example: reading an image by columnBut the accesses are simple and deterministic

In some areas the baseline protocol does not work as well as itcould and lacks a bit of scalability

In the embedded world lots of low level data processing display aregular behavior WRT their memory accesses

Convolutions on images

usual transformations (e.g. FFT, DCT)

vector operation

etc.

The idea: take advantage of these regular memory access patternto reduce the coherence traffic and enable memory prefetch

7 / 23

Page 16: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

State of the Art, Memory patterns and shared memorycoherence

Use of memory patterns:

Intel: use of special instructions to perform regular accesses tomemory limited to a single core; Patent US 7,143,264 (2006)

IBM: special instructions used to detect and apply patterns,also limited to a single cache; Patent US 7,395,407 (2008)

Other platforms:

STAR project aim to provide a scalable manycore with acoherent shared memory

8 / 23

Page 17: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Cache Coherence Architecture with patterns

Our enhancement to Cache Coherence Architecture (CCA)

Relies on the baseline protocol (adds to it, not replace it)

Update it with special cases for pattern management

Add storage with each core for pattern storage and detection

Patterns are a result of the compilation process

It can not work worst than baseline, because baseline is still thedefault.Modifies:

Core IP with the pattern storage and matching

Add the speculative protocol to the baseline protocol

The patterns (and the speculative protocol) has its owndetermination of Home Node (which can be the same or differfrom the baseline Home Node)

We call this modified system CoCCA (Codesigned CCA)

9 / 23

Page 18: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Cache Coherence Architecture with patterns

Our enhancement to Cache Coherence Architecture (CCA)

Relies on the baseline protocol (adds to it, not replace it)

Update it with special cases for pattern management

Add storage with each core for pattern storage and detection

Patterns are a result of the compilation process

It can not work worst than baseline, because baseline is still thedefault.Modifies:

Core IP with the pattern storage and matching

Add the speculative protocol to the baseline protocol

The patterns (and the speculative protocol) has its owndetermination of Home Node (which can be the same or differfrom the baseline Home Node)

We call this modified system CoCCA (Codesigned CCA)

9 / 23

Page 19: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

CoCCA architecture scheme

CacheCoherenceDirectory

CoCCAPatternTable

Mem

ory

Inte

rfac

e

Net

wor

k In

terf

ace

Address Coherence info. (state and vector bit fields)

Address Coherence info. (state and vector bit fields)

Address Coherence info. (state and vector bit fields)

Address Pattern Coherence info. (state and bit fields)

Address Pattern Coherence info. (state and bit fields)

Chip area overhead: ~+3%?

10 / 23

Page 20: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Pattern definition and storage

Patterns are not stored the same way on use nodes and homenodes

The minimum implementation uses a 2D strided access shape:

a start addressa stride lenghta pattern lenghtOn the home node: a pattern size

A speculative access fetches cache lines (as baseline protocol do)but the access pattern may need to be more fine grained in itsspecification (overlaps)Definition of triggers: way of detecting the signature of a patternto fetch

the simplest trigger is the first address of the pattern access

11 / 23

Page 21: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Triggers and pattern definition

Pattern matching principle (hw):Pattern calculation (simple case):

Desc = fn(Baddr , s, δ)

Baddr Base address

s size of the pattern

δ interval (stride) between 2 consecutive access

E.G.: Pat(1, 4, 2)(@1) = {

@2, @5, @8, @11

}

12 / 23

Page 22: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Triggers and pattern definition

Pattern matching principle (hw):Pattern calculation (simple case):

Desc = fn(Baddr , s, δ)

Baddr Base address

s size of the pattern

δ interval (stride) between 2 consecutive access

E.G.: Pat(1, 4, 2)(@1) = { @2,

@5, @8, @11

}

12 / 23

Page 23: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Triggers and pattern definition

Pattern matching principle (hw):Pattern calculation (simple case):

Desc = fn(Baddr , s, δ)

Baddr Base address

s size of the pattern

δ interval (stride) between 2 consecutive access

E.G.: Pat(1, 4, 2)(@1) = { @2, @5,

@8, @11

}

12 / 23

Page 24: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Triggers and pattern definition

Pattern matching principle (hw):Pattern calculation (simple case):

Desc = fn(Baddr , s, δ)

Baddr Base address

s size of the pattern

δ interval (stride) between 2 consecutive access

E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8,

@11

}

12 / 23

Page 25: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Triggers and pattern definition

Pattern matching principle (hw):Pattern calculation (simple case):

Desc = fn(Baddr , s, δ)

Baddr Base address

s size of the pattern

δ interval (stride) between 2 consecutive access

E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, @11}

12 / 23

Page 26: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Base of the modified protocol

Requester

DIRLookup

PTLookup

hit miss

hit miss

Send RD_RQSend SPEC_RQ

L2 cacheRead

PatternLookup

MemoryAccess

Send RD_RQ_AK

hit

Send RD_RQ_AK

DIRLookup

MemoryAccess

Send RD_RQ_AK

misshit

BaselineHome Node

Hybrid(CoCCA)Home Node

Send RD_RQ_AK

miss Patternlength

Patternlength

Without pattern information or in case of pattern miss: thesystem acts as an ordinary baseline architecture

In case of pattern hit: the speculative protocol is fired

13 / 23

Page 27: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Base of the modified protocol

Requester

DIRLookup

PTLookup

hit miss

hit miss

Send RD_RQSend SPEC_RQ

L2 cacheRead

PatternLookup

MemoryAccess

Send RD_RQ_AK

hit

Send RD_RQ_AK

DIRLookup

MemoryAccess

Send RD_RQ_AK

misshit

BaselineHome Node

Hybrid(CoCCA)Home Node

Send RD_RQ_AK

miss Patternlength

Patternlength

X

Without pattern information or in case of pattern miss: thesystem acts as an ordinary baseline architecture

In case of pattern hit: the speculative protocol is fired

13 / 23

Page 28: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Base of the modified protocol

Requester

DIRLookup

PTLookup

hit miss

hit miss

Send RD_RQSend SPEC_RQ

L2 cacheRead

PatternLookup

MemoryAccess

Send RD_RQ_AK

hit

Send RD_RQ_AK

DIRLookup

MemoryAccess

Send RD_RQ_AK

misshit

BaselineHome Node

Hybrid(CoCCA)Home Node

Send RD_RQ_AK

miss Patternlength

Patternlength

X

Without pattern information or in case of pattern miss: thesystem acts as an ordinary baseline architecture

In case of pattern hit: the speculative protocol is fired

13 / 23

Page 29: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Hardware tables and special instructions

A C-language description of pattern storing tables:unsigned long capacity; /* sizeof(address) */

unsigned long size; /* address number */

unsigned long * offset; /* pattern offset */

unsigned long * length; /* pattern length */

unsigned long * stride; /* pattern stride */

So it is possible to have a rough estimate of the size of an entry inthe pattern table

A few specialized instructions to deal managepattern tables:

PatternNew: to create a pattern,

PatternAddOffset: to add an offset entry,

PatternAddLength: to add a length entry,

PatternAddStride: to add a stride entry,

PatternFree: to release the pattern after use.

14 / 23

Page 30: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Hardware tables and special instructions

A C-language description of pattern storing tables:unsigned long capacity; /* sizeof(address) */

unsigned long size; /* address number */

unsigned long * offset; /* pattern offset */

unsigned long * length; /* pattern length */

unsigned long * stride; /* pattern stride */

So it is possible to have a rough estimate of the size of an entry inthe pattern table A few specialized instructions to deal managepattern tables:

PatternNew: to create a pattern,

PatternAddOffset: to add an offset entry,

PatternAddLength: to add a length entry,

PatternAddStride: to add a stride entry,

PatternFree: to release the pattern after use.

14 / 23

Page 31: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

A first benchmark program for early evaluation

The choice of a benchmark program for our speculative protocol:

be representative of typical embedded application

stress the protocol proposal on several aspects

We choose a 2 step image cascading filtering

the first filter result is the source of the second filter

5x5 filter

applied on chunks of the image, for each core with sharedcache lines both in read mode as in write mode

the result of the second filter is written back on the source(write invalidation)

15 / 23

Page 32: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Memory mapping of our benchmark program

16 / 23

Page 33: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Instrumentation choice: Pin/Pintools

Pin/Pintools:

Pin is an instrumentation framework of binaries based on JITtechnique to accelerate the instrumentation. It is a Intelproject

Pin acts in association with the instrumentation tool calledPintool which is programmable

Several Pintools are provided in the basic distribution of Pin

We used:

inscount: pintool which gives the number of executedinstructions

pinatrace: pintool which trace and log all the memoryaccesses (load/store operations)

See paper for details.

17 / 23

Page 34: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Instrumentation choice: Pin/Pintools

Pin/Pintools:

Pin is an instrumentation framework of binaries based on JITtechnique to accelerate the instrumentation. It is a Intelproject

Pin acts in association with the instrumentation tool calledPintool which is programmable

Several Pintools are provided in the basic distribution of Pin

We used:

inscount: pintool which gives the number of executedinstructions

pinatrace: pintool which trace and log all the memoryaccesses (load/store operations)

See paper for details.

17 / 23

Page 35: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Data sharing and prefetch

Rect. i Rect. i+1

Rect. i+7 Rect. i+8

Exclusive data (1 core only)

Data shared by 2 cores

Data shared by 4 cores

Figure: Read data sharing in conterminous rectangles

We can define three kinds of patterns on this benchmark:

Source image prefetch and setting of old Shared values (S) toExclusive values (E) when the source image becomes thedestination (2 patterns per core)

18 / 23

Page 36: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Data sharing and prefetch

Rect. i Rect. i+1

Rect. i+7 Rect. i+8

Exclusive data (1 core only)

Data shared by 2 cores

Data shared by 4 cores

Figure: Read data sharing in conterminous rectangles

We can define three kinds of patterns on this benchmark:

False concurrency of write accesses between two rectangles ofthe destination image. This happens because the frontiers isnot alined with L2 cache lines. The associated patterns is 6vertical lines with 0 bytes in common

18 / 23

Page 37: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Data sharing and prefetch

Rect. i Rect. i+1

Rect. i+7 Rect. i+8

Exclusive data (1 core only)

Data shared by 2 cores

Data shared by 4 cores

Figure: Read data sharing in conterminous rectangles

We can define three kinds of patterns on this benchmark:

Shared read data (because convolution kernels read pixels inconterminous rectangles, see figure 1). There are 6 verticallines and 3 sets of two horizontal lines for these patterns

18 / 23

Page 38: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Data sharing and prefetch

Rect. i Rect. i+1

Rect. i+7 Rect. i+8

Exclusive data (1 core only)

Data shared by 2 cores

Data shared by 4 cores

Figure: Read data sharing in conterminous rectangles

We can define three kinds of patterns on this benchmark:

After simplification, only 6 patterns are required

18 / 23

Page 39: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Evaluation results

Condition MESI CoCCA

Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772

Total throughput 48672 30723

reduction of 37% of coherence message throughput

prefetch stands for 10% of cache accesses

Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)

19 / 23

Page 40: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Evaluation results

Condition MESI CoCCA

Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772

Total throughput 48672 30723

reduction of 37% of coherence message throughput

prefetch stands for 10% of cache accesses

Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)

19 / 23

Page 41: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Evaluation results

Condition MESI CoCCA

Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772

Total throughput 48672 30723

reduction of 37% of coherence message throughput

prefetch stands for 10% of cache accesses

Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)

19 / 23

Page 42: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Evaluation results

Condition MESI CoCCA

Shared line invalidation 34560 17283Exclusive line sharing (2 cores) 12768 12768Exclusive line sharing (4 cores) 1344 772

Total throughput 48672 30723

reduction of 37% of coherence message throughput

prefetch stands for 10% of cache accesses

Means that without prefetch the application runs 67% slower(20 cycles for on chip shared cache access and 80 cycles forexternal memory accesses)

19 / 23

Page 43: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Contributions

Shared memory and coherence is important forprogrammability of CMP

SotA cache coherence mechanisms falls into worst casebehaviors for scenarios that seems simple: regular access tomemory with patterns

We defined an extension of cores to store pattern

We extended the baseline protocol with a speculative protocol

For embedded system: tables are part of the compilationprocess

Only few patterns entries are necessary for each typical lowlevel filter

Patterns can reduce significantly coherence messagethroughput

Patterns allow for early and efficient cache preloading whichaccelerate significantly applications

May provide a path to cache coherency in massive many-cores

20 / 23

Page 44: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Contributions

Shared memory and coherence is important forprogrammability of CMP

SotA cache coherence mechanisms falls into worst casebehaviors for scenarios that seems simple: regular access tomemory with patterns

We defined an extension of cores to store pattern

We extended the baseline protocol with a speculative protocol

For embedded system: tables are part of the compilationprocess

Only few patterns entries are necessary for each typical lowlevel filter

Patterns can reduce significantly coherence messagethroughput

Patterns allow for early and efficient cache preloading whichaccelerate significantly applications

May provide a path to cache coherency in massive many-cores

20 / 23

Page 45: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Contributions

Shared memory and coherence is important forprogrammability of CMP

SotA cache coherence mechanisms falls into worst casebehaviors for scenarios that seems simple: regular access tomemory with patterns

We defined an extension of cores to store pattern

We extended the baseline protocol with a speculative protocol

For embedded system: tables are part of the compilationprocess

Only few patterns entries are necessary for each typical lowlevel filter

Patterns can reduce significantly coherence messagethroughput

Patterns allow for early and efficient cache preloading whichaccelerate significantly applications

May provide a path to cache coherency in massive many-cores

20 / 23

Page 46: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Future work and perspective

extend the number of benchmark applications to draw moregeneral conclusions

apply our ideas in a NoC simulator to do cycle accuratesimulations

include it in a full scale simulator (e.g. SoCLib)

extend our work toward a HPC friendly architecture thatwould determine patterns dynamically at runtime

21 / 23

Page 47: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Future work and perspective

extend the number of benchmark applications to draw moregeneral conclusions

apply our ideas in a NoC simulator to do cycle accuratesimulations

include it in a full scale simulator (e.g. SoCLib)

extend our work toward a HPC friendly architecture thatwould determine patterns dynamically at runtime

21 / 23

Page 48: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Future work and perspective

extend the number of benchmark applications to draw moregeneral conclusions

apply our ideas in a NoC simulator to do cycle accuratesimulations

include it in a full scale simulator (e.g. SoCLib)

extend our work toward a HPC friendly architecture thatwould determine patterns dynamically at runtime

21 / 23

Page 49: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

Thank you for your attention

Questions?

22 / 23

Page 50: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

ALCHEMY wokshop @ ICCS 2013 (Barcelona)

The International Conference on Computational Science (ICCS)can be a good place to talk with people using HPC architecturesfor their needs.Loıc Cudennec and I are organizing a workshop on the issues thatare raising with future manycore systems (number of cores ¿ 1000and beyond)

Architecture, Language, Compilation and Hardware supportfor Emerging ManYcore systems

ALCHEMY wokshop

Topics:

Advanced architecture support for massive parallelismmanagement

Advanced architecture support for enhanced communicationfor manycores

Full paper submission: December 15th Notification: Feb. 10

23 / 23

Page 51: SoC-2012-pres-2

Introduction CoCCA approach evaluation Conclusion and perspective

ALCHEMY wokshop @ ICCS 2013 (Barcelona)

The International Conference on Computational Science (ICCS)can be a good place to talk with people using HPC architecturesfor their needs.Loıc Cudennec and I are organizing a workshop on the issues thatare raising with future manycore systems (number of cores ¿ 1000and beyond)

Architecture, Language, Compilation and Hardware supportfor Emerging ManYcore systems

ALCHEMY wokshop

Topics:

Advanced architecture support for massive parallelismmanagement

Advanced architecture support for enhanced communicationfor manycores

Full paper submission: December 15th Notification: Feb. 10

23 / 23