Parallel gem5 Simulation of Many - Core Systems with Software - Programmable Memories Bryan Donyanavard* Tiago Muck* Majid Shoushtari* Nikil Dutt Computer Science *all listed authors are equal contributors
Parallel gem5 Simulation of Many-Core Systems
with Software-Programmable Memories
Bryan Donyanavard*Tiago Muck*
Majid Shoushtari*Nikil Dutt
Computer Science
*all listed authors are equal contributors
Target Platform
• A mesh-based many core architecture with distributed data software programmable memories
2nd gem5 User Workshop, June 20156/11/2015 2
Tile1 Tile2 Tile3
R R R
Tile4 Tile5 Tile6
R R R
Tile7 Tile8 Tile9
R R R
MM
MM
CPU Core
NIPMMU
Instruction Cache
Enabling Simulation of Target Platform
• Software Programmable On-chip Memories (SPMs)• SPM Architectural Components• SPM Programming API• Communication Infrastructure for Distributed
SPMs
• Simulator Speedup via Parallelization• Event Queues
2nd gem5 User Workshop, June 20156/11/2015 3
6/11/2015 42nd gem5 User Workshop, June 2015
SPM Integration for Many-cores
Existing Memory Hierarchies
• Classic Memory Model• No explicit memory
controller• No NoC
• Ruby• Protocol dependent
2nd gem5 User Workshop, June 20156/11/2015 5
New Memory Hierarchy
• We add the following infrastructure to the Classic Memory Model• SPM• Paged Memory
Management Unit (PMMU)• Address Translation Table
(ATT)• SPM Governor
CPU 1
SPM 1
CPU 2
SPM 2
BUS
Main Memory
DMA DMA
SPM Governor
PMMU 1 PMMU 2
Mesh NoC
ATT1 ATT2
2nd gem5 User Workshop, June 20156/11/2015 6
New Memory Hierarchy
• SPM• Receives and responds to
memory requests from CPU (in place of old Cache)
• Receives and responds to memory requests from PMMU (remote requests over NoC)
• Forwards all main memory requests to memory bus (mimicking a DMA)
CPU 1
SPM 1
CPU 2
SPM 2
BUS
Main Memory
DMA DMA
SPM Governor
PMMU 1 PMMU 2
Mesh NoC
ATT1 ATT2
2nd gem5 User Workshop, June 20156/11/2015 7
New Memory Hierarchy
• PMMU• Receives and responds to
memory requests from SPM (local CPU)
• Receives and responds to memory requests from NoC(remote CPU)
• Processes SPM allocate and free requests from SPM Governor
CPU 1
SPM 1
CPU 2
SPM 2
BUS
Main Memory
DMA DMA
SPM Governor
PMMU 1 PMMU 2
Mesh NoC
ATT1 ATT2
2nd gem5 User Workshop, June 20156/11/2015 8
New Memory Hierarchy
• ATT• Holds translation of thread’s
virtual to SPM physical address mapping CPU 1
SPM 1
CPU 2
SPM 2
BUS
Main Memory
DMA DMA
SPM Governor
PMMU 1 PMMU 2
Mesh NoC
ATT1 ATT2
2nd gem5 User Workshop, June 20156/11/2015 9
New Memory Hierarchy
• Governor• Maintains global state of all
memory mapped to SPMs• Receives SPM alloc and free
requests from all executing threads (via pseudo instructions)
• Determines memory mapping based on gem5-user-defined policy and system state
CPU 1
SPM 1
CPU 2
SPM 2
BUS
Main Memory
DMA DMA
SPM Governor
PMMU 1 PMMU 2
Mesh NoC
ATT1 ATT2
2nd gem5 User Workshop, June 20156/11/2015 10
SPM Programming API
• Programmer’s Interface• SPM_ARRAY_ALLOC (BASE_PTR, LENGTH, DATA_TYPE)• SPM_ARRAY_FREE (BASE_PTR, LENGTH, DATA_TYPE)
int *arr1 = (int*) malloc (LENGTH*sizeof(int));...SPM_ARRAY_ALLOC (arr1, LENGTH, int);
for (i = 0; i<LENGTH; i++) {arr1 [i] = i;
}
SPM_ARRAY_FREE (arr1, LENGTH, int);...free(arr1);
2nd gem5 User Workshop, June 20156/11/2015 11
Network
NoC Integration
• Integrated simple network mesh NoC from Ruby into Classic Memory Model
• PMMU handles crossover from QueuedPorts to MessageBuffers, and acts as network node
P M M U
m_PMMU_responseToNetwork_ptr
m_PMMU_requestToNetwork_ptr
m_PMMU_responseFromSPM_ptr
m_PMMU_requestFromSPM_ptr
m_PMMU_requestToSPM_ptr
m_PMMU_responseToSPM_ptr
SPM
Governor
2nd gem5 User Workshop, June 20156/11/2015 12
Que
ued
Ports
Mas
ter
Slav
e
Mas
ter
Slav
e
Que
ued
Ports
SPM Communication Protocol
• Protocol to enable SPM• SPMRequestMsg : NetworkMessage
• SPMRequestType_READ• SPMRequestType_WRITE• SPMRequestType_ALLOC• SPMRequestType_DEALLOC
• SPMResponseMsg : NetworkMessage• SPMResponseType_WRITE_ACK• SPMResponseType_DATA• SPMResponseType_GOV_ACK
2nd gem5 User Workshop, June 20156/11/2015 13
6/11/2015 142nd gem5 User Workshop, June 2015
Simulator Speedup
Simulation Speedup via Parallelization
• Built on top of current multiple queue infrastructure
• Main changes• Assigning a separate queue for each CPU and its
“private” SimObjects (e.g. I$, D$) by setting the eventq_index param. All other SimObjectsassociated to a global queue
• Quantum-based synchronization replaced by event-based synchronization
6/11/2015 2nd gem5 User Workshop, June 2015 15
Event Synchronization
• Event on the system queue executes after all CPU queues
• CPU queues block until shared event is handled
• Using synchronization events to create a barrier
6/11/2015 2nd gem5 User Workshop, June 2015 16
CPU 0
$I $D
CPU EQ 0
CPU 1
$I $D
CPU EQ 1 System EQ
toL2Bus . . .
EvEv
EvEvEvEv
Event Synchronization
• Event on the system queue executes after all CPU queues
• CPU queues block until shared event is handled
• Using synchronization events to create a barrier
6/11/2015 2nd gem5 User Workshop, June 2015 16
CPU 0
$I $D
CPU EQ 0
CPU 1
$I $D
CPU EQ 1 System EQ
toL2Bus . . .
EvEv
EvEvEvEv
Ev Tick x
Event Synchronization
• Event on the system queue executes after all CPU queues
• CPU queues block until shared event is handled
• Using synchronization events to create a barrier
6/11/2015 2nd gem5 User Workshop, June 2015 16
CPU 0
$I $D
CPU EQ 0
CPU 1
$I $D
CPU EQ 1 System EQ
toL2Bus
BrEv
. . .
EvEv
EvEvEvEv
EvBrEv
BrEvBrEv BrEv
BrEv
Tick x-1
Tick x
Tick x+1
Synchronization Layers for Race Conditions
6/11/2015 2nd gem5 User Workshop, June 2015 17
Preliminary Simulator Speedups• Microbenchmarks on
cache-based classic memory• Cache-light:
computationally intensive• small number of accesses to
L2/system queue• Cache-heavy: memory
intensive with very high $ miss rate
• Many accesses to L2/system queue
• CPU queues block all the time
6/11/2015 2nd gem5 User Workshop, June 2015 18
Preliminary Simulator Speedups• Microbenchmarks on
cache-based classic memory• Cache-light:
computationally intensive• small number of accesses to
L2/system queue• Cache-heavy: memory
intensive with very high $ miss rate
• Many accesses to L2/system queue
• CPU queues block all the time
6/11/2015 2nd gem5 User Workshop, June 2015 18
Advantageous for coherent-less
memory (e.g. SPM)
Feedback
• Comments and suggestions? • Interested in participating? • Contact us:
• Bryan – [email protected]• Majid – [email protected]• Tiago – [email protected]
2nd gem5 User Workshop, June 20156/11/2015 19
Thank you
Q&A
duttgroup.ics.uci.edu2nd gem5 User Workshop, June 20156/11/2015 20