Heap Data Management for Limited Local Memory (LLM) Multicore Processors

CMLCML

Heap Data Management for Limited Local Memory

(LLM) Multicore Processors

Ke Bai， Aviral ShrivastavaCompiler Micro-architecture Lab

CMLCML

From multi- to many-core processors

• Simpler design and verification– Reuse the cores

• Can improve performance without much increase in power

– Each core can run at a lower frequency

• Tackle thermal and reliability problems at core granularity

IBM XCell 8i GeForce 9800 GT

Tilera TILE64

04/19/20232

http://www.public.asu.edu/~ashriva6/cml

CMLCML

Memory Scaling Challenge

• In Chip Multi Processors (CMPs) , caches provide the illusion of a large unified memory– Bring required data from wherever into the

cache– Make sure that the application gets the latest

copy of the data• Caches consume too much power

– 44% power, and greater than 34 % area• Cache coherency protocols do not scale

well– Intel 48-core Single Cloud-on-a-Chip, and

Intel 80-core processors have non-coherent caches

arm925%

SysCtl3%

CP 152%

BIU8%

PATagRAM1%

Clocks4%

Other4%

D MMU5%

D Cache19%

I Cache25%

I MMU4%

Intel 80 core chip

Strong ARM 1100

04/19/20233 http://www.public.asu.edu/~ashriva6/cml

CML

PPE

Element Interconnect Bus (EIB)

Off-chip Global Memory

PPE: Power Processor ElementSPE: Synergistic Processor ElementLS: Local Store

SPE 0 SPE 2

SPE 5

SPE 4

SPE 3SPE 1

SPE 6

LS

SPU

Limited Local Memory Architecture

• Cores have small local memories (scratch pad)– Core can only access local memory– Accesses to global memory through explicit DMAs in the program

• E.g. IBM Cell architecture, which is in Sony PS3.

SPE 7

04/19/20234


CMLCML

LLM Programming• Thread based programming, MPI like communication

#include<libspe2.h>

extern spe_program_handle_t hello_spu;

int main(void){int speid, status;

speid (&hello_spu);

}

Main Core

<spu_mfcio.h>

int main(speid, argp){printf("Hello world!\n");} Local

Core<spu_mfcio.h>


Core

<spu_mfcio.h>


Core<spu_mfcio.h>


Core

<spu_mfcio.h>


Core<spu_mfcio.h>


Core

= spe_create_thread

• Extremely power-efficient computation– If all code and data fit into the local memory of the cores


CMLCML

What if thread data is too large?

32 KB

32 KB

24 KB

24 KB

24 KB

Two threads with 32 KB memory each

Three cores with 24 KB memory each

2. Manage data to execute in limited memory of core– Easier and portable

Two Options1. Repartition and re-parallelize the application

– Can be counter-intuitive and hard


CML

Managing data

Local Memory Aware Code

Original Code

int global;

f1(){ int a,b; global = a + b;

f2(); }

int global;

f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2();}


CML

Heap Data Management• All code and data need to be managed

– Stack, heap, code and global• This paper focuses on heap data management

– Heap data management is difficult• Heap size is dynamic, while the size of code and global

data are statically known• Heap data size can be unbounded

– Cell programming manual suggests “Use heap data at your own risk”.

• Restricting heap usage is restrictive for programmers

main() { for (i=0; i<N; i++) { item[i] = malloc(sizeof(Item)); } F1();}

code

global

stack

heap

heapheap

stack


CMLCML

Outline of the talk• Motivation

• Related works on heap data management

• Our Approach of Heap Data Management

• Experiments


CMLCML

Related Works• Local memories in each core are similar to SPMs• Extensive works are proposed for SPM

– Stack: Udayakumaran2006,Dominguez2005, Kannan2009– Global: Avissar2002, Gao2005, Kandemir2002, Steinke2002– Code: Janapsatya2006, Egger2006, Angiolini2004, Pabalkar2008– Heap: Dominguez2005, Mcllroy2008

ARM SPM

Global Memory

DMA

ARM Memory Architecture

SPE LLM

Global Memory

DMA

IBM Cell Memory Architecture

direct access

SPM is for Optimization SPM is Essential


CMLCML

Our Approach

malloc2malloc1

Heap Size = 32bytessizeof(student)=16bytes

HP

Local Memory Global Memory

GM_HP

typedef struct{ int id; float score;}Student;

main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } for (i=0; i<N; i++) { student[i].id = i; }}

malloc3

• mymalloc()—May need to evict older heap

objects to global memory—It may need to allocate more

global memory

• malloc()— allocates space in local

memory


CMLCML

How to evict data to global memory?

• Can use DMA to transfer heap object to global memory— DMA is very fast – no core-to-core communication

• But eventually, you can overwrite some other data• Need OS mediation

Execution Core

malloc

Main Core

malloc

Global Memory

Execution Core

malloc

Global Memory

DMA

• Thread communication between cores is slow!04/19/2023

12http://www.public.asu.edu/~ashriva6/cml

CMLCML

Hybrid DMA + Communication

• Can use DMA to transfer heap object to global memory— DMA is very fast – no core-to-core communication

• But eventually, you can overwrite some other data• Need OS mediation

malloc() { if (enough space in global memory) then write function frame using DMA else request more space in global memory}Execution Thread on execution

core

S

startAddr endAddr

mail-box based

communication

Global Memory

allocate ≥S space

DMA write from local memory to global

memory

• free() frees global space.- Communication is similar to malloc().- Sent the global address to global thread

Main core


CMLCML

Address Translation Functions

• Mapping from SPU address to global address is one to many.

– Cannot easily find global address from SPU address• All heap accesses must happen through global addresses

main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } for (i=0; i<N; i++) { student[i].id = i; }}

malloc2malloc1

Heap Size = 32bytessizeof(student)=16bytes

HP

Local Memory Global Memory

GM_HP

malloc3

student[i] = p2s(student[i]);

student[i] = s2p(student[i]);

• p2s() will translate the global address to spu address– Make sure the heap object is in the local memory

• s2p() will translate the spu address to global address

04/19/202314 http://www.public.asu.edu/~ashriva6/cmlMore details in the paper

CML

Heap Management API


main() {for (i=0; i<N; i++) { student[i] = malloc(sizeof(Student)); student[i].id = i; }}

malloc()• allocate space in

local memory and global memory and return global addr

free()• free space in the

global memoryp2s()• Assures heap

variable exists in the local memory and uses spuAddr.

s2p() • Translate the

spuAddr back to ppuAddr.

• Code with Heap

Management

• Original Code


main() {for (i=0; i<N; i++) { student[i] = malloc(sizeof(Student)); student[i].id = i; }}

student[i] = p2s(student[i]);student[i] = s2p(student[i]);

Our approach provides an illusion of unlimited space in the local memory!

04/19/2023 15http://www.public.asu.edu/~ashriva6/cml

CMLCML

Experimental Setup

• Sony PlayStation 3 running a Fedora Core 9 Linux

• MiBench Benchmark Suite and other possible applications

http://www.public.asu.edu/~kbai3/publications.html

• The runtimes are measured with spu_decrementer() for SPE and _mftb() for the PPE provided with IBM Cell SDK 3.1


CMLCML

Unrestricted Heap Size

1 10 100

1000

1000

0

1000

001000

10000

100000

1000000

10000000

100000000

1000000000

10000000000

no-management

number of nodes in rbTree

Runti

me(u

s)

N>6800Program crashes!!!

Runtimes are comparable


CMLCML

4 16 64 256

1024

4096

1638

41000

10000100000

100000010000000

1000000001000000000 DFS

dijkstra

fft

fft_inverse

MST

rbTree

stringsearch

Log o

f R

unti

me(u

s)

Heap size (bytes)

Larger Heap Space Lower Runtime


CMLCML

Runtime decreases with Granularity

1 2 4 8 16 32 64 128

256

1000

10000

100000

1000000

10000000

100000000DFSdijkstrafftinvfftMSTrbTree

Log o

f R

unti

me(u

s)

Granularity

• Granularity: # of heap objects combined as a transfer unit


CMLCML

Embedded Systems Optimization

• If the maximum heap space needed is known– No thread communication is needed. – DMAs are sufficient

Average 14% improvement

Dijkst

ra

fft

fft_in

v

Strin

g_se

arch

DFS

MST

rbTr

ee

aver

age

0

0.2

0.4

0.6

0.8

1N

orm

aliz

ed o

ptim

izati

on R

untim

e(s

tatic

/dyn

amic

)

04/19/2023

20


CMLCML

Scalability of Heap Management

1 2 3 4 5 610000

100000

1000000

10000000

100000000

1000000000DFS

dijkstra

fft

fft_inverse

MST

rbTree

Log o

f R

unti

me(u

s)

Number of Cores


CMLCML

Summary• Moving from multi-core to many-core systems• Scaling the memory architecture is a major challenge• Limited Local Memory architectures are promising• Code and data should be managed if they can not fit in the

limited local memory• We propose a heap data management scheme

– Manage any size of heap data in a constant space in local memory– It’s automatable, then can increase productivity of programmers– It’s scalable for different number of cores– Overhead ~ 4-20%

• Comparison with software cache– Does not support pointer– One SW cache for one data type– Cannot optimize any further


Heap Data Management for Limited Local Memory (LLM) Multicore Processors

Documents

local core int mainspeid

single core

powereach core

main core int mainspeid

local memoryaccesses

local storespe

power wall

core processorssimpler