STL on Limited Local Memory ( LLM) Multi-core Processors

CML

STL on Limited Local Memory (LLM) Multi-core Processors

Di LuMaster Thesis Defense

Committee Member:Aviral ShrivastavaKaramvir ChathaPartha Dasgupta

CMLhttp://www.aviral.lab.asu.edu

3 CML

Why Multi-core Adding more resources into single core may

introduce higher latency of CPU cycle Thermal hazard as CPU speed is high

Adding one core is more energy-efficient than increasing frequency

Alternative solution: Adding additional core with lower frequency


4 CML

Hardware Cache in Multi-Core Memory Scaling

Performance The existing cache coherence protocol

cannot scale to over hundreds of cores Intel 48-core Single Cloud-on-a-Chip has

non-coherent caches Power

Cache consumes more than 40% of power in single core architecture Rajeshwari Banakar, CODES, 2002

Expected to consume more power in multicore Cache coherency introduce data

snooping

D Cache19%

I Cache25%

D MMU5%

I MMU4%

arm925%

PATag RAM1%

CP 152%

BIU8%

SysCtl3%

Clocks4% Other

4%

Strong ARM 1100


5 CML

Limited Local Memory Architecture Cores only have a small local memory

instead of cache Example: Synergistic Processing Element (SPE)

in IBM Cell B.E. Processor only has a 256 KB memory

Communication Channel – On-Chip Network DMA

Large amount data transfer Overlap communication

and computation

Global Memor

yCac

heALU

Local Memory

Local Memory

ALU

PE

Main PE

On-Chip Network

PE

Local Memory

ALUPE

Local Memory

ALUPE

ALU

Local Memory

PEALU


6 CML

Standard Template Library Generic programming library Components:

Containers – Data structure which contains a collection of data objects Allocator is used inside the Container class for memory

management E.g. vector, stack, queue, tree

Algorithms are procedures that are applied to containers to process their data E.g. sort, search

Iterator One kind of smart pointer Provide an interface of traversing different containers for algorithm


7 CML

Limitation of Existing STL on LLM The local memory size in the core is limited

Existing STL implementation assumes almost infinite memory

main() {pthread_create(...spe_context_run(speID)...);

}(a) main PE code

main() {vector<int> vec1;for( i = 0 ; i < N ; i++)

vec1.push_back(i);}

(b) PE code

Will crash when i ==

8192

Assuming LLM size = 256 KB

CMLhttp://www.aviral.lab.asu.edu CML




}(a) main PE code


vec1.push_back(i);}

(b) PE code

7

Code region

Global and static data

Vector data

Reallocation

Vector data

Vector data

Stack data

Reallocation


8192

Reallocation size :

64KB

Data size : 32KB

Fragmentation size : 32KB

– 4BProgram code size > 130KBAssuming LLM size = 256

KB





}(a) main PE code


vec1.push_back(i);}

(b) PE code

Container Class

Approx. Code Size (in Bytes)

Approx. Data Size (in Bytes)

Vector 138388 32768Deque 139364 102908

Set 141284 11464List 134924 21976

7

The max data limits strongly restrict the use of container class

LLM size = 256 KB


8192


Challenges – Minimize changes Preserve syntax and semantics

How to hide the architectural differences while we use the same interface? How do we ensure different components can

cooperate correctly in a different architecture? E.g. Ensure local pointer to container data with a

global address can work

How do we handle the template type while modifying the code? E.g. we do not know if a data is pointer or an iterator

8


Challenges – How to manage data

9

Cache data and utilize DMA Which data to be placed in the local memory? When to DMA, what to DMA?

Dynamic Memory Allocation on Global Memory Allocator on PE can only allocate on local memory Use a Static Buffer on global memory

Low memory usage when container data is too few Overflow when container data is too much

Therefore, we need to design a dynamic allocation scheme


Challenges – External hazard

10

External pointer hazard What if an external pointer points to a container

element in the global memory? Two address spaces co-exist, no matter what

scheme is implemented, pointer issue exist.

vec

main memory

(a) Pointer points to a vector element

struct* S { …… int* ptr;}

Local Memory

vec

(b) The vector element is moved to main memory

?

struct* S { …… int* ptr;}

Local Memory

main memory


Related Works – STL Extension STL has been extended to

allow parallel execution Shared memory architecture

Control the data access from different execution Pes

Distributed memory architecture

11

Thread 1

Execution PE

Thread 2

Execution PE

Thread 3

Execution PE

21

Global Shared Memory

Container Data24 58 72

Thread 1

Execution PE

Thread 2

Execution PE

Thread 3

Execution PE

21





allow parallel execution Shared memory architecture -


Distributed memory architecture Concurrency control Aware of data locality Automatic data transfer

11

Thread 1

Execution PE

Thread 2

Execution PE

Thread 3

Execution PE

21





allow parallel execution Shared memory architecture -


Distributed memory architecture Concurrency control Aware of data locality Automatic data transfer

11

Thread 1

Execution PE

Thread 2

Execution PE

Thread 3

Execution PE

21



Thread 1

Execution PE 1

Thread 2

Execution PE 2

Get data from remote memory

Access data in local memory

Local Memory

Data 1

Data 2Local

Memory

The size of global memory in shared memory and local memory in distributed memory is sufficiently large!


Problems in LLM Our problem is orthogonal

None of these considers small local memory problem and uses main memory for data storage

12

Thread 1

Execution PE 1

Local Memory

Global Memory

Allocated Memory 1

Thread n

Execution PE n

Local Memory

Allocated Memory n

...

...


Related Works – Software Cache (1) Software cache (SC) can manage data between local

memory and global memory Software cache solution has the limitation of data size

Maximum heap allocated is dependent on available Local Memory

13

Local Memory(0x00000 -0x40000)

PE core

Execution PE

Global Memory

(0x100000 -0xFFFFFFFF)

int* arr = 0x100000;

int* temp_int = cache_access(arr+5);

Execution Thread

Software Cache

arr+5

temp_int


Related Works – Software Cache (2)

14

Existing software cache solution does not consider C++ reference

Example: C++ function may instantiate reference to variables that are in the software cache buffer

SC buffer

*d1(ptr->left)

SPE Local

Memory

Subsequent Write to d1

evicted

tree_node* ptr, ptr2;void F() { tree_node*& d1 = cache_ref(ptr)->left; *(cache_ref(d2)) = val2; // d1 evicted /* Some code executed */ d1 = ptr2;}

Global Memory

*d2

Error!


Related Works – Works on LLM

Manage the data on different regions on local memory Leverage the global memory for data storage and

automatic transfer data for different regions heap region - Bai et al. CODES 2010, stack region -

Bai et al. ASAP 2011, code region - Jung et al. ASAP 2010

Heap management only handles the C program code

Code and stack management does not manage memory for container data

15


Preserve Syntax and Semantics

16

Hide programming complexities, e.g. DMA Inside Container functions Overload the operators

Use a common software cache in all STL components Apply the software cache on global pointers

Container

Software Cache

Global Memory

Iterator Algorithm


Preserve Syntax and Semantics

16

Hide programming complexities, e.g. DMA Inside Container functions Overload the operators

Use a common software cache for all STL components Apply the software cache on global pointers

Handling the template data types Separate the pointer implementation from the

iterator implementation into different functions By template specialization and function

overloading Avoid compile-time error


Software Cache Structure Direct-mapped cache

With a small FIFO list Hash table

For fast lookup FIFO list

For increasing the associativity for the whole cache

Addr

Valid

Pointer to BlockDirect-

mapped Hash table

FIFO List

Data Blocks

Item 32KB cacheblock size 32

Bytes

64KB cacheblock size 128

BytesHashTable 1024*(4+12) =

16KB512*(4+12) = 8KB

LRU List 4*(12+12) = 96 Bytes

4*(12+12) = 96 Bytes

Total 16480 Bytes 8288 Bytes17


Software Cache Structure (Con’t)

Cache lookup & miss handling Cache line size : 16B

0x01200Hash table

FIFO List

17

0x01220

Global Memory

0x01224

0x012100x013400x01720

0x11200

0x21200

0x11340

0x31720

0x01220

0x01220

Software Cache in Local

Memory


Solution to Global Memory Allocation

Use a thread on main PE Use both Message Passing and DMA

Message Passing for notifying the main PE thread DMA for transferring extra parameters

PE

Main PE

struct msgStruct { uint32_t request_size; uint32_t padding[3];}; (2) operation

type Container Data

Global Memory

(4) restart signal

(1) transfer parameters by

DMAPE

thread

Main PE thread

(3) operate

on container

data

(5) get new allocated

address by DMA

Message based

18


Resolve External Pointers

19

Step 1: Identify initial potential pointers Only several functions will return a reference to

element Step 2: Identify other potential pointers Step 3: Perform code transformation

(a) Original Program (b) Transformed Program

1: main() {2: vector<int> vec; ……3: int* a = &vec.at(idx_exp); ……4: int sum = 1 + *a; ……5: int* b = a; ……

6: int* c = b; ……7: sum = sum + *c;8: }

1: main() {2: vector<int> vec; ……3: int* a = ppu_addr(&vec.at(idx_exp)); ……4: int sum = 1 + *(cache_access(a)); 5: ……6: int* b = a; ……7: int* c = b; ……8: sum = sum + *(cache_access(c)); ……9: }

a

b ac b a


Experiment Evaluation

20

Hardware PlayStation 3 with IBM Cell BE

Software Operating System: Linux Fedora 9 and IBM SDK 3.1 Benchmarks: applications which use STL. Cycle accurate IBM SystemSim Simulator for Cell BE Cachegrind – Valgrind

Benchmarks Applications use STL containers

http://www.google.com/imgres?imgurl=http://scawley.files.wordpress.com/2008/03/sony_playstation_3_60gb_game_console__brand_new.jpg&imgrefurl=http://scawley.wordpress.com/2008/03/18/ps3-10-real-life-applications-part-1/&usg=__tMsl_SNPMA4qUycHiWT6WHwUY_w=&h=400&w=400&sz=29&hl=en&start=1&sig2=mDm5IzTCo-pqqrYBF7D6Og&itbs=1&tbnid=rzxdLYSceF4P0M:&tbnh=124&tbnw=124&prev=/images?q=PS3&hl=en&gbv=2&tbs=isch:1&ei=td0HTKLhEcL78AbIv-SEAQ

http://www.google.com/imgres?imgurl=http://www.open-of-course.org/courses/file.php/24/linux-logo-full.jpg&imgrefurl=http://www.open-of-course.org/courses/course/view.php?id=24&usg=__8uhv-9bq9S619O30_rTpVapui1o=&h=360&w=327&sz=13&hl=en&start=23&sig2=dKPlaeiH5cdJJVWrZCLDHQ&itbs=1&tbnid=ZMmjT556JA188M:&tbnh=121&tbnw=110&prev=/images?q=linux&start=20&hl=en&sa=N&gbv=2&ndsp=20&tbs=isch:1&ei=7N0HTNyoBsGC8gb_vMGAAQ


Programmability Improvement We measure the running time for inserting elements into a

STL container

1

10

100

1000

10000

100000Improvement of vector

STL vector

New vector

Number of objectsRun

tim

e in

ns

100

1000

10000

100000Improvement of deque

STL deque

New deque


tim

e in

ns

100

1000

10000

100000

1000000

10000000Improvement of list

STL list

New list


tim

e in

ns

1000

10000

100000

1000000

10000000Improvement of set

STL set

New set


tim

e in

ns

21


Communication Overhead

22

DMA overhead for data management Cache size: 32 KB, line size: 128 Byte

9000 900000.0000000011000

10000

100000

1000000

10000000Heapsort

Sw Cache

Size of Data

Δ C

ache

Mis

ses

9000 900000.00000000110000

100000

1000000

10000000

100000000 Dijkstra

Sw Cache

Size of Data

Δ Ca

che

Mis

ses

0 200000 400000 6000000

50001000015000200002500030000350004000045000

MMints CompressSw Cache

Size of Data

Δ Ca

che

Mis

ses

0 500000 10000000

20000400006000080000

100000120000140000 MMints Wavelet

Sw Cache

Size of Data

Δ Ca

che

Mis

ses


Instruction Overhead Static code size increase

Container

STL New STL Percentage of Inc

Vector 138388 155036 12%Deque 139364 156132 12%

Set 141284 166228 17.7%List 134924 151228 12%

23


Instruction Overhead Runtime Instruction increase for each benchmark

Mainly from the software cache

23

0

2

4

6

8

10

12

14

16

18

Intensive STL Uses

Benchmarks

Nor

mal

ized

Inst

ruct

ion

Coun

t

olden

_pow

er

basic

math

list_m

erge

krusk

alcrc

320

0.20.40.60.8

11.21.41.61.8

Normal STL Uses

Benchmarks

Nor

mal

ized

Inst

ruct

ion

Coun

t

Static code size increase


Scalability The scalability of Benchmarks

Run the same copy of code on different number of cores

Increased runtime is from the competition for DMA channel

1 2 3 4 5 60

5

10

15

20

25

30

35Scalability

Heapsort

Dijkstra

Kruskal

Edmonds-Karp

Anagram

MMints Compress

MMints Wavelet

Number of cores

Run

tim

e in

sec

onds

24


Conclusion The capacity of the container classes on LLM

architecture increase significantly The communication overhead and runtime

instruction overhead are reasonable

25


Publications

26

Ke. Bai, D. Lu, A. Shrivastava, Vector Class on Limited Local Memory (LLM) Multi-core Processors, CASES '11, Proceedings of the 2011 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, October 9-14, 2011, Taipei, Taiwan.

Di Lu, A. Shrivastava, Enabling Standard Template Libraries (STL) on Limited Local Memory Multicore Architectures, ACM Transactions on Design Automation of Electronic Systems (TODAES). (submitted)

CMLhttp://www.aviral.lab.asu.edu CML27

Thank you & Questions

STL on Limited Local Memory ( LLM) Multi-core Processors

Documents