Top Banner
C M L Vector Class on Limited Local Memory (LLM) Multi- core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA
24

Vector Class on Limited Local Memory (LLM) Multi-core Processors

Feb 16, 2016

Download

Documents

Ashton Brown

Vector Class on Limited Local Memory (LLM) Multi-core Processors. Ke Bai Di Lu and Aviral Shrivastava. Compiler Microarchitecture Lab Arizona State University, USA. Summary. Cannot improve performance without improving power-efficiency Cores are becoming simpler in multicore architectures - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CML

Vector Class on Limited Local Memory (LLM) Multi-core Processors

Ke BaiDi Lu and Aviral Shrivastava

Compiler Microarchitecture LabArizona State University, USA

Page 2: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

2 CML

Summary Cannot improve performance without improving power-efficiency

Cores are becoming simpler in multicore architectures Caches not scalable (both power and performance)

Limited Local Memory multicore architectures Each core has a scratch pad (e.g., Cell processor) Need explicit DMAs to communicate with global memory

Objective: How to enable vector data structure (dynamic arrays) on the LLM cores?

Challenges: 1. Use local store as temporary buffer (e.g., software cache) for vector data 2. Dynamic global memory management, and core request arbitration 3. How to use pointers when the data pointed to may have moved ?

Experiments Any size vector is supported All SPUs may use vector library simultaneously – and is scalable

Page 3: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

3 CML

From multi- to many-core processors

IBM XCell 8i GeForce 9800 GTTilera TILE64

Simpler design and verification Reuse the cores

Can improve performance without much increase in power

Each core can run at a lower frequency Tackle thermal and reliability problems at core granularity

Page 4: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

4 CML

Memory Scaling Challenge

D Cache19%

I Cache25%

D MMU5%

I MMU4%

arm925%

PATag RAM1%

CP 152%

BIU8%

SysCtl3%

Clocks4%

Other4%

Intel 48 core chip

Strong ARM 1100 In Chip Multi Processors (CMPs) , caches guarantee data coherency Bring required data from wherever into the cache Make sure that the application gets the latest copy

of the data Caches consume too much power

44% power, and greater than 34% area Cache coherency protocols do not scale

well Intel 48-core Single Cloud-on-a-Chip has non-

coherent caches

Page 5: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

5 CML

PPE

Element Interconnect Bus (EIB)

Off-chip Global

MemoryPPE: Power Processor ElementSPE: Synergistic Processor ElementLS: Local Store

SPE 0 SPE 2

SPE 5

SPE 4

SPE 3SPE 1

SPE 6

Limited Local Memory Architecture Cores have small local memories (scratch pad)

Core can only access local memory Accesses to global memory through explicit DMAs in the program

e.g. IBM Cell architecture, which is in Sony PS3.

SPE 7 LS

SPU

Page 6: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

6 CML

LLM Programming Task based programming, MPI like communication

#include<libspe2.h>

extern spe_program_handle_t hello_spu;

int main(void){int speid, status;

speid (&hello_spu);

}

Main Core

<spu_mfcio.h>

int main(speid, argp){printf("Hello world!\n");}

Local Core

<spu_mfcio.h>

int main(speid, argp){printf("Hello world!\n");}

Local Core

<spu_mfcio.h>

int main(speid, argp){printf("Hello world!\n");}

Local Core

<spu_mfcio.h>

int main(speid, argp){printf("Hello world!\n");}

Local Core

<spu_mfcio.h>

int main(speid, argp){printf("Hello world!\n");}

Local Core

<spu_mfcio.h>

int main(speid, argp){printf("Hello world!\n");}

Local Core

= spe_create_thread

Otherwise, efficient data management is required!

Extremely power-efficient computation If all code and data fit into the local memory of the cores

Page 7: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

7 CML

Managing data

Local Memory Aware CodeOriginal Code

int global;

f1(){ int a,b; global = a + b;

f2(); }

int global;

f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2();}

Page 8: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

8 CML

Vector Class Introduction

Vector Class is widely used library for programming!

One of classes in Standard Template Library(STL) for C++ Implemented as dynamic arrays, sequential container Elements stored in contiguous storage locations

Can be accessed by using iterators or offsets on regular pointers to elements

Compared to arrays: Vector have the ability to be easily resized Capacity increase and decrease is handled automatically They usually consume more memory than arrays when their capacity is

handled automatically This is in order to accommodate extra storage space for future grownth

Page 9: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

9 CML

Vector Class Management

main() { vector<int> vec; for(int i = 0; i < N; i++) vec.push back(i);}

SPE code Max N is 8192

N0

8192 INTs is only 32KB, far less than 256KB of local memory. Why it crashes so early?

All code and data need to be managed This paper focuses on vector data management

Vector management is difficult Vector size is dynamic and can be unbounded

Cell programming manual suggests “Use dynamic data at your own risk”. Restricting the usage of dynamic data is restrictive for programmers.

Page 10: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

10 CML

Outline of the Talk Motivation

Related Works on Vector Data Management

Our Approach of Vector Data Management

Experiments

Page 11: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

11 CML

Related Works

SPE

Local Memory

Global Memory

……

LLM Architecture

SPE

Local Memory

DMA

They ensure data coherency across different spaces. What about size of local memory is small?

Different threads can access vector concurrently, no matter it is in one address space or different spaces.

They provide efficient parallel implementations, abstract platform details, provide an interface to programmers to express the parallelism of the problems, automatically translate from one space to another Shared memory: MPTL[Baertschiger2006], MCSTL[Singler2007] and

Intel TBB[Intel2006] Distributed memory: POOMA[Reynders1996], AVTL[Sheffler1995],

STAPL[Buss2010] and PSTL[Johnson1998]

Page 12: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

12 CML

Space Allocation and Reallocation

Unlimited vector requires evicting older vector data to global memory and reallocating more global memory!

Vector Dataallocated space

0x010100

0x010200

(a) When the vector use up the allocated space

Vector DataNew allocated

space

0x010500

0x010600

(b) We allocate a large space and move all data

0x010700

push_back & insert Adds elements Needs to be re-allocated for a larger space when there is no unused space

Page 13: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu CML

Space Allocation and Reallocation Static buffer?

Small vector -> low utilization; large vector -> overflow SPU thread can’t use malloc() and free() on global memory Hybrid: DMA + mailbox

SPE

struct msgStruct { int vector_id; int request_size; int data_size; int new_gAddr;}; (2)operation

type

vector data

Global Memory

(4) restart signal

(1) transfer parameters by DMA

SPE thread

PPE

PPE thread (3) operate on vector,

update new_gAddr

in the data structure

(5) get new vector address by DMA

mailbox based

13

Page 14: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

14 CML

Element Retrieving

133th element: block index = 128 = 133 / 16 * 16

……

Block Size is 16

…………0th element 1st element15th element 128th element 143th element ……

Block 0 Block 7

 

Based on the global address, we can know whether this block is in the local memory or not. If not, fetch it.

Block index: index of 1st element in the block Each block contains a block index, besides the data; blocks are in linked list.

Global address:

Page 15: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

15 CML

Vector Function Implementation

But elements shifting now is a challenging task under LLM architecture Because we cannot use pointers in the local memory to access global

memory & DMA requires alignment

New Element

Global Memory

Global Memory

for (……) (*b++) = (*a++);

Local Memory

New Element

In order to keep semantics, we implemented all functions. But only insert function is shown here. Original insertion can take advantage of pointers.

Page 16: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

16 CMLPointer problem needs to be solved!

Pointer Problem In order to support limitless vector data, global memory

must be leveraged. Two address spaces co-exist, no matter what scheme is

implemented, pointer issue exist.

vec

Global Memory

(a) Pointer points to a vector element

struct* S { …… int* ptr;}

Local Memory

vec

(b) The vector element is moved to global memory

?

struct* S { …… int* ptr;}

Local Memory

Global Memory

Page 17: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

17 CML

Pointer Resolution

(a) Original Program (b) Transformed Program

main() { vector<int> vec; int* a = vec.at(index); int sum = 1 + *a; int* b = a; }

main() { vector<int> vec; int* a = ppu_addr(vec,index); a = ptrChecker(a); int sum = 1 + *a; a = s2p(a); int* b = a; }

• ppu_addr: returns global address ptr pointing to the vector element.• ptrChecker:

– checks whether ptr is pointing to a vector data; – guarantees the data pointed is in the local memory;– returns the local address.

• s2p: transforms local address back to global address

• Local address should not be used to identify the data.

Page 19: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

19 CML

Unlimited Vector Data

0.001

0.01

0.1

1

10

100Our Improved Vector Class

Original Vector Class

Total number of integers

Run

time(

s)

𝑁0 = 8192

4 B ……

B: Bytes

8 B 16 B 2n+2 B

reallocation

reallocation

reallocation

……𝒔𝟎𝒔𝟎+𝟒12 𝒔𝟎+𝟐𝟖𝒔𝟎+2n+2−𝟒

……

Why?

Page 20: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

20 CML

Impact of Block Size

4 8 16 32 64 128 2561

10

100

heap sortradix sortFFTinvfftdijkstraSORsparse matrix

Run

time(

s)

Block Size (# of elements in one block)

Page 21: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

21 CML

Impact of buffer Space

512 1024 2048 40960

5

10

15

20

25

30

heap sortradix sortFFTinvfftdijkstraSORsparse matrix

Run

time(

s)

Buffer Size (# of elements in one buffer)

buffer_size = number_of_block × block_size.

Page 22: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

22 CML

Impact of Associativity

heap sort radix sort FFT invfft dijkstra SOR sparse matrix

0

5

10

15

20

25

30

35 Direct Map2-way Associative4-way Associative8-way Associative

Benchmarks

Run

time(

s) Higher associativity -> high computation spent on looking

up data structure & low miss ratio

Page 23: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

23 CML

Scalability

1 2 3 4 5 60

5

10

15

20

25

30

heap sortradix sortFFTinvfftdijkstraSORsparse matrix

Run

time(

s)

Number of Cores

Page 24: Vector Class on Limited Local Memory (LLM) Multi-core Processors

CMLhttp://www.aviral.lab.asu.edu

24 CML

Summary Cannot improve performance without improving power-efficiency

Cores are becoming simpler in multicore architectures Caches not scalable (both power and performance)

Limited Local Memory multicore architectures Each core has a scratch pad (e.g., Cell processor) Need explicit DMAs to communicate with global memory

Objective: How to enable vector data structure (dynamic arrays) on the LLM cores?

Challenges: 1. Use local store as temporary buffer (e.g., software cache) for vector data 2. Dynamic global memory management, and core request arbitration 3. How to use pointers when the data pointed to may have moved ?

Experiments Any size vector is supported All SPUs may use vector library simultaneously – and is scalable