CSE 160/Berman Models of Parallel Computation W+A: Appendix D “LogP: Towards a Realistic Model of Parallel Computation”, PPOPP, May 1993 Alpern, B., L.

CSE 160/Berman

Models of Parallel Computation

W+A: Appendix D

“LogP: Towards a Realistic Model of Parallel Computation”, PPOPP, May 1993

Alpern, B., L. Carter, and J. Ferrante, ``Modeling Parallel Computers as Memory Hierarchies,'' Programming Models for Massively Parallel

Computers, Giloi, W. K., S. Jahnichen, and B. D. Shriver ed.,

IEEE Press, 1993.

CSE 160/Berman

Computation Models

• Model provides underlying abstraction useful for analysis of costs, design of algorithms

• Serial computational models use RAM or TM as underlying models for algorithm design

CSE 160/Berman

RAM [Random Access Machine]

• unalterable program consisting of optionally labeled instructions.

• memory is composed of a sequence of words, each capable of containing an arbitrary integer.

• an accumulator, referenced implicitly by most instructions.

• a read-only input tape• a write-only output tape

CSE 160/Berman

RAM Assumptions

• We assume– all instructions take the same time to execute– word-length unbounded– the RAM has arbitrary amounts of memory– arbitrary memory locations can be accessed

in the same amount of time

• RAM provides an ideal model of a serial computer for analyzing the efficiency of serial algorithms.

CSE 160/Berman

PRAM [Parallel Random Access Machine]

• PRAM provides an ideal model of a parallel computer for analyzing the efficiency of parallel algorithms.

• PRAM composed of– P unmodifiable programs, each composed of

optionally labeled instructions.– a single shared memory composed of a sequence of

words, each capable of containing an arbitrary integer.

– P accumulators, one associated with each program– a read-only input tape– a write-only output tape

CSE 160/Berman

More PRAM• PRAM is a synchronous, MIMD, shared memory

parallel computer.• Different protocols can be used for reading and

writing shared memory.– EREW (exclusive read, exclusive write)– CREW (concurrent read, exclusive write)– CRCW (concurrent read, concurrent write) -- requires

additional protocol for arbitrating write conflicts

• PRAM can emulate a message-passing machine by logically dividing shared memory into private memories for the P processors.

CSE 160/Berman

Broadcasting on a PRAM

• “Broadcast” can be done on CREW PRAM in O(1):– Broadcaster sends value to shared memory– Processors read from shared memory

CSE 160/Berman

LogP machine model• Model of distributed memory multicomputer• Developed by [Culler, Karp, Patterson, etc.]• Authors tried to model prevailing parallel

architectures (circa 1993).• Machine model represents prevalent MPP

organization:– machine constructed from at most a few thousand

nodes, – each node contains a powerful processor– each node contains substantial memory– interconnection structure has limited bandwidth– interconnection structure has significant latency

CSE 160/Berman

LogP parameters

• L: upper bound on latency incurred by sending a message from a source to a destination

• o: overhead, defined as the time the processor is engaged in sending or receiving a message, during which time it cannot do anything else

• g: gap, defined as the minimum time between consecutive message transmissions or receptions

• P: number of processor/memory modules

CSE 160/Berman

LogP Assumptions• network has finite capacity.

– at most ceiling(L/g) messages can be in transit from any one processor to any other atone time.

• asynchronous communication. – latency and order of messages is unpredictable

• all messages are small• context switching overhead is 0 (not modeled)• multithreading (virtual processes) may be

employed but only up to a limit of L/g virtual processors

CSE 160/Berman

LogP notes• All parameters measured in processor

cycles• Local operations take one cycle• Messages are assumed to be small • LogP was particularly well-suited to

modeling CM-5. Not clear if the same correlation is found with other machines.

CSE 160/Berman

LogP Analysis of PRAM Broadcasting Algorithm

• Algorithm:– Broadcaster sends value to shared memory

(we’ll assume the value is in P0’s memory)– P Processors read from shared memory (other

processors receive messages from P0)

• Time for P0 to send P messages = o + g (P-1)

• Maximum time for other processors to receive messages = o + (P-2)g + o + L + o

CSE 160/Berman

Efficient Broadcasting in LogP Model

Gap includes overhead time so overhead < gap

P0P1P2P3P4P5P6P7

time

og

og

og

oo

o

o o

L

LL

oLL

o og

oo

L

o

L

CSE 160/Berman

Mapping induced by LogP Broadcasting algorithm on

8 processors

242420

22181410

0

P5

P0

P1

P4P6P7

P2P3

P0P1P2P3P4P5P6P7

time

og

og

og

oo

o

o o

L

LL

oLL

o og

oo

L

o

L

CSE 160/Berman

Analysis of LogP Broadcasting Algorithm to 7

Processors• Time to receive one

message from P0 for first processor (P5) is L+2o

• Time to receive message for last processor is max{3g+L+2o, 2g+L+2o, g+2L+4o, 4o+2L, g+4o+2L}=max{3g+L+2o, g+2L+4o}

• Compare to LogP analysis of PRAM Broadcast which is o + (P-2)g + o + L + o = 5g + 3o + L

P0P1P2P3P4P5P6P7

time

og

og

og

oo

o

o o

L

LL

oLL

o og

oo

L

oL

CSE 160/Berman

Scalable Performance

• LogP Broadcast utilizes tree structure to optimize broadcast time

• Tree depends on values of L,o,g,P

• Strategy is much more scalable (and ultimately more efficient) than PRAM Broadcast

242420

22181410

0

P5

P0

P1

P4P6P7

P2P3

CSE 160/Berman

Moral

• Analysis can be no better than underlying model. The more accurate the model, the more accurate the analysis.

• (This is why we use TM to determine undecidability but RAM to determine complexity.)

CSE 160/Berman

Other Models used for Analysis

• BSP (Bulk Synchronous Parallel)– Slight precursor and competitor to

LogP

• PMH (Parallel Memory Hierarchy)– Focuses on memory costs

CSE 160/Berman

BSP[Bulk Synchronous Parallel]

• BSP proposed by Valiant• BSP model consists of

– P processors, each with local memory– Communication network for point-to-

point message passing between processors

– Mechanism for synchronizing all or some of the processors at defined intervals

CSE 160/Berman

BSP Programs• BSP programs composed of

supersteps• In each superstep, processors

execute L computational steps using locally stored data, and send and receive messages

• Processors synchronized at the end of the superstep (at which time all messages have been received)

• BSP programs can be implemented through mechanisms like Oxford BSP library (C routines for implementing BSP programs) and BSP-L.

superstep

synchronization

superstep

synchronization

CSE 160/Berman

BSP Parameters• P: number of processors (with

memory)• L: synchronization periodicity• g: communication cost• s: processor speed (measured

in number of time steps/second)

• Processor sends at most h messages and receives at most h messages in a single superstep (communication called an h-relation)

superstep

synchronization

superstep

synchronization

CSE 160/Berman

BSP Notes• Complete program = set of supersteps• Communication startup not modeled, g is for

continuous traffic conditions• Message size is one data word• More than one process or thread can be

executed by a processor.• Generally assumed that computation and

communication are not overlapped• Time for a superstep = max number of local

operations performed by any processor + g(max number of messages sent or received by a processor) + L

CSE 160/Berman

BSP Analysis of PRAM Broadcast

• Algorithm:– Broadcaster sends value to shared memory (we’ll

assume the value is in P0’s memory)– P Processors read from shared memory (other

processors receive messages from P0)

• In BSP model, processors only allowed to send or receive at most h messages in a single superstep. Broadcast for more than h processors would require a tree structure– If there were more than Lh processors, then a tree

broadcast would require more than one superstep.

• How much time does it take for a P processor broadcast?

CSE 160/Berman

BSP Analysis of PRAM Broadcast

• How much time does it take for a P processor broadcast?

…

… …

h-ary tree

CSE 160/Berman

PMH [Parallel Memory Hierarchy] Model

• PMH seeks to represent memory. Goal is to model algorithms so that good decisions can be made about where to allocate data during execution.

• Model represents costs of interprocessor communication and memory hierarchy traffic (e.g. between main memory and disk, between registers and cache).

• Proposed by Carter, Ferrante, Alpern

CSE 160/Berman

PMH Model• Computer is modeled as a tree of memory

modules with the processors at the leaves. • All data movement takes the form of block

transfers between children and their parents.

• PMH is composed of a tree of modules– all modules hold data– leaf modules also perform computation– data in a module is partitioned into blocks– Each module has 4 parameters for each

module

CSE 160/Berman

Un-parameterized PMH Models for a Cluster of Workstations

Bandwidth from processor to disk> bandwidth from processor to network

Bandwidth between 2 processors> bandwidth to disk

network

Mainmemories

DisksDisksDisks

DisksDisksCaches

ALU/registers

Mainmemories

Shareddisk

system

DisksDisksCaches

network

ALU/registers

CSE 160/Berman

PMH Module Parameters

• Blocksize s_m tells how many bytes there are per block of m

• Blockcount n_m tells how many blocks fit in m• Childcount c_m tells how many children m has• Transfer time t_m tells how many cycles it

takes to transfer a block between m and its parent

• Size of "node" and length of "edge" in PMH graph should correspond to blocksize, blockcount and transfer time

• Generally all modules at a given level of the tree will have the same parameters

CSE 160/Berman

Summary

• Goal of parallel computation models is to provide a realistic representation of the costs of programming.

• Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “good” (i.e. performance-efficient)

• Next up: Mapping and Scheduling

CSE 160/Berman Models of Parallel Computation W+A: Appendix D “LogP: Towards a Realistic Model of Parallel Computation”, PPOPP, May 1993 Alpern, B., L.

Documents

shared memory slide

pram pram

memory processors

memory hierarchies

single shared memory

berman broadcasting

parallel computers

berman ram assumptions