Department of Computer Science University of the West Indies.

Department of Computer ScienceUniversity of the West Indies

Architecture Classification

The Flynn taxonomy (proposed in 1966!)

Functional taxonomy based on the notion of streams of information: data and instructions

Platforms are classified according to whether they have a single (S) or multiple (M) stream of data or instructions.

Patricia L Walsh

Flynn’s Classification

Architecture Categories

SISD SIMD MISD MIMD

SISD

Classic von Neumann machine

Basic components: CPU (control unit, ALU) and Main Memory (RAM)

Connected via Bus (aka von Neumann bottleneck)

Examples: standard desktop computer, laptop

SISD

C P MIS IS DS

SIMD

Pure SIMD machine: single CPU devoted exclusively to control collection of subordinate ALUs each w/small amount of memory

Instruction cycle: CPU broadcasts, ALUs execute or idle lock-step progress (effectively a global clock)

Key point: completely synchronous execution of statements

Vector and matrix computation lend themselves to an SIMD implementation

Examples of SIMD computers: Illiac IV, MPP, DAP, CM-2, and MasPar MP-2

SIMD

PE PE PE

PE PE PE

PE PE PE

Controlprocessor

Data Parallel Systems

Programming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps Conceptually, a processor associated with each data element

Architectural model Array of many simple, cheap processors with little memory each

Processors don’t sequence through instructions

Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization

Original motivations Matches simple differential equation solvers Centralize high cost of instruction fetch/sequencing

Data Parallel Programming

In this approach, we must determine how large amounts of data can be split up. In other words, we need to identify small chunks of data which require similar processing.

These chunks of data are than assigned to different sites where they

can be processed. The computations at each node may require some intermediate results from peer nodes.

The same executable could be running on each processing site, but

each processing site would have different datasets.

For data parallelism to work best the volume of communicated values should be small compared with the volume of locally computed results.


Data Parallel decomposition can be implemented using a SPMD (single program multiple data) programming model.

One processing element is regarded as "first among equals“:

This processor starts up the program and initialises the other processors. It then works as an equal to these processors.

Each PE is doing approximately the same calculation on different data.


Data-parallel architectures introduced the new programming-language concept of a distributed or parallel array. Typically the set of semantic operations allowed on a distributed array was somewhat different to the operations allowed on a sequential array

Unfortunately, each data parallel language had features tied to a particular manufacturer's parallel computer architecture e.g.

*LISP, C* and CM Fortran for Thinking Machines Corporation’s Connection Machine series of computers.

In the 1980s and 1990s microprocessors grew in power and availability, and fell in price. Building SIMD computers out of simple but specialized compute nodes gradually became less economical than putting a general purpose commodity microprocessor at every node. Eventually SIMD computers were displaced almost completely by Multiple Instruction Multiple Data (MIMD) parallel computer architectures.

Example - ILLIAC IV

ILLIAC IV was the first large system to employ semiconductor primary memory, built in 1974 at the University of Illinois.

The ILLIAC IV was a SIMD computer for array processing.

It consisted of: a control unit (CU) and 64 processing elements (PEs).

Each processing element had two thousand 64-bit words of memory associated with it. The CU could access all 128K words of memory through a bus, but each PE could only directly access its local memory.

Example - ILLIAC IV

An 8 by 8 grid interconnect joined each PE to 4 neighbours.

The CU interpreted program instructions scattered across the memory, and broadcast them to the PEs.

Neither the PEs nor the CU were general-purpose computers in the modern sense--the CU had quite limited arithmetic capabilities.

Between 1975 and 1981 it was the world's fastest computer.

Example - ILLIAC IV

The ILLIAC IV had thirteen rotating fixed head disks which comprised part of the central system memory.

The ILLIAC IV, one of the first computers to use all semiconductor main memories.

Example - ILLIAC IV

Example - ILLIAC IV

Data Parallel Languages

CFD was a data parallel language developed in the early 70s at the Computational Fluid Dynamics Branch of Ames Research Center.

CFD was a ``FORTRAN-like'' language, rather than a FORTRAN dialect.

The language design was extremely pragmatic. No attempt was made to hide the hardware peculiarities from the user; in fact, every attempt was made to give programmers access and control of all of the ILLIAC hardware so they could construct an efficient program.

CFD had five basic datatypes: CU INTEGER CU REAL CU LOGICAL PE REAL PE INTEGER.


The type of a variable statically encoded its home:

either on the control unit or on the processing elements.

Apart from restrictions on their home, the two INTEGER and REAL types behave like the corresponding types in ordinary FORTRAN.

The CU LOGICAL type was more idiosyncratic:

it had 64 independent bits that acted as flags controlling activity of the PEs.


Scalars and arrays of the five types could be declared as in FORTRAN.

An ordinary variable or array of type CU REAL, for example, would be allocated in the (very small) control unit memory.

An ordinary variable or array of type PE REAL would be allocated somewhere in the collective memory of the processing elements (accessible by the control unit over the data bus) e.g.

CU REAL A, B(100)

PE INTEGER I

PE REAL D(25), E(1000)

The last data structure available in CFD was a new kind of array called a vector-aligned array.


Only the first dimension could be distributed, and the extent of that dimension had to be exactly 64.

A vector-aligned array would be of PE INTEGER or PE REAL type, and the syntax for the distributed dimension involved an asterisk:

PE INTEGER J(*) PE REAL X(*,4), Y(*,2,8)

These are parallel arrays.

J(1) is stored on the first PE J(2) is stored on the second PE, and so on.

Similarly X(1,1), X(1,2), X(1,3), X(1,4) are stored on PE 1 X(2,1), X(2,2), X(2,3), X(2,4) are stored on PE 2, etc.


A vector expression was a vector-aligned array with a (*) subscript in the first dimension.

Communication between neighbouring PEs was captured by allowing the (*) to have some shift added, as in:

DIFP(*) = P(* + 1) - P(* - 1)

All shifts were cyclic (end-around) shifts, so this parallel statement is equivalent to the sequential statements:

DIFP(1) = P(2) - P(64) DIFP(2) = P(3) - P(1) ... DIFP(64) = P(1) - P(63)


Essential flexibility was added by allowing vector assignments to be executed conditionally with a vector test, e.g.

IF(A(*) .LT. 0) A(*) = -A(*)

Less structured methods of masking operations by explicitly assigning PE activity flags in CU LOGICAL variables were also available;

there were special primitives for restricting activity to simply-specified ranges of PEs.

PEs could concurrently access different addresses in their local memory by using vector subscripts: DIAG(*) = RHO(*, X(*))

Connection Machine

(Tucker, IEEE Computer, Aug. 1988)

CM-5

Repackaged SparcStation

4 per board

Fat-Tree network Control network for

global synchronization

Whither SIMD machines?

Trade-off individual processor performance for collective performance:CM-1 had 64K PEs each 1-bit!

Problems with SIMD Inflexible - not all problems can use this style of

parallelismcannot leverage off microprocessor technology

=> cannot be general-purpose architectures

Special-purpose SIMD architecture still viable (array processors, DSP chips)

Vector Processors

Definition: a processor that can do element-wise operations on entire vectors with a single instruction, called a vector instruction These are specified as operations on vector registers A processor comes with some number of such registers

A vector register holds ~32-64 elements The number of elements is larger than the amount of parallel hardware, called

vector pipes or lanes, say 2-4The hardware performs a full vector operation in

#elements-per-vector-register / #pipes

r1 r2

r3

+ +

… vr2 … vr1

… vr3

(logically, performs #elts adds in parallel)

… vr2 … vr1

(actually, performs #pipes adds in parallel)

++ ++

Vector Processors

Advantagesquick fetch and decode of a single instruction for

multiple operations the instruction provides the processor with a regular

source of data, which can arrive at each cycle, and processed in a pipelined fashion

The compiler does the work for you of courseMemory-to-memory

no registerscan process very long vectors, but startup time is

largeappeared in the 70s and died in the 80s

Examples: Cray, Fujitsu, Hitachi, NEC

Vector Processors

What about:

for (j = 0; j < 100; j++)

A[j] = B[j] * C[j]

Scalar code: load, operate, store for each iteration

Both instructions and data consume memory bandwidth

The solution: A vector instruction

Vector Processors

A[0:99] = B[0.99] * C[0:99]

Single instruction requires memory bandwidth for data only.

No control overhead for loops

Pitfallsextension to instruction set, vector fu’s,

vector registers, memory subsystem changes for vectors

Vector Processors

Merits of vector processor

1. Very deep pipeline without data hazard The computation of each result is independent

of the computation of previous results

2. Instruction bandwidth requirement is reduced A vector instruction specifies a great deal of

work

3. Control hazards are nonexistent A vector instruction represents an entire loop. No loop branch

Vector Processors (Cont’d)

The high latency of initiating a main memory access is amortized

A single access is initiated for the entire vector rather than a single word

Known access patternInterleaved memory banks

Vector operations is faster than a sequence of scalar operations on the same number of data items!

Vector Programming Example

LD F0, aADDI R4, Rx, #512 ; last address to load

Loop: LD F2, 0(Rx) ; load X(i)MULTD F2, F0, F2 ; a x X(i)LD F4, 0(Ry) ; load Y(i)ADDD F4, F2, F4 ; a x X(i) + Y(i)SD F4, 0(Ry) ; store into Y(i)ADDI Rx, Rx, #8 ; increment index to XADDI Ry, Ry, #8 ; increment index to YSUB R20, R4, Rx ; compute boundBNZ R20, loop ; check if done

RISC machine

Repeat 64 times

Y = a * X + Y

Vector Programming Example(Cont’d)

LD F0, a ; load scalar LV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result

Vector machine

6 instructions(low instructionbandwidth)

Y = a * X + Y

A Vector-Register Architecture(DLXV)

Main Memory

VectorLoad-store

FP add/subtractFP add/subtract





Vectorregisters

Scalarregisters

Crossbar Crossbar

Vector Machines

CRAY-1

CRAY-2

CRAY X-MP

CRAY C-90

NEC SX/2

NEC SX/4

Fujitsu VP200

Hitachi S820

Convex C-1

8

8

8

8

8 + 8192

8 + 8192

8 - 256

32

8

Registers

64

64

64

128

256

256

32-1024

256

128

Elementsper register

1

1

2Ld/1St

4

8

8

2

4

1

LoadStore

6

5

8

8

16

16

3

4

4

Functionalunits

CRAY Y-MP 8 64 2Ld/1St 8

MISD

Multiple instruction, single data

Doesn’t really exist, unless you consider pipelining an MISD configuration

MISD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

Department of Computer Science University of the West Indies.

Documents

data parallel language

stream of data

element of data structure

parallel steps q

simd slide

parallel array

small chunks of data

qthese chunks of data