Top Banner
SIMD Computers SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in ECE/CS 757, no part of these notes may be reproduced, stored in a retrieval system, or transmitted,in any form or by any means, electronic, mechanical, photocopying,recording, or otherwise, without prior written permission from the author.
67

SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

Jan 01, 2016

Download

Documents

Samson Barker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

SIMD ComputersSIMD Computers

ECE/CS 757 Spring 2007

J. E. Smith

Copyright (C) 2007 by James E. Smith (unless noted otherwise)

 All rights reserved. Except for use in ECE/CS 757, no part of these notes may be reproduced, stored in a retrieval system, or transmitted,in any form or by any means, electronic, mechanical, photocopying,recording, or otherwise, without prior written permission from the author.

Page 2: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 2

OutlineOutline

Automatic Parallelization Vector Architectures

• Cray-1 case study

Data Parallel Programming• CM-2 case study

CUDA Overview (separate slides) Readings

• W. Daniel Hillis and Guy L. Steele, Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183.

• S. Ryoo, et al., Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA, Proceedings of PPoPP, Feb. 2008.

Page 3: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 3

Automatic ParallelizationAutomatic Parallelization

Start with sequential programming model Let the compiler attempt to find parallelism

• It can be done…• We will look at one of the success stories

Commonly used for SIMD computing – vectorization • Useful for MIMD systems, also -- concurrentization

Often done with FORTRAN• But, some success can be achieved with C

(Compiler address disambiguation is more difficult with C)

Page 4: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 4

Automatic ParallelizationAutomatic Parallelization

Consider operations on arrays of datado I=1,N

A(I,J) = B(I,J) + C(I,J)end do• Operations along one dimension involve vectors

Loop level parallelism• Do all – all loop iterations are independent

Completely parallel• Do across – some dependence across loop iterations

Partly parallel

A(I,J) = A(I-1,J) + C(I,J) * B(I,J)

Page 5: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 5

Data DependenceData Dependence

Independence ParallelismOR, dependence inhibits parallelism

S1: A=B+C S2: D=A+2S3: A=E+F

True Dependence (RAW): S1 S2

Antidependence (WAR):S2 - S3

Output Dependence (WAW): S1 o S3

Page 6: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 6

Data Dependence Applied to LoopsData Dependence Applied to Loops

Similar relationships for loops• But consider iterations

do I=1,2S1: A(I)=B(I)+C(I) S2: D(I)=A(I)

end do

S1 = S2• Dependence involving A, but on same loop iteration

Page 7: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 7

Data Dependence Applied to LoopsData Dependence Applied to Loops

S1 < S2

do I=1,2S1: A(I)=B(I)+C(I) S2: D(I)=A(I-1)

end do• Dependence involving A, but read occurs on next loop iteration• Loop carried dependence

S2 -< S1

• Antidependence involving A, write occurs on next loop iteration

do I=1,2S1: A(I)=B(I)+C(I) S2: D(I)=A(I+1)

end do

Page 8: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 8

Loop Carried DependenceLoop Carried Dependence Definition

do I = 1, NS1: X(f(i)) = F(...)S2: A = X(g(i)) ...

end do

S1 S2 : is loop-carried

if there exist i1, i2 where

1 i1 < i2 N and f(i1) = g(i2 )

If f and g can be arbitrary functions, the problem is essentially unsolvable.

However, if (for example)

f(i) = c*I + j and g(i) = d*I + k

there are methods for detecting dependence.

Page 9: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 9

Loop Carried DependencesLoop Carried Dependences GCD test

do I = 1, NS1: X(c*I + j ) = F(...)S2: A = X(d*I + k) ...

end do

f(x) = g(y) if c*I + j = d*I + k

This has a solution iff gcd(c, d ) | k- j Example

A(2*I) = = A(2*I +1)GCD(2,2) does not divide 1 - 0

The GCD test is of limited use because it is very conservativeoften gcd(c,d) = 1

X(4i+1) = F(X(5i+2)) Other, more complex tests have been developed

e.g. Banerjee's Inequality

Page 10: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 10

Vector Code GenerationVector Code Generation

In a vector architecture, a vector instruction performs identical operations on vectors of data

Generally, the vector operations are independent

• A common exception is reductions In general, to vectorize:

• There should be no cycles in the dependence graph

• Dependence flows should be downward

some rearranging of code may be needed.

Page 11: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 11

Vector Code Generation: ExampleVector Code Generation: Example

do I = 1, NS1: A(I) = B(I)S2: C(I) = A(I) + B(I)S3: E(I) = C(I+1)

end do

Construct dependence graphS1:

S2:

-

S3:

Vectorizes (after re-ordering S2: and S3: due to antidependence)S1: A(I:N) = B(I:N)S3: E(I:N) = C(2:N+1)S2: C(I:N) = A(I:N) + B(I:N)

Page 12: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 12

Multiple Processors (Concurrentization)Multiple Processors (Concurrentization)

Often used on outer loops Example

do I = 1, Ndo J = 2, N

S1: A(I,J) = B(I,J) + C(I,J)S2: C(I,J) = D(I,J)/2S3: E(I,J) = A(I,J-1)**2 + E(I,J-1)

end doend do

Data Dependences & Directions

S1 =, < S3

S1 =, = S2

S3 =, < S3 Observations

• All dependence directions for I loop are = Iterations of the I loop can be scheduled in parallel

Page 13: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 13

SchedulingScheduling

Data Parallel Programming Model• SPMD (single program, multiple data)

Compiler can pre-schedule:• Processor 1 executes 1st N/P iterations,• Processor 2 executes next N/P iterations• Processor P executes last N/P iterations• Pre-scheduling is effective if execution time is nearly

identical for each iteration Self-scheduling is often used:

• If each iteration is large• Time varies from iteration to iteration

- iterations are placed in a "work queue”- a processor that is idle, or becomes idle takes the next

block of work from the queue (critical section)

Page 14: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 14

Code Generation with DependencesCode Generation with Dependences

do I = 2, NS1: A(I) = B(I) + C(I)S2: C(I) = D(I) * 2S3: E(I) = C(I) + A(I-1)

end do

Data Dependences & Directions

S1 -= S2

S1 < S3S2 = S3

Parallel Code on N-1 Processors

S1: A(I) = B(I) + C(I) signal(I)

S2: C(I) = D(I) * 2 if (I > 2) wait(I-1)

S3: E(I) = C(I) + A(I-1) Observation

• Weak data-dependence tests may add unnecessary synchronization.Good dependence testing crucial for high performance

Page 15: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 15

Reducing SynchronizationReducing Synchronization

do I = 1, N

S1: A(I) = B(I) + C(I)S2: D(I) = A(I) * 2S3: SUM = SUM + A(I)

end do

Parallel Code: Version 1do I = p, N, P

S1: A(I) = B(I) + C(I)S2: D(I) = A(I) * 2

if (I > 1) wait(I-1)S3: SUM = SUM + A(I)

signal(I)end do

Page 16: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 16

Reducing Synchronization, contd.Reducing Synchronization, contd.

Parallel Code: Version 2SUMX(p) = 0

do I = p, N, PS1: A(I) = B(I) + C(I)S2: D(I) = A(I) * 2S3: SUMX(p) = SUMX(p) + A(I)

end dobarrier synchronizeadd partial sums

Page 17: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 17

Vectorization vs ConcurrentizationVectorization vs Concurrentization When a system is a vector MP, when should

vector/concurrent code be generated?

do J = 1,N

do I = 1,NS1: A(I,J+1) = B(I,J) + C(I,J)S2: D(I,J) = A(I,J) * 2

end doend do

Parallel & Vector Code: Version 1doacross J = 1,N

S1: A(1:N,J+1) = B(1:N,J)+C(1:N,J)signal(J)

if (J > 1) wait (J-1)S2: D(1:N,J) = A(1:N,J) * 2

end do

Page 18: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 18

Vectorization vs ConcurrentizationVectorization vs Concurrentization

Parallel & Vector Code: Version 2Vectorize on J, but non-unit stride memory access(assuming Fortran Column Major storage order)

doall I = 1,NS1: A(I,2:N+1) = B(I,1:N) + C(I,1:N)S2: D(I,1:N) = A(I,1:N) * 2

end do

Page 19: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 19

SummarySummary

Vectorizing compilers have been a success Dependence analysis is critical to any auto-

parallelizing scheme• Software (static) disambiguation• C pointers are especially difficult

Can also be used for improving performance of sequential programs

• Loop interchange• Fusion• Etc. (see add’l slides at end of lecture)

Page 20: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 20

Cray-1 ArchitectureCray-1 Architecture

Circa 1976 80 MHz clock

• When high performance mainframes were 20 MHz

Scalar instruction set• 16/32 bit instruction sizes

Otherwise conventional RISC• 8 S register (64-bits)• 8 A registers (24-bits)

In-order pipeline• Issue in order• Can complete out of order (no precise traps)

Page 21: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 21

Cray-1 Vector ISACray-1 Vector ISA

8 vector registers• 64 elements• 64 bits per element (word

length)• Vector length (VL) register

RISC format• Vi Vj OP Vk• Vi mem(Aj, disp)

Conditionals via vector mask (VM) register

• VM Vi pred Vj• Vi V2 conditional on VM

Page 22: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 22

Vector ExampleVector Example

Do 10 i=1,looplength a(i) = b(i) * x + c(i) 10 continue

A1 looplength .initial values: A2 address(a) .for the arrays A3 address(b) . A4 address(c) . A5 0 .index value A6 64 .max hardware VL S1 x .scalar x in register S1

VL A1 .set VL – performs mod function . BrC done, A1<=0 .branch if nothing to do

more: V3 A4,A5 .load c indexed by A5 – addr mode not in Cray-1V1 A3,A5 .load b indexed by A5

V2 V1 * S1 .vector times scalar V4 V2 + V3 .add in c A2,A5 V4 .store to a indexed by A5 A7 VL .read actual VL A1 A1 – A7 .remaining iteration count A5 A5 + A7 .increment index value

VL A6 . set VL for next iteration BrC more, A1>0 .branch if more workdone:

Page 23: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 23

Compare with ScalarCompare with Scalar

Do 10 i=1,looplength a(i) = b(i) * x + c(i) 10 continue

2 loads1 store2 FP1 branch1 index increment (at least)1 loop count increment

total -- 8 instructions per iteration

4-wide superscalar => up to 1 FP op per cyclevector, with chaining => up to 2 FP ops per cycle (assuming mem b/w)

Also, in a CMOS microprocessor would save a lot of energy .

Page 24: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 24

Vector Conditional LoopVector Conditional Loop

do 80 i = 1,looplen if (a(i).eq.b(i)) then c(i) = a(i) + e(i) endif80 continue

V1 A1 .load a(i)V2 A2 .load b(i)VM V1 == V2 .compare a and b; result to VMV3 A3; VM .load e(i) under maskV4 V1 + V3; VM .add under maskA4 V4; VM .store to c(i) under mask

Page 25: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 25

Vector Conditional LoopVector Conditional Loop

Gather/Scatter Method (used in later Cray machines) do 80 i = 1,looplen if (a(i).eq.b(i)) then c(i) = a(i) + e(i) endif80 continue

V1 A1 .load a(i)V2 A2 .load b(i)VM V1 == V2 .compare a and b; result to VMV5 IOTA(VM) .form index setVL pop(VM) .find new VL (population count)V6 A1, V5 .gather a(i) valuesV3 A3, V5 .gather e(i) valuesV4 V6 + V3 .add a and eA4,V11 V4 .scatter sum into c(i)

Page 26: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 26

Thinking Machines CM1/CM2Thinking Machines CM1/CM2

Fine-grain parallelism Looks like intelligent

RAM to host (front-end) Front-end dispatches

"macro" instructions to sequencer

Macro instructions decoded by sequencer and broadcast to bit-serial parallel processors

P M

P

P

P

P

P

P

P

P

P

M

M

M

M

M

M

M

M

M

M

M

P

P

Sequencer host

instructions

Page 27: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 27

CM Basics, contd.CM Basics, contd.

All instructions are executed by all processors Subject to context flag Context flags

• Processor is selected if context flag = 1• saving and restoring of context is unconditional• AND, OR, NOT operations can be done on context flag

Operations• Can do logical, integer, floating point as a series of bit

serial operations

Page 28: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 28

CM Basics, contd.CM Basics, contd.

Front-end can broadcast data• (e.g. immediate values)

SEND instruction does communication• within each processor, pointers can be computed,

stored and re-used Virtual processor abstraction

• time multiplexing of processors

Page 29: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 29

Data Parallel Programming ModelData Parallel Programming Model

“Parallel operations across large sets of data”

SIMD is an example, but can also be driven by multiple (identical) threads

• Thinking Machines CM-2 used SIMD• Thinking Machines CM-5 used multiple threads

Page 30: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 30

Connection Machine ArchitectureConnection Machine Architecture

Nexus: 4x4, 32-bits wide• Cross-bar interconnect for host

communications 16K processors per

sequencer Memory

• 4K mem per processor (CM-1)• 64K mem per processor (CM-

2) CM-1 Processor 16 processors on processor

chip

Page 31: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 31

Instruction ProcessingInstruction Processing

HLLs: C* and FORTRAN 8X Paris virtual machine instruction set Virtual processors

• Allows some hardware independence• Time-share real processors• V virtual processors per real processor

=> 1/V as much memory per virtual processor Nexus contains sequencer

• AMD 2900 bit-sliced micro-sequencer• 16K of 96-bit horizontal microcode

Inst. processing:• 32 bit virtual machine insts (host)

-> 96-bit microcode (nexus sequencer)

-> nanocode (to processors)

Page 32: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 32

CM-2CM-2

re-designed sequencer; 4x microcode memory New processor chip FP accelerator (1 per 32 processors) 16x memory capacity (4K-> 64K) SEC/DED on RAM I/O subsystem Data vault Graphics system Improved router

Page 33: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 33

PerformancePerformance

Computation• 4000 MIPS 32-bit integer• 20 GFLOPS 32-bit FP• 4K x 4K matrix mult: 5 GFLOPS

Communication• 2-d grid: 3 microseconds per bit

96 microseconds per 32 bits

20 billion bits /sec• general router: 600 microseconds/32 bits

3 billion bits /sec Compare with CRAY Y-MP (8 procs.)

• 2.4 GFLOPS

But could come much closer to peak than CM-2• 246 Billion bits/ sec to/from shared memory

Page 34: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 34

OutlineOutline

Automatic Parallelization Vector Architectures

• Cray-1 case study

Data Parallel Programming• CM-1/2 case study

CUDA Overview (separate slides) Readings

• W. Daniel Hillis and Guy L. Steele, Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183.

• S. Ryoo, et al., Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA, Proceedings of PPoPP, Feb. 2008.

Page 35: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

Additional Slides on Code Additional Slides on Code GenerationGeneration

Page 36: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 36

Improving ParallelismImproving Parallelism

Loop Interchange

do J = 1,N do I = 2,N

S1: A(I,J) = A(I-1,J) + B(I) end doend do

do I = 2,N do J = 1,N

S1: A(I,J) = A(I-1,J) + B(I) end doend do

Page 37: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 37

Loop InterchangeLoop Interchange

Page 38: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 38

Loop InterchangeLoop Interchange

Page 39: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 39

Improving ParallelismImproving Parallelism

Induction Variable Recognition• Successive values are an arithmetic progression

INC = Ndo I = 1,N

I2 = 2*I-1X(INC) = Y(I) + Z(I2)INC = INC - 1

end do

do I = 1,NX(N-I+1) = Y(I) + Z(2*I - 1)

end do

X(N:1:-1) = Y(1:N) + Z(1:2*N-1:2)

Page 40: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 40

Improving ParallelismImproving Parallelism

Wraparound Variable Recognition

J = Ndo I = 1,N

B(I) = (A(J) + A(I))/2J = I (Is J an induction variable?)

end do

Peel first iteration:if (N >= 1) then

B(1) = (A(N) + A(1))/2do I = 2,N

B(I) = (A(I-1) + A(I))/2end do

end if

if (N >= 1) thenB(1) = (A(N) + A(1))/2B(2:N) = (A(1:N-1) + A(2:N))/2

end if

Page 41: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 41

Improving ParallelismImproving Parallelism Symbolic Dependence Testing

do I = 1,NS1: A(LOW+I-1) = B(I)S2: B(I+N) = A(LOW+I)

end do

Global Forward Substitution NP1 = N+1

NP2 = N+2 ...do I = 1,N

S1: B(I) = A(NP1) + C(I)S2: A(I) = A(I) - 1

do J = 2,NS3: D(J,NP1)=D(J-1,NP2)*C(J)+1

end doend do

Observations• Useful for symbolic dependence testing• Constant propagation is a special case

Page 42: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 42

Improving ParallelismImproving Parallelism

Semantic Analysisdo I = LOW,HIGH

S1: A(I) = B(I) + A(I+M) ( M is unknown) end do

if (M >= 0) thendo I = LOW,HIGH

S1: A(I) = B(I) + A(I+M)end do

elsedo I = LOW,HIGH

S1: A(I) = B(I) + A(I+M)end do

end if

Top if is vectorizable

Page 43: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 43

Improving ParallelismImproving Parallelism

Interprocedural Dependence Analysis• If a procedure call is present in a loop, can we

generate parallel code? Procedure in-lining

• may reduce overhead, and lead to exact dependence analysis

• may increase compilation time

Page 44: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 44

Improving ParallelismImproving Parallelism

Removal of Output and Antidependences Eliminate them by renaming

• But may require extra vector storage

do I = 1,NS1: A(I) = B(I) + C(I)S2: D(I) = (A(I) + A(I+1))/2

end do

do I = 1,NS3: ATEMP(I) = A(I+1)S1: A(I) = B(I) + C(I)S2: D(I) = (A(I) + ATEMP(I))/2

end do

Page 45: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 45

Improving ParallelismImproving Parallelism

Scalar Expansion do I = 1,N

S1: X = A(I) + B(I)S2: C(I) = X ** 2

end do

allocate (XTEMP(1:N))do I = 1,N

S1: XTEMP(I) = A(I) + B(I)S2: C(I) = XTEMP(I) ** 2

end doX = XTEMP(N)free (XTEMP)

Page 46: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 46

Improving ParallelismImproving Parallelism Fission by Name

• Break a loop into several adjacent loops to improve performance of memory

Loop Fusion do I = 2,N

S1: A(I) = B(I) + C(I)end dodo I = 2,N

S2: D(I) = A(I-1)end do

do I = 2,NS1: A(I) = B(I) + C(I)S2: D(I) = A(I-1)

end do Observations

• fusion reduces start-up costs for loops• when does fusion yield benefits over fission?

Page 47: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 47

Improving ParallelismImproving Parallelism

Strip Mining do I = 1,N

A(I) = B(I) + 1 D(I) = B(I) - 1

end do

do J = 1,N,32 (vector reg. len. = 32)do I = J,MIN(J+31,N)

A(I) = B(I) + 1 D(I) = B(I) - 1

end do end do

Page 48: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 48

Loop CollapsingLoop Collapsing

real A(5,5), B(5,5)do I = 1,5

do J = 1,5A(I,J) = B(I,J) + 2end do

end do

real A(25),B(25)do IJ = 1,25

A(IJ) = B(IJ) + 2end do

Page 49: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 49

Conditional StatementsConditional Statements

do I = 1,NIF (A(I) .LE. 0) GOTO 100A(I+1) = B(I) + 3

end do

do I = 1,N (Is this vectorizable?)BR1 = A(I) .LE. 0IF (.NOT. BR1) A(I+1) = B(I) + 3

end do

do I = 1,N (Is this vectorizable?)BR1 = A(I) .LE. 0IF (.NOT. BR1) A(I) = B(I) + 3

end do

Page 50: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 50

Conditional Statements, contd.Conditional Statements, contd.

do I = 1,N

BR1(I) = A(I) .LE. 0IF (.NOT. BR1(I)) A(I) = B(I) + 3

end do

BR1(1:N)=A(1:N).LE.0 ( Scalar Expansion)WHERE(.NOT.BR1(1:N)) A(1:N)=B(1:N)+3

Page 51: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

Data Parallel ExamplesData Parallel Examples

Page 52: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 52

Example: ReductionExample: Reduction

Each processor (k) unconditionally tests context flag add if if flag set; else 0 perform unconditional summation of integers

• Special case: count processors

Page 53: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 53

Example: Compute Partial SumsExample: Compute Partial Sums

Similar to reduction, except use sum-prefix computation

• Special case: enumerate -give each active processor a sequence number

Page 54: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 54

Example: Radix SortExample: Radix Sort

Page 55: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 55

Example: Find End of Linked ListExample: Find End of Linked List

Page 56: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

Bonus Slides: BSPBonus Slides: BSP

Page 57: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 57

Burroughs BSPBurroughs BSP

Developed during the "supercomputer wars" of the late '70s, early '80s.

Taken to prototype stage, but never shipped Draws distinct division between vector and scalar

processing• control and parallel processors have totally different

memories (for both insts. and data)

Page 58: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 58

Burroughs BSPBurroughs BSP

Page 59: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 59

Burroughs BSPBurroughs BSP

Control (scalar) processor• processes all instructions from control memory• 80 ns clock => up to 1.5 MFLOPS

Parallel (vector) processor• 16 processors• 160 ns clock• 2 cp latency for major FP operations• pipelined at a high level

Page 60: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 60

BSP FAPAS PipelineBSP FAPAS Pipeline

stages: Fetch, Align, Process, Align, Store (FAPAS)

Memory

Processor

AlignmentNet

AlignmentNet

Store Fetch

Align

Process

AlignOutput Input

Page 61: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 61

ExampleExample

Example: D = A * B + C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Fetch A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4Align A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 Process * * + + * * + + * * + Align D1 D2 Store D1 D2

Note that memory bandwidth and fp bandwidth are both fully committed for triad.

Page 62: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 62

FAPAS pipeline contd.FAPAS pipeline contd.

Throughput: 1 FP op/320ns * 16 ops in ||=> 50 MFLOPS peak

=> big difference between performance in scalar and vector modes

also, many scalar arith operations were done in the vector unit because of partitioning of memory

Page 63: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 63

ArchitectureArchitecture

Memory to Memory Unlimited vector lengths

• (stripmining is automatic in hardware) Up to 5 input operands per instruction

(pentads)• takes into account all evaluation trees• refer to table 1 for list of forms• Any fixed stride

Supports two levels of loop nesting

Page 64: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 64

Example:Example:

VFORM TRIAD, op1, op2OBV (operand bit vector)RBV (result bit vector)VOPERAND AVOPERAND BVOPERAND CVRESULT Z=> RBV,Z = A op1 B op2 C, OBV

Page 65: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 65

FORTRAN ExampleFORTRAN Example

Livermore Fortran Kernel 1, Hydro Excerpt• inner loop: x(k)=u(k)*(r*z(k+10)+t*z(k+11))

VLEN = "one level of nesting, length 100"VFORM PENTAD2, *, +, *, *, no bit vectorsno OBVno RBVVOPERAND r (broadcast)VOPERAND z+10 (stride 1)VOPERAND t (broadcast)VOPERAND z+11 (stride 1)VOPERAND u (stride 1)VRESULT x (stride 1)

Page 66: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 66

Architecture, contd.Architecture, contd. Strided loads/stores

• Implemented with prime memory system to avoid conflicts Sparse matrices

• compress• expand• random fetch• random store

IF statements• bit vectors embedded in vector insts.

Recurrences/reductions• special instructions

Chaining• built into instructions via multi-operands• saves loads and stores in a mem-to-mem arch.• 10 temp registers per arithmetic unit

Page 67: SIMD Computers ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in.

04/07 ECE/CS 757; copyright J. E. Smith, 2007 67

Template ProcessingTemplate Processingcontrol memory

F

A

P

A

S

Vector Data Buffer16 entries

120 bits

from 16 GPR's (if needed)

fields filled by control processor

Vector Input and

Validation Unit

shipped to vector unit

assembles sequence of instructions

into a global description of operation

and checks memory hazards

TemplateDescriptorMemory

16 entries

Template Control

Unit

read

write control