Lecture 12, Slide Computer Architecture Vector Computers
Lecture 12, Slide 1
Computer Architecture
Vector Computers
Lecture 12, Slide 2
contents
1. Why Vector Processors?
2. Basic Vector Architecture
3. How Vector Processors Work
4. Vector Length and Stride
5. Effectiveness of Compiler Vectorization
6. Enhancing Vector Performance
7. Performance of Vector Processors
Lecture 12, Slide 3
Vector Processors
I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI (ASC) processor. Those three were all pioneering processors. . . . One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made.
Seymour Cray Public lecture at Lawrence Livermore Laboratories
on the introduction of the Cray-1 (1976)
Lecture 12, Slide 4
Supercomputers
Definition of a supercomputer:
• Fastest machine in world at given task
• A device to turn a compute-bound problem into an I/O bound problem
• Any machine costing $30M+
• Any machine designed by Seymour Cray
CDC6600 (Cray, 1964) regarded as first supercomputer
Lecture 12, Slide 5
Supercomputer Applications
Typical application areas• Military research (nuclear weapons, cryptography)• Scientific research• Weather forecasting• Oil exploration• Industrial design (car crash simulation)
All involve huge computations on large data sets
In 70s-80s, Supercomputer Vector Machine
Lecture 12, Slide 6
1. Why Vector Processors?
• A single vector instruction specifies a great deal of work—it is equivalent to executing an entire loop.
• The computation of each result in the vector is independent of the computation of other results in the same vector and so hardware does not have to check for data hazards within a vector instruction.
• Hardware need only check for data hazards between two vector instructions once per vector operand, not once for every element within the vectors.
• Vector instructions that access memory have a known access pattern.
• Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.
Lecture 12, Slide 7
2. Basic Vector Architecture
• There are two primary types of architectures for vector processors: vector-register processors and memory-memory vector processors.
– In a vector-register processor, all vector operations—except load and store—are among the vector registers.
– In a memory-memory vector processor, all vector operations are memory to memory.
Lecture 12, Slide 8 Vector Memory-Memory versus Vector Register Machines
• Vector memory-memory instructions hold all vector operands in main memory
• The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines
• Cray-1 (’76) was first vector register machine
for (i=0; i<N; i++)
{
C[i] = A[i] + B[i];
D[i] = A[i] - B[i];
}
Example Source Code ADDV C, A, B
SUBV D, A, B
Vector Memory-Memory Code
LV V1, A
LV V2, B
ADDV V3, V1, V2
SV V3, C
SUBV V4, V1, V2
SV V4, D
Vector Register Code
Lecture 12, Slide 9 Vector Memory-Memory vs. Vector Register Machines
• Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?
– All operands must be read in and out of memory
• VMMAs make if difficult to overlap execution of multiple vector operations, why?
– Must check dependencies on memory addresses
• VMMAs incur greater startup latency– Scalar code was faster on CDC Star-100 for vectors < 100 elements
– For Cray-1, vector/scalar breakeven point was around 2 elements
Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures
(we ignore vector memory-memory from now on)
Lecture 12, Slide 10 The basic structure of a vector-register architecture
VMIPS
Lecture 12, Slide 11
Primary Components of VMIPS• Vector registers — VMIPS has eight vector registers, and
each holds 64 elements. Each vector register must have at least two read ports and one write port.
• Vector functional units — Each unit is fully pipelined and can start a new operation on every clock cycle.
• Vector load-store unit —The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory with a bandwidth of 1 word per clock cycle, after an initial latency.
• A set of scalar registers —Scalar registers can also provide data as input to the vector functional units, as well as compute addresses to pass to the vector load-store unit.
Lecture 12, Slide 12
Vector Supercomputers
Epitomized by Cray-1, 1976:
Scalar Unit + Vector Extensions• Load/Store Architecture
• Vector Registers
• Vector Instructions
• Hardwired Control
• Highly Pipelined Functional Units
• Interleaved Memory System
• No Data Caches
• No Virtual Memory
Lecture 12, Slide 13 Cray-1 (1976)
Lecture 12, Slide 14 Cray-1 (1976)
Single PortMemory
16 banks of 64-bit words
+ 8-bit SECDED
80MW/sec data load/store
320MW/sec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V. Mask
V. Length64 Element Vector Registers
Lecture 12, Slide 15
Vector Programming Model
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2 v3
v2v1
Scalar Registers
r0
r7Vector Registers
v0
v7
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and
Store Instructions
LV v1, r1, r2
Base, r1 Stride, r2Memory
Vector Register
Lecture 12, Slide 16 • In VMIPS, vector operations use the same names as MIPS operations, but with the letter “V” appended.
Lecture 12, Slide 17
Vector Code Example
# Scalar Code
LI R4, 64
loop:
L.D F0, 0(R1)
L.D F2, 0(R2)
ADD.D F4, F2, F0
S.D F4, 0(R3)
DADDIU R1, 8
DADDIU R2, 8
DADDIU R3, 8
DSUBIU R4, 1
BNEZ R4, loop
# Vector Code
LI VLR, 64
LV V1, R1
LV V2, R2
ADDV.D V3, V1, V2
SV V3, R3
# C code
for (i=0; i<64; i++)
C[i] = A[i] + B[i];
Lecture 12, Slide 18
Vector Instruction Set Advantages
• Compact– one short instruction encodes N operations
• Expressive, tells hardware that these N operations:– are independent
– use the same functional unit
– access disjoint registers
– access registers in the same pattern as previous instructions
– access a contiguous block of memory (unit-stride load/store)
– access memory in a known pattern (strided load/store)
• Scalable– can run same object code on more parallel pipelines or lanes
Lecture 12, Slide 19 3. How Vector Processors Work 3.1 An Example• Let’s take a typical vector problem, X and Y are vectors,
a is a scalar.
• Y = a×X + Y
• This is the socalled SAXPY or DAXPY loop that forms the inner loop of the Linpack benchmark.
• Example Show the code for MIPS and VMIPS for the DAXPY loop.
• Assume that the starting addresses of X and Y are in Rx and Ry. And the number of elements, or length, of a vector register(64) matches the length of the vector operation.
Lecture 12, Slide 20
• Here is the MIPS code.
L.D F0,a ;load scalar a
DADDIU R4,Rx,#512 ;last address to load
Loop: L.D F2,0(Rx) ;load X(i)
MUL.D F2,F2,F0 ;a × X(i)
L.D F4,0(Ry) ;load Y(i)
ADD.D F4,F4,F2 ;a × X(i) + Y(i)
S.D 0(Ry),F4 ;store into Y(i)
DADDIU Rx,Rx,#8 ;increment index to X
DADDIU Ry,Ry,#8 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done
Lecture 12, Slide 21 • Here is the VMIPS code for DAXPY.
L.D F0,a ;load scalar a
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDV.D V4,V2,V3 ;add
SV Ry,V4 ;store the result
• The most dramatic comparison is that the vector processor greatly reduces the dynamic instruction bandwidth.
• Another important difference is the frequency of pipeline interlocks. (Pipeline stalls are required only once per vector operation, rather than once per vector element.)
Lecture 12, Slide 22 Vector Arithmetic Execution
• Use deep pipeline (=> fast clock) to execute element operations
• Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)
V1 V2 V3
V3 <- v1 * v2
Six stage multiply pipeline
Lecture 12, Slide 23 3.2 Vector Load-Store Units and Vector Memory Systems
Start-up penalties (in clock cycles) on VMIPS
To maintain an initiation rate of 1 word fetched or stored per clock, the memory system must be capable of producing or accepting this much data. This is usually done by spreading accesses across multiple independent memory banks.
Operation Start-up penalty
Vector add 6Vector multiply 7
Vector divide 20
Vector load / store 12
Lecture 12, Slide 24
Vector Memory System
0 1 2 3 4 5 6 7 8 9 A B C D E F
+
Base StrideVector Registers
Memory Banks
Address Generator
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency• Bank busy time: Cycles between accesses to same bank
Lecture 12, Slide 25
• Example
Suppose we want to fetch a vector of 64 elements starting at byte address 136, and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the CPU?
• Answer
Six clocks per access require at least six banks, but because we want the number of banks to be a power of two, we choose to have eight banks. Figure on next page shows the timing for the first few sets of accesses for an eight-bank system with a 6-clock-cycle access latency.
Lecture 12, Slide 26
The CPU cannot keep all eight banks busy all the time because it is limited to supplying one new address and receiving one data item each cycle.
Lecture 12, Slide 27
4. Two Real-World Issues: Vector Length and Stride
• What do you do when the vector length in a program is not exactly 64?
• How do you deal with nonadjacent elements in vectors that reside in memory?
4.1 Vector-Length Control
do 10 i = 1,n
10 Y(i) = a × X(i) + Y(i)
n may not even be known until run time
Lecture 12, Slide 28
• The solution is to create a vector-length register (VLR), which controls the length of any vector operation.
• The value in the VLR, however, cannot be greater than the length of the vector registers — maximum vector length (MVL).
• If the vector is longer than the maximum length, a technique called strip mining is used.
Lecture 12, Slide 29 Vector StripminingProblem: Vector registers have finite length
Solution: Break loops into pieces that fit into vector registers, “Stripmining”
ANDI R1, N, #63 ; N mod 64 MTC1 VLR, R1 ; Do remainderloop: LV V1, RA DSLL R2, R1, #3 ; Multiply by 8 DADDU RA, RA, R2 ; Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 ; Subtract elements LI R1, #64 MTC1 VLR, R1 ; Reset full length BGTZ N, loop ; Any more to do?
for (i=0; i<N; i++) C[i] = A[i]+B[i];
+
+
+
A B C
64 elements
Remainder
Lecture 12, Slide 30
4.2 Vector Stride
do 10 i = 1,100 do 10 j = 1,100 A(i,j) = 0.0 do 10 k = 1,100
10 A(i,j) = A(i,j)+B(i,k)*C(k,j)
At the statement labeled 10 we could vectorize the multiplication of each row of B with each column of C.When an array is allocated memory, it is linearized and must be laid out in either row-major or column-major order. This linearization means that either the elements in the row or the elements in the column are not adjacent in memory.
Lecture 12, Slide 31
Vector Stride
• The vector stride, like the vector starting address, can be put in a general-purpose register.
• Then the VMIPS instruction LVWS (load vector with stride) can be used to fetch the vector into a vector register.
• Likewise, when a nonunit stride vector is being stored, SVWS (store vector with stride) can be used.
This distance separating elements that are to be gathered into a single register is called the stride.
Lecture 12, Slide 32
5. Effectiveness of Compiler Vectorization
• Two factors affect the success with which a program can be run in vector mode.
• The first factor is the structure of the program itself. This factor is influenced by the algorithms chosen and by how they are coded.
• The second factor is the capability of the compiler.
Lecture 12, Slide 33 Automatic Code Vectorizationfor (i=0; i < N; i++) C[i] = A[i] + B[i];
load
load
add
store
load
load
add
store
Iter. 1
Iter. 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing
requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter. 1
Iter. 2
Vectorized Code
Tim
e
Lecture 12, Slide 34
6. Enhancing Vector Performance
•Chaining•Conditionally Executed
Statements•Sparse Matrices•Multiple Lanes•Pipelined Instruction Start-Up
In this section we present five techniques for improving the performance of a vector processor.
Lecture 12, Slide 35
(1) Vector Chaining• the Concept of Forwarding Extended to Vector Registers
• Vector version of register bypassing– introduced with Cray-1
Memory
V1
Load Unit
Mult.
V2 V3
Chain
Add
V4 V5
Chain
LV v1
MULV v3,v1,v2
ADDV v5, v3, v4
Lecture 12, Slide 36
Vector Chaining Advantage
• With chaining, can start dependent instruction as soon as first result appears
Load
Mul
Add
Load
Mul
AddTime
• Without chaining, must wait for last element of result to be written before starting dependent instruction
Lecture 12, Slide 37
Implementations of Chaining
• Early implementations worked like forwarding, but this restricted the timing of the source and destination instructions in the chain.
• Recent implementations use flexible chaining, which requires simultaneous access to the same vector register by different vector instructions, which can be implemented either by adding more read and write ports or by organizing the vector-register file storage into interleaved banks in a similar way to the memory system.
Lecture 12, Slide 38 (2) Vector Conditional Execution
Problem: Want to vectorize loops with conditional code:for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i];
Solution: Add vector mask (or flag) registers– vector version of predicate registers, 1 bit per element
…and maskable vector instructions– vector operation becomes NOP at elements where mask bit is clear
Code example:CVM ; Turn on all elements
LV VA, RA ; Load entire A vector
L.D F0,#0 ; Load FP zero into F0
SGTVS.D VA, F0 ; Set bits in mask register where A>0
LV VA, RB ; Load B vector into A under mask
SV VA, RA ; Store A back to memory under mask
Lecture 12, Slide 39
Masked Vector Instructions
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
M[7]=1
Density-Time Implementation– scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Enable
A[7] B[7]M[7]=1
Simple Implementation– execute all N operations, turn off result
writeback according to mask
Lecture 12, Slide 40 Compress/Expand Operations• Compress packs non-masked elements from one vector
register contiguously at start of destination vector register
– population count of mask vector gives packed vector length
• Expand performs inverse operation
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
M[7]=1
A[3]
A[4]
A[5]
A[6]
A[7]
A[0]
A[1]
A[2]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
M[7]=1
B[3]
A[4]
A[5]
B[6]
A[7]
B[0]
A[1]
B[2]
Expand
A[7]
A[1]
A[4]
A[5]
Compress
A[7]
A[1]
A[4]
A[5]
Used for density-time conditionals and also for general selection operations
Lecture 12, Slide 41
(3) Sparse Matrices
Lecture 12, Slide 42
Vector Scatter/Gather
Want to vectorize loops with indirect accesses:
(index vector D designate the nonzero elements of C)for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]
Indexed load instruction (Gather)LV VD, RD ; Load indices in D vector
LVI VC,(RC, VD) ; Load indirect from RC base
LV VB, RB ; Load B vector
ADDV.D VA, VB, VC ; Do add
SV VA, RA ; Store result
Lecture 12, Slide 43
Vector Scatter/Gather
Scatter example:
for (i=0; i<N; i++)
A[B[i]]++;
Is following a correct translation?LV VB, RB ; Load indices in B vector
LVI VA,(RA, VB) ; Gather initial A values
ADDV VA, RA, 1 ; Increment
SVI VA,(RA, VB) ; Scatter incremented values
Lecture 12, Slide 44
(4) Multiple Lanes ADDV C,A,B
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
Execution using four pipelined
functional units
Vector Instruction Execution
Lecture 12, Slide 45
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0, 4, 8, …
Elements 1, 5, 9, …
Elements 2, 6, 10, …
Elements 3, 7, 11, …
Lecture 12, Slide 46 T0 Vector Microprocessor (1995)
LaneVector register elements striped
over lanes
[0][8]
[16][24]
[1][9]
[17][25]
[2][10][18][26]
[3][11][19][27]
[4][12][20][28]
[5][13][21][29]
[6][14][22][30]
[7][15][23][31]
Lecture 12, Slide 47
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Instruction issue
Complete 24 operations/cycle while issuing 1 short instruction/cycle
Lecture 12, Slide 48
(5) Pipelined Instruction Start-Up
• The simplest case to consider is when two vector instructions access a different set of vector registers.
• For example, in the code sequenceADDV.D V1,V2,V3ADDV.D V4,V5,V6
• It becomes critical to reduce start-up overhead by allowing the start of one vector instruction to be overlapped with the completion of preceding vector instructions.
• An implementation can allow the first element of the second vector instruction to immediately follow the last element of the first vector instruction down the FP adder pipeline.
Lecture 12, Slide 49 Vector StartupTwo components of vector startup penalty
– functional unit latency (time through pipeline)
– dead time or recovery time (time before another vector instruction can start down pipeline)
R X X X W
R X X X W
R X X X W
R X X X W
R X X X W
R X X X W
R X X X W
R X X X W
R X X X W
R X X X W
Functional Unit Latency
Dead Time
First Vector Instruction
Second Vector Instruction
Dead Time
Lecture 12, Slide 50
Dead Time and Short Vectors
Cray C90, Two lanes
4 cycle dead time
Maximum efficiency 94% with 128 element vectors
4 cycles dead time T0, Eight lanes
No dead time
100% efficiency with 8 element vectors
No dead time
64 cycles active
Lecture 12, Slide 51
• Example The Cray C90 has two lanes but requires 4 clock cycles of dead time between any two vector instructions to the same functional unit. For the maximum vector length of 128 elements, what is the reduction in achievable peak performance caused by the dead time? What would be the reduction if the number of lanes were increased to 16?
• Answer A maximum length vector of 128 elements is divided over the two lanes and occupies a vector functional unit for 64 clock cycles. The dead time adds another 4 cycles of occupancy, reducing the peak performance to 64/(64 + 4) = 94.1% of the value without dead time.
• If the number of lanes is increased to 16, maximum length vector instructions will occupy a functional unit for only 128/16 = 8 cycles, and the dead time will reduce peak performance to 8/(8 + 4) = 66.6% of the value without dead time.
Lecture 12, Slide 52 7. Performance of Vector Processors
Vector Execution Time
The execution time of a sequence of vector operations primarily depends on three factors:
• the length of the operand vectors
• structural hazards among the operations
• data dependences
Lecture 12, Slide 53
Convoy and Chime
• Convoy is the set of vector instructions that could potentially begin execution together in one clock period.
– The instructions in a convoy must not contain any structural or data hazards; if such hazards were present, the instructions in the potential convoy would need to be serialized and initiated in different convoys.
• A chime is the unit of time taken to execute one convoy.
– A chime is an approximate measure of execution time for a vector sequence; a chime measurement is independent of vector length.
– A vector sequence that consists of m convoys executes in m chimes, and for a vector length of n, this is approximately m × n clock cycles.
Lecture 12, Slide 54
• Example Show how the following code sequence lays out in convoys, assuming a single copy of each vector functional unit:
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDV.D V4,V2,V3 ;add
SV Ry,V4 ;store the result
• How many chimes will this vector sequence take?
• How many cycles per FLOP (floating-point operation) are needed, ignoring vector instruction issue overhead?
Lecture 12, Slide 55
• Answer The first convoy is occupied by the first LV instruction. The MULVS.D is dependent on the first LV, so it cannot be in the same convoy. The second LV instruction can be in the same convoy as the MULVS.D. The ADDV.D is dependent on the second LV, so it must come in yet a third convoy, and finally the SV depends on the ADDV.D, so it must go in a following convoy.
1. LV2. MULVS.D LV3. ADDV.D4. SV
• The sequence requires four convoys and hence takes four chimes. Since the sequence takes a total of four chimes and there are two floating-point operations per result, the number of cycles per FLOP is 2 (ignoring any vector instruction issue overhead).
Lecture 12, Slide 56
• The most important source of overhead ignored by the chime model is vector start-up time.
• The start-up time comes from the pipelining latency of the vector operation and is principally determined by how deep the pipeline is for the functional unit used.
Unit Start-up overhead (cycles)
Load and store unit 12Multiply unit 7Add unit 6
Start-up overhead
Lecture 12, Slide 57
• Example Assume the start-up overhead for functional units is shown in Figure of the previous page.
• Show the time that each convoy can begin and the total number of cycles needed. How does the time compare to the chime approximation for a vector of length 64?
• Answer
The time per result for a vector of length 64 is 4 + (42/64) = 4.65 clock cycles, while the chime approximation would be 4.
Lecture 12, Slide 58
Running Time of a Strip-mined Loop
There are two key factors that contribute to the running time of a strip-mined loop consisting of a sequence of convoys:
1. The number of convoys in the loop, which determines the number of chimes. We use the notation Tchime for the execution time in chimes.
2. The overhead for each strip-mined sequence of convoys. This overhead consists of the cost of executing the scalar code for strip-mining each block, Tloop, plus the vector start-up cost for each convoy, Tstart.
• the total running time for a vector sequence operating on a vector of length n:
Lecture 12, Slide 59
• Example What is the execution time on VMIPS for the vector operation A = B × s, where s is a scalar and the length of the vectors A and B is 200?
• Answer – Assume the addresses of A and B are initially in Ra and
Rb, s is in Fs, and recall that for MIPS (and VMIPS) R0 always holds 0.
– The first iteration of the strip-mined loop will execute for a vector length of (200 mod 64) = 8 elements, and the following iterations will execute for a vector length of 64 elements.
– Since the vector length is either 8 or 64, we increment the address registers by 8 × 8 = 64 after the first segment and 8 × 64 = 512 for later segments. The total number of bytes in the vector is 8 × 200 = 1600, and we test for completion by comparing the address of the next vector segment to the initial address plus 1600.
Lecture 12, Slide 60 Here is the actual code:
DADDUI R2,R0,#1600 ;total # bytes in vector
DADDU R2,R2,Ra ;address of the end of A vector
DADDUI R1,R0,#8 ;loads length of 1st segment
MTC1 VLR,R1 ;load vector length in VLR
DADDUI R1,R0,#64 ;length in bytes of 1st segment
DADDUI R3,R0,#64 ;vector length of other segments
Loop: LV V1,Rb ;load B
MULVS.D V2,V1,Fs ;vector * scalar
SV Ra,V2 ;store A
DADDU Ra,Ra,R1 ;address of next segment of A
DADDU Rb,Rb,R1 ;address of next segment of B
DADDUI R1,R0,#512 ;load byte offset next segment
MTC1 VLR,R3 ;set length to 64 elements
DSUBU R4,R2,Ra ;at the end of A?
BNEZ R4,Loop ;if not, go back
Lecture 12, Slide 61
• The three vector instructions in the loop are dependent and must go into three convoys, hence Tchime = 3.
• Use our basic formula:
• The value of Tstart is given by Tstart = 12 + 7 + 12 = 31
• So, the overall value becomes T200 = 660 + 4 × 31= 784
• The execution time per element with all start-up costs is then 784/200 = 3.9, compared with a chime approximation of three.