CS141-L2- 1 Tarun Soni, Summer ‘03 Performance, ALUs and such like The good news: no quiz today ! Homework #1 is on the net now, so are the slides from previous class. Home page is www.cs.ucsd.edu/~tsoni/cse141 Finals will be the last day of class, no special time slot Add-drops shall be handled at break. Today: Chap 2 and 4 of the text.
72
Embed
CS141-L2-1Tarun Soni, Summer ‘03 Performance, ALUs and such like The good news: no quiz today ! Homework #1 is on the net now, so are the slides from.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CS141-L2-1 Tarun Soni, Summer ‘03
Performance, ALUs and such like
The good news: no quiz today !
Homework #1 is on the net now, so are the slides from previous class. Home page is www.cs.ucsd.edu/~tsoni/cse141Finals will be the last day of class, no special time slotAdd-drops shall be handled at break.
• Procedures?int PairDiff(int a, int b, int c,int d);{ int temp;
temp = (a+b)-(c+d);return temp;
}
Assume caller puts $a0-$a3 = a,b,c,d and wants result in $v0PairDiff: //
sub $sp,$sp,12 // Make space for 3 temp locationssw $t1, 8($sp) // save $t1 (optional if MIPS convention)sw $t0, 4($sp) // save $t0 (optional if MIPS convention)sw $s0, 0($sp) // save $s0
add $v0,$s0,$zero // store return value in $v0lw $s0,0($sp) // restore registerslw $t0,4($sp) // (optional if MIPS convention)lw $t1,8($sp) // (optional if MIPS convention)add $sp,$sp,12 // ‘pop’ the stack
jr $ra // The actual return to calling routine
CS141-L2-6 Tarun Soni, Summer ‘03
Example: Nested_procedure()
• What about nested procedures? $ra ??
• Recursive procedures?
int fact(int n);{
if(n<1) return(1);else return (n*fact(n-1));
}
Assume $a0 = n fact: //
sub $sp,$sp,8 // Make space for 2 temp locationssw $ra, 4($sp) // save return addresssw $a0, 4($sp) // save argument n
slt $t0,$a0,1 // test for n<1beq $t0,$zero, L1 // if (n>=1) goto L1
add $v0,$zero,1 // $v0=1add $sp,$sp,8 // ‘pop’ the stack
performance now improves 50% per year (2x every 1.5 years)
But what is performance ??
CS141-L2-10 Tarun Soni, Summer ‘03
Performance depends on the eyes of the beholder?
• Purchasing perspective – given a collection of machines, which has the
• best performance ?• least cost ?• best performance / cost ?
• Design perspective– faced with design options, which has the
• best performance improvement ?• least cost ?• best performance / cost ?
• Both require– basis for comparison– metric for evaluation
• Our goal is to understand cost & performance implications of architectural choices
CS141-L2-11 Tarun Soni, Summer ‘03
Two ideas
° Time to do the task (Execution Time)
– execution time, response time, latency
° Tasks per day, hour, week, sec, ns. .. (Performance)
– throughput, bandwidth
Response time and throughput often are in opposition
Plane
Boeing 747
Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pmph)
286,700
178,200
Which has higher performance?
•How much faster is the Concorde compared to the 747?
•How much bigger is the 747 than the Douglas DC-8?
CS141-L2-12 Tarun Soni, Summer ‘03
° Time to do the task from start to finish
– execution time, response time, latency
° Tasks per unit time
– throughput, bandwidth
Vehicle
Ferrari
Greyhound
Speed
160 mph
65 mph
Time to Bay Area
3.1 hours
7.7 hours
Passengers
2
60
Throughput (pm/h)
320
3900
mostly used for data movement
Two mechanisms of getting to the bay-area
Response time and throughput often are in opposition
CS141-L2-13 Tarun Soni, Summer ‘03
Relative performance ?
• can be confusing
A runs in 12 seconds
B runs in 20 seconds– A/B = .6 , so A is 40% faster, or 1.4X faster, or B is 40% slower– B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower
• needs a precise definition
CS141-L2-14 Tarun Soni, Summer ‘03
Relative performance ?
• Performance is in units of things-per-second– bigger is better
• If we are primarily concerned with response time– performance(x) = 1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n = ----------------------
Performance(Y)
PerformanceX
Execution TimeXPerformanceY
RelativePerformance
Execution TimeY= = = n
CS141-L2-15 Tarun Soni, Summer ‘03
How many times ?
• Time of Concorde vs. Boeing 747?
• Concord is 1350 mph / 610 mph = 2.2 times faster
( Execution Time Affected / Amount of Improvement )
• Example:
"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"
How about making it 5 times faster?
• Principle: Make the common case fast
CS141-L2-25 Tarun Soni, Summer ‘03
MIPS, MFLOPS etc.
• MIPS - million instructions per second
= number of instructions executed in program = Clock rate
execution time in seconds * 106 CPI * 106
• MFLOPS - million floating point operations per second
= number of floating point operations executed in program
execution time in seconds * 106
• program-independent• deceptive
CS141-L2-26 Tarun Soni, Summer ‘03
Example RISC Processor
Typical Mix
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
2.2
How much faster would the machine be if a better data cachereduced the average load time to 2 cycles?
How does this compare with using branch prediction to shave a cycle off the branch time?
What if two ALU instructions could be executed at once?
CS141-L2-27 Tarun Soni, Summer ‘03
SPEC
Which Programs?
• peak throughput measures (simple programs)• synthetic benchmarks (whetstone, dhrystone,...)• Real applications• SPEC (best of both worlds, but with problems of their own)
– System Performance Evaluation Cooperative– Provides a common set of real applications along with strict guidelines for
how to run them.– provides a relatively unbiased means to compare machines.
• Performance is specific to a particular program/s– Total execution time is a consistent summary of performance
• For a given architecture performance increases come from:– increases in clock rate (without adverse CPI affects)– improvements in processor organization that lower CPI– compiler enhancements that lower CPI and/or instruction count
• Pitfall: expecting improvement in one aspect of a machine’s
performance to affect the total performance
• You should not always believe everything you read! Read carefully!
CS141-L2-31 Tarun Soni, Summer ‘03
Computer Arithmetic
bits (011011011100010 ....01)
instruction
R-format I-format ...
data
number text chars ..............
integer floating point
signed unsigned single precision double precision
... .........
What do all those bits mean now?
CS141-L2-32 Tarun Soni, Summer ‘03
Computer Arithmetic
• How do you represent– negative numbers?– fractions?– really large numbers?– really small numbers?
• How do you– do arithmetic?– identify errors (e.g. overflow)?
• What is an ALU and what does it look like?– ALU=arithmetic logic unit
CS141-L2-33 Tarun Soni, Summer ‘03
Big Endian vs. Little Endian
0
1
2
3
4
5
6
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
least-significant bit
031
least-significant bit
031
Big EndianIBM, Mot, HP, Sun
Little EndianDec, Intel
Some processors (e.g. PowerPC) provide both
– If you can figure out how to switch modes or get the compiler to issue “Byte-reversed load’s and store’s”
CS141-L2-34 Tarun Soni, Summer ‘03
Binary Numbers: An Introduction
Consider a 4-bit binary number
Examples of binary arithmetic:
3 + 2 = 5 3 + 3 = 6
BinaryBinaryDecimal
0 0000
1 0001
2 0010
3 0011
Decimal
4 0100
5 0101
6 0110
7 0111
0 0 1 1
0 0 1 0+
0 1 0 1
1
0 0 1 1
0 0 1 1+
0 1 1 0
1 1
CS141-L2-35 Tarun Soni, Summer ‘03
Negative Numbers: Some options
• We would like a number system that provides– obvious representation of 0,1,2...– uses adder for addition– single value of 0– equal coverage of positive and negative numbers– easy detection of sign– easy negation
• Sign Magnitude -- MSB is sign bit, rest the same
-1 == 1001
-5 == 1101• One’s complement -- flip all bits to negate
• The adder we just built is called a “Ripple Carry Adder”– The carry bit may have to propagate from LSB to MSB– Worst case delay for an N-bit RC adder: 2N-gate delay
CarryIn
CarryOut
A
B
• E.g: (Back of the envelope approximations)• Single gate delay = 0.02 ns (inverter “speed” of 50 GHz)• 32 bit adder => 64 gate delay => 1.28 ns delay => maximum clock of 789 MHz.
CS141-L2-65 Tarun Soni, Summer ‘03
Ripple Carry Adders
• Is there more than one way to do addition?– two extremes: ripple carry and sum-of-products
Can you see the ripple? How could you get rid of it?
• An approach in-between our two extremes• Motivation:
– If we didn't know the value of carry-in, what could we do?
– When would we always generate a carry? gi = ai bi
– When would we propagate the carry? pi = ai + bi
Inputs Outputs
CommentsA B CarryIn SumCarryOut
0 0 0 0 0 0 + 0 + 0 = 00
0 0 1 0 1 0 + 0 + 1 = 01
0 1 0 0 1 0 + 1 + 0 = 01
0 1 1 1 0 0 + 1 + 1 = 101 0 0 0 1 1 + 0 + 0 = 01
1 0 1 1 0 1 + 0 + 1 = 10
1 1 0 1 0 1 + 1 + 0 = 10
1 1 1 1 1 1 + 1 + 1 = 11
c1 = g0 + p0c0
c2 = g1 + p1c1
c3 = g2 + p2c2
c4 = g3 + p3c3
Generate CarryCarryOut = 1 (independent of CarryIn)
Propagate CarryCarryOut = CarryIn
CS141-L2-67 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
The Propagate and Generate machinery.
Worst case delay of 1-gate.
CS141-L2-68 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
The Generation of the CarryOut.
The delay (and size) still does grow with number of bits.
CS141-L2-69 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
The Generation of the Result.
Sum i = Pi xor C i-1
• pi = ai + bi
CS141-L2-70 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
• It is very expensive to build a “full” carry lookahead adder– Just imagine the length of the equation for Cin31
• Common practices:– Connect several N-bit Lookahead Adders to form a big adder– Example: connect four 8-bit carry lookahead adders to form
a 32-bit partial carry lookahead adder
8-bit CarryLookahead
Adder
C0
8
88
Result[7:0]
B[7:0]A[7:0]
8-bit CarryLookahead
Adder
C8
8
88
Result[15:8]
B[15:8]A[15:8]
8-bit CarryLookahead
Adder
C16
8
88
Result[23:16]
B[23:16]A[23:16]
8-bit CarryLookahead
Adder
C24
8
88
Result[31:24]
B[31:24]A[31:24]
CS141-L2-71 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
CarryIn
Result0--3
ALU0
CarryIn
Result4--7
ALU1
CarryIn
Result8--11
ALU2
CarryIn
CarryOut
Result12--15
ALU3
CarryIn
C1
C2
C3
C4
P0G0
P1G1
P2G2
P3G3
pigi
pi + 1gi + 1
ci + 1
ci + 2
ci + 3
ci + 4
pi + 2gi + 2
pi + 3gi + 3
a0b0a1b1a2b2a3b3
a4b4a5b5a6b6a7b7
a8b8a9b9
a10b10a11b11
a12b12a13b13a14b14a15b15
Carry-lookahead unit
• Can’t build a 16 bit adder this way... (too big)• Could use ripple carry of 4-bit CLA adders• Better: use the CLA principle again!
CS141-L2-72 Tarun Soni, Summer ‘03
What did we cover today?• Last pieces of ISA class• Performance: how to quantify• Binary representation: integers, positive and negative• Basic ALU design
• 1-bit addition• Handling the carry• Carry look ahead• Subtraction• Set on less than • Condition codes such as overflow, zero