1 Code Optimization
Jan 26, 2016
1
Code Optimization
2
Outline
• Optimizing Blockers– Memory alias– Side effect in function call
• Understanding Modern Processor– Super-scalar– Out-of –order execution
• More Code Optimization techniques• Performance Tuning
• Suggested reading
– 5.1, 5.7 ~ 5.16
3
5.1 Capabilities and Limitations of Optimizing Compliers
Review on5.3 Program Example5.4 Eliminating Loop Inefficiencies5.5 Reducing Procedure Calls5.6 Eliminating Unneeded Memory References
4
void combine1(vec_ptr v, data_t *dest){ int i; *dest = IDENT;
for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }}
Example P387
5
void combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); *dest = IDENT;
for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }}
Example P388
6
void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v);
*dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i];}
Example P392
7
void combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT;
for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x;}
Example P394
8
Machine Independent Opt. Results
• Optimizations– Reduce function calls and memory references
within loop
9
Machine Independent Opt. Results
• Performance Anomaly– Computing FP product of all elements exceptionally slow.– Very large speedup when accumulate in temporary– Memory uses 64-bit format, register use 80– Benchmark data caused overflow of 64 bits, but not 80
Integer Floating Point Method + * + *
Abstract -g 42.06 41.86 41.44 160.00 Abstract -O2 31.25 33.25 31.25 143.00 Move vec_length 22.61 21.25 21.15 135.00 data access 6.00 9.00 8.00 117.00 Accum. in temp 2.00 4.00 3.00 5.00
Combine4Combine3Combine2
Combine1Combine1
P385P388P392
P394
10
Optimization Blockers P394
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
11
Optimization Blocker: Memory Aliasing P394
• Aliasing– Two different memory references specify single
location
• Example
– v: [3, 2, 17]
– combine3(v, get_vec_start(v)+2) -->
?
– combine4(v, get_vec_start(v)+2) -->
?
12
Optimization Blocker: Memory Aliasing
• Observations
– Easy to have happen in C
• Since allowed to do address arithmetic
• Direct access to storage structures
– Get in habit of introducing local variables
• Accumulating within loops
• Your way of telling compiler not to check for aliasing
13
Optimizing Compilers
• Provide efficient mapping of program to
machine
– register allocation
– code selection and ordering
– eliminating minor inefficiencies
14
Optimizing Compilers
• Don’t (usually) improve asymptotic efficiency– up to programmer to select best overall algorithm
– big-O savings are (often) more important than constant factors
• but constant factors also matter
• Have difficulty overcoming “optimization blockers”– potential memory aliasing
– potential procedure side-effects
15
Limitations of Optimizing Compilers
• Operate Under Fundamental Constraint
– Must not cause any change in program
behavior under any possible condition
– Often prevents it from making optimizations
when would only affect behavior under
pathological conditions.
16
Limitations of Optimizing Compilers
• Behavior that may be obvious to the programmer
can be obfuscated by languages and coding
styles
– e.g., data ranges may be more limited than variable
types suggest
• e.g., using an “int” in C for what could be an enumerated
typeobfuscated:混乱
17
Limitations of Optimizing Compilers
• Most analysis is performed only within procedures
– whole-program analysis is too expensive in most cases
• Most analysis is based only on static information
– compiler has difficulty anticipating run-time inputs
• When in doubt, the compiler must be conservative
18
Optimization Blockers P380
• Memory aliasing void twiddle1(int *xp, int *yp) {
*xp += *yp ;
*xp += *yp ;
}
void twiddle2(int *xp, int *yp)
{
*xp += 2* *yp ;
}
19
Optimization Blockers P381
• Function call and side effectint f(int) ;
int func1(x)
{
return f(x)+f(x)+f(x)+f(x) ;
}
int func2(x)
{
return 4*f(x) ;
}
20
Optimization Blockers P381
• Function call and side effectint counter = 0 ;
int f(int x)
{
return counter++ ;
}
21
5.7 Understanding Modern Processors5.7.1 Overall Operation
22
Modern CPU Design Figure 5.11 P396
ExecutionExecution
FunctionalUnits
Instruction ControlInstruction Control
Integer/Branch
FPAdd
FPMult/Div
Load Store
InstructionCache
DataCache
FetchControl
InstructionDecode
Address
Instructions
Operations
PredictionOK?
DataData
Addr. Addr.
GeneralInteger
Operation Results
RetirementUnit
RegisterFile
RegisterUpdates
23
Retirement
Unit
Register
File
Instruction
Cache
Fetch Control
Instruction
Decode
Address
Instructions
Integer
/branch
General
Integer
FP
Add
FP
mult/div
Load Store
Fu
nctional un
its
operations
Predication OK?
Data Cache
Operation resultsaddr
dataaddr
data
Register
Updates
1)2)
3)
4)
5)
(1) (2) (3) (4) (5) (6)
(7)
24
Modern Processor P396
• Superscalar
– Perform multiple operations on every clock cycle
• Out-of-order execution
– The order in which the instructions execute need
not correspond to their ordering in the assembly
program
25
Modern Processor P396
• Two main parts
– Instruction Control Unit
• Responsible for reading a sequence of instructions
from memory
• Generating from above instructions a set of primitive
operations to perform on program data
– Execution Unit
26
1) Instruction Control Unit
• Instruction Cache– A special, high speed memory containing the
most recently accessed instructions.
27
1) Instruction Control Unit
• Instruction Decoding Logic– Take actual program instructions– Converts them into a set of primitive operations– Each primitive operation performs some simple
task• Simple arithmetic, Load, Store• addl %eax, 4(%edx) --- three operations
load 4(%edx) t1addl %eax, t1 t2store t2, 4(%edx)
– Register renaming
P397
P398
28
2) Fetch Control
• Fetch Ahead P396
– Fetches well ahead of currently accessed
instructions
– ICU has enough time to decode these
– ICU has enough time to send decoded
operations down to the EU
29
Fetch Control
• Branch Predication P397– Branch taken or fall through
– Guess whether branch is taken or not
• Speculative Execution P397– Fetch, decode and execute only according to
the branch prediction
– Before the branch predication has been determined
30
5.7 Understanding Modern Processors5.7.2 Functional Unit Performance
31
Multi-functional Units
• Multiple Instructions Can Execute in
Parallel
– 1 load
– 1 store
– 2 integer (one may be branch)
– 1 FP Addition
– 1 FP Multiplication or Division
32
Multi-functional Units Figure 5.12 P400
• Some Instructions Take > 1 Cycle, but Can be Pipelined– Instruction Latency Cycles/Issue– Load / Store 3 1– Integer Multiply 4 1– Integer Divide 36 36– Double/Single FP Multiply 5 2– Double/Single FP Add 3 1– Double/Single FP Divide38 38
33
5.7 Understanding Modern Processors5.7.1 Overall Operation
34
Execution Unit
• Receives operations from ICU
• Each cycle it may receive more than one
operation
• Operations are queued in buffer
35
Execution Unit
• Operation is dispatched to one of multi-functional units, whenever– All the operands of an operation are ready– Suitable functional units are available
• Execution results are passed among functional units
• (7) Data Cache P398– A high speed memory containing the most
recently accessed data values
36
4) Retirement Unit P398
• Instructions need to commit in serial order
– Misprediction
– Exception
• Updates Architecture status
– Memory and register values
37
5.7.3 A Closer Look at Processor OperationTranslation Instruction into Operations
38
Translation Example P401
.L24: # Loop:imull (%eax,%edx,4),%ecx # t *= data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop
.L24:
imull (%eax,%edx,4),%ecx
incl %edxcmpl %esi,%edxjl .L24
load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1
39
Understanding Translation Example P401
• Split into two operations– Load reads from memory to generate
temporary result t.1– Multiply operation just operates on registers
imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1
40
Understanding Translation Example P401
• Operands
– Registers %eax does not change in loop.
Values will be retrieved from register file during
decoding
imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1
41
Understanding Translation Example P401
• Operands
– Register %ecx changes on every iteration.
– Uniquely identify different versions as
• %ecx.0, %ecx.1, %ecx.2, …
– Register renaming
• Values passed directly from producer to consumers
imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1
42
Understanding Translation Example P402
• Register %edx changes on each iteration
• Renamed as %edx.0, %edx.1, %edx.2, …
incl %edx incl %edx.0 %edx.1
43
Understanding Translation Example P402
• Condition codes are treated similar to
registers
• Assign tag to define connection between
producer and consumer
cmpl %esi,%edx cmpl %esi, %edx.1 cc.1
44
Understanding Translation Example P402
• Instruction control unit determines
destination of jump
• Predicts whether target will be taken
• Starts fetching instruction at predicted
destination
jl .L24 jl-taken cc.1
45
Understanding Translation Example P401
• Execution unit simply checks whether or not prediction was OK
• If not, it signals instruction control– Instruction control then “invalidates” any
operations generated from misfetched instructions
– Begins fetching and decoding instructions at correct target
jl .L24 jl-taken cc.1
46
• Operations– Vertical position denotes
time at which executed• Cannot begin operation
until operands available– Height denotes latency
• Operands– Arcs shown only for operands
that are passed within execution unit
cc.1
t.1
load
%ecx.1
incl
cmpl
jl
%edx.0
%edx.1
%ecx.0
imull
load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1
Time
Visualizing Operations Figure 5.13 P403
47
• Operations– Same as before,
except that add has latency of 1
load (%eax,%edx,4) t.1iaddl t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1
Time
cc.1
t.1
%ecx.i +1
incl
cmpl
jl
load
%edx.0
%edx.1
%ecx.0
addl%ecx.1
load
Visualizing Operations Figure 5.14 P403
48
cc.1
cc.2%ecx.0
%edx.3t.1
imull
%ecx.1
incl
cmpl
jl
%edx.0
i=0
load
t.2
imull
%ecx.2
incl
cmpl
jl
%edx.1
i=1
load
cc.3
t.3
imull
%ecx.3
incl
cmpl
jl
%edx.2
i=2
load
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cc.1
cc.2
Iteration 3
Iteration 2
Iteration 1
cc.1
cc.2%ecx.0
%edx.3t.1
imull
%ecx.1
incl
cmpl
jl
%edx.0
i=0
load
t.1
imull
%ecx.1
incl
cmpl
jl
%edx.0
i=0
load
t.2
imull
%ecx.2
incl
cmpl
jl
%edx.1
i=1
load
t.2
imull
%ecx.2
incl
cmpl
jl
%edx.1
i=1
load
cc.3
t.3
imull
%ecx.3
incl
cmpl
jl
%edx.2
i=2
load
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cc.1
cc.2
Iteration 3
Iteration 2
Iteration 1
• Unlimited Resource Analysis– Assume operation
can start as soon as operands available
– Operations for multiple iterations overlap in time
• Performance– Limiting factor
becomes latency of integer multiplier
– Gives CPE of 4.0
3 Iterations of Combining Product Figure 5.15 P404
49
• Unlimited Resource Analysis
• Performance
– Can begin a new iteration on each clock cycle
– Should give CPE of 1.0
– Would require executing 4 integer operations in parallel
%edx.0
t.1
%ecx.i +1
incl
cmpl
jl
addl%ecx.1
i=0
loadcc.1
%edx.0
t.1
%ecx.i +1
incl
cmpl
jl
addl%ecx.1
i=0
loadcc.1
%edx.1
t.2
%ecx.i +1
incl
cmpl
jl
addl%ecx.2
i=1
loadcc.2
%edx.1
t.2
%ecx.i +1
incl
cmpl
jl
addl%ecx.2
i=1
loadcc.2
%edx.2
t.3
%ecx.i +1
incl
cmpl
jl
addl%ecx.3
i=2
loadcc.3
%edx.2
t.3
%ecx.i +1
incl
cmpl
jl
addl%ecx.3
i=2
loadcc.3
%edx.3
t.4
%ecx.i +1
incl
cmpl
jl
addl%ecx.4
i=3
loadcc.4
%edx.3
t.4
%ecx.i +1
incl
cmpl
jl
addl%ecx.4
i=3
loadcc.4
%ecx.0
%edx.4
Cycle
1
2
3
4
5
6
7
Cycle
1
2
3
4
5
6
7
Iteration 1
Iteration 2
Iteration 3
Iteration 4
4 integer ops
4 Iterations of Combining Sum Figure 5.16 P405
50
Combining Product: Resource Constraints Figure 5.17 P406
• Figure 5.17 P406
51
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
%ecx.3
%edx.8
%edx.3
t.4%ecx.i +1
incl
cmpl
jladdl
%ecx.4
i=3
load
cc.4
%edx.3
t.4%ecx.i +1
incl
cmpl
jladdl
%ecx.4
i=3
load
cc.4
%edx.4
t.5%ecx.i +1
incl
cmpl
jladdl%ecx.5
i=4
load
cc.5
%edx.4
t.5%ecx.i +1
incl
cmpl
jladdl%ecx.5
i=4
load
cc.5
cc.6
%edx.7
t.8%ecx.i +1
incl
cmpl
jladdl
%ecx.8
i=7
load
cc.8
%edx.7
t.8%ecx.i +1
incl
cmpl
jladdl
%ecx.8
i=7
load
cc.8
%edx.5
t.6
incl
cmpl
jl
addl
%ecx.6
load
i=5
%edx.5
t.6
incl
cmpl
jl
addl
%ecx.6
load
i=5
6
7
8
9
10
11
12
Cycle
13
14
15
16
17
6
7
8
9
10
11
12
Cycle
13
14
15
16
17
18
cc.6
%edx.6
t.7
cmpl
jl
addl
%ecx.7
load
cc.7
i=6
incl
%edx.6
t.7
cmpl
jl
addl
%ecx.7
load
cc.7
i=6
incl
Combining Sum: Resource Constraints Figure 5.18 P408
52
Combining Sum: Resource Constraints
• Only have two integer functional units• Some operations delayed even though
operands available• Set priority based on program order• Performance
– Sustain CPE of 2.0
53
5.8 Reducing Loop Overhead
54
Loop unrolling P409
void combine5(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT;
/* combine 3 elements at a time */ for (i = 0; i < length-2; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2];
/* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; *dest = x;}
55
– Loads can pipeline, since don’t have dependencies
– Only one set of loop control operations
load (%eax,%edx.0,4) t.1aiaddl t.1a, %ecx.0c %ecx.1aload 4(%eax,%edx.0,4) t.1biaddl t.1b, %ecx.1a %ecx.1bload 8(%eax,%edx.0,4) t.1ciaddl t.1c, %ecx.1b %ecx.1ciaddl $3,%edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1
Visualizing Unrolled Loop P410
56
Time
%edx.0
%edx.1
%ecx.0c
cc.1
t.1a
%ecx.i +1
addl
cmpl
jl
addl
%ecx.1c
addl
addl
t.1b
t.1c
%ecx.1a
%ecx.1b
load
load
load
Measured CPE = 1.33
Visualizing Unrolled Loop Figure 5.20 P410
57
i=6
cc.3
t.3a
%ecx.i +1
addl
cmpl
jl
addl
%ecx.3c
addl
addl
t.3b
t.3c
%ecx.3a
%ecx.3b
load
load
load
%ecx.2c
i=9
cc.4
t.4a
%ecx.i +1
addl
cmpl
jl
addl
%ecx.4c
addl
addl
t.4b
t.4c
%ecx.4a
%ecx.4b
load
load
load
cc.4
t.4a
%ecx.i +1
addl
cmpl
jl
addl
%ecx.4c
addl
addl
t.4b
t.4c
%ecx.4a
%ecx.4b
load
load
load
%edx.3
%edx.2
%edx.4
5
6
7
8
9
10
11
Cycle
12
13
14
15
5
6
7
8
9
10
11
Cycle
12
13
14
15
Iteration 3
Iteration 4
Executing with Loop Unrolling Figure 5.21 P411
58
Executing with Loop Unrolling
• Predicted Performance– Can complete iteration in 3 cycles
– Should give CPE of 1.0
• Measured Performance– CPE of 1.33
– One iteration every 4 cycles
59
Unrolling Degree
1 2 3 4 8 16
Integer
Sum 2.00 1.50 1.33 1.50 1.25 1.06
Integer
Product 4.00
FP Sum 3.00
FP Product 5.00
Effect of Unrolling P411
60
Effect of Unrolling
• Only helps integer sum for our examples
– Other cases constrained by functional unit
latencies
• Effect is nonlinear with degree of
unrolling
– Many subtle effects determine exact
scheduling of operations
61
5.9 Converting to Pointer Code
62
void combine4p(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT;
for (; data < dend ; data++ ) x = x OPER *data; *dest = x;}
Example P413
63
• Some compilers and processors do better job optimizing array code
Function Integer Floating pointer
+ * + *
Combine4 2.00 4.00 3.00 5.00
Combine4p 3.00 4.00 3.00 5.00
Pointer Code vs. Array Code P414
64
.L24: # Loop:addl (%eax,%edx,4),%ecx # x += data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop
.L30: # Loop:addl (%eax),%ecx # x += *dataaddl $4,%eax # data ++cmpl %edx,%eax # data:dendjb .L30 # if < goto Loop
Pointer vs. Array Code Inner Loops P414
65
• Performance– Array Code: 4 instructions in 2 clock cycles– Pointer Code: Almost same 4 instructions in 3
clock cycles
Pointer vs. Array Code Inner Loops
66
5.10 Enhancing Parallelism
67
void combine6(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x0 = IDENT, x1 = IDENT;
/* combine 2 elements at a time */ for (i = 0; i < length; i+=2){ x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; }
/* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1;}
Loop Splitting P416
68
Loop Splitting
• Optimization– Accumulate in two different sums
• Can be performed simultaneously
– Combine at end
– Exploits property that integer addition & multiplication are associative & commutative
– FP addition & multiplication not associative, but transformation usually acceptable
Associative:可结合的Commutative:可交换的
69
load (%eax,%edx.0,4) t.1aimull t.1a, %ecx.0 %ecx.1load 4(%eax,%edx.0,4) t.1bimull t.1b, %ebx.0 %ebx.1iaddl $2,%edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1
Visualizing Parallel Loop P417
• Two multiplies within loop no longer have data dependency
• Allows them to pipeline
70
Time
%edx.1
%ecx.0
%ebx.0
cc.1
t.1a
imull
%ecx.1
addl
cmpl
jl
%edx.0
imull
%ebx.1
t.1b
load
load
Visualizing Parallel Loop Figure 5.25 P417
71
%edx.3%ecx.0
%ebx.0
i=0
i=2
cc.1
t.1a
imull
%ecx.1
addl
cmpl
jl
%edx.0
imull
%ebx.1
t.1b
load
loadcc.1
t.1a
imull
%ecx.1
addl
cmpl
jl
%edx.0
imull
%ebx.1
t.1b
load
loadcc.2
t.2a
imull
%ecx.2
addl
cmpl
jl
%edx.1
imull
%ebx.2
t.2b
load
loadcc.2
t.2a
imull
%ecx.2
addl
cmpl
jl
%edx.1
imull
%ebx.2
t.2b
load
load
i=4
cc.3
t.3a
imull
%ecx.3
addl
cmpl
jl
%edx.2
imull
%ebx.3
t.3b
load
load
14
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Iteration 1
Iteration 2
Iteration 3
Executing with Parallel Loop Figure 5.26 P418
72
Integer Floating Point Method + * + *
Unroll 4 1.50 4.00 3.00 5.00 Unroll 16 1.06 4.00 3.00 5.00 2 X 2 1.50 2.00 2.00 2.50 4 X 4 1.50 2.00 1.50 2.50 8 X 4 1.25 1.25 1.50 2.00 Theoretical Opt. 1.00 1.00 1.00 2.00
Optimization Results for Combining P419
73
Optimization Results for Combining
• Register spilling
– only 6 registers available
– Using memory as storage
• Register spilling
– movl -12(%ebp), %edi
– imull 24(%eax), %edi
– movl %edi, -12(%ebp)
74
5.11 Putting it Together: Summary of Results for Optimizing Combining Code
5.11.1 Floating-Point Performance Anomaly5.11.2 Changing Platforms
75
5.12 Branch Prediction and Misprediction Penalties
76
What About Branches?
• Challenge– Instruction Control Unit must work well ahead
of Exec. Unit– To generate enough operations to keep EU
busy
77
80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx
Executing
Fetching &Decoding
What About Branches?
78
What About Branches?
• Challenge
– When encounters conditional branch, cannot
reliably determine where to continue fetching
79
Branch Outcomes
• When encounter conditional branch,
cannot determine where to continue
fetching
– Branch Taken: Transfer control to branch target
– Branch Not-Taken: Continue with next
instruction in sequence
• Cannot resolve until outcome determined
by branch/integer unit
80
80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx
8048a25: cmpl %edi,%edx 8048a27: jl 8048a20 8048a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax)
Branch Taken
Branch Not-Taken
Branch Outcomes
81
Branch Prediction
• Idea
– Guess which way branch will go
– Begin executing instructions at predicted
position
• But don’t actually modify register or memory data
82
80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 . . .
8048a25: cmpl %edi,%edx 8048a27: jl 8048a20 8048a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax)
Predict Taken
Execute
Branch Prediction
83
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1
i = 98
i = 99
i = 100
Predict Taken (OK)
Predict Taken(Oops)
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1
i = 101
Assume vector length = 100
Read invalid location
Executed
Fetched
Branch Prediction Through Loop
84
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1
i = 98
i = 99
i = 100
Predict Taken (OK)
Predict Taken (Oops)
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx i = 101
Invalidate
Assume vector length = 100
Branch Misprediction Invalidation
85
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1
80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 80488bb: leal 0xffffffe8(%ebp),%esp 80488be: popl %ebx 80488bf: popl %esi 80488c0: popl %edi
i = 98
i = 99
Predict Taken (OK)
Definitely not taken
Assume vector length = 100
Branch Misprediction Recovery
86
Branch Misprediction Recovery P427
• Performance Cost
– Misprediction on Pentium III wastes ~14 clock
cycles
– That’s a lot of time on a high performance
processor
87
• Misprediction penalty is about 14 cycles in PIII
machine
• Conditional mov is used to avoid the
misprediction penalty when the branch outcome
is not predictable
• For example: int absval(int val) {
return (val <0)? –val : val
}
Conditional Jump Figure 5.29 P427
88
Conditional Jump P428
movl 8(%ebp), %eax Get val as result
movl %eax, %edx Copy to %edx
negl %edx Negate %edx
testl %eax, %eax Test Val
cmov1 %edx, %eax if <0 copy %edx to result
89
5.13 Understanding Memory Performance
90
typedef struct ELE {
struct ELE *next ;
int data ;
} list_ele, *list_ptr ;
int list_len(list_ptr ls)
{
int len = 0 ;
for (;ls;ls=ls->next)
len++ ;
return len ;
}
Assembly Instructions.L27:
incl %eaxmovl (%edx), %edxtestl %edx, %edxjne .L27
Execution unit operationsincl %eax.0 %eax.1load (%edx.0) %edx.1testl %edx.1, %edx.1 cc.1jne-taken cc.1
Load Latency P429, P430
Figure 5.30 P430
91
incl
testl
jne
%eax.0
%edx.0
incl
testl
jne
load
incl
testl
jne
load
load
%eax.1
%eax.2
%eax.3%edx.1
%edx.2
%edx.3
cc.1
cc.2
cc.3
i=1
i=2
i=3
1
2
3
4
5
6
7
8
9
10
11
Figure 5.31 P430
92
Store Latency Figure 5.32 P431
void array_clear(int *dest, int n)
{
int i;
for ( i = 0 ; i < n ; i++)
dest[i] = 0 ;
}
CPE 2.0
93
Store Latency Figure 5.32 P431
void array_clear(int *dest, int n) {
int i;int len = n-7 ;
for ( i = 0 ; i < len ; i++) {dest[i] = dest[i+1] = dest[i+2] = dest[i+3] = 0 ;dest[i+4] = dest[i+5] = dest[i+6] = dest[i+7] =
0 ;}for ( ; i < n ; i++)
dest[i] = 0 ; }CPE 1.25
94
Store latency Figure 5.33 P432
void write_read(int *src, int *dest, int n)
{
int cnt = n;
int val = 0;
while (cnt--) {
*dest = val;
val = (*src)+1;
}
}
95
Store latency Figure 5.33 P432
write_read(&a[0], &a[1], 3) initial iter. 1 iter. 2 iter. 3 cnt 3 2 1 0 a (-10, 17) (-10, 0) (-10, -9) (-10, -9) val 0 -9 -9 -9
write_read(&a[0], &a[0], 3)initial iter. 1 iter. 2 iter. 3
cnt 3 2 1 0 a (-10, 17) (0, 17) (1, 17)
(2, 17) val 0 1 2 3
96
Store latency
void write_read(int *src, int *dest, int n)
{
int cnt = n;
int val = 0;
while (cnt--) {
*dest = val;
val = (*src)+1;
}
}
97
Store latency P434
.L32:movl %edx, (%ecx)movl (%ebx), %edxincl %edxdecl %eaxjnc .L32
storeaddr (%ecx)storedata %edx.0load (%ebx) %edx.1aincl %edx.1a %edx.1bdecl %eax.0 %eax.1jnc-taken cc.1
98
%eax.2
1
2
3
4
5
6
7
decl
storedata
storeaddr
load
incl
jnc decl
storedata
storeaddr
load
incl
jnc
%edx.1a
%edx.1b
cc.1 %eax.1
%edx.2a
%edx.2b
%eax.0
%edx.0
Store latency Figure 5.35 P434
99
1
2
3
4
5
6
7
decl
storedata
storeaddr
load
incl
jnc decl
Storedata
storeaddr
incl
jnc
=
%edx.1a
%edx.1b
cc.1 %eax.1
=
%edx.2b
%eax.2
%eax.0
%edx.0
load
%edx.2a
Figure 5.36 P435
100
5.14 Life in the Real World: Performance Improvement Techniques
101
Basic Strategies for Optimizing Program Performance
• High-level design• Basic coding principles
• Eliminate excessive function calls• Eliminate unnecessary memory
references• Low-level optimizations
• Try various forms of pointer versus array code.
• Reduce loop overhead by unrolling loops.• Find ways to make use of the pipelined
functional units by techniques such as iteration splitting
102
5.15 Identifying and Eliminating Performance Bottlenecks
103
Performance Tuning
• Identify – Which is the hottest part of the program– Using a very useful method profiling
• Instrument the program• Run it with typical input data• Collect information from the result• Analysis the result
– gprof example• $gcc –O2 –pg prog.c –o prog• $prog file.text (generate new file gmon.out)• $gprof prog (with gmon.out)
104
Example
• Task– Count word frequencies in text document– Sort the words in descending order of
occurence
• Steps– Convert strings to lower case– Apply hash function– Read words and insert into hash table
• Mostly list operations• Maintain counter for each unique word
– Sort results
105
Examples
unix> gcc –O2 –pg prog.c –o prog
unix> ./prog file.txt
unix> gprof prog
% cumulative self self total
time seconds seconds calls ms/call ms/call name
86.60 8.21 8.21 1 8210.00 8210.00 sort_words
5.80 8.76 0.55 946596 0.00 0.00 lower1
4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec
1.27 9.33 0.12 946596 0.00 0.00 h_add
106
Branch Misprediction Recovery
• Performance Cost
– Misprediction on Pentium III wastes ~14 clock
cycles
– That’s a lot of time on a high performance
processor
107
4872758 find_ele_rec [5]
0.60 0.01 946596/946596 insert_string [4]
[5] 6.7 0.60 0.01 946596+4872758 find_ele_rec [5]
0.00 0.01 26946/26946 save_string [9]
0.00 0.00 26946/26946 new_ele [11]
4872758 find_ele_rec [5]
Example P439
108
Principle
• Interval counting
– Maintain a counter for each function
• Record the time spent executing this function
– Interrupted at regular time (1ms)
• Check which function is executing when interrupt
occurs
• Increment the counter for this function
109
Data Set P439
• Collected works of Shakespeare• 946,596 total words, 26,596 unique• Initial implementation: 9.2 seconds
110
Code Optimizations
– First step: Use more efficient sorting function– Library function qsort
0
1
2
3
4
5
6
7
8
9
10
Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower
CP
U S
ecs.
Rest
Hash
Lower
List
Sort
Figure 5.37 P441
1) 2) 3) 4) 5) 6) 7)
111
Further Optimizations
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Low er
CPU
Secs
.
Rest
Hash
Low er
List
Sort
2) 3) 4) 5) 6) 7)1)
112
Example
• 3) Iter first: Use iterative function to insert elements in linked list– Causes code to slow down
• 4) Iter last: Iterative function, places new entry at end of list– Tend to place most common words at front of list
• 5) Big table: Increase number of hash buckets
• 6) Better hash: Use more sophisticated hash function
• 7) Linear lower: Move strlen out of loop
113
Code Motion Example#2void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}
void lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done:}
114
Lower Case Conversion Performance
– Time quadruples when double string length– Quadratic performance
lower1
0.0001
0.001
0.01
0.1
1
10
100
1000
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
String Length
CP
U S
eco
nd
s
115
• Time quadruples when double string length• Quadratic performance
lower1
0.0001
0.001
0.01
0.1
1
10
100
1000
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
String Length
CP
U S
eco
nd
s
Lower Case Conversion Performance
116
• Move call to strlen outside of loop• Since result does not change from one iteration to another• Form of code motion
void lower(char *s){ int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}
Improving Performance
117
Lower Case Conversion Performance
– Time doubles when double string length– Linear performance
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
String Length
CP
U S
eco
nd
s
lower1 lower2
118
• Benefits– Helps identify performance bottlenecks
– Especially useful when have complex system with many components
• Limitations– Only shows performance for data tested
– E.g., linear lower did not show big gain, since words are short
• Quadratic inefficiency could remain lurking in code
– Timing mechanism fairly crude• Only works for programs that run for > 3 seconds
Performance Tuning
119
Tnew = (1-)Told + (Told)/k
= Told[(1-) + /k]
S = Told / Tnew = 1/[(1-) + /k]
S = 1/(1-)
Amdahl’s Law P443